Skip navigation.
New Mexico State University

Molecular Biology Program Bioinformatics and Educational Resources - Bioinformatics Tools Help

GraphAlign Help

Run Program

If you use GraphAlign in a publication, please cite:

Spalding, J. B. and Lammers, P. J. 2004. BLAST Filter and GraphAlign: rule-based formation and analysis of sets of related DNA and protein sequences. Nucleic Acids Res. 32:W26-W32.
[Link to free full text]

Contents

Sequence Formats and Limits
How GraphAlign Works
Submission Form Page
Analysis Results Pages

Overview

GraphAlign is a bioinformatics program for analyzing pairwise alignments of protein or nucleotide sequences in novel ways and presenting the alignments in a graphical form. You submit a query sequence and one or more subject sequences. The query sequence is then paired with each subject and a ClustalW global pairwise alignment is performed for each pair. These alignments are then analyzed to find sections (subalignments) of high quality, and graphs of each alignment are prepared that show which parts of the alignments are the most similar. The alignments may be sorted in a number of different ways based on what the user considers most important in the nature of the alignments.

GraphAlign can provide information about the similarity between a query and subject sequences that is different from that provided by sequence similarity search programs such as BLAST and FASTA, which match subject sequences by finding high quality local alignments. GraphAlign analyzes global pairwise alignments, which places different constraints on the alignment process than does local alignment. GraphAlign would normally be used after a BLAST-type search to further explore similarities between the query and subject sequences. In order to use GraphAlign, the full sequences must be obtained. A simple way to do this is to use BLAST Filter to perform the BLAST Search and return selected subject sequences.

The "quality" of alignments and subalignments is determined in one of three ways, which are explained in more detail below.

  • Percent identity: The percent of identical residues; appropriate for both nucleotide and protein sequences
  • Mean score: The average score; appropriate for protein sequences only. This method uses a protein substitution scoring matrix and gap penalties to assign a score for each position in the alignment. For help in understanding alignment concepts, scoring matrices, and gap penalties see the excellent Pittsburgh Supercomputing Center tutorial on Searching Sequence Databases and Sequence Scoring Methods.
  • Total score: The total score. This is the same as Mean score except that the measure of alignment quality is the sum of the scores over each position in the alignment.

GraphAlign performs its calculations in two stages. When you run it initially (stage 1), the ClustalW alignments are done and results are computed and output. You may then click on Modify Parameters to change analysis parameters and redisplay changed output tables and graphs. This stage uses the existing ClustalW alignments, and its speed depends on which parameters you change: changing some parameters requires re-analysis of the alignments, while changing others does not (explained below).

You may bookmark any of the results pages and return later to continue a GraphAlign analysis. We can't guarantee, however, to keep results longer than two weeks.

ClustalW recognizes whether the sequences are nucleotide or protein and is run with default parameters. If 85% or more of the characters in the sequence are A, C, G, T, U or N, the sequence is assumed to be nucleotide.

Sequence Formats and Limits

If you have used BLAST Filter to generate a set of subject sequences from a query, you can import those files directly into GraphAlign. Simply paste the URL for the BLAST Filter results page into the top text box.

Otherwise, the query and subject sequences are either pasted into input boxes or specified as text files on your computer. Sequences must be in FASTA format, i.e., each sequence begins with a definition line (starting with the ">" character) followed by one or more sequence lines. There must be only one query sequence, and one or more subject sequences. The following limits apply:

  • Maximum number of subject sequences: 200
  • Maximum length of a sequence line: 1000 characters
  • Sequence length: no limit, but the resulting pairwise alignment cannot exceed the following limit
  • Maximum alignment length: 10,000 positions

These limits are designed to allow analysis of many short sequences, or a few very long sequences, but not both. Please do not submit a sequence set that approaches both these limits.

How GraphAlign Works

The global pairwise alignments are generated by the ClustalW version 1.83 "slow pairwise" (full dynamic programming) method with default parameters, which includes the BLOSUM and IUB scoring matrices for protein and nucleotide sequences, respectively. For more on how ClustalW works see the Baylor College of Medicine Help for ClustalW page. GraphAlign analyzes these alignments and reports results as statistics (in a table) and in graphs.

The following examples are based on an analysis of the ClustalW pairwise alignment of two dehydroquinate dehydratase protein sequences. The alignment position numbers are added by GraphAlign beneath the consensus lines to assist the user in identifying specific sections of the alignment from the information in the graphs.
To view this alignment click on the image.

Evaluation of alignment quality

The quality of alignments (both full and subalignments) are computed in three ways:

  • Percent Identity (protein and nucleotide sequences): This measure is calculated by dividing the number of identities (pairs of identical letters, not case sensitive) in an alignment by the length of the alignment; therefore all positions with a gap in either sequence are non-identities.
  • Mean score (protein sequences only): This method assigns a score to each position in the alignment. If the position contains a gap ("-") in either sequence the gap penalty score specified by the user (default -1) is used. If not, the score (in units of bits) for the pair of amino acids is determined from the substitution scoring matrix selected by the user (default BLOSUM30). The mean score for an alignment is the sum of scores divided by the length of the alignment or subalignment.
  • Total score (protein sequences only): Same as Mean score, except it is the sum of scores of the alignment. This is the typical measure used by local alignment methods such as FASTA and BLAST, which find local alignments that maximize the sum of scores.

A GraphAlign analysis uses one of these three methods. To show results using a differerent method, you can select Modify Parameters and choose another method.

Percent identity, mean score, and total score curves

Each graph presents (at minimum) a curve that shows the percent identity, the mean score, or total score at each point in the alignment, depending on which evaluation method you selected. This example shows a percent identity curve:

The title of each graph is the definition line of the subject sequence, preceded by the number of that subject sequence (i.e., the rank of that sequence in the input order of sequences). The X axis is position in the alignment and the Y axis is percent identity. The percent identity at each position is that for a "window" of the alignment centered at that position. This percent identity is calculated by dividing the number of identities (pairs of identical letters) in the window by the window length; therefore all positions with a gap in either sequence are non-identities. By default the window is set at 21 alignment positions (the value used for the graph above), but may be set to other values to obtain less- or more-smoothed curves. In the next example the window is set at 7 positions:

Therefore, the curve represents a sliding window of percent identities. The highest points on the curve show the centers of regions of highest percent identity. In this example, the trailing end of the alignment contains only gaps for one sequence, which is shown by the red curve along the X axis, indicating 0% identity.

If you selected to evaluate alignments by mean score, the results are similar: the curve represents a sliding window of mean scores. In the next example the default parameters are used: window = 21 positions, gap penalty = -1, and scoring matrix = BLOSUM30:

If the gap penalty is increased to -4, the gapped sections of the alignment (centered on positions 35 and 195) become more apparent, and the mean scores for the trailing end of the alignment (all gaps) fall below the Y axis cutoff of -2:

If you selected to evaluate alignments by total score, the curves are sliding windows of total subalignment scores, and therefore depend on the window size. This is not like the first two methods, which normalize the number of identities and sum or scores by the subalignment length. Therefore, larger windows will produce higher total scores, and higher curves. The next example uses the same default parameters as above (window = 21 positions, gap penalty = -1, and scoring matrix = BLOSUM30).

If the window size is increased to 51, the curve is higher and smoother because scores are being totaled over 51 positions instead of 21:

Since the magnitude of the total score curves depend on window size, the Y axes of the graphs are not fixed, but depend on the range of Y values.

Best subalignments and Significant subalignments

GraphAlign also performs calculations to identify regions of the alignment than meet specified thresholds. These regions are determined in novel ways and are defined as follows.

For a given threshold percent identity, mean score, or total score, the Best subalignment is the longest section of the alignment that achieves this value. For example, if the threshold percent identity is 50%, GraphAlign finds the longest part of the alignment with at least 50% identity. The Best subalignment percent identity or mean or total score is specified when you choose to sort by Best subalignment length, and the length is shown in the output table. Significant subalignments are predefined Best subalignments with values set as follows:

  • percent identities between 30% and 100% in 5% intervals
  • mean scores between 0.0 and 4 in intervals of 0.5 bits
  • total scores between 0 and 300 in intervals of 50 bits

As an option, the Significant subalignments may be displayed in the graphs. Here is an example showing Significant subalignments for percent identity:

Each Significant subalignment is shown as a horizontal line spanning the region of the alignment and with a Y axis position equal to that of the percent identity used as the threshold. In this example sections of high percent identity are centered around alignment position 170, although the highest percent identities are very short sections near alignment position 240. Graphs of Significant subalignments for mean score and total score intervals are similar, except that the Y axis is the mean or total score.

Critical subalignments, Critical length, and Critical percent identity, Critical mean score, or Critical total score

These statistics depend on a user-specified value called the Critical length. This length is a percentage of the alignment length. The Critical subalignment is defined as that region of the alignment with the highest percent identity or mean or total score that is at least as long as the Critical length. This highest percent identity or mean or total score is called the Critical percent identity, Critical mean score, or Critical total score. The next example shows a graph with a Critical length of 50% and evaluates subalignments based on percent identity.

The Critical subalignment is shown as a horizontal line spanning the region of the alignment and with a Y axis position (here 30%) equal to that of its percent identity (the Critical percent identity). This Critical subalignment can be understood as answering the question: "What is the highest percent identity that can be found for a subalignment whose length is equal to or greater than the critical length." Another way to interpret this result is that all subalignments with an identity > 30% are shorter than the Critical length.

NOTE: The Critical subalignment may not be found. To reduce computer time, subalignments with less than 30% identity or a mean or total score of 0.0 (depending on how alignments are evaluated) are not identified. Therefore, there may be no critical subalignments reported if the Critical percent identity, Critical mean score, or Critical total score is less than the minimum.

You may choose to produce graphs showing significant, critical, or no subalignments (i.e., percent identity or mean or total score curve only).

Sorting Alignments

There are six ways in which the alignments may be sorted in the output pages. The table results and graphs are shown in this order.

  • No sort: Alignments are output without sorting the order of subject sequences.
  • Global percent identity or mean or total score: Sorted by global pairwise percent identity or mean or total score, depending on which method of evaluating alignment quality was selected.
  • Best subalignment length: If selected, you are asked to specify a threshold percent identity or threshold mean or total score. For each alignment, the longest section that achieves this threshold is computed, and the output is sorted by this length (see Best subalignments and Significant subalignments). The best subalignment lengths are shown in the results table but not in the graphs.
  • Best subalignment length / total alignment length (%): This is the same as "Best subalignment length", except that the results are sorted by the longest section of the alignment expressed as a percentage of alignment length. These percentages are shown in the results table but not in the graphs.
  • Critical percent identity or mean or total score: Sorted by this value (see Critical subalignments).
  • Sequence length difference (%): This sort value is defined as the absolute value of the percent difference between the query and subject sequence lengths.

Submission Form Page

The inputs to GraphAlign are as follows. See the above sections for clarification of formats and meanings of analysis parameters.

Query sequence

This sequence is paired with each subject sequence in the global pairwise alignments (Sequence format).

Subject sequences

These sequences are each paired with the single query sequence in the global pairwise alignments (Sequence format).

Critical Length (%)

This length is a percentage of the alignment length. The Critical subalignment is defined as that region of the alignment with the highest percent identity or mean or total score that is at least as long as the Critical length. See Critical sublignments for more.

Alignment quality evaluation method

You may choose to evaluate alignments (global and local) based on percent identity or mean ot total score. See Evaluation of alignment quality for more. The mean or total score methods should be used only with protein sequences.

Scoring matrix (protein sequences only, mean or total score method)

Choose one of the BLOSUM protein substitution scoring matrices to use. See Evaluation of alignment quality for more.

Gap penalty (protein sequences only, mean or total score method)

Select the gap penalty (zero or negative score) to use for gap positions in the alignment. See Evaluation of alignment quality for more.

Show which subalignments

You may choose to show in the graphs no subalignments, Significant subalignments, or Critical subalignments.

Window size for percent identity or mean or total score curves

The percent identity or mean or total score at each position in the graph curves is that for a "window" of the alignment centered at that position. See How GraphAlign Works for more information. NOTE: The window size cannot be greater than the length of an alignment. If this situation is encountered, an error message is given indicating the maximum window size allowed.

Sort results by

See Sorting Alignments.

Number of graphs per page

This parameter determines how many alignment graphs are shown in each web page of output.

Number of graph columns per page

This parameter determines how many columns of alignment graphs are shown in each web page of output.

Graph size

There are three choices: small, medium, and large. They are sized to allow three, two, and one column per page output, respectively, to fit on a printed page.

Perform Analysis

This button starts a GraphAlign analysis. After results are presented you may change any of the above parameters by selecting Modify Parameters.

Analysis Results Pages

Whenever new results are being computed, the progress of the analysis is shown in a status page. When the analyses are done, a link titled View Results is shown. Click on this link to view the output pages.

The results of a GraphAlign analysis are presented in one or more output pages. The number of pages output depends on the the number of graphs per page value that you selected. If all of the graphs fit on one page, then all of the results are on one page. If not, the first page consists of the summary table of results for all alignments, and links are provided to navigate through the pages with the graphs. For example, if you submitted 50 subject sequences, and chose to show 10 graphs per page, there would be 6 output pages: the table of all results, plus 5 pages of results with graphs.

Analysis Parameters

Each output page contains a table at the top that shows which analysis parameters you chose.

Summary table

This table contains the analysis statistics for the graphs shown on the page, sorted by the parameter you chose. If there are multiple output pages (when not all graphs can fit on one page, see above), then the first page contains only this table with links to navigate to each output page. If graphs are on multiple output pages, the "Sort Number" (column 1) values are links that jump to the output page containing the graph for that subject sequence. In all cases the "Description" (column 2) value is a link that shows the ClustalW pairwise alignment for that subject sequence.

Click on the image for an example of the Summary table using the percent identity method.

The columns of this table contain the following information (note that the length of the query sequence is shown above the table):

  • Column 1: Sort Number - The number of the subject sequence in sorted order.
  • Column 2: Description - The initial part of the definition line for the subject sequence.
  • Column 3: Input Order - The number of the subject sequence in the order in which it was submitted. This is the same as the number shown in the title of each graph.
  • Column 4: Subject Length - The length of the subject sequence.
  • Column 5: Subject Diff. (%) - The difference in lengths between the query and subject sequences, expressed as a signed percentage. For example, +24.8% means that the subject sequence is 24.8% longer than the query sequence.
  • Column 6: Length - The length of the global pairwise alignment.
  • Column 7: % Identity - The global pairwise percent identity.
  • Column 8: Crit. % Identity - The critical percent identity. Note: to reduce computer time, subalignments with less than 30% identity are not identified. In this case the value shown is "< 30" (see Critical subalignments).
  • Column 9 (not shown above): This extra column shows values that depend on the sorting method (see Sorting Alignments). The sorting methods and their column 9 values are:
    • Sort by Best subalignment length: the best subalignment length (i.e., the sort value)
    • Sort by Best subalignment length / total alignment length (%): the best subalignment length as a percentage of the alignment (i.e., the sort value)
    • Sort by Critical percent identity or mean or total score: the critical subalignment length as a percentage of the total alignment length; this is not the sort value (column 8), but is additional information; if the critical subalignment is not identified, the value shown is "N/A" (see Critical subalignments)

If a GraphAlign analysis is done using the mean score method then column 7 shows the global pairwise mean score and column 8 shows the critical mean score. Click on the image for an example of the Summary table using the mean score method.

If a GraphAlign analysis is done using the total score method then column 7 shows the global pairwise total score and column 8 shows the critical total score.

Note on critical scores: Subalignments with less than 0 mean or total score are not identified to reduce computer time. In this case the value shown in column 8 is "< 0" (see Critical subalignments).

Graphical Alignments

The graphs themselves are shown below the Summary table. See How GraphAlign Works for a description of what the graphs show. Each graph is also a link that shows the ClustalW pairwise alignment for that subject sequence in a second window. If you wish to save these images, use your browser's "Save image as" capability to download the image files to your computer. All of the graphs on a page are part of a single image. If you want to download the image of a single graph only, choose to display one graph per page.

Modify Parameters

After results are presented you may change any of the above parameters by selecting Modify Parameters. If you change the Evaluation of alignment quality method, scoring matrix, gap penalty, or Critical Length parameters, it will take longer because the pairwise alignments must be reanalyzed. Each time you modify parameters and reanalyze, the results are generated in a new set of web pages. Since the previous pages are retained on the server, you can bookmark any page at any time to be able to return to previous analysis results for the same query/subject sequence set.

Perform New Analysis

This begins a new GraphAlign analysis.