Input format for sequences
Note on treatment of punctuation characters
Blank, tab, and digit sequence-character options
Valid sequence-character options
Compositional analysis options
Definition line format options
Sequence line format options
Results: output files and information
SeqCheck is a sequence utility: it reformats, does a validation check on sequence characters, performs several types of compositional analyses, and optionally removes sequences that fail certain rules. The input is a set or file of sequences in FASTA format. After you have specified the options to use, click the SUBMIT button to run SeqCheck. Clicking the RESET button returns all fields to their default values, and clears the sequences. If there is an error (such as no sequence input), it will be reported and you may use the “back button” of your browser to return to the input page and correct the error. The output consists of summary statistics and compositional analyses over the entire set of sequences, and the reformatted sequences, ready to download to your computer.
Defaults: If you take all the default options, SeqCheck does the following:
- truncates (if necessary) all definition lines to 80 characters
- removes blanks, tabs, and digits from the sequence data
- performs a basic compositional analysis (not case sensitive)
- reformats sequence lines to 50 characters maximum
The simplest use of SeqCheck would be to reformat a set of sequences with varying line lengths so that all sequence lines are the same length. Other examples might also involve a sequence validation check (to make sure only legal characters are in the sequences) and make use of the compositional analysis.
- Number of sequences: no limit; if large, upload a file instead of pasting sequences into the text box
- Definition line: 1,000 characters
- Sequence line: 5,000 characters
- Total single sequence length: 25,000 characters
You may paste a set of sequences in the text box, or click the UPLOAD button to specify a sequence file on your computer. The sequence must be in FASTA format, i.e., each new sequence begins with a single line that has the “>” character in the first position, followed by descriptive text (the “definition line”). One or more lines containing the sequence then follow; these are the “sequence lines.” These lines may be of varying length and contain any characters. The following example of protein sequences is valid (note that empty lines are acceptable):
>seq123 this is description MGIKALAGRDLLAIADLTIEEMKSLLQLAADLKSGVLKPHCRKILGLLFYKASTRTRVSF TAAMYQLGGQ VLDLNPSVTQ VGRGEPIQ*? DTARVLDRYI DI LAVRTFKQTDLQTFADHAKM > PIINALSDLE HPCqiLADLQ tikeCFGKLE GLTVTYLGDG NNvahslILG GVMMGMTVRV >seq125 ATPKNYEPLAEIVQQAQQIAAPGGKVELTDDPKAAAQGSHILYTDVWASMGQEDLADSRI
Punctuation characters are allowed within sequences. There are no punctuation characters, however, in the built-in valid sequence character sets. Therefore, if you wish to do a valid sequence-character check on sequences with valid punctuation characters, you must supply your own “user-defined” character set (see Valid sequence-character options).
Also note that punctuation characters are ignored in the compositional analysis, which counts only letters (see Compositional analysis options).
SeqCheck does not consider blank, tab, and digit characters to be sequence characters. This assumes that blanks and tabs are “spacer” characters that make it easier to count positions in a sequence, and that digits are used only in numbers (positional counts) at the beginnings of sequence lines. They are stripped from the input sequences before compositional analysis and reformatting. They may be inserted in the reformatted sequence output, however, by the Sequence line format options. Use the next two options only if you want to check if any of these characters are in your sequences; if they are, sequences containing them will be deleted (see Results: output files and information).
Check if blank or tab characters exist? [default=NO]
Set this to YES if you believe these characters should not be in your sequences, and you wish to determine if they are.
Check if digit characters exist? [default=NO]
Set this to YES if you believe digits should not be in your sequences, and you wish to determine if they are.
These options control validity checks of sequence characters. A character is valid if it is in the specified character set. When invalid characters are found, sequences containing them will be deleted (see Results: output files and information). Any character (except blank, tab, and digits) may be in the user-defined character set.
Check for specific characters? [default=NO]
If YES, the validity check will be done. If NO, the following options are ignored.
Select a valid character set to use [default=User-defined]
The set of valid characters must be specified from this list or in the next option.
If User-defined type, enter the name of file containing valid characters Or directly enter valid characters
You may upload a file containing the character set from your computer, or enter the characters in the text box. If you do both, the text box characters take precedence. Blanks, tabs, and digit characters are ignored.
Are letters case sensitive? [default=NO]
If YES, the validity check is case sensitive. For example, if the character set is “ATCG”, and this option is set to “YES”, the sequence letters “atcg” are all non-valid.
Compositional analyses are performed only on non-deleted sequences. A basic analysis is always performed, in which the percentage of each letter for the entire sequence set is calculated. Punctuation characters are ignored, so the total of the percentages will always equal 100%. If you select to perform the special DNA analysis, the following percentages are also computed: GC (not case sensitive), ATCG (uppercase), and atcg (lowercase). The latter are included to support base calling software that uses uppercase letters to indicate high confidence and lowercase for lower confidence. Also reported is the non-ATCG (not case sensitive) percentage.
Perform special DNA analysis? [default=NO]
DNA must be specified for the special DNA compositional analyses to be done.
Treat upper and lowercase letters as different? [default=NO]
If YES, the basic compositional analysis is case sensitive (e.g., “A” and “a” will be considered different letters). This option does not affect the special DNA compositional analyses.
These options control how the definition line (starting with the “>” character) of each sequence will be reformatted.
Maximum length of definition line [default=80]
Each line will be truncated to this length. If one or more of the next options are selected, the line will still not exceed this length. Be careful to leave enough room in the line for the extra information added by the next three options, A length of “1” will result in definition lines containing only the “>” character.
Add sequence length to each definition line? [default=NO]
If YES, the sequence length will be appended in the form “LEN=nnn”.
Add CG percentage to each definition line? [default=NO]
If YES, the CG (any case) percentage will be appended in the form “CG=nn.n%” (useful for DNA sequences only).
Add uppercase ATCG percentage to each definition line? [default=NO]
If YES, the ATCG (uppercase) percentage will be appended in the form “ATCG=nn.n%” (useful for DNA sequences only).
These options control how the lines containing the sequence data will be reformatted. Note that blanks, tabs, and digits in the input sequences are removed on input, but blanks and digits may be reinserted in the output sequences through the last two options.
Maximum length of sequence lines [default=50]
The sequence lines will be limited to this number of characters, not counting inserted blanks and position counts.
Change the letter cases? [default=No change]
You may choose not to change the case of letters, or change all to uppercase or lowercase. If the case is changed, this is done after the valid sequence-character check and compositional analysis, therefore not affecting those results.
Insert position count at start of each line
If YES, the position in the sequence of the first character in each line will be inserted. For example:1 YTATAATAAAaaTGCTCTtTTTGATGTTTTGATGATAAATTATTAATTAA 51 GGCAGCGAATGAAACTCCAGATGATTTAAGaGTATTTCCSGATGATGCTG 101 GGAGCAGCAcCACCGGGAGCAGCA
Insert blank after every [default=10] characters? [default=NO]
If YES, a blank will be inserted after the specified number of characters.
These options are rules; sequences that fail a rule will be deleted (see Results: output files and information). To activate a rule, select YES and, if needed, change the default rule value.
Delete sequences with < [default=10] characters? [default=NO]
If YES, sequences shorter that the specified number of characters will be removed.
Delete sequences with > [default=9999] characters? [default=NO]
If YES, sequences longer that the specified number of characters (not including blanks) will be removed.
If DNA, delete sequences with < [default=95]% ATCG? [default=NO]
If YES, DNA sequences with less than the specified ATCG (uppercase) percentage will be removed.
After clicking the SUBMIT button, SeqCheck will run and present a results page. If you want to change input options, you may use the “back button” of your browser to return to the input page, change options, and re-submit.
The results are divided into two categories: sequences kept and sequences deleted. Sequences are deleted if they fail any of the Blank, tab, and digit sequence-character options; Valid sequence-character options; or Sequence-filter options By default, these options are turned off.
The total number of sequences is given, as well as the following statistics:
- Number of sequences
- Number sequences with invalid characters
- Minimum sequence length
- Maximum sequence length
- Mean sequence length
- Total length, all sequences
If any sequences failed one of the sequence-character or sequence-filter options, they will be deleted and available in a second file. Otherwise, there will be a single file of sequences prepared (i.e., all of the input sequences). Output file(s) may be viewed in either “HTML” format (which shows errors in red), or in “text” format. To download a sequence file, use your browser’s “save file” option with the “text” link.
If there is a deleted-sequence file, the reason(s) a sequence was deleted is included before each sequence. If there were invalid characters, they are listed and shown in blinking red within the sequence in the “HTML” format version of the sequence file, with the exception of blank, tab, and digit characters, which are stripped from the input sequences. If a sequence failed a Sequence-filter options rule, the filter options failed are given. Also, by turning on the corresponding Definition line format options, you will be able to see (in the definition lines) the exact values that caused the filter rules to fail sequences. Another possible reason for deleting a sequence is that it is empty sequence, i.e., with a definition line only.
The results of the compositional analyses are also available in a file, in a computer-readable format. The format for each analysis is: type of analysis (basic and DNA), number of letters, and a line for each letter showing the percentage and the letter itself.
Basic compositional analysis
This is a table showing the percentages of each letter for all non-deleted sequences. Note that punctuation characters are ignored (see Compositional analysis options).
DNA compositional analysis
If the sequence type was specified to be DNA, the following statistics are reported:
- GC (any case) percentage
- ATCG (upper case) percentage
- atcg (lower case) percentage
- non-ATCG (any case) percentage