SeqMake Help
Run Program
Contents
Sequence Composition Options
Sequence Number and Length Options
Results
Overview
SeqMake is a utility that generates 110,000 random sequences. It does this
in two fundamentally different ways:

Sequence shuffling: The characters in an input sequence are "shuffled", i.e.,
moved around randomly to produce sequences with exactly the same set of
characters and length as the input sequence. The algorithm is as follows:
each character in the sequence is swapped with another character selected
at random; this process is repeated five times.

Probability distribution: The characters of the output sequences are generated
randomly based on a probability distribution of characters. These distributions
(expressed as percentages for each different character of the sequence) may
be that for protein, DNA, RNA, or computed from an input sequence. In this
method, the lengths of the random sequences are controlled by sequence length
options.
The random sequences are provided in FASTA format labeled ">RND00001",
">RND00002", etc. The length of each sequence is also added to the definition
line. The length of each line of sequence is one of the format options.
Sequence Composition Options
These options determine whether the sequences will be generated by shuffling
or from a probability distribution. If the latter, the type of probability
distribution is set by the user.
Sequence for shuffling or sequencebased probability distribution
Paste or upload a sequence if it is required, either because you are shuffling
a sequence, or want to generate random sequences with the same distribution
of characters as the input sequence. Enter a raw sequence, without a FASTAformat
definition line (indicated by a leading ">" character). All characters
are valid, except blanks, which are stripped from the input.
Are you shuffling a sequence?
Select Yes or No. No is default. If Yes, you must enter a sequence, and specify
the next two parameters.
If Yes, enter a sequence above, choose the following two parameters, and
then Submit
There are only two options for shuffled sequences: the number of sequences
to generate (110,000) and the sequence line length (10999) in the FASTAformat
output.
Otherwise, choose a sequence type and base or amino acid probability distribution
If not shuffling, these options determine the probability distribution of
characters to be used to generate the random sequences. There are four types
of sequences: protein, DNA, RNA, and Input Sequence. The first three types
have predetermined alphabets (the 20 amino acid letters or the 4 DNA or RNA
base letters). There are three types of probability distributions available
for proteins: uniform (equal probability of each amino acid), userdefined
(see next section), and SwissProt (probabilities determined from a recent
release of the SwissProt protein sequence database). There are two types
of distributions for DNA and RNA sequences: uniform (equal probability of
each base) and userdefined. You may also select to have protein sequences
begin with methionine (the start codon amino acid), and DNA or RNA sequences
begin with a start codon (ATG or AUG).
The final type of sequence is "Input Sequence" (entered at the top of the
form); in this case the probabilities are calculated from the input sequence,
and random sequences are generated based on this distribution. No assumption
is made about the type of sequence in this option; only the characters in
the input sequence will appear in the random sequences.
Userdefined Probability Percentages
This table is used to define the percentages of each protein, DNA, or RNA
letter to be used to generate the random sequences. The first column (4 letters)
does triple duty: use it to enter values for DNA or RNA sequences, or the
percentages for the first four amino acids. If the sequence type is protein,
the remainder of the table is used. The values entered into this table are
used in the following way. Enter a percentage value in one or more boxes.
If boxes are left blank, percentages will be calculated for a uniform
distribution over all blank boxes. For an example set of random DNA sequences,
say the user entered 20 for A and 30 for C. The percentages for the blank
boxes (T and G) will be calculated to be uniform for the remaining 50%, i.e.,
25% for T and 25% for G. If all boxes are filled in, the total must exactly
equal 100.
Sequence Number and Length Options
These options apply only to random sequences generated from a probability
distribution (not shuffling).
Number of sequences to generate
Select 110,000 sequences.
Sequence Length
There are four different methods for determining the lengths of the random
sequences.
Same as input sequence: all sequences are the length of the input
sequence (applies only to inputsequencebased generation).
Constant: all sequences are of the specified length.
Randomly distributed: lengths are uniformly randomly distributed between
a minimum and maximum length. Normally distributed: lengths are normally
randomly distributed around a mean length and standard deviation.
Sequence lengths are contrained to the range 104,000. For normally distributed
lengths, if the mean plus/minus 3 times the standard deviation exceeds these
limits, an error is generated.
The final option is the length of each line of sequence in the FASTAformat
output file.
Results
The set of randomly generated sequences is supplied in a link. You may view
the sequences, or use your browsers "save as" function to download this file
to your computer.
The head of the sequence file and the results page give the following
information:

the method by which the random sequences were generated

the desired (i.e., the probability distribution) and observed percentages
of each sequence character for the set of random sequences

statistics for the lengths of the sequences, which depend on the method for
sequence length selected