## SeqMake Help

### Contents

Sequence Composition Options

Sequence Number and Length Options

Results

### Overview

SeqMake is a utility that generates 1-10,000 random sequences. It does this in two fundamentally different ways:

- Sequence shuffling: The characters in an input sequence are “shuffled”, i.e., moved around randomly to produce sequences with exactly the same set of characters and length as the input sequence. The algorithm is as follows: each character in the sequence is swapped with another character selected at random; this process is repeated five times.
- Probability distribution: The characters of the output sequences are generated randomly based on a probability distribution of characters. These distributions (expressed as percentages for each different character of the sequence) may be that for protein, DNA, RNA, or computed from an input sequence. In this method, the lengths of the random sequences are controlled by sequence length options.

The random sequences are provided in FASTA format labeled “>RND00001”, “>RND00002”, etc. The length of each sequence is also added to the definition line. The length of each line of sequence is one of the format options.

### Sequence Composition Options

These options determine whether the sequences will be generated by shuffling or from a probability distribution. If the latter, the type of probability distribution is set by the user.

#### Sequence for shuffling or sequence-based probability distribution

Paste or upload a sequence if it is required, either because you are shuffling a sequence, or want to generate random sequences with the same distribution of characters as the input sequence. Enter a raw sequence, without a FASTA-format definition line (indicated by a leading “>” character). All characters are valid, except blanks, which are stripped from the input.

#### Are you shuffling a sequence?

Select Yes or No. No is default. If Yes, you must enter a sequence, and specify the next two parameters.

#### If Yes, enter a sequence above, choose the following two parameters, and then Submit

There are only two options for shuffled sequences: the number of sequences to generate (1-10,000) and the sequence line length (10-999) in the FASTA-format output.

#### Otherwise, choose a sequence type and base or amino acid probability distribution

If not shuffling, these options determine the probability distribution of characters to be used to generate the random sequences. There are four types of sequences: protein, DNA, RNA, and Input Sequence. The first three types have predetermined alphabets (the 20 amino acid letters or the 4 DNA or RNA base letters). There are three types of probability distributions available for proteins: uniform (equal probability of each amino acid), user-defined (see next section), and Swiss-Prot (probabilities determined from a recent release of the Swiss-Prot protein sequence database). There are two types of distributions for DNA and RNA sequences: uniform (equal probability of each base) and user-defined. You may also select to have protein sequences begin with methionine (the start codon amino acid), and DNA or RNA sequences begin with a start codon (ATG or AUG).

The final type of sequence is “Input Sequence” (entered at the top of the form); in this case the probabilities are calculated from the input sequence, and random sequences are generated based on this distribution. No assumption is made about the type of sequence in this option; only the characters in the input sequence will appear in the random sequences.

#### User-defined Probability Percentages

This table is used to define the percentages of each protein, DNA, or RNA letter to be used to generate the random sequences. The first column (4 letters) does triple duty: use it to enter values for DNA or RNA sequences, or the percentages for the first four amino acids. If the sequence type is protein, the remainder of the table is used. The values entered into this table are used in the following way. Enter a percentage value in one or more boxes. If boxes are left blank, percentages will be calculated for a uniform distribution over all blank boxes. For an example set of random DNA sequences, say the user entered 20 for A and 30 for C. The percentages for the blank boxes (T and G) will be calculated to be uniform for the remaining 50%, i.e., 25% for T and 25% for G. If all boxes are filled in, the total must exactly equal 100.

### Sequence Number and Length Options

These options apply only to random sequences generated from a probability distribution (not shuffling).

#### Number of sequences to generate

Select 1-10,000 sequences.

#### Sequence Length

There are four different methods for determining the lengths of the random sequences. **Same as input sequence**: all sequences are the length of the input sequence (applies only to input-sequence-based generation). **Constant**: all sequences are of the specified length. **Randomly distributed**: lengths are uniformly randomly distributed between a minimum and maximum length. **Normally distributed**: lengths are normally randomly distributed around a mean length and standard deviation.

Sequence lengths are contrained to the range 10-4,000. For normally distributed lengths, if the mean plus/minus 3 times the standard deviation exceeds these limits, an error is generated.

The final option is the length of each line of sequence in the FASTA-format output file.

### Results

The set of randomly generated sequences is supplied in a link. You may view the sequences, or use your browsers “save as” function to download this file to your computer.

The head of the sequence file and the results page give the following information:

- the method by which the random sequences were generated
- the desired (i.e., the probability distribution) and observed percentages of each sequence character for the set of random sequences
- statistics for the lengths of the sequences, which depend on the method for sequence length selected