Module 3: DNA Databases and Sequence Queries
Sequence Search and Retrieval Methods
|When a researcher sequences a new cDNA or genomic clone isolated from his/her field of research (tissue-, organ-, disease- or
species-specific), the obvious next step is to search for and retrieve any sequence(s) in the databases which could be the same as or
similar to the new sequence. There are a variety of sequence search algorithms which are quite good at this process. Most of these sites
enable the user to retrieve other data besides DNA sequences.
- Entrez: This service from NCBI is a very powerful tool to search for any
nucleotide or protein sequence found in GenBank. In addition, it can search for related
Medline research articles associated with the sequence. The Entrez
database has the following relationships within its components.
- Sequence relationships are computed with BLAST (Module 4 for discussion). One can
not search directly for similarities between a test sequence and the database sequences since Entrez does not allow sequence
entries, in any format, as a search term. After a BLAST search, the sequences which seem highly similar to the test sequence
can then be retrieved through Entrez.
- Relationships between DNA and protein sequences rely on accession numbers.
- Shared keywords and accession numbers can be used for going back-and-forth between sequences and related Medline articles
& abstracts (Approximately 11 million of the publications in the Medline database).
- Related articles can be accessed with shared keywords called "MESH" (Medical Subject Headings) terms.
- Sequence Retrieval System (SRS): This system is a service created at the
EBI of EMBL. It is a global sequence retrieval system and includes other applications (such as search for mutations, transcription
factor binding sites etc.,) besides the retrieval of DNA and protein sequences.
- WWW search launcher: Baylor College of Medicine has a global portal for
searching any type of database (similar to the SRS). The nucleotide search site in this server allows the user to enter DNA
sequences (up to 7,000 bases maximum) for similarity searches using BLAST. There is also an application which can convert sequences
in any format to the FASTA format (Module 4 for more details) for similarity searches.
- E-mail servers (Non-interactive): Each of the 3 major databases has some form of e-mail service through which one can
retrieve sequences. In these cases, the request to the database should contain some key word (eg., MESH words, gene name)
and/or unique identifier (eg., accession number).
- Query Server from NCBI can be searched by sending e-mail to [email protected]
with the following two-line message:
UID [Unique identifier or text term]
The domain could be "n" for nucleotide sequence and "p" for protein sequence. The unique identifier (UID) could be an accession
number. Optional search parameters and formatting specifications are placed in following lines. To get a better explanation for
the search terms and other help with this service, help documentation can be obtained by sending e-mail to the above address
with the word "HELP" in the body of the message.
- EMBL sequences can be obtained by sending e-mail to [email protected] with no subject
heading and a message that says,
get nuc:[accession number]
This message will get the annotated DNA sequence with the requested accession number.