Searching Sequence Databases with Sequences
- Why Search a Sequence against a Database?
- What programs are there?
- Can you give any General Guidelines for Database Scanning?
Why Search a Sequence against a Database?
- Database similarity searches is one of the first and most important steps in analysising a new sequence. If your unknown sequence has a similar copy already in the databases, a search will quickly reveal this fact and if the copy is well annotated you need go little further in trying to identify your sequence.
- Database searches usually provide the first clues of whether the sequence belongs to an already studied and well known protein family.
- If there is a similarity to a sequence that is from another species, then they may be homologous (i.e. sequences that descended from a common ancestral sequence). Knowing the function of a similar/homologous sequence will often give a good indication of the identity of the unknown sequence.
- You should bear in mind that in order to identify homologous sequences, searches should be made at the protein sequence level, because it is about 5 times more sensitive at finding matches
What programs are there?
Many programs for database searching already exist, but still many more are being developed. However, only the three most commonly used ones will be mentioned today, and more emphasis will be given on one of them.
BLAST
BLAST (Basic Local Alignment Search Tool) performs fast database searching combined with rigorous statistics for judging the significance of matches. There are some BLAST services avaliable at CBI, such as:
- BLASTn - Nucleotide-nucleotide BLAST
- BLASTp - Protein-protein BLAST
- BLASTx - Translated Query vs. Protein Database
FASTA
- FASTA can be used to compare either protein or DNA sequences and hence the name, which stands for Fast-All.
- When is a FASTA Match/Hit Significant? - If a high scoring sequence is found (i.e. initn>100, init1>60 and opt>150) then it is likely that these sequences are homologous.
Both BLAST and FASTA try to find patches of local/regional similarities rather than finding the best alignment between your entire query and an entire database sequence. This characteristic of being quite fast, has the consequence of missing out some hits, and therefore is not quaranteed to find all the best matches.
Can you give any General Guidelines for Database Scanning?
Which is the best method for database scanning? Sadly, there is not a straightforward answer to this question. Attempts have been made to make comparisons but the process is complicated by the difficulty of designing suitable test cases and the number of adjustable parameters. The most effective method of assessing the success of a scanning technique is to test its ability to find all the members of a known protein family from the database of all known sequences. The principle is simple:
- Record the identifier codes of all proteins known to be in the family.
- Select a member to scan with (the query).
- Perform the scan using the method of choice.
- Count how many of the known members are found with higher scores than known non-members.
When given a newly determined sequence, a search with BLAST or FASTA will quickly tell you if a close homologue exists. The sensitive BLITZ, is well worth trying. If no similar sequences are found then alternative PAM matrices should be tried. Start with PAM120, then try PAM250 and in each case vary the gap penalty around the minimum value of the matrix. For PAM250 this is 8, values of 7-10 are worth trying. Care should always be taken to consider the likely significance of an apparent match.
A low value for T reduces the possibility of missing MSPs with the required S score, however lower T values also increase the size of the hit list generated in step 2 and hence the execution time and memory required. In practice, the BLASTP program used for protein searches sets compromise values of T and X to balance the processor requirements and sensitivity.
BLAST is unlikely to be as sensitive for all protein searches as a full dynamic programming algorithm. However, the underlying statistics provide a direct estimate of the significance of any match found. The program was developed at the NCBI and benefits from strong technical support and continuing refinement. For example, filters have recently been developed to exclude automatically regions of the query sequence that have low compositional complexity, or short periodicity internal repeats. The presence of such sequences can yield extremely large numbers of statistically significant but biologically uninteresting MSPs. For example, searching with a sequence that contains a long section of hydrophobic residues will find many proteins with transmembrane helices.
Copyright © 1996-2008,