EXERCISE 1

Run a basic BLAST search using the following sequence as a query. Copy and paste the sequence into the correct position at the BLAST server. You may use the default settings but you will have to choose a database and a program. Please run the search using an untranslated nucleotide query against a

nucleotide database.

GTCCGGCCTGGGCGACAGAGCAAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Question 1. Examine the output from your BLAST search. Did you identify a unique sequence in the database? Is there only one sequence that exactly matches all the bases in your query?

In your opinion, is the match biologically relevant? Hint: Look carefully at the E values. Do the matches appear to be random?

Question 2. Do all the matches correspond to DNA from one organism? Do most of them? Of the sequences that match your query and are human, do they all correspond to the same chromosome?

Question 3. Formulate a hypothesis to account for these results.

Now open and carefully examine the following file. (You can use your browser's "find" facility to locate it in your Blast result page.) Note that this file is is one of the many "matches" to your query with the same value.)

gb|U12581.1|HSU12581

Question 4. Is the information in this file consistent with your hypothesis? Why or why not? If not, would you now like to formulate a new hypothesis?

Return to the original blast output . The sequence you retrieved above was listed well down from the top. Click on the "score" from this sequence (68) to examine the alignment. (Hint: you may have to re-run the Blast query requesting more alignments be returned.)

Question 5. What are the specific bases in the sequence that matched your query (please specify by number)? Compare these numbers to the "Features" information from the Genbank entry you retreived above. Does your query align with any particular feature in the sequence?

Question 6. Look closely at the query sequence you used above to run the search. Why in your opinion do you find no alignments to the string of A's at the end of the query?

EXERCISE 2

Copy the above sequence again, but this time use the blastx program. This program translates the nucleotide sequence in all six reading frames (both strands). Paste the sequence into the BLAST query form, select blastx from the program menu and SWISS-PROT (PROT) from the database menu. Run the search and examine the results. The layout is slightly different, but principally the same.

Question 7. How many sequences matched your query when searching the protein database? What do these sequences have in common?

Question 8. Based on the results of these two searches, which output do you feel was easier to interpret? Which search do you feel was more sensitive? Why?

Question 9. Retrieve the Swiss-Prot file corresponding the best match from the most recent search (blastx). Read the "Comments" in the header. Given the results of the two searches you have performed, why are the managers of the Swiss-Prot database concerned about these sequences?

EXERCISE 3

Copy the sequence below, and run a BLAST search. Use blastn. Examine the results.

GAATTCTAATCTCCCTCTCAACCCTACAGTCACCCATTTGGTATATTAAAGATGTGTTGTCTACTGTCTA

GTATCCCTCAAGTAGTGTCAGGAATTAGTCATTTAAATAGTCTGCAAGCCAGGAGTGGTGGCTCATGTCT

Question 10. What do you believe the query sequence encodes?

Question 11. How many of the matches do you believe are biologically relevant?

Examine the alignments of the "best" matches. (Recall that you may do this by clicking on the score for each alignment).

Question 12. How may bases of your query match each of these sequences?

Question 13. How much did the E value rise with only one base difference?

EXERCISE 4

The sequence in the last alignment is clearly quite unique. What would happen if you didn't know any more than the first 13 bases? Here they are:

GAATTCTAATCTC

Repeat the blastn search with this sequence as a query.

Question 14. What is the result of the search? Why do you think you obtained this result? Does this result surprise you given that you know that the sequence is in the database?

You should have received the message:

No significant similarity found.

Question 15. What parameters of the search do you feel you could adjust to try to get a match to a sequence in the database? Could you modify the expect value (E value) threshold?

Try again, using some of the "Options for advanced blasting": use the same 13 base query sequence:

GAATTCTAATCTC

This time, adjust the E value threshold to 100 (from the default of 10). Also, limit the search to human sequences only.

Question 16. Do you get any matches from the search now? Are any to the same two matching sequences you found when using the complete (140 bp) query sequence? (note that these two sequences were:

gb|U01317.1|HUMHBB

and

gb|L22754.1|HUMBGLOBC

Question 17. If you were to return to the BLAST server and run an advanced BLAST search with the same 13 base query sequence, and want to make the search more sensitive by altering the the word length, do you want to use a smaller or larger word size? Recall that the default word size on nucleic acid searches is 11.

The Moral of the Story:

The point of this exercise was to demonstrate that although the database returns a "No Hits" message, this not be because there isn't a sequence to be found. It may be because the search criteria need to be optimized for the search. If a given query does not produce a significant match when searching a database one can always try a different algorithm.

EXERCISE 5 NOTE: Please skip this exercise.

EXERCISE 6

Using the following sequence as a query, search the protein database Swiss-Prot using blastx.

acaggtaagc gcccctaaaa tccctttggg cacaatgtgt cctgagggga gaggcagcga cctgtagatg

ggacgggggc actaaccctc aggtttgggg cttctgaatg agtatcgcca tgtaagccca gtatggccaa

tctcagaaag ctcctggtcc ctggagggat ggagagagaa aaacaaacag ctcctggagc agggagagtg

ctggcctctt gctctccggc tccctctgtt gccctctggt ttctccccagg

Recall that protein alignments use a smaller word size (3 instead of 11) as a default. Also please recall that the alignments report both identical residues and "conservative substitutions." You should see that the best match is to the sequence shown below.

>gi|23396972|sp|Q62005|ZP1_MOUSE   Zona pellucida sperm-binding protein 1 precursor 
                                  (Zona pellucida glycoprotein 1) (Zp-1)
          Length = 623

 Score = 35.4 bits (80), Expect = 0.021
 Identities = 25/59 (42%), Positives = 29/59 (49%), Gaps = 1/59 (1%)
 Frame = -1

Query: 180 FSLHPSRDQELSEIGHTGLTWRYSFRSPKPEG*CPRPIYRSL-PLPSGHIVPKGILGAL 7
           F+LHP  D  L+  GHTGLT  Y    P+     P P   SL P P+G  VP    G L
Sbjct: 154 FALHPIPDHTLAGSGHTGLTTLY----PEQSFIHPTPAPPSLGPGPAGSTVPHSQWGTL 208

The sample output above shows that the reading frame used in your query sequence is -1 (the translation started with the second base of the query's complementary strand). It notes that 42% of the amino acids are identical and 49% are positives. The positives include all identical matched residues AND conserved substitutions (i.e., a lysine for a valine).

Now repeat the search using blastn. You should see an exact match to the sequence used as a query.

Question 21. What does the query sequence match? Locus? Can you explain why the blastx search gave such a funny result?

Question 22. Can you explain why the blastx search gave such a different result? Which do you believe and why? HINT: Retrieve the best matching sequence from the nucleotide database. Carefully examine which specific bases of the sequence were used as a query.

Question 23. What is the moral of the story in this exercise (make up your own!!)?

EXERCISE 7

1. Go to the Basic BLAST search page at NCBI. Copy the following human amino acid

sequence (given in the one letter code).

MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEE

VGALAKVLRLFEENDVNLTHIESRPSRLKKDEYEFFTHLDKRSLPALTNIIKILRHDI

GATVHELSRDKKKDTVPWFPRTIQELDRFANQILSYGAELDADHPGFKDPVYRARRKQ

FADIAYNYRHGQPIPRVEYMEEEKKTWGTVFKTLKSLYKTHACYEYNHIFPLLEKYCG

FHEDNIPQLEDVSQFLQTCTGFRLRPVAGLLSSRDFLGGLAFRVFHCTQYIRHGSKPM

YTPEPDICHELLGHVPLFSDRSFAQFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLC

KQGDSIKAYGAGLLSSFGELQYCLSEKPKLLPLELEKTAIQNYTVTEFQPLYYVAESF

NDAKEKVRNFAATIPRPFSVRYDPYTQRIEVLDNTQQLKILADSINSEIGILCSALQKIK

2. Paste the sequence into the query sequence window and adjust the options as necessary. You won't need to specify advanced options, but you should choose a program and database. For simplicity, please use the main SWISS-PROT database. You may wish to try other databases, but you should return to SWISS-PROT when continuing with this exercise.

Run the search and identify the protein. Use the link provided to retrieve the best matching sequence and examine the Comments. Isn't it great to have this annotation!

You may need look at pages that are linked from the SWISS-PROT report to answer the following questions. The idea is to get a feel for what is available.

Question 24. What is the SWISS-PROT name of the entry?

Question 25. What is the SWISS-PROT primary accession number?

Question 26. What is the most common name of the protein?

Question 27. What is the gene called?

Question 28. Which year was the crystal structure of the catalytic domain determined? Name the first named author of this work

Question 29. Does the enzyme require a co-factor to function? If so, what?

Question 30. Name the most common disease that arises as a result of deficiency of this enzyme.

Question 31. Which cytogenetic locus does the gene reside at? (e.g. 13p10.1)

Question 32. What is the PAHdb?

Question 33. How many amino acid residues are there in the protein?

Question 34. What is the molecular weight of the protein?