Question 1. Examine the output from your BLAST search. Did you identify a unique sequence in the database?. Is there only one sequence that exactly matches all the bases in your query? In your opinion, is the match biologically relevant? Hint: Look carefully at the E values. Do the matches appear to be random?
Answer 1a. Using Blastn against the nr database, there are many sequences that match at the e-9 level. Notice that they are scattered across many human chromosomes.
Answer 1b. The matches are scattered across many chromosomes, so while they may not be exactly random, their biological relevance is at most, rather non-specific. This suggests that an e-value in the e-9 area, taken by itself, may not indicate biological relevance.
Question 2. Do all the matches correspond to DNA from one organism?. Do most of them?. Of the sequences that match your query and are human, do they all correspond to the same chromosome?.
Answer 2. While there are a few other primate hits, the vast majority are human, but from various chromosomes.
Question 3. Formulate a hypothesis to account for these results. Now open and carefully examine the following file. (Note that this file is is one of the several "matches" to your query with an E value of 5e-10.
gb|U12581.1|HSU12581
Question 4. Is the information in this file consistent with your hypothesis? Why or why not? If not, would you now like to formulate a new hypothesis?
Answer 3. and 4. The hit gb|U12581.1|HSU12581 is a member of the human ALU repeat family. Blasting the sequence against the NCBI alu db returns a large number of hits. Inspection of their alignments provides convincing evidence for the sample sequences membership in the alu family.
Question 5. What are the specific bases in the sequence that matched your query (please specify by number)?
Compare these numbers to the "Features" information from the sequence file you retrieved above. Does your query align with any particular feature in the sequence?
Answer 5. Bases 1 through 34 align exactly with bases 322-355 of the U12581 alu. The repeat region feature of U12581 runs from 69.. 368, encompassing our alignment.
(See under FEATURES in the Genbank entry for U12581 = gi:529348.)
Again, we are convinced the sample sequence is part of a human alu repeat.
Question 6. Look closely at the query sequence you used above to run the search. Why in your opinion do you find no alignments to the string of A's at the end of the query?
Answer 6. Because the Blast option to filter out low-complexity sequence was (by default) selected. Running the query with that filter un-checked, produces very large numbers of very low e-value matches. (Dont try this unless you have a lot of time.) Given that poly-A tails are biologically common and not very informative, and that our query sequence matches part of a human alu repeat, hits without that filter may not be very useful, despite the relatively low e-values.Thus the significance of low e-value matches must be considered in context.
Question 7. How many sequences matched your query when searching the protein database? 2. What do these sequences have in common?
Answer 7. Only one sequence matched: gi|728834|sp|P39191|ALU4_HUMAN Alu subfamily SB2 sequence
Question 8. Based on the results of these two searches, which output do you feel was easier to interpret? Which search do you feel was more sensitive? Why?
Answer 8. The blastx, with a single match, is far easier to interpret. It is less sensitive in the sense that only a single result was returned, in part because the swiss-prot protein database is much smaller than the genbank non-redundant (nr) database. Also less sensitive in the sense that our single-sequence query is compared translated in all 6 reading frames, reducing the information encoded by any given subsequence, thus reducing the likelihood of less significant matches.
Question 9. Retrieve the Swiss-Prot file corresponding the best match from the most recent search (blastx). Read the "Comments" in the header. Given the results of the two searches you have performed, why are the managers of the Swiss-Prot database concerned about these sequences?
Answer 9. The DEFINITION section reads Alu subfamily SB2 sequence contamination warning entry.Most biologists, except possibly those studying the ALU family, would consider query sequences consisting of ALU repeats as contaminant, noise, not genetic signal.
Question 10. What do you believe the query sequence encodes?
Answer 10. Human beta-globin cluster gene, enhancers with repeat region
Question 11. How many of the matches do you believe are biologically relevant?
Answer 11. There are 4 relevant entries, the top 4. They each contain 140 nucleotides of homology. Additionally, they have Scores of 270 and E values 2e-72.
Question 12. How may bases of your query match each of these sequences?
Answer 12. 140 bp of the query matched the beta-globulin sequence.
Question 13. How much did the E value rise with only one base difference?
Answer 13. From 5e-70 to 2e-72, or two orders of magnitude, caused by only one base-pair mismatch (a T for a C) between the 3 clone hits and the beta-globin hit.
Question 14. What is the result of the search? Why do you think you obtained this result? Does this result surprise you given that you know that the sequence is in the database?
You should have received the message:
No significant similarity found.
Answer 14. No significant similarity found. NCBI explanation:
Short Sequences: There is a special BLAST optimized for searching with small sequences. Go to the main BLAST web page and select the "Search for short nearly exact matches" link for Nucleotide - Nucleotide or Protein - Protein sections. "
Question 15. What parameters of the search do you feel you could adjust to try to get a match to a sequence in the database? Could you modify the expect value (E value) threshold?
Try an Advanced BLAST search using the same 13 base query sequence:
GAATTCTAATCTC
This time, adjust the E value threshold to 100 (from the default of 10). Also, limit the search to human sequences only.
Question 16. Do you get any matches from the search now? Are any to the same two matching sequences you found when using the complete (140 bp) query sequence? (note that these two sequences were:
gb|U01317.1|HUMHBB and
gb|L22754.1|HUMBGLOBC )
Answers 15 & 16. With Expect threshold Alignment Set to 100, and matches limited to Homo sapiens, Blastn search against the nr database returns many, many matches, but not both the gb|U01317.1|HUMHBB and gb|L22754.1|HUMBGLOBC entries matched before.
Question 17. If you were to return to the BLAST server and run an advanced BLAST search with the same 13 base query sequence, and want to make the search more sensitive by altering the the word length, do you want to use a smaller or larger word size? Recall that the default word size on nucleic acid searches is 11.
Answer 17. Smaller word size (7) will return more hits. Larger (15) is longer than the query sequence, and returns no High-scoring Pairs (HSP) matches at all.
The point of these exercises was to demonstrate that when the database query returns a "No Hits" message, this need not be because there isn't a sequence to be found. It may be because the search criteria need to be optimized for the search.
The Moral of the Story:
If a given query does not produce a significant match when searching a database one can always try different parameters or algorithms.
[FASTA now finds too many matches and times out. Skip this exercise for now.]
Question 21. What does the query sequence match?
Can you explain why the blastx search gave such a funny result?
Answer 21.gi|31906|emb|V00520.1|HSGROW2
Human germ line gene for growth hormone (presomatotropin) matches exactly, while the other high-scoring pairs have minor mismatches. Clicking on the L takes you to NCBI Locuslink:
2688 Hs GH1 growth hormone 1 on chromosome 17q24.2
Question 22. Can you explain why the blastx search gave such a different result? Which do you believe and why? HINT: Retrieve the best matching sequence from the nucleotide database. Carefully examine which specific bases of the sequence were used as a query.
Answer 22. Following the hint, we find that our query sequence matches bases 341-601 in the growth hormone entry, and looking a little deeper, under Features find also that bases 345 to 600 are known to constitute an intron region. Thus our query sequence is genomic DNA, mostly intron, so translating that as if it coded for amino acids, is not going to find good matches in a protein database. We should trust an exact alignment from the blastn search because blastx treats introns as if they are coding, thereby introducing extraneous signal into the system. The relatively low scores and poor alignments on the blastx results also suggest the appropriateness of blastn over blastx in this case.
Question 23. What is the moral of the story in this exercise (make up your own.)
Answer 23. It pays to select your alignment tool after consideration of sequence being queried, so the sequence matches the capabilities of the tool.
Question 24. What is the SWISS-PROT name of the entry?
Answer 24. PH4H_HUMAN
Question 25. What is the SWISS-PROT primary accession number?
Answer 25. P00439
Question 26. What is the most common name of the protein?
Answer 26. Phenylalanine-4-hydroxylase
Question 27. What is the gene called?
Answer 27. PAH
Question 28. Which year was the crystal structure of the catalytic domain determined? Name the first named author of this work.
Answer 28. 1997, Erlandsen,H.
Question 29. Does the enzyme require a co-factor to function? If so, what?
Answer 29. Ferrous Ion
Question 30. Name the most common disease that arises as a result of deficiency of this enzyme.
Answer 30. phenylketonuria (PKU), leading to mental retardation, etc.
Question 31. Which cytogenetic locus does the gene reside at? (e.g. 13p10.1)
Answer 31. Chromosome 12q22-q24.1 or terminus12q (obtained from OMIM entry 261600)
Question 32. What is the PAHdb?
Answer 32. Phenylalanine Hydroxylase locus and mutations database
Question 33. How many amino acid residues are there in the protein?
Answer 33. 452aa
Question 34. What is the molecular weight of the protein?
Answer 34. 51862 Da (reported at top of Sequence Information, near bottom of Swissprot entry for PH4H_HUMAN)
Question 1. Global versus local alignment activity.
Answer 1. ALIGN produces no alignment, LALIGN returns,
47.6% identity in 143 aa overlap; score: 418
Question 2. Function identification
Answer 2. Using blastp vs swissprot, Genestream Blast2 produces:
| Sequences producing significant alignments: | Score (bits) |
E Value |
| sp|P00325|ADHB_HUMAN Alcohol dehydrogenase beta chain (EC 1.1.1.1). | 758 | 0.0 |
| sp|P00325|ADHB_HUMAN Alcohol dehydrogenase beta chain (EC 1.1.1.1). | 758 |
for the 1st sequence and this for the 2nd:
| Sequences producing significant alignments: | Score (bits) |
E Value |
| sp|P37005|LAST_ECOLI Hypothetical tRNA/rRNA methyltransferase la... | 418 | e-117 |
| sp|P37006|LAST_SERMA Hypothetical tRNA/rRNA methyltransferase la... | 268 | 8e-72 |
Question 1. What is the default scoring matrix for this search? What is the penalty for opening a gap? What is the penalty for extending a gap (per skipped AA?)
Answer 1, step by step:
Question 2. Does the program align all Amino acids in both proteins? Which specific residues were aligned in each protein? (i.e., 22-104 in cytochrome c2, with 12-94 in isocytochrome c2).
Answer 2: BL2SEQ aligns positions 8 through 138 of cytochrome c2 with 7 through 122 from isocytochrome.
Question 3. How many of the residues were identical in each protein? How many were "positives?"
HINT: These are either identical residues or conservative substitutions.
Answer 3: Identities = 51/132 (38%), Positives = 63/132 (47%), Gaps = 17/132 (12%)
Question 4. Examine the alignment carefully. In how many regions of the alignment were gaps introduced?
Answer 4:There are 4 gaps, of length 7, 1, and 8 on isocytochrome and 1 on cytochrome.
Question 5. What color are the residues that are "similar" when your alignment is displayed with TEXSHADE? What color are residues that are identical?
Answer 5: step-by-step
Question 6. Returning to the comparison of the two alignments, one from BL2SEQ, the other from ALIGN. How much of the two sequences were aligned using ALIGN. That is, were all AA's used in alignment? What do you think the term "Global Alignment" means?
Answer 6: All the aas of one of the sequences were used. Global alignment is global in the sense that it seeks an alignment across the whole length of a sequence.
Question 7. Are the gap penalties the same in ALIGN as they were in BL2SEQ (when both used a BLOSUM as a scoring matrix)? Which alignment program has a stiffer penalty?
Answer 7:Gap opening penalties are 11 for BL2SEQ and 12 for ALIGN; gap extension penalties are 1 and 2, respectively.So ALIGN penalizes gaps more severely.
Question 8. How are identical residues indicated in the output when using ALIGN? Which algorithm (ALIGN or BL2SEQ) gave the highest percentage identity between the two proteins (when both used a BLOSUM scoring matrix)? What is the percentage identity between these two proteins using each algorithm (ALIGN and BL2SEQ)?
Answer 8: ALIGN uses : to indicate identical aligned aas, BL2SEQ uses the amino acid code letter itself, as for a consensus sequence. BL2SEQ calculated a 38 percent identity, ALIGN yielded 34.2 percent identity.
Question 9. In your opinion, which algorithm gave the most relevant alignment? Why?
Answer 9:I feel BL2SEQ builds a better alignment, with smaller gaps, a few more aligned aas, and the start of a consensus sequence.
Question 10. What was the result? Why do you think you got this result given the previous results you got when comparing the protein sequences?
Answer 10: no significant similarity found. The DNA code is degenerate, in the sense that several different nucleotide codons can specify the same amino acid.So two proteins may have significant similarity, while their mRNAs have very little, and their genomic DNA sources even less.
Question 11. From what organism was the best aligning sequence from?
Answer 11: On Feb 12, 2003, I got Agrobacterium tumefaciens (your mileage may vary.)
Question 12. Why do you believe the first 21 residues of the isocytochrome c2 protein sequence do not align with anything in the CY2_AGRTC sequence? HINT: There is a biologically relevant answer to this question. Carefully examine the DNA sequence file for the isocytochrome c2 protein. What do the first 21 AA's encode?
Answer 12: the 1st 21 aas of the protein are signal sequence (a peptide which is destined to be either secreted or part of membrane components.)
Question 13. The optimal global alignment between the R. sphaeroides isocytochrome c2 and cytochrome c2 protein required two large gaps (7 and 8 residues) in the isocytochrome c2 sequence. Does the alignment between the isocytochrome c2 protein sequence and the CY2_AGRTC sequence require gaps at similar positions?
Answer 13: No such gaps in this alignment: only that at the beginning (as above), and another at position 127 to 136, near the end.
Question 14. Which sequence from the Swiss-Prot database scored highest in the Blast search using the R. sphaeroides cytochrome c2 protein sequence as a query? From what organism is this sequence derived?
Answer 14: On Feb. 12, 2003, I got C551_ROSDE, from Roseobacter denitrificans. It was previously listed as ErythroBacter sp, or C551_ERYSP. The DB may return something else by the time of your workshop. This should not affect the exercise.
Question 15. Does the alignment between the cytochrome c2 protein sequence and the C551_ERYSP sequence require gaps at similar positions as those introduced in the isocytochrome c2 protein (when aligned with isocytochrome c2)? Speculate on where within the 3-D structure of these small, globular proteins the AA's corresponding to those missing in the gapped regions are. Do you believe these residues would be buried within the protein's globular structure or on the surface near the aqueous environment? Why?
Answer 15: Again, no gaps in this alignment. The residues in question most likely reside interior to the folded protein. Otherwise, they might interfere with the function of the protein, becoming incapable of substituting for a knocked-out gene.
Question 16. Does your cytochrome c have these two loops of AA's (like the R. sphaeroides cytochrome c2 protein), or does it lack these loops (like the R. sphaeroides isocytochrome c2 protein)? Save (import) any alignments you use to answer this question.
Answer 16: Like the isocytochrome of the Rhodobacter (and unlike the cytochrome C2 precursor), human mitochondrial cytochrome has no loops, as indicated by the gaps in the alignment.
Question 17. Does the multiple sequence alignment confirm or refute your answer to question 17? Besides the lack of a signal sequence in the human cytochrome c, and the loops/gaps we have discussed, what feature stands out as being unique to the R. sphaeroides isocytochrome c2 protein?
Answer 17. Isocytochrome C2 may have a loop lacking in both human and bacterial cytochrome, from positions 123-128, indicated by corresponding gaps inserted in the other 2 sequences at that position.
Question 18. Given the results of your alignment, what type of mutation do you think has occurred to produce the mutant mRNA? (i.e., insertion, deletion, point mutation, etc). Retrieve the mutant version of the BRCA1 mRNA and check your answer.
Answer 18: A splice-variant (deletion): at 4478 to 4481 in the BRCA1 gene are 3 bases missing in the mutant.
Question 19. What is the percentage identity between these two protein sequences? In your opinion, is this biologically significant or random?
Answer 19:19% identity, probably not biologically significant.
Question 20. Do the regions of identity/similarity cluster in certain spots within the alignment or are they scattered throughout the alignment? Do you believe these proteins are "homologs" as your friend suggests?
>Answer 20: The pattern of identical residues appears very scattered and random, with many gaps. The case for homology is not persuasive.
Question 21. Which two of the three putative homologs are most identical?
Answer 21: Using ClustalW to do a multiple alignment of all 3 sequences, we find more gaps in the Human sequence, and more identical or conserved. ALIGNing the arabidopsis and binding protein, we find 19.9 percent identity, versus 17.4 percent for BRCA1 and the Plasmodium protein. It appears the human BRCA1 is less closely related than the other two.