Practice Task

The purpose of this section is to help you become more familiar with using the various flavors of BLAST to gain knowledge about an "unknown" DNA sequence. The source of the sequence below (either genomic or mRNA) is to be determined by you.

 
>TRIAL
CCCACAGGGGGACCGGCCCTGTGACCCCTCACCGGGGCCGTGGGCCCGAGCCCCGGACTT
CCCTAAGCCGGCAATGACCGCCTGCGCCCGCCGAGCGGGTGGGCTTCCGGACCCCGGGCT
CTGCGGTCCCGCGTGGTGGGCTCCGTCCCTGCCCCGCCTCCCCCGGGCCCTGCCCCGGCT
CCCGCTCCTGCTGCTCCTGCTTCTGCTGCAGCCCCCCGCCCTCTCCGCCGTGTTCACGGT
GGGGGTCCTGGGCCCCTGGGCTTGCGACCCCATCTTCTCTCGGGCTCGCCCGGACCTGGC
CGCCCGCCTGGCCGCCGCCCGCCTGAACCGCGACCCCGGCCTGGCAGGCGGTCCCCGCTT
CGAGGTAGCGCTGCTGCCCGAGCCTTGCCGGACGCCGGGCTCGCTGGGGGCCGTGTCCTC
CGCGCTGGCCCGCGTGTCGGGCCTCGTGGGTCCGGTGAACCCTGCGGCCTGCCGGCCAGC

Go to NCBI's nucleotide BLASTN search site.

Perform a BLASTN search of the sequence against the NCBI non-redundant (nr) database. Paste the above sequence into the Search window on the BLAST page. Be sure to uncheck the Low complexity box in the second part of the form. After clicking the BLAST! button, a page is displayed for formatting the output.

[BLASTN nr database run - input format: fasta file]

Record the length of your sequence (given on the format page as number of letters). After waiting long enough for the search to run, check your results by clicking Format!.


Examine the BLASTN results and answer the following:

1. To what organism does the sequence belong?
2. What type of sequence is it (genomic or mRNA)?

Based on your observations, is it possible to answer these questions based on the results from a single BLAST search?

Record the accession number of the best hit.


From the BLASTN output, choose a full-length mRNA which significantly aligns with your fragment. Go to the GenBank annotation page for that mRNA by clicking on the gi#|version#|accession# link of the hit. Use the pull down menu on the GenBank page to display a FastA formatted version of the full-length mRNA. Save this FastA file to your local machine by clicking on "Save" and responding to the prompts with "Save File", storing the file with a name of your choice.

Go back to the GenBank annotation page for the full-length mRNA. Scroll down the page to locate the following pieces of information.

3. What is the mRNA's accession number?
4. What is the mRNA's GI number?
5. On what chromosome is the mRNA sequence located?
6. What part of the sequence actually encodes for a protein?
7. What is the encoded protein's function?


Use your mRNA sequence in a BLASTX search against the nr protein databases. Use only the coding region of the mRNA in the search. Be sure that filtering is turned off.

[BLASTX run - input format: accession number]

Explore the impact of changing matrices on your search by repeating the BLASTX search, this time using a different matrix. If your first search found lots of very high quality hits use BLOSUM80 for the second run. If your search was light on high quality hits use BLOSUM45.

Print the results of your last run. From this set of results, select at least 5 protein sequences from different species, saving them as fasta files. Saving too few files will make for a poor alignment later on. Give the saved files names that reflect the species of origin. Be sure to save the protein which best represents your initial selected sequence. This is only an attempt to get a small number of sequences so that the future multiple alignment is more realistic.

Print each of your saved sequences.



The purpose of this section is to help you become more familiar with web-based protein characterization tools. The sequences to be used were derived during the previous section.

Understanding a protein starts with exploring its basic characteristics. Visually examine your main protein sequence derived in previous section or run it through one of the EXPASY's ProtParams sites ( Canada, China, Korea, Taiwan, USA) to see if there is anything unusual about its amino acid composition.

High percentages of any one amino acid can greatly influence the protein's behavior. METALLOTHIONEINs have about 30% cysteine and form metal cages using the -SH group from the cysteines. Some ANTIFREEZE proteins are about 50% alanine. One of the current theories on how these proteins work is due to possible hydrophobic interactions between the proteins and water.

8. Did visual inspection of your protein(s) detect any amino acids in obvious abundance?


Protein families have been studied to identify functional patterns called motifs. Some of these motifs are based on text patterns and others on profile matrices.

Go to one of the EXPASY's Prosite sites ( Canada, China, Korea, Taiwan, USA) to find text pattern-based functional motifs in your full-length protein sequence.

[EXPASY's Prosite run - input format: raw sequence]

Paste your sequence into the Scan a protein for Prosite matches window. Choose the option to Exclude patterns with a high probability of occurrence. Run the process by clicking the START THE SCAN button.

The results page gives the sequence used in the search and then any found hits. Following the PDOC link gives detailed information about the located pattern including references. Explore the documentation on any hits. Repeat this process with at least two other protein files that you saved.

Record the following:
9. Prosite data for the protein of interest (pattern name and location)
10. Prosite data for the second protein (pattern name and location)
11. Prosite data for the third protein (pattern name and location)

Mark the hard copies of the processed sequences with this information.


Next, check to see if the protein has long stretches of hydrophobic residues indicating phobic regions. This could indicate the central phobic core of the folded protein or possible transmembrane segments.

Go to the Weizmann Institute's Hydropathic Profile site and determine your protein's profile.

[Weizmann's Hydropathic Profile run - input format: raw sequence]

In the resulting plot, check to see if there are any large sections of the plot below the zero line that are at least 20 or more residues long. Record the following:

12. Number of possible phobic stretches
13. Location of possible phobic stretches

Repeat this process with the two other protein files used in the Prosite section.


Predicting transmembrane segments can be considered a specialized form of secondary structure prediction. Such predictions can provide organizational information useful for understanding how transmembrane proteins function and some 2D structure relevant to the membrane.

Take your protein sequence and use it in the following transmembrane prediction sites to determine if it is a transmembrane protein.

HMMTOP
SOSUI
TMHMM

[HMMTOP run - input format: raw sequence]
[SOSUI run - input format: raw sequence]
[TMHMM run - input format: fasta file]

14. Does your protein contain any transmembrane segments?
15. How well do the prediction methods agree with one another?
16. Is there a consensus on areas of your protein that might be transmembrane segments?

Repeat this process with the two other protein files used in the Prosite section.



On the local machine use an editing program to create a file containing all the fasta formatted protein files you saved previously. Trim the information after the > sign for each sequence to a single word reflecting the species.


Once your combined fasta file has been created, use it at the EBI site to do a Clustalw alignment on the sequences.

[EBI's Clustalw run - input format: single file containing multiple fasta sequences]

Change the OUTPUT FORMAT from the default value to gcg MSF prior to running the alignment process.

Print the resulting alignment page.

Examine your alignment.


Now go back and look at the information collected on your proteins. Compare your recorded characterized hard copy pattern data with the generated alignment.

17. Do the found motifs align?
18. Just how conserved is the protein family being worked with?


Copyright 2002 Regents of the University of California. All rights reserved.