The purpose of the following exercise is to help you become more familiar with web-based protein characterization tools. The sequences to be used were derived during the previous lab.
Understanding a protein starts with exploring its basic characteristics. Visually examine your main protein sequence derived in lab one or run it through one of the EXPASY's ProtParams sites ( Canada, China, Korea, Taiwan, USA) to see if there is anything unusual about its amino acid composition.
High percentages of any one amino acid can greatly influence the protein's behavior. METALLOTHIONEINs have about 30% cysteine and form metal cages using the -SH group from the cysteines. Some ANTIFREEZE proteins are about 50% alanine. One of the current theories on how these proteins work is due to possible hydrophobic interactions between the proteins and water.
1. Did visual inspection of your protein detect any amino acids in obvious abundance?
Protein families have been studied to identify functional patterns called motifs. Some of these motifs are based on text patterns and others on profile matrices.
Go to one of the EXPASY's Prosite sites ( Canada, China, Korea, Taiwan, USA) to find text pattern-based functional motifs in your full-length protein sequence.
[EXPASY's Prosite run - input format: raw sequence] Paste your sequence into the Scan a protein for Prosite matches window. Choose the option to Exclude patterns with a high probability of occurrence. Run the process by clicking the START THE SCAN button.
The results page gives the sequence used in the search and then any found hits. Following the PDOC link gives detailed information about the located pattern including references. Explore the documentation on any hits. Repeat this process with at least two other protein files that you saved.
Record the following:
2. Prosite data for the protein of interest (pattern name and location)
3. Prosite data for the second protein (pattern name and location)
4. Prosite data for the third protein (pattern name and location)
To explore profile-based functional information, go to the ProfileScan server at ISREC. Paste in your sequence and select all the database options that you can.
[ISREC's PSCAN run - input format: raw or fasta sequence] Click scan to start the process.
Results from this run are very long. First, the sequence used is given along with a listing of the databases searched. Second, a summary of the resulting hits are given. An exclamation point is used to mark significant hits. There will be many more hits from the PROSITE database on this list than the previous one, since this time the more frequent patterns weren't excluded. Record the names of any hit with an exclamation point.
The best way to explore any of the found hits of interest is to click on the link given for the desired hit in the portions of the results that follow the match location section. The PDOC or the QDOC links go off to prosite documentation (most of the time). The Pfam-site documentation is quite colorful and informative. Click on any significant Pfam hit's documentation link. Record the PDB name given in the box containing the structure.
Repeat this process with the same two additional proteins used in the prosite runs. Record the following:
5. Number of Pfam hits for main protein
6. The name, location and associated PDB structure for each Pfam hit of the main protein
7. The name and location of any other significant hit from another database for the main protein
8. Number of Pfam hits for second protein
9. The name, location and associated PDB structure for each Pfam hit of the second protein
10. The name and location of any other significant hit from another database for the second protein
11. Number of Pfam hits for third protein
12. The name, location and associated PDB structure for each Pfam hit of the third protein
13. The name and location of any other significant hit from another database for the third proteinMark the locations of the found characterized patterns or profiles on the printouts for your three proteins. Name or color code them to help with identification later.
If your chosen sequence lacks significant Pfam hits (using ProfileScan) or associated PDB structures, choose one of the following protein sequences only for motif, profile, and fingerprint analysis. All sequences are human in origin.
sequence 1 genpept format fasta format sequence 2 genpept format fasta format sequence 3 genpept format fasta format sequence 4 genpept format fasta format sequence 5 genpept format fasta format
Another approach to finding functional information is to use fingerprints. A fingerprint is a group of conserved motifs used to characterize a protein family. Usually the motifs do not overlap, but are separated along the sequence, though they may be contiguous in 3D-space. Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs.
The PRINTS database was established on this premise. To search this database, go to the FPSCAN site. Paste in your raw protein sequence and click on Send Query.
[FPScan run - input format: raw sequence] The results are organized by high-scoring fingerprints, the top ten scoring fingerprints and then detailed information on the top scoring fingerprints. Finally the sequence used in the search is given.
Explore any statistically significant matches, plus those with a P value of less than 10-4, to get a feel for the nature of this resource. The start location of the found block is given in the Pos column of the "ten top scoring fingerprints for your query. Detailed by motif" section of the results. To find where the block ends add the length of the block to the start value and subtract one.
Again, repeat the process with your additional two proteins.
Record the following:
14. Any found fingerprint pattern name(s) for the main protein and their location
15. Any found fingerprint pattern name(s) for the second protein and their location
16. Any found fingerprint pattern name(s) for the third protein and their location
On the local machine use Microsoft Word to create a file containing all the fasta formatted protein files you saved in lab 1. Trim the information after the > sign for each sequence to a single word reflecting the species.
Once your combined fasta file has been created, use it at the EBI site to do a Clustalw alignment on the sequences.
[EBI's Clustalw run - input format: single file containing multiple fasta sequences] Change the OUTPUT FORMAT from the default value to gcg MSF prior to running the alignment process.
From the results page click on the aln file and save it to your local machine. Again on the results page, save the dnd file to your local machine.
Check out your alignment. To do this requires the use of the local machine's program GeneDoc. From the File menu select Open, and navigate to the desired input file location. Any file that is in MSF format will show up in the resulting folder window. Double click on the name of your alignment file.
The alignment is currently displayed with consensus information given below the actual sequences used. To remove the consensus line click on the upper case C in the first line of buttons. The resulting window contains the display settings for the program. At the top of the third column is the section that controls the consensus line. Click the No Consensus button followed by OK to close the window and make the change in your alignment.
Print off your alignment file by selecting the Print option of the File menu and clicking on the OK button of the Print window. Once satisfied with your alignment output, use the Exit option of the File menu to get out of the GeneDoc program. You can either save your alignment changes or not, depending on your own needs.
Examine your tree file with the local machine's program treeview32. From the File menu select Open and navigate to the desired input data location. Once there, click on the name of the saved tree file.
The program has a number of different viewing options for the tree. These are shown by the second block of four control buttons [radial tree, cladogram, rectangular cladogram and phylogram]. Explore all the viewing options, printing off the version of the tree that makes the most phylogenetic sense to you.
Print your desired tree by selecting the Print option of the File menu and clicking on the OK button of the Print window. Use the Exit option of the File menu to get out of the treeview32 program.
Now go back and look at the information collected on your proteins. Compare your recorded characterized hard copy pattern data with the generated alignment.
17. Do the found motifs align?
18. Just how conserved is the protein family being worked with?