This file contains a very detailed version of the exercise given in lab 2. It assumes that the user knows how to cut and paste text from a computer screen and can use a text editor (word processing program). It is divided up into sections so that a user can move to a specific area from the main lab page. To completely understand a given section, read all the material contained in its links. Detailed instructions are given in green text on the html page and are italicized in the hard copy.
Table of Contents
Introduction
Motif Analysis
Profile Analysis
Fingerprint Analysis
File for Alignment
Clustalw Alignment
Alignment processing
Looking at a tree
Conclusions
Lab Two The purpose of the following exercise is to help you become more familiar with web-based protein characterization tools. The sequences to be used were derived during the previous lab.
Understanding a protein starts with exploring its basic characteristics. Visually examine your main protein sequence derived in lab one or run it through one of the EXPASY's ProtParams sites ( Canada, China, Korea, Taiwan, USA ) to see if there is anything unusual about its amino acid composition.
High percentages of any one amino acid can greatly influence the protein's behavior. METALLOTHIONEINs have about 30% cysteine and form metal cages using the -SH group from the cysteines. Some ANTIFREEZE proteins are about 50% alanine. One of the current theories on how these proteins work is due to possible hydrophobic interactions between the proteins and water.
Look at the GenBank version of your main protein sequence. Check to see if you notice any particular amino acid which seems to appear more frequently than the others.
1. Did visual inspection of your protein detect any amino acids in obvious abundance?
In the example main protein given below, there appears to be lots of leucines and alanines. However, this is due to the detectable repeating sections of A's and L's and not the result of any actual count of characters.
>TRIAL associated full length protein MTACARRAGGLPDPGLCGPAWWAPSLPRLPRALPRLPLLLLLLLLQPPALSAVFTVGVLG PWACDPIFSRARPDLAARLAAARLNRDPGLAGGPRFEVALLPEPCRTPGSLGAVSSALAR VSGLVGPVNPAACRPAELLAEEAGIALVPWGCPWTQAEGTTAPAVTPAADALYALLRAFG WARVALVTAPQDLWVEAGRSLSTALRARGLPVASVTSMEPLDLSGAREALRKVRDGPRVT AVIMVMHSVLLGGEEQRYLLEAAEELGLTDGSLVFLPFDTIHYALSPGPEALAALANSSQ LRRAHDAVLTLTRHCPSEGSVLDSLRRAQERRELPSDLNLQQVSPLFGTIYDAVFLLARG VAEARAAAGGRWVSGAAVARHIRDAQVPGFCGDLGGDEEPPFVLLDTDAAGDRLFATYML DPARGSFLSAGTRMHFPRGGSAPGPDPSCWFDPNNICGGGLEPGLVFLGFLLVVGMGLAG AFLAHYVRHRLLHMQMVSGPNKIILTVDDITFLHPHGGTSRKVAQGSRSSLGARSMSDIR SGPSQHLDSPNIGVYEGDRVWLKKFPGDQHIAIRPATKTAFSKLQELRHENVALYLGLFL ARGAEGPAALWEGNLAVVSEHCTRGSLQDLLAQREIKLDWMFKSSLLLDLIKGIRYLHHR GVAHGRLKSRNCIVDGRFVLKITDHGHGRLLEAQKVLPEPPRAEDQLWTAPELLRDPALE RRGTLAGDVFSLAIIMQEVVCRSAPYAMLELTPEEVVQRVRSPPPLCRPLVSMDQAPVEC ILLMKQCWAEQPELRPSMDHTFDLFKNINKGRKTNIIDSMLRMLEQYSSNLEDLIRERTE ELELEKQKTDRLLTQMLPPSVAEALKTGTPVEPEYFEQVTLYFSDIVGFTTISAMSEPIE VVDLLNDLYTLFDAIIGSHDVYKVETIGDAYMVASGLPQRNGQRHAAEIANMSLDILSAV GTFRMRHMPEVPVRIRIGLHSGPCVAGVVGLTMPRYCLFGDTVNTASRMESTGLPYRIHV NLSTVGILRALDSGYQVELRGRTELKGKGAEDTFWLVGRRGFNKPIPKPPDLQPGSSNHG ISLQEIPPERRRKLEKARPGQFS
Protein families have been studied to identify functional patterns called motifs. Some of these motifs are based on text patterns and others on profile matrices.
Go to one of the EXPASY's Prosite sites ( Canada, China, Korea, Taiwan, USA) to find text pattern-based functional motifs in your full-length protein sequence.
[EXPASY's Prosite run - input format: raw sequence] The EXPASY Prosite form is given below.
![]()
![]()
Paste your sequence into the Scan a protein for Prosite matches window.Choose the option to Exclude patterns with a high probability of occurrence.
Use the editor to open the necessary file and copy the text into machine's buffer.
Then paste the text into the form's pale yelloe window in the browser.
A properly filled out page with the example data is given below.
![]()
Run the process by clicking the START THE SCAN button.
The results page gives the sequence used in the search and then any found hits. Following the PDOC link gives detailed information about the located motif including references. Explore the documentation on any hits.
The results page for the example data is given below. Three motifs were found. Links to the PDOC and PS documentation pages are given.
![]()
Links to the documentation for the example protein's matches.
Repeat this process with at least two other protein files that you saved.
Record the following:
In the case of the example data, this process was repeated with the stored sequences from fly and medaka.
2. Prosite data for the protein of interest (pattern name and location)
The main sequence has hits for the PROTEIN_KINASE_DOM motif (525-808), the GUANYLATE_CYCLASES_1 motif (987-1010) and the GUANYLATE_CYCLASES_2 motif (880-1010).
3. Prosite data for the second protein (pattern name and location)
The fly sequence had the same motifs as the main seqeunce and the MITOCH_CARRIER motif.
![]()
4. Prosite data for the third protein (pattern name and location)
The medaka sequence has the same hits as the main sequence.
![]()
To explore profile-based functional information, go to the ProfileScan server at ISREC.
The ProfileScan page is given below.
![]()
Paste in your sequence and select all the database options that you can.
[ISREC's PSCAN run - input format: raw or fasta sequence] The necessary data is in your editor window.
A properly filled out page is given below for the example data. Be sure to click on all the available databases on the form.
![]()
Click scan to start the process.
It takes a few minutes for this analysis process to run. The form puts up the databases searched so that you can keep track of the run's progress.
Results from this run are very long. First, the sequence used is given along with a listing of the databases searched. Second, a summary of the resulting hits are given. An exclamation point is used to mark significant hits. There will be many more hits from the PROSITE database on this list than the previous one, since this time the more frequent patterns weren't excluded. Record the names of any hit with an exclamation point.
The example data results are given below. Partial images are given for the various sections of the results. It is necessary to scroll down the entire list to see if there are any significant hits or not.
top of results
![]()
summary section
![]()
////////////////////////////
![]()
match location section
![]()
////////////////////////////
The best way to explore any of the found hits of interest is to click on the link given for the desired hit in the portions of the results that follow the match location section. The PDOC or the QDOC links go off to prosite documentation (most of the time).
pattern section
![]()
////////////////////////////
profile section
![]()
//////////////////////////
The Pfam-site documentation is quite colorful and informative. Click on any significant Pfam hit's documentation link. Record the PDB name given in the box containing the structure.
pfam section
![]()
Repeat this process with the same two additional proteins used in the prosite runs. Record the following:
5. Number of Pfam hits for main protein
There were 3 Pfam hits for the main protein.
6. The name, location and associated PDB structure for each Pfam hit of the main protein
The Pfam hits for the main protein were:
Pfam name Location PDB code pkinase 584-800 1qmz guanylate_cyc 871-1058 1azs anf_receptor 43-429 1ewk 7. The name and location of any other significant hit from another database for the main protein
The other database hits for the main protein were:
database name location prosite profile guanylate_cyclases_2 880-1010 prosite profile protein_kinase_dom 525-808 prosite pattern guanylate_cyclases_1 987-1010 8. Number of Pfam hits for second protein
There were 3 Pfam hits for the fly protein.
9. The name, location and associated PDB structure for each Pfam hit of the second protein
The Pfam hits for the fly protein were:
Pfam name Location PDB code pkinase 580-812 1b6c guanylate_cyc 887-1074 1cjk anf_receptor 36-436 1ewk 10. The name and location of any other significant hit from another database for the second protein
The other database hits for the fly protein were:
database name location prosite profile guanylate_cyclases_2 896-1026 prosite profile protein_kinase_dom 547-824 prosite pattern guanylate_cyclases_1 1003-1026 prosite pattern mitoch_carrier 985-994 11. Number of Pfam hits for third protein
There were 3 Pfam hits for the medaka protein.
12. The name, location and associated PDB structure for each Pfam hit of the third protein
The Pfam hits for the medaka protein were:
Pfam name Location PDB code pkinase 593-806 1qdp guanylate_cyc 870-1057 1cju anf_receptor 43-433 1dp4 13. The name and location of any other significant hit from another database for the third protein
The other database hits for the medaka protein were:
database name location prosite profile guanylate_cyclases_2 879-1009 prosite profile protein_kinase_dom 537-807 prosite pattern guanylate_cyclases_1 986-1009 Mark the locations of the found characterized patterns or profiles on the printouts for your three proteins. Name or color code them to help with identification later.
Use the highlighter provided in your workshop materials to mark any found characterized regions of the three sequences you have been using. You may need to come up with a scheme to tell the various patterns apart since profiles and Pfam hits can be longer than Prosite patterns and might overlap them.
Note that the Pfam PDB codes may have changed even if the name of the found pattern was the same. This is because the Pfam site being used by ISREC (the Sanger Centre) changes the presented structure on their page through the day. It works through a list of PDB codes which have the found pattern. By recording the PDB code names you have made a collection of solved structures which contain the Pfam pattern.
If your chosen sequence lacks significant Pfam hits (using ProfileScan) or associated PDB structures, choose one of the following protein sequences only for motif, profile, and fingerprint analysis. All sequences are human in origin.
sequence 1 genpept format fasta format sequence 2 genpept format fasta format sequence 3 genpept format fasta format sequence 4 genpept format fasta format sequence 5 genpept format fasta format
Another approach to finding functional information is to use fingerprints. A fingerprint is a group of conserved motifs used to characterize a protein family. Usually the motifs do not overlap, but are separated along the sequence, though they may be contiguous in 3D-space. Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs.
The PRINTS database was established on this premise. To search this database, go to the FPSCAN site.
The initial Prints form is given below.
![]()
Paste in your raw protein sequence and click on Send Query.
[FPScan run - input format: raw sequence] Your editor session is the source of the necessary sequence data.
A properly filled out Prints form is given below.
![]()
The results are organized by high-scoring fingerprints, the top ten scoring fingerprints and then detailed information on the top scoring fingerprints. Finally the sequence used in the search is given.
The example results are given below.
![]()
![]()
![]()
Explore any statistically significant matches, plus those with a Pvalue of less than 10-4, to get a feel for the nature of this resource. The start location of the found block is given in the Pos column of the "ten top scoring fingerprints for your query. Detailed by motif" section of the results. To find where the block ends add the length of the block to the start value and subtract one.
The example run resulted in the TYRKINASE hit (Tyrosine kinase catalytic domain signature) being most significant.
Again, repeat the process with your additional two proteins.
All three runs had TYRKINASE as their only significant hit.
Record the following:
14. Any found fingerprint pattern name(s) for the main protein and their location
The main protein has three TYRKINASE blocks.
TYRKINASE block 1 of 5 starts at 619 and ends at 632 TYRKINASE block 2 of 5 starts at 656 and ends at 674 TYRKINASE block 5 of 5 starts at 776 and ends at 79815. Any found fingerprint pattern name(s) for the second protein and their location
The second protein from fly has two TYRKINASE blocks.
TYRKINASE block 1 of 5 starts at 627 and ends at 640 TYRKINASE block 5 of 5 starts at 792 and ends at 81416. Any found fingerprint pattern name(s) for the third protein and their location
The third protein from medaka has three TYRKINASE blocks.
TYRKINASE block 1 of 5 starts at 618 and ends at 631 TYRKINASE block 2 of 5 starts at 655 and ends at 673 TYRKINASE block 5 of 5 starts at 775 and ends at 797
On the local machine use an editor to create a file containing all the fasta formatted protein files you saved in lab 1.
This created file should resemble the one given below in the example data file link.
Trim the information after the > sign for each sequence to a single word reflecting the species.
Once your combined fasta file has been created, use it at the EBI site to do a Clustalw alignment on the sequences.
[EBI's Clustalw run - input format: single file containing multiple fasta sequences] The initial form for this site is given below.
![]()
Depending on how you left the file in the editor, either open the file again and/or simply copy and paste it into the form's input box.
A properly filled out form is given below.
![]()
CLick the RUN CLUSTALW button to submit the job. It takes a while to run your job. The site has a dynamic window that keeps you occupied while you wait. Your results form a rather large html file. Only portions of the example output are shown here. The file starts with information on the data used to generate the alignment. The second section has a link to the downloadable output of your alignment and then goes on to print the data off on the screen. Clustalw output files end with the extension aln.
![]()
![]()
At the end, in the third section after the alignment, is the generated tree file data. Clustalw tree files end with the extension dnd.
![]()
From the results page click on the aln file and save it to your local machine.
Scroll down the results page to the area with the term Your Multiple Sequence Alignment:. Click on the link there and get to the actual alignment file. From the browser's File menu, select the Save As option and save the file. Click on the browser's Back button to return to the results page.
Again on the results page, save the dnd file to your local machine.
Scroll down to the bottom of the results page. There is a link for the generated tree file. Click on that link and get to the actual tree file. From the browser's File menu, select the Save As option and save the file. Click on the browser's Back button to return to the results page.
Click on the browser's Back button until you return to the lab instructions.
Check out your alignment. To do this requires the use of the local program GeneDoc. From the File menu select Open, navigate to the desired input file location. Any file that is in MSF format will show up in the resulting folder window. Double click on the name of your alignment file.
The alignment is currently displayed with consensus information given below the actual sequences used. To remove the consensus line click on the upper case C in the first line of buttons. The resulting window contains the display settings for the program. At the top of the third column is the section that controls the consensus line. Click the No Consensus button followed by OK to close the window and make the change in your alignment.
Print off your alignment file by selecting the Print option of the File menu and clicking on the OK button of the Print window. Once satisfied with your alignment output, use the Exit option of the File menu to get out of the GeneDoc program. You can either save your alignment changes or not, depending on your own needs.
The example alignment is given below. The images are rather large and you will need to scroll over to the right to see the entire alignment block.
The data was divided up into sequence blocks, each containing a portion of the alignment that is 97 residues long. This was the result of the size of the terminal screen being used when the alignment was generated.
In the default mode of the GeneDoc program highly similar columns in the alignment are colored black. Those slightly less similar are colored dark gray. A lighter shade of gray is used for even less similar and finally white for areas of variation.
The fly sequence is much larger than the other sequences in the alignment data set. After block 14 of the alignment the only sequence with any data is the fly one. Therefore, only data for the first fourteen blocks is given.
The alignment generated by the example data set shows the greatest areas of conserved residues in blocks 11 and 12 corresponding to the guanylate cyclases region. A less conserved area in blocks 8 and 9 corresponds to the kinase region.
block 1 block 2 block 3 block 4 block 5 block 6 block 7 block 8 block 9 block 10 block 11 block 12 block 13 block 14
Examine your tree file with the local machine's program treeview32. From the File menu select Open and navigate to the desired input data location. Once there, click on the name of the saved tree file.
The program has a number of different viewing options for the tree. These are shown by the second block of four control buttons [radial tree, cladogram, rectangular cladogram and phylogram]. Explore all the viewing options, printing off the version of the tree that makes the most phylogenetic sense to you.
Print your desired tree by selecting the Print option of the File menu and clicking on the OK button of the Print window. Use the Exit option of the File menu to get out of the treeview32 program.
The example run tree gives the following four variations.
radial tree cladogram rectangular cladogram phylogram
Now go back and look at the information collected on your proteins. Compare your recorded characterized hard copy pattern data with the generated alignment.
17. Do the found motifs align?
Use your printed alignment and the notes you have about the three proteins that you analyzed for motifs and profiles. Check to see if the alignment has the found motifs and profiles in the same area or not. They should be, but if they aren't, it means that you should return to the clustalw site and try the alignment again, this time using a different comparison table or changing penalty values.
18. Just how conserved is the protein family being worked with?
A highly conserved protein family will have large blocks of the alignment with black backgrounds.
In the alignment generated by the example data set, the most highly conserved area corresponds to the guanylate cyclases region. A less conserved area corresponds to the kinase region.