Detailed Example for Lab 2

This file contains a very detailed version of the exercise given in lab 2. It assumes that the user knows how to cut and paste text from a computer screen and can use a text editor (word processing program). It is divided up into sections so that a user can move to a specific area from the main lab page. To completely understand a given section, read all the material contained in its links. Detailed instructions are given in green text on the html page and are italicized in the hard copy.

Table of Contents


Introduction

Lab Two

The purpose of the following exercise is to help you become more familiar with web-based protein characterization tools. The sequences to be used were derived during the previous lab.


Understanding a protein starts with exploring its basic characteristics. Visually examine your main protein sequence derived in lab one or run it through one of the EXPASY's ProtParams sites ( Canada, China, Korea, Taiwan, USA ) to see if there is anything unusual about its amino acid composition.

High percentages of any one amino acid can greatly influence the protein's behavior. METALLOTHIONEINs have about 30% cysteine and form metal cages using the -SH group from the cysteines. Some ANTIFREEZE proteins are about 50% alanine. One of the current theories on how these proteins work is due to possible hydrophobic interactions between the proteins and water.

Look at the GenBank version of your main protein sequence. Check to see if you notice any particular amino acid which seems to appear more frequently than the others.

1. Did visual inspection of your protein detect any amino acids in obvious abundance?

In the example main protein given below, there appears to be lots of leucines and alanines. However, this is due to the detectable repeating sections of A's and L's and not the result of any actual count of characters.

>TRIAL associated full length protein
MTACARRAGGLPDPGLCGPAWWAPSLPRLPRALPRLPLLLLLLLLQPPALSAVFTVGVLG
PWACDPIFSRARPDLAARLAAARLNRDPGLAGGPRFEVALLPEPCRTPGSLGAVSSALAR
VSGLVGPVNPAACRPAELLAEEAGIALVPWGCPWTQAEGTTAPAVTPAADALYALLRAFG
WARVALVTAPQDLWVEAGRSLSTALRARGLPVASVTSMEPLDLSGAREALRKVRDGPRVT
AVIMVMHSVLLGGEEQRYLLEAAEELGLTDGSLVFLPFDTIHYALSPGPEALAALANSSQ
LRRAHDAVLTLTRHCPSEGSVLDSLRRAQERRELPSDLNLQQVSPLFGTIYDAVFLLARG
VAEARAAAGGRWVSGAAVARHIRDAQVPGFCGDLGGDEEPPFVLLDTDAAGDRLFATYML
DPARGSFLSAGTRMHFPRGGSAPGPDPSCWFDPNNICGGGLEPGLVFLGFLLVVGMGLAG
AFLAHYVRHRLLHMQMVSGPNKIILTVDDITFLHPHGGTSRKVAQGSRSSLGARSMSDIR
SGPSQHLDSPNIGVYEGDRVWLKKFPGDQHIAIRPATKTAFSKLQELRHENVALYLGLFL
ARGAEGPAALWEGNLAVVSEHCTRGSLQDLLAQREIKLDWMFKSSLLLDLIKGIRYLHHR
GVAHGRLKSRNCIVDGRFVLKITDHGHGRLLEAQKVLPEPPRAEDQLWTAPELLRDPALE
RRGTLAGDVFSLAIIMQEVVCRSAPYAMLELTPEEVVQRVRSPPPLCRPLVSMDQAPVEC
ILLMKQCWAEQPELRPSMDHTFDLFKNINKGRKTNIIDSMLRMLEQYSSNLEDLIRERTE
ELELEKQKTDRLLTQMLPPSVAEALKTGTPVEPEYFEQVTLYFSDIVGFTTISAMSEPIE
VVDLLNDLYTLFDAIIGSHDVYKVETIGDAYMVASGLPQRNGQRHAAEIANMSLDILSAV
GTFRMRHMPEVPVRIRIGLHSGPCVAGVVGLTMPRYCLFGDTVNTASRMESTGLPYRIHV
NLSTVGILRALDSGYQVELRGRTELKGKGAEDTFWLVGRRGFNKPIPKPPDLQPGSSNHG
ISLQEIPPERRRKLEKARPGQFS


Motif Analysis

Protein families have been studied to identify functional patterns called motifs. Some of these motifs are based on text patterns and others on profile matrices.

Go to one of the EXPASY's Prosite sites ( Canada, China, Korea, Taiwan, USA) to find text pattern-based functional motifs in your full-length protein sequence.

[EXPASY's Prosite run - input format: raw sequence]

The EXPASY Prosite form is given below.

Paste your sequence into the Scan a protein for Prosite matches window.Choose the option to Exclude patterns with a high probability of occurrence.

Use the editor to open the necessary file and copy the text into machine's buffer.

Then paste the text into the form's pale yelloe window in the browser.

A properly filled out page with the example data is given below.

Run the process by clicking the START THE SCAN button.

The results page gives the sequence used in the search and then any found hits. Following the PDOC link gives detailed information about the located motif including references. Explore the documentation on any hits.

The results page for the example data is given below. Three motifs were found. Links to the PDOC and PS documentation pages are given.

Links to the documentation for the example protein's matches.

PROTEIN_KINASE_DOM PDOC documentation PROTEIN_KINASE_DOM PS documentation
GUANYLATE_CYCLASES_1 PDOC documentation GUANYLATE_CYCLASES_1 PS documentation
GUANYLATE_CYCLASES_2 PDOC documentation GUANYLATE_CYCLASES_2 PS documentation

Repeat this process with at least two other protein files that you saved.

Record the following:

In the case of the example data, this process was repeated with the stored sequences from fly and medaka.

2. Prosite data for the protein of interest (pattern name and location)

The main sequence has hits for the PROTEIN_KINASE_DOM motif (525-808), the GUANYLATE_CYCLASES_1 motif (987-1010) and the GUANYLATE_CYCLASES_2 motif (880-1010).

3. Prosite data for the second protein (pattern name and location)

The fly sequence had the same motifs as the main seqeunce and the MITOCH_CARRIER motif.

4. Prosite data for the third protein (pattern name and location)

The medaka sequence has the same hits as the main sequence.

Back to Table of Contents


Profile Analysis

To explore profile-based functional information, go to the ProfileScan server at ISREC.

The ProfileScan page is given below.

Paste in your sequence and select all the database options that you can.

[ISREC's PSCAN run - input format: raw or fasta sequence]

The necessary data is in your editor window.

A properly filled out page is given below for the example data. Be sure to click on all the available databases on the form.

Click scan to start the process.

It takes a few minutes for this analysis process to run. The form puts up the databases searched so that you can keep track of the run's progress.

Results from this run are very long. First, the sequence used is given along with a listing of the databases searched. Second, a summary of the resulting hits are given. An exclamation point is used to mark significant hits. There will be many more hits from the PROSITE database on this list than the previous one, since this time the more frequent patterns weren't excluded. Record the names of any hit with an exclamation point.

The example data results are given below. Partial images are given for the various sections of the results. It is necessary to scroll down the entire list to see if there are any significant hits or not.

top of results

summary section

////////////////////////////

match location section

////////////////////////////

The best way to explore any of the found hits of interest is to click on the link given for the desired hit in the portions of the results that follow the match location section. The PDOC or the QDOC links go off to prosite documentation (most of the time).

pattern section

////////////////////////////

profile section

//////////////////////////

The Pfam-site documentation is quite colorful and informative. Click on any significant Pfam hit's documentation link. Record the PDB name given in the box containing the structure.

pfam section

Repeat this process with the same two additional proteins used in the prosite runs. Record the following:

5. Number of Pfam hits for main protein

There were 3 Pfam hits for the main protein.

6. The name, location and associated PDB structure for each Pfam hit of the main protein

The Pfam hits for the main protein were:

Pfam name Location PDB code
pkinase 584-800 1qmz
guanylate_cyc 871-1058 1azs
anf_receptor 43-429 1ewk

7. The name and location of any other significant hit from another database for the main protein

The other database hits for the main protein were:

database name location
prosite profile guanylate_cyclases_2 880-1010
prosite profile protein_kinase_dom 525-808
prosite pattern guanylate_cyclases_1 987-1010

8. Number of Pfam hits for second protein

There were 3 Pfam hits for the fly protein.

9. The name, location and associated PDB structure for each Pfam hit of the second protein

The Pfam hits for the fly protein were:

Pfam name Location PDB code
pkinase 580-812 1b6c
guanylate_cyc 887-1074 1cjk
anf_receptor 36-436 1ewk

10. The name and location of any other significant hit from another database for the second protein

The other database hits for the fly protein were:

database name location
prosite profile guanylate_cyclases_2 896-1026
prosite profile protein_kinase_dom 547-824
prosite pattern guanylate_cyclases_1 1003-1026
prosite pattern mitoch_carrier 985-994

11. Number of Pfam hits for third protein

There were 3 Pfam hits for the medaka protein.

12. The name, location and associated PDB structure for each Pfam hit of the third protein

The Pfam hits for the medaka protein were:

Pfam name Location PDB code
pkinase 593-806 1qdp
guanylate_cyc 870-1057 1cju
anf_receptor 43-433 1dp4

13. The name and location of any other significant hit from another database for the third protein

The other database hits for the medaka protein were:

database name location
prosite profile guanylate_cyclases_2 879-1009
prosite profile protein_kinase_dom 537-807
prosite pattern guanylate_cyclases_1 986-1009

Mark the locations of the found characterized patterns or profiles on the printouts for your three proteins. Name or color code them to help with identification later.

Use the highlighter provided in your workshop materials to mark any found characterized regions of the three sequences you have been using. You may need to come up with a scheme to tell the various patterns apart since profiles and Pfam hits can be longer than Prosite patterns and might overlap them.

Note that the Pfam PDB codes may have changed even if the name of the found pattern was the same. This is because the Pfam site being used by ISREC (the Sanger Centre) changes the presented structure on their page through the day. It works through a list of PDB codes which have the found pattern. By recording the PDB code names you have made a collection of solved structures which contain the Pfam pattern.

If your chosen sequence lacks significant Pfam hits (using ProfileScan) or associated PDB structures, choose one of the following protein sequences only for motif, profile, and fingerprint analysis. All sequences are human in origin.

sequence 1 genpept format fasta format
sequence 2 genpept format fasta format
sequence 3 genpept format fasta format
sequence 4 genpept format fasta format
sequence 5 genpept format fasta format

Back to Table of Contents


Fingerprint Analysis

Another approach to finding functional information is to use fingerprints. A fingerprint is a group of conserved motifs used to characterize a protein family. Usually the motifs do not overlap, but are separated along the sequence, though they may be contiguous in 3D-space. Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs.

The PRINTS database was established on this premise. To search this database, go to the FPSCAN site.

The initial Prints form is given below.

Paste in your raw protein sequence and click on Send Query.

[FPScan run - input format: raw sequence]

Your editor session is the source of the necessary sequence data.

A properly filled out Prints form is given below.

The results are organized by high-scoring fingerprints, the top ten scoring fingerprints and then detailed information on the top scoring fingerprints. Finally the sequence used in the search is given.

The example results are given below.

Explore any statistically significant matches, plus those with a Pvalue of less than 10-4, to get a feel for the nature of this resource. The start location of the found block is given in the Pos column of the "ten top scoring fingerprints for your query. Detailed by motif" section of the results. To find where the block ends add the length of the block to the start value and subtract one.

The example run resulted in the TYRKINASE hit (Tyrosine kinase catalytic domain signature) being most significant.

Again, repeat the process with your additional two proteins.

All three runs had TYRKINASE as their only significant hit.

Record the following:
14. Any found fingerprint pattern name(s) for the main protein and their location
The main protein has three TYRKINASE blocks.


TYRKINASE block 1 of 5 starts at 619 and ends at 632
TYRKINASE block 2 of 5 starts at 656 and ends at 674
TYRKINASE block 5 of 5 starts at 776 and ends at 798

15. Any found fingerprint pattern name(s) for the second protein and their location

The second protein from fly has two TYRKINASE blocks.


TYRKINASE block 1 of 5 starts at 627 and ends at 640
TYRKINASE block 5 of 5 starts at 792 and ends at 814

16. Any found fingerprint pattern name(s) for the third protein and their location

The third protein from medaka has three TYRKINASE blocks.


TYRKINASE block 1 of 5 starts at 618 and ends at 631
TYRKINASE block 2 of 5 starts at 655 and ends at 673
TYRKINASE block 5 of 5 starts at 775 and ends at 797

Back to Table of Contents


File for Alignment

On the local machine use an editor to create a file containing all the fasta formatted protein files you saved in lab 1.

This created file should resemble the one given below in the example data file link.

example data file

Trim the information after the > sign for each sequence to a single word reflecting the species.

Back to Table of Contents


Clustalw Alignment

Once your combined fasta file has been created, use it at the EBI site to do a Clustalw alignment on the sequences.

[EBI's Clustalw run - input format: single file containing multiple fasta sequences]

The initial form for this site is given below.

Depending on how you left the file in the editor, either open the file again and/or simply copy and paste it into the form's input box.

A properly filled out form is given below.

CLick the RUN CLUSTALW button to submit the job. It takes a while to run your job. The site has a dynamic window that keeps you occupied while you wait. Your results form a rather large html file. Only portions of the example output are shown here. The file starts with information on the data used to generate the alignment. The second section has a link to the downloadable output of your alignment and then goes on to print the data off on the screen. Clustalw output files end with the extension aln.

At the end, in the third section after the alignment, is the generated tree file data. Clustalw tree files end with the extension dnd.

From the results page click on the aln file and save it to your local machine.

Scroll down the results page to the area with the term Your Multiple Sequence Alignment:. Click on the link there and get to the actual alignment file. From the browser's File menu, select the Save As option and save the file. Click on the browser's Back button to return to the results page.

Again on the results page, save the dnd file to your local machine.

Scroll down to the bottom of the results page. There is a link for the generated tree file. Click on that link and get to the actual tree file. From the browser's File menu, select the Save As option and save the file. Click on the browser's Back button to return to the results page.

Click on the browser's Back button until you return to the lab instructions.

Back to Table of Contents


Alignment processing

Check out your alignment. To do this requires the use of the local program GeneDoc. From the File menu select Open, navigate to the desired input file location. Any file that is in MSF format will show up in the resulting folder window. Double click on the name of your alignment file.

The alignment is currently displayed with consensus information given below the actual sequences used. To remove the consensus line click on the upper case C in the first line of buttons. The resulting window contains the display settings for the program. At the top of the third column is the section that controls the consensus line. Click the No Consensus button followed by OK to close the window and make the change in your alignment.

Print off your alignment file by selecting the Print option of the File menu and clicking on the OK button of the Print window. Once satisfied with your alignment output, use the Exit option of the File menu to get out of the GeneDoc program. You can either save your alignment changes or not, depending on your own needs.

The example alignment is given below. The images are rather large and you will need to scroll over to the right to see the entire alignment block.

The data was divided up into sequence blocks, each containing a portion of the alignment that is 97 residues long. This was the result of the size of the terminal screen being used when the alignment was generated.

In the default mode of the GeneDoc program highly similar columns in the alignment are colored black. Those slightly less similar are colored dark gray. A lighter shade of gray is used for even less similar and finally white for areas of variation.

The fly sequence is much larger than the other sequences in the alignment data set. After block 14 of the alignment the only sequence with any data is the fly one. Therefore, only data for the first fourteen blocks is given.

The alignment generated by the example data set shows the greatest areas of conserved residues in blocks 11 and 12 corresponding to the guanylate cyclases region. A less conserved area in blocks 8 and 9 corresponds to the kinase region.

block 1 block 2 block 3 block 4
block 5 block 6 block 7 block 8
block 9 block 10 block 11 block 12
block 13 block 14    

Back to Table of Contents


Looking at a tree

Examine your tree file with the local machine's program treeview32. From the File menu select Open and navigate to the desired input data location. Once there, click on the name of the saved tree file.

The program has a number of different viewing options for the tree. These are shown by the second block of four control buttons [radial tree, cladogram, rectangular cladogram and phylogram]. Explore all the viewing options, printing off the version of the tree that makes the most phylogenetic sense to you.

Print your desired tree by selecting the Print option of the File menu and clicking on the OK button of the Print window. Use the Exit option of the File menu to get out of the treeview32 program.

The example run tree gives the following four variations.

radial tree cladogram rectangular cladogram phylogram


Conclusions

Now go back and look at the information collected on your proteins. Compare your recorded characterized hard copy pattern data with the generated alignment.

17. Do the found motifs align?

Use your printed alignment and the notes you have about the three proteins that you analyzed for motifs and profiles. Check to see if the alignment has the found motifs and profiles in the same area or not. They should be, but if they aren't, it means that you should return to the clustalw site and try the alignment again, this time using a different comparison table or changing penalty values.

18. Just how conserved is the protein family being worked with?

A highly conserved protein family will have large blocks of the alignment with black backgrounds.

In the alignment generated by the example data set, the most highly conserved area corresponds to the guanylate cyclases region. A less conserved area corresponds to the kinase region.

Back to Table of Contents

 


Copyright 2002 Regents of the University of California. All rights reserved.