Web-exercises

 

Many sites on Internet permit you to do different kinds of sequence analyses and sequence information searches. Below you have a list of some sites that can be of interest, but there are many other sites that can do similar things in similar or different ways. Running these Web-based analyses is kind of a lottery depending on if you can access a site or not and sometimes an analysis takes a preposterous amount of time. Don't forget to bookmark sites that can be useful to you.

 

Databases/Sequence retrieval/ Alignments etc.

NCBI http://www.ncbi.nlm.nih.gov/

EMBL http://www.embl-heidelberg.de/Services/index.html

DDBJ http://www.ddbj.nig.ac.jp/

EMBL-EBI http://www.ebi.ac.uk/services/index.html

GENESTREAM http://pdb.igh.cnrs.fr/

EMBnet http://www.ch.embnet.org/

NPS http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=/NPSA/npsa_server.html

ISREC http://www.isrec.isb-sib.ch/software/software.html

EXPASY http://us.expasy.org/

 

BLAST servers

BLAST2 at NCBI http://www.ncbi.nlm.nih.gov/BLAST/

BLAST2 at EMBL http://dove.embl-heidelberg.de/Blast2/

Pairwise alignments: http://www.ncbi.nlm.nih.gov/blast/bl2seq/bl2.html

 

 

Protein patterns and families

Pfam and PROSITE  http://www.sanger.ac.uk/Software/Pfam/index.shtml

BLOCKS http://www.blocks.fhcrc.org/

PROSITE Scan http://us.expasy.org/tools/scanprosite/

 

GENE ONTOLOGYTM CONSORTIUM http://www.geneontology.org/

 

Exercises

 

The BLAST programs employ the SEG algorithm to filter low complexity

regions from proteins before executing a database search.

 

a. How many low complexity regions can you find in the PAX-6 protein of humans?

b. Does this sequence contain any sequence motif?

 

 

Pattern searching. Use this sequence:

 

MMTAKAVDKIPVTLSGFVHQLSDNIYPVEDLAATSVTIFPNAELGGPFDQ

MNGVAGDGMINIDMTGEKRSLDLPYPSSFAPVSAPRNQTFTYMGKFSIDP

QYPGASCYPEGIINIVSAGILQGVTSPASTTASSSVTSASPNPLATGPLG

VCTMSQTQPDLDHLYSPPPPPPPYSGCAGDLYQDPSAFLSAATTSTSSSL

AYPPPPSYPSPKPATDPGLFPMIPDYPGFFPSQCQRDLHGTAGPDRKPFP

CPLDTLRVPPPLTPLSTIRNFTLGGPSAGMTGPGASGGSEGPRLPGSSSAA

AAAAAAAAYNPHHLPLRPILRPRKYPNRPSKTPVHERPYPCPAEGCDRRFS

RSDELTRHIRIHTGHKPFQCRICMRNFSRSDHLTTHIRTHTGEKPFACDYCGR

KFARSDERKRHTKIHLRQKERKSSAPSASVPAPSTASCSGGVQPGGTLCSS

NSSSLGGGPLAPCSSRTRTP.

 

The Prosite database and the Pfam contain a lot of information on protein

families and functional domains. Often these databases can be used to get a good

hint of the function of particular protein. You can e.g. search for known

functional motifs and domains in your protein (or DNA).

 

a) Use the "http://www.expasy.ch/tools/scnpsite.html" to find out if the

protein sequence contains any motifs from the Prosite database. Follow the

different links to e.g. find out what function the motif could have and if the

motif is present in other proteins.

b) Use the Pfam database at http://www.sanger.ac.uk/Software/Pfam/ to find

out if the protein sequence contains any motifs from this database. Follow the

different links to e.g. find out what function the motif could have and if the

motif is present in other proteins.

c) Based on the results from these different analyses, which functional motifs/domains do you think the protein contains?

d) Find the "mystic" sequence in the database and compare the annotations

with your results.

 

Consider the following partial amino acid sequence of a protein from

Saccharomyces cerevisiae:

 

MSSVAENIIQ HATHNSTLHQ

 

a) What is the likely function of this protein?

b) What is the molecular weight (in kilodalton) and predicted isolectric

point (pI) for the protein?

c) Which chromosome is the gene located on?

d) Which genes are located upstream and downstream of the gene?

e) Does this protein have a sequence motif that belongs to a certain protein

group (family)? Give the name of the protein group and the sequence of this

motif.

 

Run the E. coli RecA protein against the yeast genome on the BLAST server.

Choose basic BLAST and carefully review the various option windows on the page

that comes up. Choose BLASTP as the choice of program and yeast as the sequence

database (all of the yeast proteins). Enter the sequence in Fasta format or

enter the PIR identifier of the query sequence, RQECA into the input data window

and indicate the choice in the small option window just above the input data

window. Otherwise, use the default parameters provided by the program.

 

Answer the following questions:

A. In the diagram that comes up, click the mouse on the yeast sequence which

best matches the RecA query sequence. Identify the name and gi (Genbank index)

of the highest scoring sequence and the score in bits.

B. What scoring matrix and gap penalties were used?

C. What value of K and l were used for calculating the Expect scores for the

gapped alignment (please note that there are two sets of these paramaters - one

for ungapped and one for gapped alignments)? Where do these values come from?

D. How many database sequences were searched?

E. Is the alignment of the highest scoring sequence with RecA protein

significant and why? What biological information (protein structure and function) does this match suggest about the bacterial RecA protein and the yeast

protein?

F. What was the lowest reported score in this search, and is this score

significant?

 

In many cases sequence databases include experimental artifacts. Databases

are known to include vector sequences and other sequencing errors including

contaminants, chimeric sequences, and shifts in reading frame due to insertions

or deletions.

 

From a colleague you have obtained a stretch of DNA (see sequence below) that is

supposed to be from the bacterium Bacillus subtilis

 

accgcacctgtggcgccggtgatgccggccacgatgcgtccggcgtagaggatcgagatctcgatcccgcgaaattaatacgactcactataggggaattgtgagcggataacaattcccctctagaaataattttgtttaactttaagaaggagatataccatgggacaatcgtttaacgcaccttatgaagcgattggagaggaacttctatcgcaacttgttgatactttttatgagcgtgtcgcgtctcatcctttgctgaagccgatttttccaagcgatttgacagaaaccgccaggaaacagaagcaattcttaactcagtatttaggcgggcctcctctttatactgaggaacacggccatcctatgctcagagcaaggcatcttccctttccaattacaaacgagagagctgatgcgtggctcagctgtatgaaggacgcaatggaccatgtagggctggagggcgaaattcgtgagtttttgtttggccggctggagttgacagcaaggcatatggtgaatcaaacggaagcggaggatcgatcatcttgacaagcttggatccggctgctaacaaagcccgaaaggaagctgagttggctgctgccaccgctgagcaataactagcataaccccttggggcctctaaacgggtcttgaggggttttttgctgaaaggaggaactatatccggatatcccgcaagaggcccggcagtaccggcataaccaagcctatgcctacagcatccagggtgacggtgccg

 

a) Is the information correct?

b) What gene is encoded on the fragment?

c) Which protein family is the gene product likely to be in?

Using the GO Browser at the EBI, QuickGO,(http://www.ebi.ac.uk/ego/ )research a biological protein or topic of interest to you. Browse up and down the GO trees. Now use the AmiGO browser (http://www.godatabase.org/cgi-bin/go.cgi) to view some of the same info. Compare and contrast the capabilities of each browser. For the protein you chose, is there a GO process, function, and location?

Using AmiGO Advanced Query, find the GO's associated with specific gene

products. Human genes can be found using LocusLink. Can you find a human ortholog of the protein you chose above?