Web-exercises
Many sites on Internet permit you to do different kinds of sequence analyses and sequence information searches. Below you have a list of some sites that can be of interest, but there are many other sites that can do similar things in similar or different ways. Running these Web-based analyses is kind of a lottery depending on if you can access a site or not and sometimes an analysis takes a preposterous amount of time. Don't forget to bookmark sites that can be useful to you.
Databases/Sequence retrieval/ Alignments etc.
NCBI http://www.ncbi.nlm.nih.gov/
EMBL http://www.embl-heidelberg.de/Services/index.html
DDBJ http://www.ddbj.nig.ac.jp/
EMBL-EBI http://www.ebi.ac.uk/services/index.html
GENESTREAM http://pdb.igh.cnrs.fr/
EMBnet http://www.ch.embnet.org/
NPS http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=/NPSA/npsa_server.html
ISREC http://www.isrec.isb-sib.ch/software/software.html
EXPASY http://us.expasy.org/
BLAST servers
BLAST2 at NCBI http://www.ncbi.nlm.nih.gov/BLAST/
BLAST2 at EMBL http://dove.embl-heidelberg.de/Blast2/
Pairwise alignments: http://www.ncbi.nlm.nih.gov/blast/bl2seq/bl2.html
Protein patterns and families
Pfam and PROSITE http://www.sanger.ac.uk/Software/Pfam/index.shtml
BLOCKS http://www.blocks.fhcrc.org/
PROSITE Scan http://us.expasy.org/tools/scanprosite/
GENE ONTOLOGYTM CONSORTIUM http://www.geneontology.org/
Exercises
The BLAST programs employ the SEG algorithm to filter low complexity
regions from proteins before executing a database search.
a. How many low complexity regions can you find in the PAX-6 protein of humans?
b. Does this sequence contain any sequence motif?
Pattern searching. Use this sequence:
MMTAKAVDKIPVTLSGFVHQLSDNIYPVEDLAATSVTIFPNAELGGPFDQ
MNGVAGDGMINIDMTGEKRSLDLPYPSSFAPVSAPRNQTFTYMGKFSIDP
QYPGASCYPEGIINIVSAGILQGVTSPASTTASSSVTSASPNPLATGPLG
VCTMSQTQPDLDHLYSPPPPPPPYSGCAGDLYQDPSAFLSAATTSTSSSL
AYPPPPSYPSPKPATDPGLFPMIPDYPGFFPSQCQRDLHGTAGPDRKPFP
CPLDTLRVPPPLTPLSTIRNFTLGGPSAGMTGPGASGGSEGPRLPGSSSAA
AAAAAAAAYNPHHLPLRPILRPRKYPNRPSKTPVHERPYPCPAEGCDRRFS
RSDELTRHIRIHTGHKPFQCRICMRNFSRSDHLTTHIRTHTGEKPFACDYCGR
KFARSDERKRHTKIHLRQKERKSSAPSASVPAPSTASCSGGVQPGGTLCSS
NSSSLGGGPLAPCSSRTRTP.
The Prosite database and the Pfam contain a lot of information on protein
families and functional domains. Often these databases can be used to get a good
hint of the function of particular protein. You can e.g. search for known
functional motifs and domains in your protein (or DNA).
a) Use the "http://www.expasy.ch/tools/scnpsite.html" to find out if the
protein sequence contains any motifs from the Prosite database. Follow the
different links to e.g. find out what function the motif could have and if the
motif is present in other proteins.
b) Use the Pfam database at http://www.sanger.ac.uk/Software/Pfam/ to find
out if the protein sequence contains any motifs from this database. Follow the
different links to e.g. find out what function the motif could have and if the
motif is present in other proteins.
c) Based on the results from these different analyses, which functional motifs/domains do you think the protein contains?
d) Find the "mystic" sequence in the database and compare the annotations
with your results.
Consider the following partial amino acid sequence of a protein from
Saccharomyces cerevisiae:
MSSVAENIIQ HATHNSTLHQ
a) What is the likely function of this protein?
b) What is the molecular weight (in kilodalton) and predicted isolectric
point (pI) for the protein?
c) Which chromosome is the gene located on?
d) Which genes are located upstream and downstream of the gene?
e) Does this protein have a sequence motif that belongs to a certain protein
group (family)? Give the name of the protein group and the sequence of this
motif.
Run the E. coli RecA protein against the yeast genome on the BLAST server.
Choose basic BLAST and carefully review the various option windows on the page
that comes up. Choose BLASTP as the choice of program and yeast as the sequence
database (all of the yeast proteins). Enter the sequence in Fasta format or
enter the PIR identifier of the query sequence, RQECA into the input data window
and indicate the choice in the small option window just above the input data
window. Otherwise, use the default parameters provided by the program.
Answer the following questions:
A. In the diagram that comes up, click the mouse on the yeast sequence which
best matches the RecA query sequence. Identify the name and gi (Genbank index)
of the highest scoring sequence and the score in bits.
B. What scoring matrix and gap penalties were used?
C. What value of K and l were used for calculating the Expect scores for the
gapped alignment (please note that there are two sets of these paramaters - one
for ungapped and one for gapped alignments)? Where do these values come from?
D. How many database sequences were searched?
E. Is the alignment of the highest scoring sequence with RecA protein
significant and why? What biological information (protein structure and function) does this match suggest about the bacterial RecA protein and the yeast
protein?
F. What was the lowest reported score in this search, and is this score
significant?
In many cases sequence databases include experimental artifacts. Databases
are known to include vector sequences and other sequencing errors including
contaminants, chimeric sequences, and shifts in reading frame due to insertions
or deletions.
From a colleague you have obtained a stretch of DNA (see sequence below) that is
supposed to be from the bacterium Bacillus subtilis
a) Is the information correct?
b) What gene is encoded on the fragment?
c) Which protein family is the gene product likely to be in?
Using the GO Browser at the EBI, QuickGO,(http://www.ebi.ac.uk/ego/ )research a biological protein or topic of interest to you. Browse up and down the GO trees. Now use the AmiGO browser (http://www.godatabase.org/cgi-bin/go.cgi) to view some of the same info. Compare and contrast the capabilities of each browser. For the protein you chose, is there a GO process, function, and location?
Using AmiGO Advanced Query, find the GO's associated with specific gene
products. Human genes can be found using LocusLink. Can you find a human ortholog of the protein you chose above?