Alignment Exercise Set 4 Answers

 

1.  The BLAST programs employ the SEG algorithm to filer low complexity regions from proteins before executing a database search.

 

a. How many low complexity regions can you find in the PAX-6 protein of humans? Use Protein BLAST at NCBI

 

5 low complexity regions. You get an image that shows motifs along with the Query ID.  Motifs and shows low complexity.  In the comparison of sequence, low complexity is shown by XX.

 

1   HSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETG 60
     HSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETG
5   HSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETG 64
 
Query: 61 SIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVCTNDNIPSVSSIN 120
                SIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVCTNDNIPSVSSIN
Sbjct: 65    SIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVCTNDNIPSVSSIN 124
 
Query: 121 RVLRNLASEKQQMGADGMYDKLRMLNGQTGSWGTRPGWYPGTSVPGQPTXXXXXXXXXXX 180
               RVLRNLASEKQQMGADGMYDKLRMLNGQTGSWGTRPGWYPGTSVPGQPT           
Sbjct: 125 RVLRNLASEKQQMGADGMYDKLRMLNGQTGSWGTRPGWYPGTSVPGQPTQDGCQQQEGGG 184
 
Query: 181 XNTNSISSNGEDSDEAQMXXXXXXXXXXNRTSFTQEQIEALEKEFERTHYPDVFARERLA 240
                   NTNSISSNGEDSDEAQM                   NRTSFTQEQIEALEKEFERTHYPDVFARERLA
Sbjct: 185 ENTNSISSNGEDSDEAQMRLQLKRKLQRNRTSFTQEQIEALEKEFERTHYPDVFARERLA 244
 
Query: 241 AKIDLPEARIQVWFSNRRAKWRREEKLRNQRRQASNXXXXXXXXXXXXXXVYQPIPQPTT 300
                 AKIDLPEARIQVWFSNRRAKWRREEKLRNQRRQASN                          VYQPIPQPTT
Sbjct: 245  AKIDLPEARIQVWFSNRRAKWRREEKLRNQRRQASNTPSHIPISSSFSTSVYQPIPQPTT 304
 
Query: 301 PVSSFTSGSMLGRTDTALTNTYSALPPMPSFTMANNLPMQPPVPSQTSSYSCMLPTSPSV 360
                 PVSSFTSGSMLGRTDTALTNTYSALPPMPSFTMANNLPMQPPVPSQTSSYSCMLPTSPSV
Sbjct: 305  PVSSFTSGSMLGRTDTALTNTYSALPPMPSFTMANNLPMQPPVPSQTSSYSCMLPTSPSV 364
 
Query: 361 NGRSYDTYTPPHMQTHMNSQPMXXXXXXXXXLIXXXXXXXXXXXXXXXDMSQYWPRLQ 418
                 NGRSYDTYTPPHMQTHMNSQPM                 LI                            DMSQYWPRLQ
Sbjct: 365  NGRSYDTYTPPHMQTHMNSQPMGTSGTTSTGLISPGVSVPVQVPGSEPDMSQYWPRLQ 422
 

 

 

b. Does this sequence contain any sequence motifs?

 

Answer: Yes, homeoboxes and pairbox domains. 

 

 

2.  Pattern searching. Use this sequence:

 

MMTAKAVDKIPVTLSGFVHQLSDNIYPVEDLAATSVTIFPNAELGGPFDQ

MNGVAGDGMINIDMTGEKRSLDLPYPSSFAPVSAPRNQTFTYMGKFSIDP

QYPGASCYPEGIINIVSAGILQGVTSPASTTASSSVTSASPNPLATGPLG

VCTMSQTQPDLDHLYSPPPPPPPYSGCAGDLYQDPSAFLSAATTSTSSSL

AYPPPPSYPSPKPATDPGLFPMIPDYPGFFPSQCQRDLHGTAGPDRKPFP

CPLDTLRVPPPLTPLSTIRNFTLGGPSAGMTGPGASGGSEGPRLPGSSSAA

AAAAAAAAYNPHHLPLRPILRPRKYPNRPSKTPVHERPYPCPAEGCDRRFS

RSDELTRHIRIHTGHKPFQCRICMRNFSRSDHLTTHIRTHTGEKPFACDYCGR

KFARSDERKRHTKIHLRQKERKSSAPSASVPAPSTASCSGGVQPGGTLCSS

NSSSLGGGPLAPCSSRTRTP.

 

The Prosite database and the Pfam contain a lot of information on protein families and functional domains. Often these databases can be used to get a good hint of the function of particular protein. You can e.g. search for known functional motifs and domains in your protein (or DNA).

 

a) Use the "http://www.expasy.ch/tools/scnpsite.html" to find out if the protein sequence contains any motifs from the Prosite database. Follow the different links to e.g. find out what function the motif could have and if the motif is present in other proteins.

 

b) Use the Pfam database at http://www.sanger.ac.uk/Software/Pfam/to find out if the protein sequence contains any motifs from this database. Follow the different links to e.g. find out what function the motif could have and if the motif is present in other proteins.

 

c) Based on the results from these different analyses, which functional motifs/domains do you think the protein contains?

 

 

 

3.  Consider the following partial amino acid sequence of a protein from Saccharomyces cerevisiae:

 

Answer: MSSVAENIIQHATHNSTLHQ

 

a) What is the likely function of this protein?

 

Cytochrome P450 involved in the c-22 denaturation of the ergosterol side-chain.

 

b) What is the molecular weight (in kilodalton Kda) and predicted isolectric point (pI) for the protein?

 

Answer: Molecular weight 61334.39 Daltons;  Theoretical pI, 8.1.  Use the services at EXPASY.

 

c) Which chromosome is the gene located on?

 

Answer: One way to answer this is to go to the SGD database at Stanford, devoted to Saccharomyces.  Many organisms have community databases that are hand-curated.  Another way is to go to the Genomes division of Entrez at NCFBI.  Chr XIII; coordinates 302484-300868.

 

d) Which genes are located upstream and downstream of the gene?

 

Answer: BUD22 and SOK2

 

e) Does this protein have a sequence motif that belongs to a certain protein group (family)? Give the name of the protein group and the sequence of this motif.

 

 

NiceSite View of PROSITE: PDOC00081 (documentation)

Cytochrome P450 cysteine heme-iron ligand signature

PROSITE cross-reference(s)

PS00086; CYTOCHROME_P450

Documentation

Cytochrome P450's [1,2,3,E1] are a group of enzymes involved in  the oxidative
metabolism  of a  high number of natural  compounds (such  as steroids,  fatty
acids, prostaglandins, leukotrienes, etc) as well  as drugs,  carcinogens  and
mutagens. Based on  sequence  similarities, P450's have  been  classified into
about forty different families [4,5].  P450's are proteins of 400 to 530 amino
acids; the only exception is Bacillus BM-3 (CYP102) which is a protein of 1048
residues that  contains  a  N-terminal  P450  domain  followed  by a reductase
domain. P450's  are  heme  proteins.  A  conserved  cysteine residue in the C-
terminal part  of  P450's  is  involved  in binding the heme iron in the fifth
coordination site.  From  a  region  around  this  residue, we developed a ten
residue signature specific to P450's.
 

Description

 

 

Consensus pattern

[FW]-[SGNH]-x-[GD]-x-[RKHPT]-x-C-[LIVMFAP]-[GAD] [C is the heme iron ligand]

Sequences known to belong to this class detected by the pattern

ALL, except for P450 IIB10 from mouse, which has Lys in the first position of the pattern.

Other sequence(s) detected in SWISS-PROT

9.

 

Note

the term 'cytochrome' P450, while commonly used, is incorrect as P450 are not electron-transfer proteins; the appropriate name is P450 'heme- thiolate proteins'.

 

 

4.  Run the E. coli RecA protein against the yeast genome on the BLAST server. Choose basic BLAST and carefully review the various option windows on the page that comes up. Choose BLASTP as the choice of program and yeast as the sequence database (all of the yeast proteins). Enter the sequence in Fasta format or enter the PIR identifier of the query sequence, RECA into the input data window and indicate the choice in the small option window just above the input data window. Otherwise, use the default parameters provided by the program.

 

Answer the following questions:

a)In the diagram that comes up, click the mouse on the yeast sequence which best matches the RecA query sequence. Identify the name and gi (Genbank index) of the highest scoring sequence and the score in bits.

 

Answer: Rad51, gi/6320942.  52.3;

 

b) What scoring matrix and gap penalties were used?

 

BLOSUM61; -11 gap opening, -1 gap extension

 

c)What value of K and l were used for calculating the Expect scores for the gapped alignment (please note that there are two sets of these paramaters - one for ungapped and one for gapped alignments)? Where do these values come from?

 

Lambda     K      H
   0.314    0.134    0.367 
 
Gapped
Lambda     K      H
   0.267   0.0410    0.140 
 

 

d) How many database sequences were searched?

 

Answer: 6304 sequences from yeast.

 

e) Is the alignment of the highest scoring sequence with RecA protein significant and why? What biological information (protein structure and function) does this match suggest about the bacterial RecA protein and the yeast protein?

 

>pir||A44348 RAD51 protein - yeast (Saccharomyces cerevisiae)
 emb|CAA45563.1| similarities to procaryotic  RecA [Saccharomyces cerevisiae]
 dbj|BAA00913.1| Rad51 protein [Saccharomyces cerevisiae]
 ref|NP_011021.1| Involved in processing ds breaks, synaptonemal complex formation,
           meiotic gene conversion and reciprocal recombination.;
           Rad51p [Saccharomyces cerevisiae]
 sp|P25454|RA51_YEAST DNA repair protein RAD51
 gb|AAB64650.1| Rad51p: RecA-like protein [Saccharomyces cerevisiae]
 gb|AAA34948.1| RAD51 protein
          Length = 400
 
 Score = 62.8 bits (151), Expect = 8e-11
 Identities = 62/217 (28%), Positives = 105/217 (48%), Gaps = 29/217 (13%)

 

f) What was the lowest reported score in this search, and is this score

significant?

 

 
>ref|NP_010287.1| Required for X-ray damage repair, meiotic recombination, wild-type
           levels of sporulation and viable spores; Rad57p
           [Saccharomyces cerevisiae]
 sp|P25301|RA57_YEAST DNA repair protein RAD57
 emb|CAA88064.1| Rad57p [Saccharomyces cerevisiae]
 gb|AAA34950.1| DNA repair protein
 pir||JQ1275 RAD57 protein - yeast (Saccharomyces cerevisiae)
          Length = 460
 
 Score = 37.7 bits (86), Expect = 0.003
 Identities = 36/133 (27%), Positives = 61/133 (45%), Gaps = 25/133 (18%)
 
Query: 38  ETISTGSLSLDIALGAGGLPMGRIVEIYGPESSGKTTLTLQVIAAAQRE------GKTCA 91
           E  +T  +++D  LG G    G I EI+G  S+GK+ L +Q+  + Q        G  C 
Sbjct: 98  ECFTTADVAMDELLGGGIFTHG-ITEIFGESSTGKSQLLMQLALSVQLSEPAGGLGGKCV 156
 
Query: 92  FIDAEHALD-----------PIYARKLGVDIDNLL---CSQPDTGEQAL--EICDALARS 135
           +I  E  L            P Y  KLG+   N+    C+     E  +  ++   L RS
Sbjct: 157 YITTEGDLPTQRLESMLSSRPAY-EKLGITQSNIFTVSCNDLINQEHIINVQLPILLERS 215
 
Query: 136 -GAVDVIVVDSVA 147
            G++ ++++DS++
Sbjct: 216 KGSIKLVIIDSIS 228

 

 

The score is relatively high, but the function of the protein is similar.

 

5.  In many cases sequence databases include experimental artifacts. Databases are known to include vector sequences and other sequencing errors including contaminants, chimeric sequences, and shifts in reading frame due to insertions or deletions.

 

From a colleague you have obtained a stretch of DNA (see sequence below) that is supposed to be from the bacterium Bacillus subtilis

 

accgcacctgtggcgccggtgatgccggccacgatgcgtccggcgtagaggatcgagatctcgatcccgcgaaattaatacgactcactataggggaattgtgagcggataacaattcccctctagaaataattttgtttaactttaagaaggagatataccatgggacaatcgtttaacgcaccttatgaagcgattggagaggaacttctatcgcaacttgttgatactttttatgagcgtgtcgcgtctcatcctttgctgaagccgatttttccaagcgatttgacagaaaccgccaggaaacagaagcaattcttaactcagtatttaggcgggcctcctctttatactgaggaacacggccatcctatgctcagagcaaggcatcttccctttccaattacaaacgagagagctgatgcgtggctcagctgtatgaaggacgcaatggaccatgtagggctggagggcgaaattcgtgagtttttgtttggccggctggagttgacagcaaggcatatggtgaatcaaacggaagcggaggatcgatcatcttgacaagcttggatccggctgctaacaaagcccgaaaggaagctgagttggctgctgccaccgctgagcaataactagcataaccccttggggcctctaaacgggtcttgaggggttttttgctgaaaggaggaactatatccggatatcccgcaagaggcccggcagtaccggcataaccaagcctatgcctacagcatccagggtgacggtgccg

 

a ) Is the information correct?

 

Answer: The sequence has B. subtilis DNA, but also contains vector sequence

 

b) What gene is encoded on the fragment?

 

Answer: Yjb1

 

c) Which protein family is the gene product likely to be in?

 

Answer: GLOBINS

 

6. Using the GO Browser at the EBI, QuickGO, http://www.ebi.ac.uk/ego/ research a biological protein or topic of interest to you. Browse up and down the GO trees. Now use the AmiGO browser http://www.godatabase.org/cgi-bin/go.cgi to view some of the same info. Compare and contrast the capabilities of each browser. For the protein you chose, is there a GO process, function, and location?

 

Using AmiGO Advanced Query, find the GO's associated with specific gene products. Human genes can be found using LocusLink. Can you find a human ortholog of the protein you chose above?