Annotation – Exercise 1 Answers

Pick three gene finding programs and submit the sequence.

Question 1: How many exons are in the unknown sequence?

Question 2: What are the start and end points for each exon?

Question 3: Do the three gene finding programs agree on the above answers or are there discrepancies?

Question 4: What other elements could you identify with these programs? Look for Poly A sites, GC content, etc.

Question 5: Can you translate the sequence into a protein? What is the length of the protein sequence?

Question 6: What else can you say about the putative protein sequence? (Molecular weight, other characteristic attributes, matches in a database, structure)

Answers:

Fgeneh [apparently down, possibly removed]

GeneID v1.1:

1.      3 exons

2.      from 6028-7042, 7820-7946,8304-8402

3.      agreement with all 3 Genscan exons,  two HMMgene and one GrailExp exon

4.      no other features

5.      413 aa

6.      no other information

Genscan 1.0 :

1.      3 exons

2.       coordinates: 6028-7042, 7820-7646, and 8304-8402

3.      agreement with GeneID, GrailExp and HMMGene

4.      PolyA

5.      413 amino acids predicted

6.      Blastp vs Swiss-prot matched APOA4. see below, GrailExp for further info.

Grail 2:

1.      5 exons

2.      predicted coordinates:

a.       forward strand: 2836-2983 and 11479-11553

b.      reverse strand: 5332-6303, 4415-4541, 3959-4057

3.      little agreement, but large overlap with GrailExp on 1st forward-strand exon

4.      PolyA sites, repetitive DNA elements, CPG islands w/GC content, ORFs

5.      343aa in the largest predicted exon

6.      Blastp vs Swiss-prot for each of the reverse-strand exons returns Apo4A, (see below for more info.)  The forward-strand predictions return no hits.

GrailExp “Perceval” analysis predicts

1.      7 exon candidates, 2 with “marginal”, 2 “good”, and 3 “excellent” quality

2.      predicted locations:

a.       on plus strand: 631- 798, 1460 – 1573

b.      on minus strand: 2851-2974,4254-4299,6058-7042,7820-7946 & 8304-8402

3.      The penultimate - exon was also found by GeneID 1.1, the last – exon was also found by Genscan 1.0, GeneID 1.1

4.      PolyA

5.      367 aa in the Gawain-predicted gene

6.      no matches to ESTs, but Blastp vs Swiss-prot matches human Apolipoprotein A-IV precursor, gi|114006|sp|P06727|APA4_HUMAN and variants from other mammals. Much information available at the above link on function, location, tissue, polymorphisms, domains, similarity. From the actual Swiss-prot entry, APOA4’s molecular wt is 45371 Daltons before processing.

MZEF

1.      1 exon on strand 1, 3 on strand 2

2.      exon coordinates:

a.       strand 1: 6920-7144

b.      strand 2: 6400-7029, 6820-7946 and 8304-8402

3.      agrees with GeneID, Genscan, GrailExp on the last exon,

4.      G+C content (0.505), ORFs, CDS

5.      no aa prediction

6.      no other information

Procrustes:  [not currently available, in transition to new location]

HMMgene:

1.      4 on the plus and 3 exons on the minus strand

2.      coordinates:

a.       plus strand: 1445-1573,6181-6333, 6388-7002 and 8549-8656

b.      minus strand: 6028-7042, 7820-7946, 8304-8352

3.      agrees with GeneID and Genescan on 1st  two minus exons,

4.      no other biologically relevant info

5.      no aa prediction

6.      no other biologically useful information

Annotation – Exercise 2 Answers

Question 1: On which human chromosome can you find the adrenoleukodystrophy gene ALD?

Adrenoleukosystrophy (ALD) is an inherited neurodegenerative disorder.  It is found on the X chromosome.

Question 2: Are there any SNPs in the gene ACE?

Yes, there are SNPs within the ACE gene.

Pick one gene/protein from Exercise 1 and locate it in the human genome:

APOA4 is on 11q23.

Annotation – Exercise 3 Answers

Question 1: Is the annotation consistent among the various genome browsers? 

Not entirely.  For instance, the Chromosome begin/end position varies slightly among browsers.  The predicted protein sequence varies in size: Ie. Ensembl – 382 aa’s, UCSC Golden Path – 396 aa’s, and in composition.   Also, the reported MW differs among browsers.

Question 2: Is your gene in the same location/chromosome region/arm in these genome browsers? 

At least three give the following position for the human apolipoprotein A-IV gene: 11q23.3.

Question 3: What other information can you extract from the genome browsers regarding your gene?  

For the APO A-IV gene:

Physical Properties:

Molecular Weight: 45371

Theoretical pI: 5.28.

The protein can be split into 30 peptides > 500 Dalton with masses up to 3537.7 Dalton using Trypsin.

A Kyte-Doolittle hydropathy plot shows that this protein may have a transmembrane region close to the 20 bp spot.

The commonest aa in this protein sequence is Leucine, which comprises 14.1% of the protein.

The net charge of this protein is –3.5%.

Protein Identity

A search on AACompIdent using the amino acid percentages determined above pulls up the correct protein in SwissProt! 

MW in SwissProt listed as 43375.

The sequence in SwissProt with the most similar aa composition to the protein is PEPL_HUMAN: Periplakin.

Motifs & Patterns

BLOCKS:  Two blocks were found.  The closest match was Apolipoprotein (E-Value 2.6e-54), but a second match of TFIIE beta subunit core domain was found with a higher E-Value of 0.31.

Pfam:  This search revealed the structure of the protein to be a pair of alpha-helices.

Secondary Structure and Folding Classes

Nnpredict:

Tertiary structure class: none 
Sequence sp|P06727|APA4_HUMAN Apolipoprotein A-IV precursor (Apo-AIV) - Homo sapiens (Human).:
MFLKAVVLTLALVAVAGARAEVSADQVATVMWDYFSQLSNNAKEAVEHLQKSELTQQLNA
LFQDKLGEVNTYAGDLQKKLVPFATELHERLAKDSEKLKEEIGKELEELRARLLPHANEV
SQKIGDNLRELQQRLEPYADQLRTQVNTQAEQLRRQLTPYAQRMERVLRENADSLQASLR
PHADELKAKIDQNVEELKGRLTPYADEFKVKIDQTVEELRRSLAPYAQDTQEKLNHQLEG
LTFQMKKNAEELKARISASAEELRQRLAPLAEDVRGNLKGNTEGLQKSLAELGGHLDQQV
EEFRRRVEPYGENFNKALVQQMEQLRQKLGPHAGDVEGHLSFLEKDLRDKVNSFFSTFKE
KESQDKTLSLPELEQQQEQQQEQQQEQVQMLAPLES
Secondary structure prediction (H = helix, E = strand, - = no prediction):
--HHHHHHHHHHHHHH--HHHH---HHHHHEHHHH-H----HHHHHHHHH---HHHHHHH
HHH-----E--HHHHHHHH--HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH-----HH
-HH----HHHHHHH------H-------HHHHHHH----HHHHHHHHHHHHHHHHHH---
---HHHHH---H-HHHH---------HHEE----HHHHHHHH--------HHHHHHHHHH
HHHHHHH-HHHHHHHHH--HHHHHHHH----HHH--------HHHHHHHHHH------HH
HHHH--------HH-HHHHHHHHHHHH-----------HHHHHHH--HHH----H-----
------------H-HH--------HHHHHHH-----

DOE-UCLA fold predictor: Broken (week of Feb. 24-28, 2003)

Swiss-Model, “first approach”: 

AlignMaster output
=========================================================
Length of target sequence: 396 residues
Searching sequences of known 3D structures
Found 1le4_.pdb with P(N)=2e-13
Found 1h7iA.pdb with P(N)=9e-13
Found 1lpe_.pdb with P(N)=2e-12
Found 1av1A.pdb with P(N)=3e-12
Found 1av1B.pdb with P(N)=3e-12
Found 1av1C.pdb with P(N)=3e-12
Found 1av1D.pdb with P(N)=3e-12
Found 1le2_.pdb with P(N)=5e-12
Found 1bz4A.pdb with P(N)=2e-11
Found 1ea8A.pdb with P(N)=3e-11
Found 1b68A.pdb with P(N)=5e-11
Found 1nfn_.pdb with P(N)=3e-08
Found 1or2A.pdb with P(N)=4e-08
Found 1or3A.pdb with P(N)=8e-08
Found 1nfo_.pdb with P(N)=1e-07
Extracting template sequences
Running pair-wise alignments with target sequence
Sequence identity of templates with target:
1le4_.pdb: 30.2 % identity
1h7iA.pdb: 29.5 % identity
1lpe_.pdb: 29.5 % identity
1av1A.pdb: 24.3 % identity
1av1B.pdb: 24.3 % identity
1av1C.pdb: 24.3 % identity
1av1D.pdb: 24.3 % identity
1le2_.pdb: 29.5 % identity
1bz4A.pdb: 28.7 % identity
1ea8A.pdb: 28.7 % identity
1b68A.pdb: 29.95 % identity
1nfn_.pdb: 28.8 % identity
1or2A.pdb: 27 % identity
1or3A.pdb: 28.1 % identity
1nfo_.pdb: 28.8 % identity
Looking for template groups
Global alignment overview:
Taget Sequence:   |=====================================================|
1le4_.pdb |    ------------------                                 
1h7iA.pdb |    ------------------                                 
1lpe_.pdb |    ------------------                                 
1av1A.pdb |          --------------------------                   
1av1B.pdb |          --------------------------                   
1av1C.pdb |          --------------------------                   
1av1D.pdb |          --------------------------                   
1le2_.pdb |    ------------------                                 
1bz4A.pdb |    ------------------                                 
1ea8A.pdb |    ------------------                                 
1b68A.pdb |    ------------------                                 
1nfn_.pdb |    ------------------                                 
1or2A.pdb |    ----------------                                   
1or3A.pdb |    ------------------                                 
1nfo_.pdb |    ------------------                                 
AlignMaster found 1 regions to model separately:
   1: Using template(s)   1av1A.pdb 1av1B.pdb 1av1C.pdb 1av1D.pdb 1b68A.pdb 1bz4A.pdb 1ea8A.pdb 1h7iA.pdb 1le2_.pdb 1le4_.pdb 1lpe_.pdb 1nfn_.pdb 1nfo_.pdb 1or2A.pdb 1or3A.pdb
ProModII trace log for Batch.1
=========================================================
ProModII: ProMod     version 3.5 date Jul 19 1999 17:15
ProModII: SPDBV      version 3.5
ProModII: Loop       version 2.60
ProModII: LoopDB     version 2.60
ProModII: Parameters version 3.5
ProModII: Topologies version 3.5
ProModII: Loading Template: 1le4_.pdb
ProModII: Loading Template: 1h7iA.pdb
ProModII: Loading Template: 1lpe_.pdb
ProModII: Loading Template: 1av1A.pdb
ProModII: Loading Template: 1av1B.pdb
ProModII: Loading Raw Sequence
ProModII: Iterative Template Fitting
ProModII: Iterative Template Fitting
ProModII: Iterative Template Fitting
ProModII: Protein does not fit well; Removing Layer 1av1A
ProModII: (if you really want to include this template, use the optimise mode)
ProModII: Iterative Template Fitting
ProModII: Protein does not fit well; Removing Layer 1av1B
ProModII: (if you really want to include this template, use the optimise mode)
ProModII: Generating Structural Alignment
ProModII: Aligning Raw Sequence
ProModII: Refining Raw Sequence Alignment
ProModII: C-terminal overhang trimmed for chain ' '. End at residue: 134
ProModII: adding blocking groups
ProModII: Weighting Backbones
ProModII: Averaging Sidechains
ProModII: Adding Missing Sidechains
ProModII: Optimizing Sidechains
ProModII: Dumping Preliminary Model
ProModII: Adding Hydrogens
ProModII: Optimizing loops and OXT (nb = 1)
ProModII: Final Total Energy:       -6413.914 KJ/mol
ProModII: Removing Hydrogens
ProModII: Fixing Atom Nomenclature
ProModII: Dumping Sequence Alignment
***

Predict Protein:

PROF results (normal)
         ....,....1....,....2....,....3....,....4....,....5....,....6
AA       MFLKAVVLTLALVAVAGARAEVSADQVATVMWDYFSQLSNNAKEAVEHLQKSELTQQLNA
PROF_sec   HHHHHHHHHHHHHH    HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
Rel_sec  913688888898885301003457778898888898765314777875312145577877
SUB_sec  L..HHHHHHHHHHHH.......HHHHHHHHHHHHHHHHH...HHHHHH.....HHHHHHH
P_3_acc  eeb bbbbbbbbbbbbbbebe b e b ebbb bbe b e beebbe beeeeb e b e
Rel_acc  230379999999643342242200329322903484472626354453322312262523
SUB_acc  ....bbbbbbbbbb..b..b......b...b..bbeib.e.b.ebbe........e.b..
  
         ....,....7....,....8....,....9....,....10.1.,....11.1.,....1
AA       LFQDKLGEVNTYAGDLQKKLVPFATELHERLAKDSEKLKEEIGKELEELRARLLPHANEV
PROF_sec HHHHHHHHHHHHHHHHHHH   HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH  HHHHH
Rel_sec  765113347788888875317446878887778888888887777888886312206788
SUB_sec  HHH.....HHHHHHHHHH..L..HHHHHHHHHHHHHHHHHHHHHHHHHHHH.....HHHH
P_3_acc   b eebeeb ebbeeb eebbe be bbe b e beeb e b e beebbe b e beeb
Rel_acc  353622546351555624202305336244537343544555273335815311220535
SUB_acc  .b.e..eeb.e.beeb.e.....b..b.eib.e.b.ebieib.e...eb.e......e.b
         ....,....13.1.,....14.1.,....15.1.,....16.1.,....17.1.,....1
AA       SQKIGDNLRELQQRLEPYADQLRTQVNTQAEQLRRQLTPYAQRMERVLRENADSLQASLR
PROF_sec HHHHHHHHHHHHHHH HHHHHHHHHHHHHHHHHHHH   HHHHHHHHHHHHHHHHHHH  
Rel_sec  788888888888631014787887777778888862153478888878764777764116
SUB_sec  HHHHHHHHHHHHH.....HHHHHHHHHHHHHHHHH..L..HHHHHHHHHH.HHHHH...L
P_3_acc   eeb e beebbe beebbe b e b e beeb eeb e be b e b ebbeeb e bb
Rel_acc  263426354631645221362745333423447273123016394536431643624300
SUB_acc  .e.b.e.bee..eib....e.bie...e..eeb.e......e.bie.bi..be.b.e...
         ....,....19.1.,....20.1.,....21.1.,....22.1.,....23.1.,....2
AA       PHADELKAKIDQNVEELKGRLTPYADEFKVKIDQTVEELRRSLAPYAQDTQEKLNHQLEG
PROF_sec   HHHHHHHHHHHHHHHHH    HHHHHHHHHHHHHHHHHHHH  HHHHHHHHHHHHHHH
Rel_sec  515888888887688776126740688778777667777753022246688765555776
SUB_sec  L.HHHHHHHHHHHHHHHH..LL..HHHHHHHHHHHHHHHHH......HHHHHHHHHHHHH
P_3_acc  b be b eeb e beebeee  e beeb eeb e beeb eeb e be b eeb ebbee
Rel_acc  001437443535337463731232343243363426546263112124342634330342
SUB_acc  ...e.bie.b.e..eeb.e......e..i..b.e.beeb.e......e.b.e.b....e.
                    
         ....,....25.1.,....26.1.,....27.1.,....28.1.,....29.1.,....3
AA       LTFQMKKNAEELKARISASAEELRQRLAPLAEDVRGNLKGNTEGLQKSLAELGGHLDQQV
PROF_sec HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH  HHHHHHHHHHHHHHHHHHH
Rel_sec  666456778888887767788787644347887876410235778887887643478888
SUB_sec  HHH.HHHHHHHHHHHHHHHHHHHHH....HHHHHHH.....HHHHHHHHHHH...HHHHH
P_3_acc  b e b e beebee b e beeb e b ebbe b eebeee eeb e beeb e beebb
Rel_acc  323343624464354533226473634333354736213221636263544415333444
SUB_acc  ....b.e.beeb.eib....eeb.e.b....eib.e......e.b.e.beeb.e...ebb
         ....,....31.1.,....32.1.,....33.1.,....34.1.,....35.1.,....3
AA       EEFRRRVEPYGENFNKALVQQMEQLRQKLGPHAGDVEGHLSFLEKDLRDKVNSFFSTFKE
PROF_sec HHHHHHHH  HHHHHHHHHHHHHHHHHHH       HHHHHHHHHHHHHHHHHHHHHHHH
Rel_sec  888875200004688988878888887624543540102667767477566557754420
SUB_sec  HHHHHH......HHHHHHHHHHHHHHHH..L..L.....HHHHHH.HHHHHHHHHH....
P_3_acc  eeb e beeb e b ebbbebbeeb eebeeeeee eeb e bbe b e bbbbbbbbee
Rel_acc  438363443016242409232734836312203220523332605563336035112454
SUB_acc  e.b.e.be...e.b.e.b...b.eb.e.........e.....b.eib...b..b...bee
         ....,....37.1.,....38.1.,....39.1.,.
AA       KESQDKTLSLPELEQQQEQQQEQQQEQVQMLAPLES
PROF_sec             HHHHHHHHHHHHHH          
Rel_sec  147665445782678777788866430122265558
SUB_sec  ..LLLL..LLL.HHHHHHHHHHHH.......LLLLL
P_3_acc  ee eeee e eeeee  eeeeeee eebee ee ee
Rel_acc  330322112202223122333132064023221267
SUB_acc  .........................ee.......ee

Full PredictProtein result for APOA4: http://pga.lbl.gov/Workshop/Feb2003/Classwork/predictproApoa4.htm

Jpred: uses a java-based viewer, Jalview.  To see Donn’s 2/25/2003 prediction for APOA4 (should be around for a few days only) try:

 http://www.compbio.dundee.ac.uk/~www-jpred/results/jp_0803849/jp_0803849.results.html 

PSIPred:

PSIPRED PREDICTION RESULTS
Key
Conf: Confidence (0=low, 9=high)
Pred: Predicted secondary structure (H=helix, E=strand, C=coil)
  AA: Target sequence
 
Conf: 928999999999971369888718999999999999998769999998888999999999
Pred: CHHHHHHHHHHHHHHCCCCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
  AA: MFLKAVVLTLALVAVAGARAEVSADQVATVMWDYFSQLSNNAKEAVEHLQKSELTQQLNA
              10        20        30        40        50        60
Conf: 998769999999999999853378998888999999999998688999999999999999
Pred: HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
  AA: LFQDKLGEVNTYAGDLQKKLVPFATELHERLAKDSEKLKEEIGKELEELRARLLPHANEV
              70        80        90       100       110       120
Conf: 999999999999999999999999999999999998899999999999999999999999
Pred: HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
  AA: SQKIGDNLRELQQRLEPYADQLRTQVNTQAEQLRRQLTPYAQRMERVLRENADSLQASLR
             130       140       150       160       170       180
Conf: 999999999988899998778999999999999999999999999999999999999999
Pred: HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
  AA: PHADELKAKIDQNVEELKGRLTPYADEFKVKIDQTVEELRRSLAPYAQDTQEKLNHQLEG
             190       200       210       220       230       240
Conf: 999999999999999999999999999999999999999888776578999999999999
Pred: HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
  AA: LTFQMKKNAEELKARISASAEELRQRLAPLAEDVRGNLKGNTEGLQKSLAELGGHLDQQV
             250       260       270       280       290       300
Conf: 999999999999999999999999999889999999999999999999888999999999
Pred: HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
  AA: EEFRRRVEPYGENFNKALVQQMEQLRQKLGPHAGDVEGHLSFLEKDLRDKVNSFFSTFKE
             310       320       330       340       350       360
Conf: 986646655689999999999999999997266889
Pred: HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHCCCCC
  AA: KESQDKTLSLPELEQQQEQQQEQQQEQVQMLAPLES
             370       380       390    


Predator: [no longer available as a  web service].