Pick three gene finding programs and submit the sequence.
Question 1: How many exons are in the unknown sequence?
Question 2: What are the start and end points for each exon?
Question 3: Do the three gene finding programs agree on the above answers or are there discrepancies?
Question 4: What other elements could you identify with these programs? Look for Poly A sites, GC content, etc.
Question 5: Can you translate the sequence into a protein? What is the length of the protein sequence?
Question 6: What else can you say about the putative protein sequence? (Molecular weight, other characteristic attributes, matches in a database, structure)
Answers:
Fgeneh [apparently down, possibly removed]
GeneID v1.1:
1. 3 exons
2. from 6028-7042, 7820-7946,8304-8402
3. agreement with all 3 Genscan exons, two HMMgene and one GrailExp exon
4. no other features
5. 413 aa
6. no other information
Genscan 1.0 :
1. 3 exons
2. coordinates: 6028-7042, 7820-7646, and 8304-8402
3. agreement with GeneID, GrailExp and HMMGene
4. PolyA
5. 413 amino acids predicted
6. Blastp vs Swiss-prot matched APOA4. see below, GrailExp for further info.
Grail 2:
1. 5 exons
2. predicted coordinates:
a. forward strand: 2836-2983 and 11479-11553
b. reverse strand: 5332-6303, 4415-4541, 3959-4057
3. little agreement, but large overlap with GrailExp on 1st forward-strand exon
4. PolyA sites, repetitive DNA elements, CPG islands w/GC content, ORFs
5. 343aa in the largest predicted exon
6. Blastp vs Swiss-prot for each of the reverse-strand exons returns Apo4A, (see below for more info.) The forward-strand predictions return no hits.
GrailExp “Perceval” analysis predicts
1. 7 exon candidates, 2 with “marginal”, 2 “good”, and 3 “excellent” quality
2. predicted locations:
a. on plus strand: 631- 798, 1460 – 1573
b. on minus strand: 2851-2974,4254-4299,6058-7042,7820-7946 & 8304-8402
3. The penultimate - exon was also found by GeneID 1.1, the last – exon was also found by Genscan 1.0, GeneID 1.1
4. PolyA
5. 367 aa in the Gawain-predicted gene
6. no matches to ESTs, but Blastp vs Swiss-prot matches human Apolipoprotein A-IV precursor, gi|114006|sp|P06727|APA4_HUMAN and variants from other mammals. Much information available at the above link on function, location, tissue, polymorphisms, domains, similarity. From the actual Swiss-prot entry, APOA4’s molecular wt is 45371 Daltons before processing.
MZEF
1. 1 exon on strand 1, 3 on strand 2
2. exon coordinates:
a. strand 1: 6920-7144
b. strand 2: 6400-7029, 6820-7946 and 8304-8402
3. agrees with GeneID, Genscan, GrailExp on the last exon,
4. G+C content (0.505), ORFs, CDS
5. no aa prediction
6. no other information
Procrustes: [not currently available, in transition to new location]
HMMgene:
1. 4 on the plus and 3 exons on the minus strand
2. coordinates:
a. plus strand: 1445-1573,6181-6333, 6388-7002 and 8549-8656
b. minus strand: 6028-7042, 7820-7946, 8304-8352
3. agrees with GeneID and Genescan on 1st two minus exons,
4. no other biologically relevant info
5. no aa prediction
6. no other biologically useful information
Question 1: On which human chromosome can you find the adrenoleukodystrophy gene ALD?
Adrenoleukosystrophy (ALD) is an inherited neurodegenerative disorder. It is found on the X chromosome.
Question 2: Are there any SNPs in the gene ACE?
Yes, there are SNPs within the ACE gene.
Pick one gene/protein from Exercise 1 and locate it in the human genome:
APOA4 is on 11q23.
Question 1: Is the annotation consistent among the various genome browsers?
Not entirely. For instance, the Chromosome begin/end position varies slightly among browsers. The predicted protein sequence varies in size: Ie. Ensembl – 382 aa’s, UCSC Golden Path – 396 aa’s, and in composition. Also, the reported MW differs among browsers.
Question 2: Is your gene in the same location/chromosome region/arm in these genome browsers?
At least three give the following position for the human apolipoprotein A-IV gene: 11q23.3.
Question 3: What other information can you extract from the genome browsers regarding your gene?
For the APO A-IV gene:
Physical Properties:
Molecular Weight: 45371
Theoretical pI: 5.28.
The protein can be split into 30 peptides > 500 Dalton with masses up to 3537.7 Dalton using Trypsin.
A Kyte-Doolittle hydropathy plot shows that this protein may have a transmembrane region close to the 20 bp spot.
The commonest aa in this protein sequence is Leucine, which comprises 14.1% of the protein.
The net charge of this protein is –3.5%.
Protein Identity
A search on AACompIdent using the amino acid percentages determined above pulls up the correct protein in SwissProt!
MW in SwissProt listed as 43375.
The sequence in SwissProt with the most similar aa composition to the protein is PEPL_HUMAN: Periplakin.
Motifs & Patterns
BLOCKS: Two blocks were found. The closest match was Apolipoprotein (E-Value 2.6e-54), but a second match of TFIIE beta subunit core domain was found with a higher E-Value of 0.31.
Pfam: This search revealed the structure of the protein to be a pair of alpha-helices.
Secondary Structure and Folding Classes
Nnpredict:
Tertiary structure class: none
Sequence sp|P06727|APA4_HUMAN Apolipoprotein A-IV precursor (Apo-AIV) - Homo sapiens (Human).:
MFLKAVVLTLALVAVAGARAEVSADQVATVMWDYFSQLSNNAKEAVEHLQKSELTQQLNA
LFQDKLGEVNTYAGDLQKKLVPFATELHERLAKDSEKLKEEIGKELEELRARLLPHANEV
SQKIGDNLRELQQRLEPYADQLRTQVNTQAEQLRRQLTPYAQRMERVLRENADSLQASLR
PHADELKAKIDQNVEELKGRLTPYADEFKVKIDQTVEELRRSLAPYAQDTQEKLNHQLEG
LTFQMKKNAEELKARISASAEELRQRLAPLAEDVRGNLKGNTEGLQKSLAELGGHLDQQV
EEFRRRVEPYGENFNKALVQQMEQLRQKLGPHAGDVEGHLSFLEKDLRDKVNSFFSTFKE
KESQDKTLSLPELEQQQEQQQEQQQEQVQMLAPLES
Secondary structure prediction (H = helix, E = strand, - = no prediction):
--HHHHHHHHHHHHHH--HHHH---HHHHHEHHHH-H----HHHHHHHHH---HHHHHHH
HHH-----E--HHHHHHHH--HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH-----HH
-HH----HHHHHHH------H-------HHHHHHH----HHHHHHHHHHHHHHHHHH---
---HHHHH---H-HHHH---------HHEE----HHHHHHHH--------HHHHHHHHHH
HHHHHHH-HHHHHHHHH--HHHHHHHH----HHH--------HHHHHHHHHH------HH
HHHH--------HH-HHHHHHHHHHHH-----------HHHHHHH--HHH----H-----
------------H-HH--------HHHHHHH-----
DOE-UCLA fold predictor: Broken (week of Feb. 24-28, 2003)
Swiss-Model, “first approach”:
AlignMaster output
=========================================================
Length of target sequence: 396 residues
Searching sequences of known 3D structures
Found 1le4_.pdb with P(N)=2e-13
Found 1h7iA.pdb with P(N)=9e-13
Found 1lpe_.pdb with P(N)=2e-12
Found 1av1A.pdb with P(N)=3e-12
Found 1av1B.pdb with P(N)=3e-12
Found 1av1C.pdb with P(N)=3e-12
Found 1av1D.pdb with P(N)=3e-12
Found 1le2_.pdb with P(N)=5e-12
Found 1bz4A.pdb with P(N)=2e-11
Found 1ea8A.pdb with P(N)=3e-11
Found 1b68A.pdb with P(N)=5e-11
Found 1nfn_.pdb with P(N)=3e-08
Found 1or2A.pdb with P(N)=4e-08
Found 1or3A.pdb with P(N)=8e-08
Found 1nfo_.pdb with P(N)=1e-07
Extracting template sequences
Running pair-wise alignments with target sequence
Sequence identity of templates with target:
1le4_.pdb: 30.2 % identity
1h7iA.pdb: 29.5 % identity
1lpe_.pdb: 29.5 % identity
1av1A.pdb: 24.3 % identity
1av1B.pdb: 24.3 % identity
1av1C.pdb: 24.3 % identity
1av1D.pdb: 24.3 % identity
1le2_.pdb: 29.5 % identity
1bz4A.pdb: 28.7 % identity
1ea8A.pdb: 28.7 % identity
1b68A.pdb: 29.95 % identity
1nfn_.pdb: 28.8 % identity
1or2A.pdb: 27 % identity
1or3A.pdb: 28.1 % identity
1nfo_.pdb: 28.8 % identity
Looking for template groups
Global alignment overview:
Taget Sequence: |=====================================================|1le4_.pdb | ------------------
1h7iA.pdb | ------------------
1lpe_.pdb | ------------------
1av1A.pdb | --------------------------
1av1B.pdb | --------------------------
1av1C.pdb | --------------------------
1av1D.pdb | --------------------------
1le2_.pdb | ------------------
1bz4A.pdb | ------------------
1ea8A.pdb | ------------------
1b68A.pdb | ------------------
1nfn_.pdb | ------------------
1or2A.pdb | ----------------
1or3A.pdb | ------------------
1nfo_.pdb | ------------------
AlignMaster found 1 regions to model separately:
1: Using template(s) 1av1A.pdb 1av1B.pdb 1av1C.pdb 1av1D.pdb 1b68A.pdb 1bz4A.pdb 1ea8A.pdb 1h7iA.pdb 1le2_.pdb 1le4_.pdb 1lpe_.pdb 1nfn_.pdb 1nfo_.pdb 1or2A.pdb 1or3A.pdb
ProModII trace log for Batch.1
=========================================================
ProModII: ProMod version 3.5 date Jul 19 1999 17:15ProModII: SPDBV version 3.5ProModII: Loop version 2.60ProModII: LoopDB version 2.60ProModII: Parameters version 3.5
ProModII: Topologies version 3.5
ProModII: Loading Template: 1le4_.pdb
ProModII: Loading Template: 1h7iA.pdb
ProModII: Loading Template: 1lpe_.pdb
ProModII: Loading Template: 1av1A.pdb
ProModII: Loading Template: 1av1B.pdb
ProModII: Loading Raw Sequence
ProModII: Iterative Template Fitting
ProModII: Iterative Template Fitting
ProModII: Iterative Template Fitting
ProModII: Protein does not fit well; Removing Layer 1av1A
ProModII: (if you really want to include this template, use the optimise mode)
ProModII: Iterative Template Fitting
ProModII: Protein does not fit well; Removing Layer 1av1B
ProModII: (if you really want to include this template, use the optimise mode)
ProModII: Generating Structural Alignment
ProModII: Aligning Raw Sequence
ProModII: Refining Raw Sequence Alignment
ProModII: C-terminal overhang trimmed for chain ' '. End at residue: 134
ProModII: adding blocking groups
ProModII: Weighting Backbones
ProModII: Averaging Sidechains
ProModII: Adding Missing Sidechains
ProModII: Optimizing Sidechains
ProModII: Dumping Preliminary Model
ProModII: Adding Hydrogens
ProModII: Optimizing loops and OXT (nb = 1)
ProModII: Final Total Energy: -6413.914 KJ/molProModII: Removing Hydrogens
ProModII: Fixing Atom Nomenclature
ProModII: Dumping Sequence Alignment
***
Predict Protein:
PROF results (normal) ....,....1....,....2....,....3....,....4....,....5....,....6AA MFLKAVVLTLALVAVAGARAEVSADQVATVMWDYFSQLSNNAKEAVEHLQKSELTQQLNAPROF_sec HHHHHHHHHHHHHH HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
Rel_sec 913688888898885301003457778898888898765314777875312145577877SUB_sec L..HHHHHHHHHHHH.......HHHHHHHHHHHHHHHHH...HHHHHH.....HHHHHHHP_3_acc eeb bbbbbbbbbbbbbbebe b e b ebbb bbe b e beebbe beeeeb e b eRel_acc 230379999999643342242200329322903484472626354453322312262523SUB_acc ....bbbbbbbbbb..b..b......b...b..bbeib.e.b.ebbe........e.b.. ....,....7....,....8....,....9....,....10.1.,....11.1.,....1AA LFQDKLGEVNTYAGDLQKKLVPFATELHERLAKDSEKLKEEIGKELEELRARLLPHANEVPROF_sec HHHHHHHHHHHHHHHHHHH HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH HHHHH
Rel_sec 765113347788888875317446878887778888888887777888886312206788SUB_sec HHH.....HHHHHHHHHH..L..HHHHHHHHHHHHHHHHHHHHHHHHHHHH.....HHHHP_3_acc b eebeeb ebbeeb eebbe be bbe b e beeb e b e beebbe b e beebRel_acc 353622546351555624202305336244537343544555273335815311220535SUB_acc .b.e..eeb.e.beeb.e.....b..b.eib.e.b.ebieib.e...eb.e......e.b ....,....13.1.,....14.1.,....15.1.,....16.1.,....17.1.,....1AA SQKIGDNLRELQQRLEPYADQLRTQVNTQAEQLRRQLTPYAQRMERVLRENADSLQASLRPROF_sec HHHHHHHHHHHHHHH HHHHHHHHHHHHHHHHHHHH HHHHHHHHHHHHHHHHHHH
Rel_sec 788888888888631014787887777778888862153478888878764777764116SUB_sec HHHHHHHHHHHHH.....HHHHHHHHHHHHHHHHH..L..HHHHHHHHHH.HHHHH...LP_3_acc eeb e beebbe beebbe b e b e beeb eeb e be b e b ebbeeb e bbRel_acc 263426354631645221362745333423447273123016394536431643624300SUB_acc .e.b.e.bee..eib....e.bie...e..eeb.e......e.bie.bi..be.b.e... ....,....19.1.,....20.1.,....21.1.,....22.1.,....23.1.,....2AA PHADELKAKIDQNVEELKGRLTPYADEFKVKIDQTVEELRRSLAPYAQDTQEKLNHQLEGPROF_sec HHHHHHHHHHHHHHHHH HHHHHHHHHHHHHHHHHHHH HHHHHHHHHHHHHHH
Rel_sec 515888888887688776126740688778777667777753022246688765555776SUB_sec L.HHHHHHHHHHHHHHHH..LL..HHHHHHHHHHHHHHHHH......HHHHHHHHHHHHHP_3_acc b be b eeb e beebeee e beeb eeb e beeb eeb e be b eeb ebbee
Rel_acc 001437443535337463731232343243363426546263112124342634330342SUB_acc ...e.bie.b.e..eeb.e......e..i..b.e.beeb.e......e.b.e.b....e. ....,....25.1.,....26.1.,....27.1.,....28.1.,....29.1.,....3
AA LTFQMKKNAEELKARISASAEELRQRLAPLAEDVRGNLKGNTEGLQKSLAELGGHLDQQVPROF_sec HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH HHHHHHHHHHHHHHHHHHHRel_sec 666456778888887767788787644347887876410235778887887643478888SUB_sec HHH.HHHHHHHHHHHHHHHHHHHHH....HHHHHHH.....HHHHHHHHHHH...HHHHHP_3_acc b e b e beebee b e beeb e b ebbe b eebeee eeb e beeb e beebbRel_acc 323343624464354533226473634333354736213221636263544415333444SUB_acc ....b.e.beeb.eib....eeb.e.b....eib.e......e.b.e.beeb.e...ebb ....,....31.1.,....32.1.,....33.1.,....34.1.,....35.1.,....3AA EEFRRRVEPYGENFNKALVQQMEQLRQKLGPHAGDVEGHLSFLEKDLRDKVNSFFSTFKEPROF_sec HHHHHHHH HHHHHHHHHHHHHHHHHHH HHHHHHHHHHHHHHHHHHHHHHHH
Rel_sec 888875200004688988878888887624543540102667767477566557754420SUB_sec HHHHHH......HHHHHHHHHHHHHHHH..L..L.....HHHHHH.HHHHHHHHHH....P_3_acc eeb e beeb e b ebbbebbeeb eebeeeeee eeb e bbe b e bbbbbbbbeeRel_acc 438363443016242409232734836312203220523332605563336035112454SUB_acc e.b.e.be...e.b.e.b...b.eb.e.........e.....b.eib...b..b...bee ....,....37.1.,....38.1.,....39.1.,.AA KESQDKTLSLPELEQQQEQQQEQQQEQVQMLAPLESPROF_sec HHHHHHHHHHHHHH
Rel_sec 147665445782678777788866430122265558SUB_sec ..LLLL..LLL.HHHHHHHHHHHH.......LLLLLP_3_acc ee eeee e eeeee eeeeeee eebee ee ee
Rel_acc 330322112202223122333132064023221267SUB_acc .........................ee.......ee
Full PredictProtein result for APOA4: http://pga.lbl.gov/Workshop/Feb2003/Classwork/predictproApoa4.htm
Jpred: uses a java-based viewer, Jalview. To see Donn’s 2/25/2003 prediction for APOA4 (should be around for a few days only) try:
http://www.compbio.dundee.ac.uk/~www-jpred/results/jp_0803849/jp_0803849.results.html
PSIPred:
PSIPRED PREDICTION RESULTS
Key
Conf: Confidence (0=low, 9=high)
Pred: Predicted secondary structure (H=helix, E=strand, C=coil)
AA: Target sequenceConf: 928999999999971369888718999999999999998769999998888999999999
Pred: CHHHHHHHHHHHHHHCCCCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
AA: MFLKAVVLTLALVAVAGARAEVSADQVATVMWDYFSQLSNNAKEAVEHLQKSELTQQLNA10 20 30 40 50 60
Conf: 998769999999999999853378998888999999999998688999999999999999
Pred: HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
AA: LFQDKLGEVNTYAGDLQKKLVPFATELHERLAKDSEKLKEEIGKELEELRARLLPHANEV70 80 90 100 110 120
Conf: 999999999999999999999999999999999998899999999999999999999999
Pred: HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
AA: SQKIGDNLRELQQRLEPYADQLRTQVNTQAEQLRRQLTPYAQRMERVLRENADSLQASLR130 140 150 160 170 180
Conf: 999999999988899998778999999999999999999999999999999999999999
Pred: HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
AA: PHADELKAKIDQNVEELKGRLTPYADEFKVKIDQTVEELRRSLAPYAQDTQEKLNHQLEG190 200 210 220 230 240
Conf: 999999999999999999999999999999999999999888776578999999999999
Pred: HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
AA: LTFQMKKNAEELKARISASAEELRQRLAPLAEDVRGNLKGNTEGLQKSLAELGGHLDQQV250 260 270 280 290 300
Conf: 999999999999999999999999999889999999999999999999888999999999
Pred: HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
AA: EEFRRRVEPYGENFNKALVQQMEQLRQKLGPHAGDVEGHLSFLEKDLRDKVNSFFSTFKE310 320 330 340 350 360
Conf: 986646655689999999999999999997266889
Pred: HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHCCCCC
AA: KESQDKTLSLPELEQQQEQQQEQQQEQVQMLAPLES370 380 390
Predator: [no longer available as a web service].