PGA WorkshopProgramApplication formcourse locationHousingBerkeley PGA web sitecontact us

Day 1 exercises

Step-by-step keys are at the end.

 

Exercises 1-4: The goal of these exercises is to use DNA sequence analysis tools to identify genes on an unknown BAC. You will have to retrieve a genomic DNA sequence, for which no additional information is available, and discover which genes are present in this sequence using database searches for genes that have homology to the genomic sequence.

#1. Retrieve the sequence of the BAC LB5-28F9 from GenBank

a) To which species does this sequence belong to?

b) Is it more closely related to human or mouse?

Hint: Use the "Nucleotide" option in the Entrez database http://www.ncbi.nlm.nih.gov/entrez/query.fcgi. Species information is available in the "Organism" field of the Genbank entry.

Answer: This BAC belongs to Callicebus moloch, commonly known as Dusky titi. It is a monkey, as such closer to human than to mouse.

 

#2. To simplify the analysis, we will only first work with a subset of the BAC sequence, from base160,000 to 221,500, in its Reverse-Complement form. Save the sequence on your desktop.

a) what is the best format to save sequences for further analysis?

Hint: Use the "Range" option.

Answer: FASTA format.

 

#3. Find out what coding sequences are present in this BAC. We will do this by finding which cDNA sequences have a high scoring match to our BAC sequence.

a) Which sequence analysis tool is best suited for this task?

b) To optimize search speed, to which organism should we limit the BLAST search?

Hint: Inspect the NCBI BLAST page at http://www.ncbi.nlm.nih.gov/BLAST/.

Answer: MegaBLAST. Homo sapiens.

 

#4. Inspect the BLAST output.

a) How many different sequences have similarity matches to your sequence?

b) Are these similarity matches coming from genomic or cDNA sequences?

c) How many Refseq entries hit your sequence?

d) What genes are they?

Hint: Look for how many different accession numbers you can find in the BLAST output. Remember that Refseq accession numbers begin with NM_. Use the Entrez database to retrieve information about the Refseq entries.

Answer: There are a large number of hits. The similarity matches come from both genomic and cDNA sequences. 3 refseq entries. APOA1, APOA4 and APOA5.

 

 

Exercises 5-7: The goal of these exercises is to reconstruct the predicted coding sequence of the genes identified in the previous exercise and comparing them to their orthologous human sequences.

 

#5. Reconstruct the expected coding sequence of the Callicebus moloch APOA5 gene. An easy way to do this is to inspect a pairwise sequence alignment between your Callicebus moloch sequence and the human ApoA5 cDNA sequence

a) How many exons are there?

b) What are the coordinates of all the intron-exon junctions?

Hint: Use Blast2seq on the NCBI blast page. Change "mismatch" to -1, "gap open" to 3, and "gap extension" to 1 to obtain sequence aligments surrounding exon junctions. Remember that an intron almost always begins with GT and ends with AG.

Answer: 4. Exon 1: 44265-44321, exon 2: 44454-44565, exon 3: 45070-46256, exon 4: 46603-47120.

 

#6. Use the EXTRACTSEQ tool to assemble the reconstructed Callicebus moloch ApoA5 cDNA. Verify that you have done everything correctly by checking that you have one intact open reading frame in the extracted sequence. Save the extracted sequence on your desktop.

Hint: Use EXTRACTSEQ at http://bioweb.pasteur.fr/seqanal/interfaces/extractseq.html. Run GETORF from your EXTRACTSEQ output page.

 

#7. Compare the human APOA5 protein to your extracted cDNA.

a) How many changes are between aminoacids of the same chemical type?

b) How many changes are between aminoacids of different chemical type?

c) How many gaps are there?

Hint: Use the blastx version of Blast2seq on the NCBI blast page.

Answer: 12, 11, 0.

 

If you have time left after the next set of exercises, you can repeat exercise 5-7 with the two other genes, APOA1 and APOA4, found in this BAC.

 


Exercises 8-11: The goal of these exercises is to design PCR primers for resequencing exons, as when looking for mutations in clinical samples. These exercises introduce the ENSEMBL genome browser and the Primer3 program.

 

#8. Retrieve the sequence of LXR-alpha exon 2 and of its 2 surrounding introns.

a) How many alternative transcripts are there for the LXR-alpha gene?

b) How many exons are in the LXR-alpha gene?

c) How long are exon 2, introns 1-2 and 2-3?

Hint: Use the ENSEMBL genome browser and the "exon information" function.

Answer: 2 alternative transcripts. 9 exons. 189, 531, and 429 bp.

 

#9. Find and mask repeats in the sequence you just retrieved (Exon 2 w/ introns).

a) How many repeats are there ?

b) Of what type?

Hint: Use Repeatmasker at http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker. Save the masked sequence on your desktop.

Answer: 1 repeat. SINE/MIR.

 

#10. Design primers to amplify LXR-alpha exon 2, including approximately additional 50 bp on the sides of both intron-exon junctions.

a) Do the primers fall in repeat regions?

b) What is the primers annealing temperature difference?

c) Is there any G or C stretch of 4 or more bases?

Hint: Submit your masked sequence to Primer3 at http://frodo.wi.mit.edu/cgi-bin/primer3/primer3_www.cgi

Answer: Depends on your specific Primer3 output. In general, you must avoid repeat regions and GC stretches. Annealing temperature differences of 20 C or more may make the PCR reaction more difficult.

 

#11. Verify that your primers are unique in the human genome and that they do not have the potential to amplify other regions of the human genome.

a) How many different database hits did you find?

b) Are the hits from the same or different chromosomes?

c) Does the whole primer sequence match or only part of it?

Hint: Run BLAST for short, nearly exact matches. Alternatively, you can use UCSC In-Silico PCR at http://genome.ucsc.edu/cgi-bin/hgPcr?command=start

Answer: Depends on your specific BLAST result. Different hits from the same chromosome probably reflect multiple genomic sequences from the same genomic locus. If only part of the primer sequence matches regions other than your specific target, the primer is likely to anneal only at your target region.

 

For problems with the web site contact the
Berkeley PGA web siteNIH Program in Genomic applications