Day 1
exercises
Step-by-step keys are at the end.
Exercises 1-4: The goal of these exercises is to use DNA sequence
analysis tools to identify genes on an unknown BAC. You will have to retrieve
a genomic DNA sequence, for which no additional information is available,
and discover which genes are present in this sequence using database searches
for genes that have homology to the genomic sequence.
#1. Retrieve the sequence of the BAC LB5-28F9 from GenBank
a) To which species does
this sequence belong to?
b) Is it more closely related to human
or mouse?
Hint: Use the "Nucleotide" option in the Entrez database
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi. Species information is
available in the "Organism" field of the Genbank entry.
Answer: This BAC belongs to
Callicebus moloch, commonly known as Dusky titi. It is a monkey, as such closer to human than
to mouse.
#2. To simplify the
analysis, we will only first work with a subset of the BAC sequence, from
base160,000 to 221,500, in its Reverse-Complement form. Save the sequence on your desktop.
a) what is the best format to save
sequences for further analysis?
Hint: Use the "Range" option.
Answer: FASTA format.
#3. Find out what coding
sequences are present in this BAC. We
will do this by finding which cDNA sequences have a high scoring match to our
BAC sequence.
a) Which sequence analysis
tool is best suited for this task?
b) To optimize search speed, to which
organism should we limit the BLAST search?
Hint: Inspect the NCBI BLAST page at http://www.ncbi.nlm.nih.gov/BLAST/.
Answer: MegaBLAST. Homo sapiens.
#4. Inspect the BLAST
output.
a) How many different
sequences have similarity matches to your sequence?
b) Are these similarity
matches coming from genomic or cDNA sequences?
c) How many Refseq entries
hit your sequence?
d) What genes are they?
Hint: Look for how many different accession numbers you can find
in the BLAST output. Remember that
Refseq accession numbers begin with NM_.
Use the Entrez database to retrieve information about the Refseq
entries.
Answer: There are a large number of hits. The similarity matches come from both genomic
and cDNA sequences. 3 refseq entries. APOA1,
APOA4 and APOA5.
Exercises 5-7: The goal of these exercises is to reconstruct the
predicted coding sequence of the genes identified in the previous exercise and
comparing them to their orthologous human sequences.
#5. Reconstruct the expected
coding sequence of the Callicebus moloch APOA5 gene. An easy way to do this is to inspect a
pairwise sequence alignment between your Callicebus moloch sequence and the
human ApoA5 cDNA sequence
a) How many exons are
there?
b) What are the coordinates of all the
intron-exon junctions?
Hint: Use Blast2seq on the NCBI blast page. Change "mismatch" to -1, "gap open" to 3, and
"gap extension" to 1 to obtain sequence aligments surrounding exon
junctions. Remember that an intron
almost always begins with GT and ends with AG.
Answer: 4. Exon
1: 44265-44321, exon 2: 44454-44565, exon 3: 45070-46256, exon 4: 46603-47120.
#6. Use the EXTRACTSEQ tool
to assemble the reconstructed Callicebus moloch ApoA5 cDNA. Verify that you have done everything
correctly by checking that you have one intact open reading frame in the
extracted sequence. Save the extracted
sequence on your desktop.
Hint: Use EXTRACTSEQ at http://bioweb.pasteur.fr/seqanal/interfaces/extractseq.html. Run GETORF from your EXTRACTSEQ output page.
#7. Compare the human APOA5
protein to your extracted cDNA.
a) How many changes are
between aminoacids of the same chemical type?
b) How many changes are
between aminoacids of different chemical type?
c) How many gaps are there?
Hint: Use the blastx version of Blast2seq on the NCBI blast page.
Answer: 12, 11, 0.
If you have time left after the
next set of exercises, you can repeat exercise 5-7 with the two other genes,
APOA1 and APOA4, found in this BAC.
Exercises 8-11: The goal of these exercises is to design PCR
primers for resequencing exons, as when looking for mutations in clinical
samples. These exercises introduce the
ENSEMBL genome browser and the Primer3 program.
#8. Retrieve the sequence of
LXR-alpha exon 2 and of its 2 surrounding introns.
a) How many alternative
transcripts are there for the LXR-alpha gene?
b) How many exons are in
the LXR-alpha gene?
c) How long are exon 2, introns 1-2 and
2-3?
Hint: Use the ENSEMBL genome browser and the "exon information"
function.
Answer: 2 alternative transcripts. 9 exons. 189, 531,
and 429 bp.
#9. Find and mask repeats in
the sequence you just retrieved (Exon 2 w/ introns).
a) How many repeats are
there ?
b) Of what type?
Hint: Use Repeatmasker at
http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker.
Save the masked sequence on your desktop.
Answer: 1 repeat. SINE/MIR.
#10. Design primers to amplify
LXR-alpha exon 2, including approximately additional 50 bp on the sides of both
intron-exon junctions.
a) Do the primers fall in
repeat regions?
b) What is the primers
annealing temperature difference?
c) Is there any G or C stretch of 4 or
more bases?
Hint: Submit your masked sequence to Primer3 at http://frodo.wi.mit.edu/cgi-bin/primer3/primer3_www.cgi
Answer: Depends on your specific Primer3 output. In general, you must avoid repeat regions and
GC stretches. Annealing temperature
differences of 20
C or
more may make the PCR reaction more difficult.
#11. Verify that your primers
are unique in the human genome and that they do not have the potential to
amplify other regions of the human genome.
a) How many different
database hits did you find?
b) Are the hits from the
same or different chromosomes?
c) Does the whole primer sequence match
or only part of it?
Hint: Run BLAST for short, nearly exact matches.
Alternatively, you can use UCSC In-Silico PCR at
http://genome.ucsc.edu/cgi-bin/hgPcr?command=start
Answer: Depends on your specific BLAST result. Different hits from the same chromosome
probably reflect multiple genomic sequences from the same genomic locus. If only part of the primer sequence matches
regions other than your specific target, the primer is likely to anneal only at
your target region.