SKILL SHEET
SEQUENCE ANALYSIS UNIT
COMPUTATIONAL BIOLOGY COURSES 03-310, 03-311, 03-510 & 03710
This is an organized listing of skills and concepts designed to help you prepare for exams. This list is aimed at focusing your attention on those parts of the course material related to skills you may be required to perform. This is NOT an exhaustive list of concepts - not all material from the lectures is listed here, so you should still review your course notes and assigned reading. SKILLS AND CONCEPTS WHICH ARE NOT LISTED HERE MAY BE INCLUDED ON EXAMS. Unless noted below, all material is relevant to all three courses (but the depth of understanding is expected to be greater for 03-510 & 03710).
Entrez and Related Topics
Understand the use of Entrez for accomplishing the following:
l Given the name or accession number of a gene or protein do the following:
o Download its sequence or structure
o Locate related research materials
o Locate related nucleotide, protein, or structure entries
Understand and describe the theory behind the following:
l The difference between the various Entrez databases
l The difference between "MESH terms" and "Keywords"
l The difference between related items within a database and links to other databases
l Classification of sequence file formats based on:
n Text vs. binary
n Minimal vs. annotated
Sequence Alignment
Understand and describe the following:
l Representing ambiguity in nucleotide sequences and protein sequences
l Relationship between sequence homology and sequence motifs
l Concept of a similarity function or similarity matrix
l Concept of global alignment vs. local alignment
l Application of dot matrix on sequence alignment and interpretation
l Concept of dynamic programming applied to sequence
comparison including gap penalties
l
Principles
of the two algorithms: FASTA and BLAST, and their major differences
l
Similarity
values and simple classification
l
Pair
wise multiple sequence alignment, Carrillo-Lipman sum of pairs method, and two
programs of multiple sequence alignment (MSA and ClustalW)
Understand how to manually do the following:
l Build and interpret dot matrix, with specific widow size and stringency
l Find optimal alignment between two short sequences using dynamic programming, both basic and with similar function
l Dynamic programming with gap penalty (opening, extending and end)
l Translate between the IUB and bit coding forms of sequence representation
l Estimate the probability that a pattern would occur randomly in a given sequence, using: equal frequency, mononucleotide frequency, or dinucleotide frequency.
l Estimate the probability that an alignment would occur randomly in any pair of sequences of the same lengths and composition
l Search a database for entries related to a given sequence
Representing and Finding Sequence Features
Understand and describe the theory behind the following:
l Concept of "feature" in a sequence
l Concept of a consensus sequence
l Concept of a frequency matrix
l Concept of a Profile or PSSM
o When to use frequency matrix vs. when to use consensus sequence
l Concept of Markov chain and hidden markov models
l Genetic codes, reading frame and open reading frame
l Concept of base composition bias and codon bias
l Nucleotide subsequences as indicators of gene expression including:
Relationship between:
o
DNA, mRNA & cDNA
o
transcription
o splicing
o translation
l Gene model and Genscan HMM
l Performance measuring of gene finding, in nucleotide level and exon level
l Concept of hydro-pathy/phobicity/philicity
l Amphiphilicity/amphipathicity, helical wheel plot and hydrophobic moment
l Concept of a "window" and what the axes respresent in the following:
o % nucleotide content plot
o Codon bias plot
o Hydrophilicity plot
l (03-510/03-710) HMM: Optimal path through a hidden Markov model by Viterbi algorithm, probability of a path by forward/backward algorithm, and parameter estimation by Baum-Welch algorithm
Understand how to manually do the
following:
l
Build
consensus sequences for multiple sequence alignment
l
Calculate
frequency matrix and build PSSM
l
Build
PSSM from frequency matrix, with pseudo-counts
l
Build
simple Markov models / interpret complicated ones
l
Find
reading frames (on both plus and minus strand)
Protein Structures
Understand and describe the theory behind the following:
l Difference between primary, secondary and tertiary structures
l Secondary structure prediction
o Classical methods
? Chou-Fasman
? Garnier-Osguthorpe-Robson
o Adaptive methods
? Neural network methods
? Homology-based methods
l Confusion matrix, and relationship between false positive, true positive, false negative , true negative, recall and precision
l
Structure
homology and VAST Neighbors
Understand the use of RasMol for accomplishing the following:
l Displaying 3-dimensional structure files of biological macromolecules
l Controlling the parts of a structure that are displayed
l Controlling how parts of a structure are displayed (color,
model)
Microarrays
Understand the following:
l Origin of the data, how was the data
generated, two kinds of microarray (cDNA and oligonucleotide)
l How do we use the log ratio to
represent the data
l Choice of distance functions
(Euclidean distance, Mahalanobis distance and etc.)
l Two clustering methods (K-mean vs.
hierarchical)
l Interpretation of tree
Prepared by Steve
Vanni,