Data and Text Mining for Computational Biology

CS 6365

Fall 2009

 

Instructor

 

Vasileios Hatzivassiloglou

Office: ECSS 3.406

Phone: (972) 883-4342

E-mail: vh (at) hlt.utdallas.edu

Office Hours: Monday and Wednesday 6:00 - 7:00pm; additional times by appointment

Class Time and Location: Monday and Wednesday, 4:00 - 5:15pm at ECSS 2.203

 

Teaching Assistant: To Be Determined

Office: TBD

E-mail: TBD

Office Hours: TBD

 

Course Goals

Introduce the field of bioinformatics

Discuss primary techniques used for data mining and discovery

Introduce text mining and additional issues it brings to data mining methods

Use examples from computational biology

 

Course Topics

The course introduces students to concepts from data and text mining as practiced currently in the bioinformatics field. Applications of those techniques to other fields are noted as well. The major topics include:

1.     A self-contained review of relevant background material from molecular biology;

2.     Sequence alignment as a means of determining similarity between proteins and genes (including algorithms for finding global and local, pairwise and multiple string alignments and measuring similarity);

3.     Properties of similarities and distances and their implications for data mining;

4.     Genomic, Proteomic, and Text databases in the real world;

5.     Finding patterns (motifs) in genes and proteins using a variety of general methods (maximum likelihood estimation, statistical sampling, Markov chain Monte Carlo simulation);

6.     Differentiating between valid patterns and noise (likelihood ratios, relative entropy, expectation-maximization);

7.     Classification algorithms (naive Bayes, nearest neighbor);

8.     Hierarchical clustering and its application to phylogenetic trees;

9.     Selected topics from text mining including term identification, disambiguation, and relationship extraction;

10.  Examples of real-world experimental procedures, such as gene microarrays.

 

Materials

 Recommended Text Books

“An Introduction to Bioinformatics Algorithms (Computational Molecular Biology)”, by Neil C. Jones and Pavel A. Pevzner, MIT Press, 2004.

ISBN 0262101068

448 pages

Available on Amazon.com or Barnes and Noble for $49

 

“Data Mining : Concepts and Techniques” by Jiawei Han and Micheline Kamber, Elsevier, 2nd edition, 2006.

ISBN 1558609016

800 pages

Available on Amazon.com or Barnes and Noble for $55

 Supplementary Text Books

“Bioinformatics: The Machine Learning Approach” by Pierre Baldi and Soren Brunak, 2nd edition, 2001.

“Data mining : multimedia, soft computing, and bioinformatics” by Sushmita Mitra and Tinku Acharya, 2003.

Both of the above are available as full-text eBooks via http://library.utdallas.edu.

 Background Reading

Biology: “Molecular Biology of the Cell”, by Bruce Alberts et al., 4th edition, 2002.

Machine learning: “Machine Learning”, by Tom Mitchell, 1997.

Statistics: “The elements of statistical learning: data mining, inference, and prediction”, by Trevor Hastie, Robert Tibshirani and Jerome Friedman, 2001.

Data structures and algorithms: “Introduction to Algorithms”, by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein, 2nd edition, 2001.


Grading

Class participation: 20%

Homework assignments: 30% total

Midterm: 10%

Team project: 20%

Final exam: 20%

90+ gets you an “A”, 70+ a “B”

 

Lectures

 

Topic

Date

Lecture Notes

Introduction

Monday 8/24

Lecture #1

Biology, Part 1: Classification, Cells, and Proteins

Wednesday 8/26

Lecture #2

Biology, Part 2: DNA, RNA, Replication, and Reproduction

Monday 8/31

Lecture #3

Biology, Part 3: Mitosis, Meiosis, Transcription, and Translation

Wednesday 9/2

Lecture #4

Biology, Part 4: Regulation, Gene Networks, and Systems Biology

Wednesday 9/9

Lecture #5

Biology, Part 5: Text Mining and DNA Microarrays

Monday 9/14

Lecture #6

Biology, Part 6: Evolution and Forensic Biology

Wednesday 9/16

Lecture #7

Biology, Part 7: Gene Amplification and Recombinant DNA

Monday 9/21

Lecture #8

Challenges in Bioinformatics

Wednesday 9/23

Lecture #9

Databases, part 1

Monday 9/28

Lecture #10

Databases, part 2

Wednesday 9/30

Lecture #11

Alignment, part 1

Monday 10/5

Lecture #12

Alignment, part 2

Wednesday 10/7

Lecture #13

Alignment, part 3

Monday 10/12

Lecture #14

Local Alignment

Wednesday 10/14

Lecture #15

Approximate Alignment

Monday 10/19

Lecture #16

Midterm

Wednesday 10/21

No slides

Classifier Evaluation

Monday 10/26

Lecture #17

Motifs

Wednesday 10/28

Lecture #18

Finding Motifs

Monday 11/2

Lecture #19

Multiple Sequence Alignment

Wednesday 11/4

Lecture #20

Statistical Estimation

Monday 11/9

Lecture #21

Simulation and Classification

Wednesday 11/11

Lecture #22

Classification

Monday 11/16

Lecture #23

Classification Methods

Wednesday 11/18

Lecture #24

 

Additional lecture information will be added to the table above as the course progresses.

Homework Assignments

 

Information about homework assignments (problems and due dates) will be added here as the course progresses.

Student Presentations

 

The schedule of student presentations of  projects will be listed here.

Presentations will take place in early December.

Students must discuss their proposed project with the instructor by early November.

Supplemental Reading

 

Papers and additional slides that supplement the course material will be listed here.