Structure Prediction Protein Design Biclustering + cMonkey Network Inference Social Networks Publications        Links + Collaborations People        Opportunities        Software + Code Teaching       

 

Computing with Large Data Sets

Course Links

External

Course Wiki

Internal

Course Materials Evaluation Syllabus

Introduction

Enormous collections of data in multiple fields of science and engineering are being gathered and need to be analyzed. For example, the Sloan Digital Sky Survey will represent more than 200 million objects, each with 100 dimensions, and other activities in physics, biology, astronomy, and medicine will soon gather ever-larger sets of data. Biology, and more specificallythe field of systems biology, have seen massive improvements in thetechnologies we use to sequence genomes and measure the levels of gene expression (or activity) in cells under different conditions. These large biology data sets have have features in common with large data sets arising in other fields and illustrate the general need for tools for analysis, manipulation and statisticalanalysis of large data sets. This course will discuss some of the associated unprecedented computational challenges, focusing on these very large data sets arising in computational biology.

High-level languages for mathematical modeling and statistical analysis offer a double-edged sword: use these languages correctly and you'll be able to prototype methods for data analysis and discovery that amaze your co-workers and can be translated into stand-alone code and Web services; but use these language incorrectly and you will end up with inefficient code that is impossible for others to understand. The course is intended to address some of the needed general principles by using the R statistical programming language to analyze large genomic data sets. In the first few lectures I'll describe what these large biology data sets are, where they come from, and what we'll try to learn from them. We will then learn basic statistics fundamentals and how to program R in ways that are efficient in usage of compute-time, memory, and programmer time. We will focus on four main data-sets in this class that come from current genomics and systems-biology studies.

Pre-requisites: Experience in programming. Prior knowledge of biology and statistics is not required. Non-CS majors with programming experience are encouraged to take the class.

Books and Materials

The course will be taught primarily from Web materials and the primary literature.

Web Materials

The main reading assignments will be these and other Web documents:

http://www.r-project.org/ (click on Manuals on the left side-bar)

Translations of many of these documents into several languages are also provided on the R website.

Books (optional)

John Verzani, Using R for Introductory Statistics. Chapman & Hall/CRC. New York, 2005. ISBN 1-58488-4509. A introduction to the language that is woven together with an introduction to statistics.

JohnM. Chambers. Programming with Data. Springer, New York, 1998. ISBN 0-387-98503-4. A general introduction to the S language (which is very similar to R). A great book, but out of date.

Paul Murrell. R Graphics. A book on the core graphics facilities of the R language and environment for statistical computing and graphics (Chapman & Hall/CRC, August 2005).

More R- and S-books on the R Website (http://www.r-project.org/)

Code, Packages and libraries used:

We will use only open source code in the course. Each student will need access to a computer with R and all required packages installed as there will be at least one exercise a week, and several graded coding exercises. Most packages are available at: http://cran.cnr.Berkeley.edu/ All packages used will have windows and MacOS binaries available. Packages made as part of assignments will need to compile on Windows, Mac, and linux platforms.

Evaluation

There will be three graded exercises (20 points each) and one final project (40 points). You will also be expected to help build the course Wiki which will be used to design the next iteration of the course (10 points). Optional challenges will be given most weeks and will include visualization, speed and memory challenges depending on the exercise (i.e. best graphic wins, lowest memory footprint wins, fastest run time on oldest computer wins, etc.). Extra-credit points will beawarded to people who win challenges.

Syllabus

Each week will have one hour-long lecture discussing the main points for the week and one hour devoted to working on using those principles via the exercises for the week. Remember that R is a high-level language, and example code will be provided, so many of the items on the syllabus seem more difficult than they, in fact, are. The full tentative syllabus will be provided on the course Wiki.