BIG DATA
>> FALL 2012 | [ SCHEDULE ] | [ PDF ] | [ tumblr ]
Instructor: Andrew Cencini (acencini@bennington.edu)
Credits: 4
Meeting Time: M/Th 2:10pm 4:00pm
Location: The Pod (VAPA 2nd Floor)
Office Hours: W/F 10:00am-12:00pm, Dickinson 211
Course Web Site: http://cs.bennington.edu/courses/f2012/cs4130.01/

SUMMARY:
In this class, we will explore three areas in computing related to Big Data - programming, databases, and distributed computing. The first area, programming, will help us identify problems that can be solved by means of computer programs, and show us how ideas can be transferred into algorithms and, ultimately, code. The second area is databases. Here, we will be exposed to what a database is and does, the various types and styles of databases, and ways in which data may be organized, imported, manipulated, analyzed, exported and shared. Finally, we will explore Hadoop - a distributed processing framework used to work with massively large data sets. In this class, you will design, refine and implement a project that uses the theory, skills and tools from one or more of these areas to ask and answer a data-driven question of your own.

STUDENTS:

  • Daniel Adamec
  • Benjamin Broderick Phillips
  • Glennis Henderson
  • Pratham Joshi
  • Jonathan Kiritharan
  • Nicholas Sadnytzky
  • Anisha Sharma

PURPOSE OF THIS COURSE:

  • Be able to identify the correct tools and strategies that may be used to manage and solve problems related to big data.
  • Become comfortable with basic/intermediate programming skills.
  • Understand how and why a database management system works the way it does, how to use it, and how to write programs that interact with a database.
  • Gain exposure and become conversant with Hadoop, a tool that is becoming widely adopted in a growing number of disciplines.
  • Roll back personal limits and beliefs about the tractability of questions, ideas and problems involving large or complex data sets.

FORMAT:
In this class, we will spend most of our time working hands-on with the various tools we will be working with. Interspersed throughout this work will be a handful of lectures and discussions. Additionally, we will periodically present, review and critique each others' projects and project plans.

TEXTBOOK:
There are two required books for this course (though you only need to physically purchase one of them), as well as a handful of recommended (but by all means not required) books that you may find useful.

Redmond, Eric and Wilson, Jim "Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement" (7D7W). Obtain via the Bennington College bookstore, or bookseller of your choice. Also on course reserve at Crossett Library.

Elkner, Downey & Meyers "How to Think Like a Computer Scientist: Learning with Python" 2nd Edition (EDM). Freely available online at:

Recommended (if you want to dig deeper):
Python:
Payne "Beginning Python Using Python 2.6 and Python 3.1".

Java:
Reges & Stepp "Building Java Programs", any edition (BJP). Available via Amazon.com - older editions are cheaper and are generally ok.

Eckel "Thinking in Java" 3rd edition. A great book. Available freely online at:

Databases & NoSQL:
Elmasri & Navathe "Fundamentals of Database Systems" (4th - 6th edition are fine). Expensive, but good if you want in-depth coverage. Available through Amazon.

Melton & Simon "SQL: 1999 - Understanding Relational Language Components". Understand why the SQL language is the way it is from one of the people who designed it. The "Advanced SQL: 1999" book is also worth a read if you are very serious about this subject.

O'Higgins, Niall "MongoDB and Python". A quick and easy overview of using MongoDB with the Python programming language.

Tahagohghi, Seyed and Williams, Hugh "Learning MySQL". A classic for those new to MySQL and looking to build web applications with MySQL and PHP.

Hadoop:
Lam "Hadoop in Action". An indispensable introduction and resource for those interested in digging deeper with Hadoop.

White "Hadoop: The Definitive Guide". O'Reilly book on Hadoop. A little more technical than Lam's book, and a great resource for those who are very serious about this subject.

Assorted readings and papers will be provided throughout the course on the course web site. I also have an extensive library of computing books and literature. Feel free to ask to borrow a book from me if obtaining a book is problematic.

REQUIREMENTS:
We all bring our own perspectives, experiences and beliefs to this class. I ask you to keep an open mind, respectful demeanor, maintain a lively sense of experimentation, and remain patient in the inevitable face of difficulty. This is a challenging course. We will be quickly climbing several difficult mountains - but we are climbing those mountains together. If something is not working, or holding you back, do not remain silent.

Depending on your experience level and the nature of the work you are pursuing, this course may require significant time outside of class to work on logistics (e.g. data transfer), honing your skills, and project work.

You are expected to attend every scheduled meeting of this course. More than 2 absences, or a pattern of chronic lateness will substantially jeopardize your standing in this course. Additionally, at a minimum, I expect you to thoroughly read and consider all assigned readings for this class, and to come to class ready to discuss, question, critique and relate those readings. All assignments and exercises must be completed and are due on the date and time agreed upon. Late work will not be accepted.

Finally, I expect you to be honest, independent and creative. Half of the purpose of the exercises and readings is to expose you to new tools and ideas, as well as develop skills and context, while the other half is for you to use those tools, ideas, skills and context in order to ask and attempt to answer good, tough questions. Free riding, plagiarism, and unoriginal/unmotivated work defeat the purpose of our work, and therefore will result in failure of that portion of the class.

EVALUATION:
You will be evaluated in this course on the following criteria:

  • Engagement, classroom/lab contribution (40%) - Are you an asset to your colleagues' and your own learning? Is it clear you have carefully read and thought about the reading? Do you have insightful questions, bring in additional perspective and/or materials? Are you respectful of your colleagues? Is the work you do in labs and experiments thoughtful, motivated and challenging?
  • Assignments and projects (60%) - Do exercises and your final project demonstrate fullness of thought on the assigned topic? Are the solutions technically sound? Does the work connect with the course topic in a meaningful way? Does the final project demonstrate reasonably comprehensive coverage of the chosen topic?
EVALUATIVE ELEMENTS:
  • 3 exercises
  • 1 final project
EXERCISES:
The exercises in Big Data will serve the purpose of solidifying your skills in each area of the course. More details on programming assignments will be provided as the course progresses.

FINAL PROJECT: The final project will be of your own design. A wide variety of questions, topics, ideas, methods and approaches are possible. More details will be provided as the class progresses.

GETTING HELP:
I am always happy to help if you are stuck with a particular concept, idea, piece of code, etc. I strongly prefer email (acencini@bennington.edu) and in-person interaction versus the use of the phone. Before you come for help, be sure you are ready to frame and articulate clearly and specifically what your question is in order to make the most efficient use of our time.

ACKNOWLEDGMENTS:
Big Data is generously supported by a teaching grant from Amazon Web Services (AWS). AWS computing resources used in this class are made available through this grant.

SCHEDULE:
Week 0
Thursday 9/6

Introduction, Orientation, Problem Space
Logistics
What is Big Data?
Data Sources
Individual Interests & Questions

Reading:
Lohr, Steve "Amid the Flood, A Catchphrase is Born" (New York Times, August 12, 2012).

Science Workshops - First workshop is 9/7 for meet and greet in Dickinson 225 at 1pm (SNACKS).

Week 1
Monday 9/10

Introduction to Programming
Why Programming?

Lab:
Python (and/or Java)
Tools
Variables, loops

Reading:
EDM Chapters 1..3.

Bryant, et al. "Big-Data Computing: Creating Revolutionary Breakthroughs in Commerce, Science, and Society" (Computing Research Assosciation White Paper, 2008).

Thursday 9/13

Programming: First Steps
Algorithms & Program Design
Data Organization (I)

Lab:
Variables, loops
Strings
Collections of data
Functions, program organization

Reading:
EDM Chapters 4..9 (Chapter 8 is optional)

Auerbach, David "The Stupidity of Computers" (n+1, July 2012).

HANDY REFERENCE: Unix Cheat Sheet (PDF)

Week 2
Monday 9/17

Programming: Data Acquisition
Files and the Web

Lab:
Strings & string manipulation
Files
Using libraries
urllib

Reading:
EDM Chapters 10..12

Thursday 9/20

Programming: Data Manipulation
File formats (CSV, binary)
Tactics and problems: sorting & searching
Data structures

Lab: Wikileaks Unredacted Cables
Data for Class (623MB)

Other Data:
Crossett Library Catalog CSV (10MB)
Enron Mail Archive (423MB)

Week 3
Monday 9/24

Programming: Continue WikiLeaks Program

Assignment 1
WikiFind.py - code from class

Thursday 9/27

Open Lab

Code from Class Today - Directories

Unix Tricks for Data Management

Week 4
Monday 10/1

FIELD TRIP: Crossett Library * Bennington Data Center

Reading:
7D7W: Chapter 1, 2.1, 2.2

Thursday 10/4

Database: Introduction
What is a database?
Problem space
Data types and organization
Structured v unstructured data
Database: SQL Language
A brief intro to SQL
DDL v DML
Queries
How to write and execute SQL

Lab:
Logistics: Servers
Hands-on with SQL
Implement proposed schema (DDL)
Modify schema
Insert and modify test data (DML)
Verify data (Query)

Requirement:
Choose a data set you'd like to work with in the database space.

Reading:
7D7W: Chapter 2.3, 2.4, 2.5

Week 5
Monday 10/8

Assignment 1 Due (beginning of class)

Database: SQL Language
Joins and multi-table statements
Importing data
Continued work with queries

Requirement:
Choose a data source you plan to work with in the database space.

Reading:
7D7W: Chapter 3 (Riak) - begin

Thursday 10/11

Database: Unstructured data
Text, binary data How search works

Lab:
Hands on with text data, text search

Reading:
7D7W Chapter 3 (Riak) - finish

Week 6
Monday 10/15

Finish PostgreSQL lab #2
Text Search

Lab:
Text Search

Reading:
7D7W: Chapter 4 (HBase) - begin

Thursday 10/18

Text Search
Database: How it Works
Components of a database
Basic theory, relational algebra

Regular Expressions Cheat Sheet

Assignment 2 Proposal Due

Reading:
7D7W: Chapter 4 (HBase) - finish

Aiyer, et al. "Storage Infrastructure Behind Facebook Messages Using HBase at Scale" (IEEE Computer, 2012).

Week 7
Monday 10/22

LONG WEEKEND:
NO CLASS

Thursday 10/25

NoSQL: Digging Deeper

Lab:
MongoDB(II)
NoSQL (I) : Hbase, CouchDB, Cassandra et al.

Today's Lab

Reading:
7D7W: Chapter 5 (MongoDB) - start

Week 8
Monday 10/29

NoSQL Continued

Today's Lab (PHP)

Lab:
MongoDB (III)
Work on Assignment 2 (if time)

Reading:
7D7W: Chapter 5 (MongoDB) - finish

Thursday 11/1

Open Lab: Work on Assignment 2

Lab:
Work on Assignment 2
Optional: Continue NoSQL experiments (MongoDB, CouchDB, HBase, Riak, Cassandra...)

Reading:
7D7W: Chapter 6 (CouchDB) - start

Week 9
Monday 11/5

Project Day
Project Presentations and Critique

Thursday 11/8

PLAN DAY:
NO CLASS

Week 10
Monday 11/12

Hadoop
A little bit about Hadoop
A little bit about Amazon Web Services

Lab:
Becoming a Linux/AWS Ninja
Set up Hadoop cluster (1, 20 node)
Run quick test job

Reading
7D7W: Chapter 6 (CouchDB) - finish

Thursday 11/15

Hadoop: MapReduce
Discussion of MapReduce
Applicable problems
Jobs: Mappers/Reducers
Sample Code

Lab:
Examine and modify existing job
Run it, report results

Reading:
MapReduce Paper

Week 11
Monday 11/19

Hadoop: Big Jobs

Lab:
Query Data
Open lab - begin experimenting with crunching big data.
Suggestion: take this time to configure your cluster and get data where it needs to be.

Assignment 3 Proposal Due

Thursday 11/22

THANKSGIVING BREAK:
NO CLASS

Week 12
Monday 11/26

Continue Working on MongoDB/AWS

Thursday 11/29

test2.php code

MongoDB/AWS/PHP

Week 13
Monday 12/3

READING: MapReduce Paper

Hadoop Lab 1

Assignment 3

Thursday 12/6

Hadoop Lab / Do Assignment 3 In Class

Week 14
Monday 12/10

Final Project Presentations

Thursday 12/13

Wrap-up and tearful goodbye...ramuntos?