FAILURE
>> SPRING 2020 | [ SCHEDULE ] | [ PDF ] | [ GitHub ]
Instructor: Andrew Cencini (acencini@bennington.edu)
Credits: 4
Meeting Time: Tu/F 4:10pm - 6:00pm
Location: CATLab (Dickinson 235)
Office Hours: Tu 3-4pm, Th 2-3pm - Dickinson 211
Course Web Site: http://cs.bennington.edu/courses/s2020/cs4129.01/

SUMMARY:
Why do systems fail? How do we determine what went wrong? How do we learn from failure to build better systems and prevent similar problems from occurring in the future? In this course we will examine a variety of ways that software and hardware systems can fail, their causes, impacts, and (where applicable), remediation. We will learn about tools and techniques that can be used to debug, analyze, and simulate failures, and will conduct a series of experiments where we will observe the various forms of failure. The course, its content and direction will be, to some extent, determined by participants' skills and interests.

SKILLS:
In this class, you will learn the following skills:

  • Identification and classification of system failures
  • Root cause analysis of failures
  • Debugging skills (system and network)
  • Logging practices
  • Monitoring techniques
  • Unit, regression, and integration testing
  • Continuous integration/deployment techniques
  • Fault injection
  • Failure response techniques (technical and social)
Additionally, you will learn to read technical documentation, primary research, and media sources related to failure of computer and other types of systems.

FORMAT:
This class is structured as an experimental workshop designed to approach the various skills and topics in two ways:

  • Construction/deconstruction - hands-on exercises to investigate specific skills and concepts related to failure.
  • Discussion and research - participation and contribution towards a common understanding of real-world failures and lessons learned.
Other modalities may be employed; however, the main goal is to survey the landscape of systems failures in order to ensure that all students in the class are better-trained to expect, identify, and handle failure as engineers and system builders.

WORKLOAD:
The workload for this class will vary based on the depth to which students wish to explore a particular topic. For example, if you choose to analyze a failure in a large, complicated system at one or several points in the class, then out-of-class time and cognitive load (and frustration) may be significant. Likewise, workload can be kept to a manageable level by working with provided example systems or choosing smaller or better-known/documented examples. In other words, workload is entirely within the control of the student.

Workload will vary somewhat over the course of the term, this should be anticipated and planned for - again, this will vary based on your interests and existing/developing skills.

TEXTBOOK:
There is no textbook required for this class. Readings will be provided either in hard-copy form, or via the course web site.

REQUIREMENTS AND ACADEMIC INTEGRITY:

  • You will attend every class. More than two absences (excused or unexcused) will jeopardize your standing in the course.
  • You will check-in all required assignments prior to the start of the class in which they are due.
  • You will be a productive and positive collaborator with your colleagues on the group project.
  • You will be an attentive and positive contributor to class discussion and activities.
  • Your participation and presence in class and other activities will foster a safe and welcoming environment for all others in the class.
  • You will seek out help promptly if you are struggling or falling behind.
  • You will submit your own ideas and work. Academic dishonesty will not be tolerated, and will be passed along without exception the appropriate administrative or judicial entity.

EVALUATION:

  • Class participation and attendance (30%).
  • Assignments and exercises (40%).
  • Group project & presentation (30%).

GETTING HELP:
If you are struggling in class, or would like to investigate a topic in greater depth, come see me. My office hours are listed on the top of this syllabus. I truly enjoy and look forward to meeting with you - some general guidance on making sure we are able to meet:

  • I strongly prefer email (acencini@bennington.edu). Please allow 24 hours for a response, perhaps a bit longer on weekends - though in all cases, I will do my best to get back to you as soon as possible.
  • If you would like to meet with me, please consult my schedule (located at my page) and propose a date and time that is not generally booked.
  • I hang up a signup sheet each week outside of my office for office hours. Walk-ins may be possible but are not at all guaranteed. The sheet for the following week usually goes up on Friday after lunch.
  • If you need to meet me outside of my office hours, making an appointment 24 hours or more ahead of time is strongly suggested.
Additionally, I have a large selection of hardware, software and print materials that may be of interest for coursework or independent projects. Feel free to stop by and inquire about what is available and what may be borrowed or used!

SCHEDULE:
Subject to change. Readings and assignments will be disseminated in class.

Week	Date		Topic
Week 1	2/18/2020	Introduction
	2/21/2020	Failure Research I - Assignment 0 [Reading 0] [Reading 1] [Reading 2] [Reading 3] [Assignment 0]
Week 2	2/25/2020	Identifying and Classifying Failure [Reading 4] [Reading 5] [Reading 6] [Reading 7] [sample.c]
	2/28/2020	Andrew Sick - NO CLASS
Week 3	3/3/2020	Identifying Failure I: Monitoring & Instrumentation [Reading 8] [Reading 9] [Reading 10] [Assignment 1] [Monitoring Lab]
	3/6/2020	Monitoring Lab cont'd / Identifying Failure I: Monitoring & Instrumentation [Reading 11 (optional)] [Reading 12 (optional)]
Week 4	3/10/2020	Monitoring Lab cont'd [Reading 13] [Reading 14 (optional)] 
	3/13/2020	Monitoring Lab cont'd [Reading 15] [Reading 16]
Week 5	3/17/2020	Monitoring Lab cont'd
	3/20/2020	LONG WEEKEND
Week 6	3/24/2020	PREPARATION WEEK
	3/27/2020	PREPARATION WEEK
Week 7	3/31/2020*	Reboot:  Monitoring Lab Checkin/Next Steps
	4/3/2020*	Reboot:  Monitoring Lab Checkin/Logging [Reading 17]
Week 8	4/7/2020*	Logging:  To log or not to log?
	4/10/2020*	Monitoring System - group session
Week 9	4/14/2020*	Debugging:  gdb Lab [Reading 18] [Main.c | CheckPrime.c | Defs.h | Externs.h ]
	4/17/2020*	Debugging:  gdb Lab (cont'd)
Week 10	4/21/2020*	Testing I: Unit Testing (pytest) / Debugging (bomb lab cont'd)
	4/24/2020*	Testing II: Regression Testing / Debugging (bomb lab cont'd)
Week 11	4/28/2020*	Testing/Fancy Stuff / Debugging (bomb lab cont'd)
	5/1/2020*	Testing/Fancy Stuff / Debugging (bomb lab cont'd)
Week 12	5/5/2020	PLAN DAY - NO CLASS
	5/8/2020*	Bomb lab completion
Week 13	5/12/2020*	Failure Project / Testing [Reading 18]
	5/15/2020*	Failure Project / Testing [Reading 19] [Reading 20]
Week 14	5/19/2020*	Failure Project / Integration / CI/CD [Test Lab!]
	5/22/2020*	Failure Project / Test Lab!
Week 15	5/26/2020*	Tearful Goodbye