Concepts in Computing with Data.

UC Berkeley, Statistics 133, Summer 2014

An introduction to computationally intensive applied statistics. Topics may include organization and use of databases, visualization and graphics, statistical learning and data mining, model validation procedures, and the presentation of results.

Prerequisites: Familiarity with basic concepts in probability and statistics is important. Being comfortable with matrices, vectors, basic set theory, functions, and graphs will also help. There is no prerequisite for programming.

Topics

The goal of this course is to introduce you to a variety of programs and technologies that are useful for organizing, manipulating, and visualizing data with a focus on the R statistical computing environment. Topics may include:

  • Woring in a reproducible manner
  • Working at the (Bash) command line
  • Version control using Git and GitHub
  • Basics of R programming (data structures, control flow, debugging, etc.)
  • Simulation and random number generation
  • Exploratory data analysis and dimension reduction (PCA)
  • Hypothesis testing (t-tests)
  • Clustering (k-means, hierarchical clustering) and classification (k-nn, CART, Random Forests)
  • Linear and logistic regression

Grading

Your final grade will be a weighted average of grades in the following areas:

  • 5% participation
  • 10% labs
  • 40% homework
  • 10% midterm
  • 20% group project - due at the end of semester
  • 15% final exam

Course Policies

Attendance and behavior in class: You are expected to attend all lectures and labs. Any known or potential extracurricular conflicts should be discussed in person with the instructor during the first two weeks of the semester, or as soon as they arise. Cellphones are to be turned off during class time. Laptop use during class is recommend, but it is expected that you will be using your laptop to type along with the lecture.

Submission of assignments: Assignments will be accepted by electronic submission to GitHub only. There will be no makeup midterm nor final exam. No late labs or homework will be accepted. Grades of Incomplete will be granted only for dire medical or personal emergencies that cause you to miss the final, and only if your work up to that point has been satisfactory.

Academic integrity: Any test, paper, or report submitted by you is presumed to be your own original work that has not previously been submitted for credit in another course. While you are encouraged to work together on homework assignments, the work and writeup must be your own. For example, suggesting a function to another student is acceptable, whereas simply giving him or her your own code is not. If you are not clear about the expectations for completing an assignment or taking an exam, be sure to seek clarification from the instructor or GSI beforehand. Any evidence of cheating and plagiarism will be subject to disciplinary action. Please read the Honor Code carefully.

Class discussion: We will be using Piazza for class discussion. The system is highly catered to getting you help fast and efficiently from classmates, the GSI, and myself. Rather than emailing questions to the teaching staff, you should post your questions on Piazza. If you have any problems or feedback for the developers, email team@piazza.com.

People new to programming often have a hard time asking computing questions in a sensible manner. To help your fellow students, the GSI, and the instructor, you should review Eric Raymond's How To Ask Questions The Smart Way.

Find our class page at: https://piazza.com/berkeley/summer2014/statistics133/home

Students with disabilities: If you need accommodations for any physical, psychological, or learning disability, please speak to me after class or during office hours so that we can make the necessary arrangements.