Principles of Big Data Analytics

Author

Marissa Masden

Published

January 20, 2026

Preface

If you want to trust a prediction, you need to understand how all the computations work. For example, in health care, you need to know if the model even applies to your patient. And it’s really hard to troubleshoot models if you don’t know what’s in them.

Cynthia Rudin, Professor of Computer Science at Duke University, in an interview for Quanta magazine.

What is this course about?

I am a firm believer that data can be used for good – when used wisely. However, to use data wisely, we need to understand what our algorithms do. This course is about understanding what is happening when we ask computers to classify, cluster, and predict. We will cover many different ways we use computers to perform these data tasks, and discusses their benefits and drawbacks.

This site is primarily aimed at students of my DATA 260 (Principles of Big Data Analytics) course. However, it may be useful to anyone interested in understanding the what, why and how of data algorithms.

What background knowledge should I have?

You should have some rudimentary experience in programming, ideally in Python (though there is a section to bring you back up to speed). You should also be prepared to take calculus I.

This course is designed so that it might accompany a standard course in calculus I; in that sense, I hope to provide motivation for why many topics in calculus are useful for data science. Even though calculus is listed as a prerequisite, I will usually approve a prerequisite override for students who request to take calculus at the same time (as a co-requisite) so long as they have taken the prerequisite course.

I will occasionally use calculus computations and/or concepts to explain ideas, and you might occasionally need to perform such calculations to write and evaluate your data algorithms.

How is this course different from a standard machine learning course?

This class focuses on concepts and principles, and incorporates practical skills.

You are not expected to be a computer science expert before entering this course; you are simply expected to be willing to brave programming. Likewise, you are not expected to be a master of mathematics, but you should be willing to do computations and do battle with new notation.

In other words, this course is intended for students who want to not only be able to implement, but actually understand their data tools.

Learning Outcomes

The goals of this course are as follows:

  1. Students will be able to use contemporary Python libraries to access web data

  2. Students will implement classification and regression algorithms that rely on intermediate mathematics skills, including naive Bayes, decision trees, ARIMA, and logistic and multilinear regression models. 

  3. Students will understand the benefits and drawbacks of algorithm choice when working with big data, including expressing time and space complexity of simple algorithms in using Big-O notation, and compare classification and regression algorithms through this lens.

  4. Students will articulate the role of randomness and optimization as tools for improving the time and space complexity of data tasks.

  5. Students will produce a portfolio of well-documented data analysis projects on GitHub.