CS 291 Assignment #6

Due Monday, March 18th
(Must be submitted by 1pm on March 18th!)
Not accepted late

The Assignment

For your final Haskell assignment you'll implement a program to do K-Means clustering of data points. Data Scientists use clustering algorithms to automatically group data points into clusters of related points, and K-Means is a popular algorithm for forming these groups without any supervision or training by the user. This will make a good final assignment since it's of the right scope, it solves a real-world problem, and it involves some interaction with a user to get information about the desired number of clusters, etc.

K-Means Clustering

We'll talk about this in more detail in class, but here's the short version: The user specifies how many clusters they want the algorithm to find. (That's the "K" in K-Means) That's more of an art than a science, but once K has been specified, K points are selected (randomly or otherwise) to serve as the initial location of cluster centers. The following two steps are then repeated until a stable solution is found:
  1. Assign points to clusters: For each data point, compute the distances to each of the cluster centers and assign the point to the nearest center.
  2. Adjust the centers: Each center's location is updated to be at the center of the points assigned to it.
There are a variety of ways to determine when to stop, but we'll go with a simple rule: Repeat the two steps above until the cluster centers stop moving. In other words, keep going until the center locations at the end of an iteration are the same as they were at the start (or awfully close).

The data points can be in as many dimensions as we want. For example, maybe we have measurements on a bunch of penguins. We could use just their heights as data values, and cluster them in one dimension to look for groupings. Or we could plot their heights vs. their beak lengths, and look for clusters in 2D. If we knew their wing lengths, we could plot points in 3D where each point represents a single penguin and its coordinates correspond to its height, beak length, and wing length.

Your solutions should be flexible enough work with any number of dimensions. We will consult the user for the desired number of clusters, and also prompt them for the initial positions of the cluster centers. (This simplifies the assignment since we won't need to figure out how to generate random coordinates, and it also gives us more control during testing.) In the interests of simplifying the assignment, we'll also hard-code in the data values to be clustered rather than reading them from a file. Haskell can do file I/O like any respectable language, but we don't have the time to learn the details.

Setting it up in Haskell

I'll give you some code to get you started. The first step is to define a data type to represent a point. It needs to store information about the cluster to which the point has been assigned, as well as its location in N-dimensional space. The data type below could be used to represent both the data points and the cluster centers:
-- Data type to hold information about a labeled data point. The constructor
-- takes a label and a list of Doubles that is the point's location. It's 
-- general enough that the label can be of any type, and the list of Doubles
-- can accommodate any number of dimensions in the data.

data LabeledPoint a = Point a [Double]
  deriving (Ord, Show)
When checking to see if the centers have stopped moving, it would be nice to have a definition of == that checks whether two points are close to the same location without being identical. (Checking doubles with == is a bad idea in any language.) Therefore, instead of deriving the default implementation of ==, we'll define our own version for a change:
-- Here we define our own version of == rather than asking for its default
-- implementation. Points are == if their coordinates are within .0001 across 
-- all dimensions.
    
instance Eq (LabeledPoint a) where
  Point _ ns == Point _ ms  =  and (map (\(n,m)->abs(n-m)<0.0001) (zip ns ms))
Make sure you understand how that function works, since similar techniques might be useful elsewhere in your implementation: The zip call pairs up the two points' coordinates. For example, if we had a point at [5,10,7] and one at [5,9,7], zipping the two lists would give [(5,5),(10,9),(7,7)]. The map turns that list of tuples into a list of booleans — [True, False, True] in this case. The and call returns true if all of the boolean values in the list are True. In this case our two hypothetical points would not be equal because their second dimensions are too far apart, but if they were within 0.0001 of each other they would be considered equal. Note that because == is defined on LabeledPoint, lists of points can be compared via == and it will return true if the points in the lists satisfy our definition of equality.

The starter code also contains a function that takes a list of data points and a list of centers, and prints them to the screen in a format that can be directly pasted into a Google Sheet and charted (as a scatter plot) if you want to see your clusters visually. Even if you don't end up using the function, it would be worth studying to see how it works. You might be able to borrow some techniques from it for use in your own code.

Specifics

I am strongly recommending that you break your implementation into the following functions. You're welcome to take other approaches to decomposing the task, but the steps below will give you a specific framework to follow, and break it into reasonable sized pieces. When defining the functions below, make an effort to use map, filter, and foldr1 — they will simplify your programming tasks and make the resulting definitions shorter and easier to read. If you choose a different organization for your implementation, keep as many of the functions on the "pure side" as possible. The only two functions below that require side effects are the readCenters function that reads the initial cluster center points from the user, and the main function that does some additional input and, after clustering, prints the results.

Submitting:

To make it easier for me to test your submissions, I'm asking you to submit via Canvas this time around rather than pasting your code into Gradescope. Please submit only your .hs file, and make sure it's set up to use the list of 2D data points, points_2D.


Brad Richards, 2024