CS 291 Assignment #6

Due Monday, March 18th
(Must be submitted by 1pm on March 18th!)
Not accepted late

The Assignment

For your final Haskell assignment you'll implement a program to do K-Means clustering of data points. Data Scientists use clustering algorithms to automatically group data points into clusters of related points, and K-Means is a popular algorithm for forming these groups without any supervision or training by the user. This will make a good final assignment since it's of the right scope, it solves a real-world problem, and it involves some interaction with a user to get information about the desired number of clusters, etc.

K-Means Clustering

We'll talk about this in more detail in class, but here's the short version: The user specifies how many clusters they want the algorithm to find. (That's the "K" in K-Means) That's more of an art than a science, but once K has been specified, K points are selected (randomly or otherwise) to serve as the initial location of cluster centers. The following two steps are then repeated until a stable solution is found:

Assign points to clusters: For each data point, compute the distances to each of the cluster centers and assign the point to the nearest center.
Adjust the centers: Each center's location is updated to be at the center of the points assigned to it.

There are a variety of ways to determine when to stop, but we'll go with a simple rule: Repeat the two steps above until the cluster centers stop moving. In other words, keep going until the center locations at the end of an iteration are the same as they were at the start (or awfully close).

The data points can be in as many dimensions as we want. For example, maybe we have measurements on a bunch of penguins. We could use just their heights as data values, and cluster them in one dimension to look for groupings. Or we could plot their heights vs. their beak lengths, and look for clusters in 2D. If we knew their wing lengths, we could plot points in 3D where each point represents a single penguin and its coordinates correspond to its height, beak length, and wing length.

Your solutions should be flexible enough work with any number of dimensions. We will consult the user for the desired number of clusters, and also prompt them for the initial positions of the cluster centers. (This simplifies the assignment since we won't need to figure out how to generate random coordinates, and it also gives us more control during testing.) In the interests of simplifying the assignment, we'll also hard-code in the data values to be clustered rather than reading them from a file. Haskell can do file I/O like any respectable language, but we don't have the time to learn the details.

Setting it up in Haskell

I'll give you some code to get you started. The first step is to define a data type to represent a point. It needs to store information about the cluster to which the point has been assigned, as well as its location in N-dimensional space. The data type below could be used to represent both the data points and the cluster centers:

-- Data type to hold information about a labeled data point. The constructor
-- takes a label and a list of Doubles that is the point's location. It's 
-- general enough that the label can be of any type, and the list of Doubles
-- can accommodate any number of dimensions in the data.

data LabeledPoint a = Point a [Double]
  deriving (Ord, Show)

When checking to see if the centers have stopped moving, it would be nice to have a definition of == that checks whether two points are close to the same location without being identical. (Checking doubles with == is a bad idea in any language.) Therefore, instead of deriving the default implementation of ==, we'll define our own version for a change:

-- Here we define our own version of == rather than asking for its default
-- implementation. Points are == if their coordinates are within .0001 across 
-- all dimensions.
    
instance Eq (LabeledPoint a) where
  Point _ ns == Point _ ms  =  and (map (\(n,m)->abs(n-m)<0.0001) (zip ns ms))

Make sure you understand how that function works, since similar techniques might be useful elsewhere in your implementation: The zip call pairs up the two points' coordinates. For example, if we had a point at [5,10,7] and one at [5,9,7], zipping the two lists would give [(5,5),(10,9),(7,7)]. The map turns that list of tuples into a list of booleans — [True, False, True] in this case. The and call returns true if all of the boolean values in the list are True. In this case our two hypothetical points would not be equal because their second dimensions are too far apart, but if they were within 0.0001 of each other they would be considered equal. Note that because == is defined on LabeledPoint, lists of points can be compared via == and it will return true if the points in the lists satisfy our definition of equality.

The starter code also contains a function that takes a list of data points and a list of centers, and prints them to the screen in a format that can be directly pasted into a Google Sheet and charted (as a scatter plot) if you want to see your clusters visually. Even if you don't end up using the function, it would be worth studying to see how it works. You might be able to borrow some techniques from it for use in your own code.

Specifics

I am strongly recommending that you break your implementation into the following functions. You're welcome to take other approaches to decomposing the task, but the steps below will give you a specific framework to follow, and break it into reasonable sized pieces. When defining the functions below, make an effort to use map, filter, and foldr1 — they will simplify your programming tasks and make the resulting definitions shorter and easier to read. If you choose a different organization for your implementation, keep as many of the functions on the "pure side" as possible. The only two functions below that require side effects are the readCenters function that reads the initial cluster center points from the user, and the main function that does some additional input and, after clustering, prints the results.

distance: Takes two LabeledPoint values and returns the square of the distance between them. This will be used to determine the distance between a point and a cluster center, when picking the closest center. You could calculate the actual distance rather than the square of the distance, but all we really care about is which center is closest, and the squared distance will work just fine for that without requiring us to use sqrt, which is computationally expensive. This function should work no matter how many dimensions the points have, though you can assume they all have the same number of dimensions. Here are some sample interactions:
```
> distance (Point "foo" [5]) (Point "bar" [7])
4.0
> distance (Point "us" [0,0]) (Point "them" [10,-10])
200.0
> distance (Point 1 [2,5,3]) (Point 2 [4,8,7])
29.0
```
nearest: Takes a point and a list of points (cluster centers) and returns the label of the nearest point in the list. (The previous function will be helpful here.) You may assume the list contains at least one point. The interactions below only show 2D points, but the function should work with any number of dimensions. You may assume that all points have the same number of dimensions.
```
> nearest (Point "p" [10,10]) [Point "a" [0,2],Point "b" [12,15],Point "c" [-3,5]]
"b"
> nearest (Point "p" [10,10]) [Point "a" [0,2]]
"a"
```

relabel: Takes a list of data points and a list of cluster centers and reassigns the points to their nearest cluster centers. More specifically, it returns a list of points in the same order as the original data points but potentially with different labels — the label of the nearest cluster center. The interactions below only show 1D, but your function should work with any number of dimensions. Here again you may assume all of the points have the same number of dimensions. The interactions below show a list of points being relabeled against a variety of different centers.

> let points = [Point "x" [-2],Point "x" [1],Point "x" [15],Point "x" [5],Point "x" [17]]
> relabel points [Point "left" [0],Point "right" [12]]
[Point "left" [-2.0],Point "left" [1.0],Point "right" [15.0],Point "left" [5.0],Point "right" [17.0]]
> relabel points [Point "right" [12],Point "left" [0]]
[Point "left" [-2.0],Point "left" [1.0],Point "right" [15.0],Point "left" [5.0],Point "right" [17.0]]
> relabel points [Point "left" [0],Point "right" [8]]
[Point "left" [-2.0],Point "left" [1.0],Point "right" [15.0],Point "right" [5.0],Point "right" [17.0]]
> relabel points [Point "left" [0],Point "right" [0]]
[Point "right" [-2.0],Point "right" [1.0],Point "right" [15.0],Point "right" [5.0],Point "right" [17.0]]

center: Takes a list of points and returns the coordinates of the center of the group. In my implementation, this function returns a list of doubles rather than a point. You may assume that there's at least one point in the list, and that all points have the same number of dimensions, but it should work in any number of dimensions. In 2D, for example, it would return a list where the X coordinate (the head of the output list) is the average of the X coordinates across all points, and the Y coordinate (the second value in the list) is the average of all Y coordinates. It might be helpful to think about ways to build a list containing the nth coordinates from across all of the points. If you could extend that function to collect coordinate values and average them, you could then map that across the dimensions. (Hint: You'll probably need to use fromIntegral to make the types work out right in your average calculations.)
```
> center [Point 0 [10],Point 1 [12],Point 2 [14]]
[12.0]
> center [Point 0 [-5],Point 17 [5]]
[0.0]
> center [Point 4 [0,0],Point 5 [10,0],Point 6 [0,10],Point 7 [10,10]]
[5.0,5.0]
> center [Point "a" [5,5,5],Point "b" [10,0,15]]
[7.5,2.5,10.0]
```
recenter: Takes a list of data items (assigned to various clusters) and a list of cluster centers and finds the new center points for each of the cluster centers. It returns a list of points (cluster centers) in the same order as the original but with potentially different coordinates. (Hint: You basically need to map the previous function over the cluster centers to implement this.) Note that it's possible that a center could end up with no points assigned to it. My code doesn't handle that case properly and yours doesn't need to either.
cluster: Takes a list of data points and a list of centers, and applies the K-Means algorithm to assign points to clusters. Given the helper functions above, this boils down to calling relabel and recenter over and over until the centers stop changing. The interactions below show how to do those steps manually. First we relabel the points (to get points'), then we recenter the cluster centers within their new groups (to get centers'). Two of the points are associated with the first cluster, and four as associated with the second at this point. After another round of relabeling and recentering, we discover that the solution has converged — centers' is the same as centers''. (Remember that using == on two lists of centers uses LabeledPoint's definition of == for each of the points in the lists, so the two lists wouldn't have to be identical — just pretty darned close.)
```
> let points = [Point 0 [0], Point 0 [2], Point 0 [5], Point 0 [7], Point 0 [10], Point 0 [12]]
> let centers = [Point 0 [2.5], Point 1 [7.1]]
> let points' = relabel points centers
> points'
[Point 0 [0.0],Point 0 [2.0],Point 1 [5.0],Point 1 [7.0],Point 1 [10.0],Point 1 [12.0]]
> let centers' = recenter points' centers
> centers'
[Point 0 [1.0],Point 1 [8.5]]
> let points'' = relabel points' centers'
> let centers'' = recenter points'' centers'
> centers''
[Point 0 [1.0],Point 1 [8.5]]
> centers' == centers''
True
```
You'll want to write a single (probably recursive) function that repeats the relabel and recenter calls until the centers stop changing, then return a two-tuple containing both the labeled points and the final centers. Here's a call to cluster on the same inputs as were shown above:
```
> cluster points centers
([Point 0 [0.0],Point 0 [2.0],Point 1 [5.0],Point 1 [7.0],Point 1 [10.0],Point 1 [12.0]],[Point 0 [1.0],Point 1 [8.5]])
```
readCenters: Takes an integer (the number of desired center points to be entered) and prompts the user for the details of points. It returns a list containing the LabeledPoint values specified by the user. In my code I don't put any restrictions on the number of dimensions in the coordinates, or even enforce that they're the same across points, but you should feel free to write more robust code here if you wish. Also, while the functions above are general enough that the point labels could be of any type, readCenters needs to make a choice. You can see from the type signature at the end of the interactions below that mine treats all labels as strings. (Hint, you can read entire lists of doubles rather than inputting the coordinates individually.) I've shown the inputs in blue below to make it clear which things are being typed by the user, but they won't really be blue in ghci.
```
> readCenters 2
Please enter a center label: 
foo
Please enter a list of coordinates: 
[3, 4,5]
Please enter a center label: 
bar
Please enter a list of coordinates: 
[-1.5,0,1.73]
[Point "foo" [3.0,4.0,5.0],Point "bar" [-1.5,0.0,1.73]]
> :type readCenters
readCenters :: Int -> IO [LabeledPoint String]
```

main: Your solution is required to have a main function that starts a run of your K-Means code. I'm including the framework of a main function in the starter code that shows how it hard-codes in a set of data points to be clustered, prints information about the points, prompts the user for information, then runs the clustering function and reports the results. Here's a sample run of my implementation:

> main
Welcome to k-Means
There are 150 points to be labeled.
It looks like we're working with 2 dimension(s).
How many centers would you like?
3
Please enter a center label: 
one
Please enter a list of coordinates: 
[8,10]
Please enter a center label: 
two
Please enter a list of coordinates: 
[35, 5]
Please enter a center label: 
three
Please enter a list of coordinates: 
[17.5, 35]
Final centers are: 
[Point "one" [9.363265306122447,9.60204081632653],Point "two" [30.13199999999999,9.862],Point "three" [19.55686274509804,24.617647058823533]]
Labeled points: 
[Point "one" [12.1,7.2],Point "one" [8.7,11.0],Point "one" [6.1,3.9],Point "one" [11.7,8.2],Point "one" [7.6,8.5],Point "one" [4.3,11.0],Point "one" [7.5,11.7],Point "one" [6.8,6.3],Point "one" [8.6,14.4],Point "three" [17.0,17.6],Point "one" [12.8,16.3],Point "one" [4.2,10.9],Point "one" [2.8,6.6],Point "one" [2.6,15.0],Point "one" [10.1,10.0],Point "one" [2.0,15.6],Point "one" [7.5,8.4],Point "one" [10.3,11.7],Point "one" [14.2,7.8],Point "one" [9.8,3.7],Point "one" [8.6,9.1],Point "one" [9.2,8.2],Point "one" [9.8,7.3],Point "one" [9.6,9.6],Point "one" [10.2,14.5],Point "one" [9.1,9.0],Point "one" [13.0,9.7],Point "one" [6.3,9.4],Point "one" [10.2,10.1],Point "one" [11.1,13.8],Point "one" [9.4,12.0],Point "one" [13.1,5.2],Point "one" [5.9,12.9],Point "one" [12.1,9.3],Point "one" [3.7,13.3],Point "one" [10.8,10.0],Point "one" [9.1,3.2],Point "one" [12.2,-1.1],Point "one" [10.0,10.0],Point "one" [13.9,11.2],Point "one" [9.4,10.0],Point "one" [14.2,7.6],Point "one" [10.7,4.6],Point "one" [9.2,10.4],Point "one" [9.8,10.7],Point "one" [11.6,9.5],Point "one" [13.7,8.3],Point "one" [10.5,11.6],Point "one" [11.2,10.6],Point "one" [11.5,12.3],Point "three" [18.5,26.9],Point "three" [19.9,25.2],Point "three" [18.7,24.2],Point "three" [18.2,33.3],Point "three" [24.1,25.8],Point "three" [18.8,25.9],Point "three" [17.1,24.5],Point "three" [17.0,26.2],Point "three" [20.3,25.0],Point "three" [20.3,25.0],Point "three" [22.0,27.3],Point "three" [21.5,25.8],Point "three" [22.1,26.6],Point "three" [25.6,28.9],Point "three" [13.9,20.7],Point "three" [26.0,27.2],Point "three" [19.9,25.0],Point "three" [10.2,26.0],Point "three" [19.9,24.8],Point "three" [20.3,24.3],Point "three" [19.0,28.7],Point "three" [17.0,22.1],Point "three" [15.2,22.4],Point "three" [21.0,22.2],Point "three" [19.1,28.5],Point "three" [22.6,24.0],Point "three" [18.2,26.4],Point "three" [20.0,25.1],Point "three" [19.3,18.8],Point "three" [20.7,21.3],Point "three" [17.3,24.4],Point "three" [24.5,19.7],Point "three" [12.3,21.6],Point "three" [16.4,25.6],Point "three" [22.0,26.4],Point "three" [18.1,24.1],Point "three" [19.0,23.8],Point "three" [21.1,31.5],Point "three" [17.4,24.0],Point "three" [20.1,24.9],Point "three" [17.7,22.0],Point "three" [20.6,29.3],Point "three" [21.2,29.7],Point "three" [19.8,18.9],Point "three" [21.4,18.8],Point "three" [23.1,23.5],Point "three" [21.2,24.5],Point "three" [21.0,24.5],Point "three" [19.9,23.9],Point "three" [19.9,18.7],Point "two" [25.9,14.2],Point "two" [26.5,12.4],Point "two" [32.3,10.3],Point "two" [26.0,5.7],Point "two" [22.7,6.0],Point "two" [30.1,13.0],Point "two" [33.8,8.1],Point "two" [29.9,9.3],Point "two" [38.9,10.3],Point "two" [30.7,10.5],Point "two" [35.2,1.3],Point "two" [28.4,14.8],Point "two" [30.8,8.2],Point "two" [26.3,16.8],Point "two" [24.3,7.6],Point "two" [32.1,14.4],Point "two" [31.2,8.6],Point "two" [31.4,9.3],Point "two" [31.3,10.5],Point "two" [35.4,13.3],Point "two" [30.0,10.7],Point "two" [29.1,10.2],Point "two" [25.2,6.0],Point "two" [30.3,10.0],Point "two" [30.0,2.9],Point "two" [28.7,0.3],Point "two" [29.2,8.8],Point "two" [25.4,-1.5],Point "two" [24.0,4.5],Point "two" [30.2,14.3],Point "two" [33.5,11.2],Point "two" [31.9,10.7],Point "two" [28.1,10.4],Point "two" [29.9,8.7],Point "two" [31.6,10.4],Point "two" [29.3,8.5],Point "two" [29.3,13.7],Point "two" [27.2,11.3],Point "two" [33.0,7.7],Point "two" [30.0,10.0],Point "two" [28.4,15.1],Point "two" [37.1,13.3],Point "two" [37.6,13.1],Point "two" [29.6,10.7],Point "two" [31.9,14.2],Point "two" [29.5,9.7],Point "two" [30.1,9.5],Point "two" [29.6,9.6],Point "two" [34.1,14.7],Point "two" [29.6,9.8]]

I've put together snapshots of the clustering process on a separate page.

Submitting:

To make it easier for me to test your submissions, I'm asking you to submit via Canvas this time around rather than pasting your code into Gradescope. Please submit only your .hs file, and make sure it's set up to use the list of 2D data points, points_2D.

Brad Richards, 2024