CS 361: Algorithms and Data Structures
Homework 8
Due date: 11/28/18 by 11:59pm



Congratulations! You've made it to the final homework assignment for CS361. This assignment brings together many of the algorithms/ideas that we've studied throughout the semester.

In this assignment, we will be analyzing the MovieLens data set (ml-latest-small) which contains 100k movie ratings from 671 users collected over a 20-year period (from 1995 to 2016).

The goal of this assignment is to create a graph using the MovieLens data (where the nodes are movies) and a text-based interface that allows the user to explore the graph. Ideally, this graph could be used as a recommender system (i.e. to recommend movies) and you will find some extra credit options at the end if you wish to take the assignment a step further and provide movie recommendations.

This assignment is entirely a programming assignment and it is recommended (but not required) that you work with a partner. This is a substantial assignment so please do not wait until the last minute to start.

Starter Code

Click here for a zipped directory containing all the necessary files for this assignment. The files are divided into the following packages:

  1. analyzer - The analyzer package contains the main controller class for this assignment MovieLensAnalzyer.java. This class contains a main() method that should build the graph, and then allow the user to explore the graph.

  2. data - The data package contains two classes for storing the MovieLens data: Movie.java and Reviewer.java. You do not need to modify these classes but feel free to take a look at them.

  3. graph - The graph package contains all classes that relate to graphs. You should add your Graph.java and GraphIfc.java to this package. In addition, there is a GraphAlgorithms.java class in which you will implement Dijkstra's algorithm and the Floyd-Warshall algorithm.

  4. util - The util package contains all other classes that are necessary including a DataLoader.java file that reads in and parses the MovieLens data. Please review the public methods of this class since you will need to call them to get the data. You should add your PriorityQueue.java file to this package. Note that the Pair.java class is already provided for your priority queue.
To summarize, you will need to implement two files: MovieLensAnalyzer.java and GraphAlgorithms.java

MovieLens Analyzer

Inside the MovieLensAnalyzer.java class should be a main method that allows the user to explore the MovieLens data. The user should specify the filenames for loading the data at the command line. When the main method is run, here is what the user should see:


This welcome message asks the user how they want to build a graph from the MovieLens data. The nodes of the graph should always be movies but there are lots of ways to define the edges of the graph. You should come up with at least two (2) different options for defining what it means for movies u and v to be adjacent in the graph. You are free to use the options I came up with above or to experiment with others. Note: The first two options produce an undirected graph which I implement using a directed graph structure (by adding an edge from u to v and from v to u)

Here is what should be printed after choosing an option:


If the user chooses option 1, you should print out the following information about the graph:

  • The number of nodes
  • The number of edges
  • The density of the graph defined as D = E / (V*(V-1)) for a directed graph
  • The maximum degree (i.e. the largest number of outgoing edges of any node)
  • The diameter of the graph (i.e. the longest shortest path)
  • The average length of the shortest paths in the graph
These last two (diameter and average) require you to compute the shortest path between all pairs of nodes in the graph using the Floyd-Warshall algorithm.

If the user chooses option 2, you should print out information about the specified node. You can use the toString() method in the Movie.java class but you should also add code to print out all of the neighbors of the specified node.

If the user chooses option 3, you should ask the user for a starting node and an ending node. Then use Dijkstra's algorithm to find the shortest path between the two nodes. You should print out the shortest path for the user.

Continue printing the menu and letting the user choose options until they choose option 4 which should cause your program to quit. Click here to see a full session.

You should add other (probably static) methods to your MovieLensAnalyzer class in addition to the main method. For example, a method that uses DataLoader to read in the data, a method that constructs the graph according to the user's choice, a method to print out different messages to the console etc. In particular, it's better to break your code up into small methods instead of having a single giant main method with dozens of lines of code.


Graph Algorithms

In the class GraphAlgorithms.java you should implement both Dijkstra's algorithm and the Floyd-Warshall algorithm. Even though the graph is unweighted, I still want you to implement Dijkstra's algorithm. Just assume each edge has a weight of 1.

Dijkstra's algorithm should return back the set of parent nodes because you'll need to print out the actual path for the user. The Floyd-Warshall algorithm however can simply return back the path costs. Here is my recommendation for the definition of both methods:

  • public static int[][] floydWarshall(Graph<Integer> graph)
  • public static int[] dijkstrasAlgorithm(Graph<Integer> graph, int source)

The floydWarshall method takes in a graph and returns a two-dimensional array of integers. The entry in spot [i][j] should be the length of the shortest path from node i to node j. Dijkstra's algorithm takes in a graph and a source and returns an integer array. This array is the "prev" data structure we talked about in class. The i-th element in the array is the parent of node i on the shortest path from the source.


Final Tips

The starter code for this assignment contains the directory ml-latest-small with the actual data files:
  • README.txt - Describes the data set
  • movies.csv - A list of each movie (includes title and genres)
  • ratings.csv - The movie ratings
  • tags.csv - User generated tags
I have modified these files to make them more manageable. If you want the original, untouched dataset (which has the Sharknado movies in it), let me know and I will send them to you.

It should take less than a minute to build the graph! My code takes around 25 seconds to build the graph choosing option 1. If your code is taking longer, then chances are you're inefficiently determining which node should be connected to which node in the graph.

You should test your code as you write it! A good idea is to test each class using a main method. You can create a smaller movies file by just using the top 10 or top 100 lines of movies.csv. Or you could draw a small graph by hand on paper and check that your Dijkstra's and Floyd-Warshall return the correct answer.



Extra Credit Ideas:
  • [Recommender System] An interesting use of this data would be a system that could recommend movies to a user based upon the movies they have liked in the past. This is called a recommender system.

    One idea would be to include a 4th option in your menu labeled "Recommend movie". If the user chooses this option, you could ask them to enter the integer id of 2 movies they have watched and liked. Given this information, you could find the shortest path between the movies and return the intermediate nodes as recommendations. Or a more sophisticated system might ask the user to enter as many movies as they want and then given this information:

    • For each liked movie, look at its neighbors in the graph. A movie that is a neighbor to multiple "liked movies" might make a good recommendation.
    • For each pair of liked movies, find the shortest path between them. A movie that is an intermediate node on multiple shortest paths might make a good recommendation.

    Or feel free to come up with your idea for recommending a movie given the user's preferences.

  • [Weighted Graph] Modify your code so that it produces and works on a weighted graph (instead of an unweighted graph). For example, two movies are connected by an edge whose weight is the number of users that watched both movies. This would require you to modify your Graph.java file along with your Dijkstra's algorithm and Floyd-Warshall algorithm.

  • [Exploring the Graph] The graph is quite large and without any knowledge apriori, it's hard to find movies and it's rare to guess two nodes that actually have a path between them. One nice idea would be to add some options that allow the user to navigate through and find movies. For example,

    • An option that would allow the user to type in a string and then return all movies (along with their movie ids) whose title contain that string.
    • An option that would choose and print an "interesting path" for the user. For example, you might choose a path that involves the node with the highest degree (e.g. the node/movie with the highest degree was watched the most often and thus should be interesting to a general audience). Or you could try to find paths that connect two nodes with low degree. Or you could try to find paths that do not go through the node with the highest degree.


Submission Instructions

Your Java code should be submitted in a zipped directory and uploaded to Moodle. The directory should contain all necessary Java files. You and your partner only need to submit one directory. Please make sure that you put both of your names in the Javadoc comments at the top of your MovieLensAnalyzer.java file so I know who you worked with.

Your code will be graded on the following:

  • The functionality of your code. Your code should compile with no errors and it should run without throwing any runtime exceptions. I will also run your code on some small test graphs that I have and check that your code returns the correct answers for both option 1 (printing out information about the graph) and option 3 (shortest paths).

  • Your adherence to the course style guide

  • The running time of your methods. This is important e.g. when you are building the graph. (If you are not thoughtful about how you build the graph, it will end up taking a very long time). Also, your implementation of Dijkstra's and Floyd-Warshall should not be asymptotically worse than their respective running times.



  • Last modified: Fri Jan 24 10:58:47 PST 2014