A data set on Kaggle

kaggle.com is a web site where you can find a rich collection of data sets for data science and machine learning. For this homework exercise we are going to use one of the data sets posted on Kaggle to do a text classification task.

The data set we will use is a collection of comments scraped from github. Many popular projects on github attract comments. The data set for this example consists of a collection of those comments where each comment is categorized as either a feature request, a bug report, or a question about the project. For this homework you will be constructing a simple language model to categorize comments into one of these three categories.

For your convenience I have placed an archive containing the data on the course web site at

http://www.lawrence.edu/fast/greggj/CMSC490/github.zip

Finding useful code

When you visit the dataset page for this data set on Kaggle you will see a Code tab: under this tab you will see a list of notebooks that Kaggle users have posted containing code for this data set.

A common problem that everyone has to deal with when working with this data set is cleaning the text. This particular data set is particularly dirty, because it contains HTML code, snippets of code in various programming languages, and other stuff that you don't typically find in ordinary text. For this exercise I suggest you take a look at some of the notebooks that users posted for this problem and pay particular attention to the methods they use to clean up the input text.

What you need to do

Your challenge in this assignment is to write some code that can load the data in the data set, clean it, and then pass that data to a simple neural network model that can solve the classification problem.

You are welcome to model your network on one of the examples the author develops in chapter 11 to do sentiment analysis on the movie reviews.

The level of accuracy you should shoot for should be to get an accuracy of 75% or higher.

Helpful hint

If you use some of the cleaning code from Kaggle you will end up with the data for your model stored in a pandas dataframe. You should extract the text and labels from this dataframe, put them in a pair of numpy arrays and then slice those numpy arrays into training, validation, and test slices. You will then have the problem of setting up a keras dataset object that can load data from those arrays. Here is some example code to demonstrate how to do that:

train_ds = tf.data.Dataset.from_tensor_slices((X_train,Y_train)).batch(64)