Homework 1 – Classification




SI 630: Homework 1 – Classification
Due: Wednesday, January 31, 5:30pm
1 Introduction Homework1willintroduceyoutobuildingyourfirstclassifiersfromscratch. Ratherthanusemany of the wonderful, highly-customizable machine learning libraries out there (e.g., sklearn), you will build bare-bones (and much slower) implementations of two classic algorithms: Na¨ıve Bayes and Logistic Regression. Implementing these will provide you with deep understanding of how these algorithms work and how the libraries are actually calculating things. Further, you’ll be implementing two knob to tune in the data: smoothing for Na¨ıve Bayes and the learning rate for Logistic Regression. In implementing these, you will get to adjust them and not only understand their impact but understand why they’re impacting the result. The programmingassignmentwillbemoderately challenging, mostlystemmingfrom the conceptual level. The code for each algorithm is be 20-30 lines, with additional lines for simple I/O and boilerplate code. The big challenge will be from trying to ground the mathematical concepts we talk about in class in actual code, e.g., “what does it mean to calculate the likelihood???”. Overcoming this hurdle will take a bit of head scratching and possibly some debugging but it will be worth it, as you will be well on your way to understanding how newer algorithms work and even how to design your own! There are equations in the homework. You should not be afraid of then—even if you’ve not takenamathclasssincehighschool. Equationsaretheretoprovideprecisenotationandwherever possible,we’vetriedtoexplainthemmore. Therearealsomanydescriptionsinthelecturematerials and textbooks. Getting used to equations will help you express yourself in the future. If you’re not sure what something means, please ask; there are six office hours between now and when this assignment is due, plus an active Canvas discussion board. You’re not alone in wondering and someone will likely benefit from your courage in asking. Given that these are fundamental algorithms for NLP and many other fields, you are likely to run into implementations of them online. While you’re allowed to look at them (sometimes they can be informative!), all work you submit should be your own (see Section 7). Finally, remember this is a no-busywork class. If you think some part of this assignment is unnecessary or to much effort, let us know. We are happy to provide more detailed explanations for why each part of this assignment is useful. Tasks that take inordinate amounts of time could even be a bug in the homework and will be fixed.
2 ClassificationTask TheSchoolofInformation’smissionisliterally“Wecreateandshareknowledgesothatpeoplewill use information – with technology – to build a better world.” One major impediment to people’s
useoftechnologytodayisuncivilbehavioronline. Hatespeech,bullying,toxiclanguage,andother formsofharassmentallaffectpeople’sabilitytoaccessinformationandengageinsocialactivities. In this homework, you’ll be tackling this problem head-on by building a hate speech detector. Thedatasetcomestousefrom[Davidson et al.(2017)Davidson, Warmsley, Macy, and Weber]and consists of tweets that people have rated as hate speech or not.1 Your task will be to build two classifiers that label text as hate speech. The fate of free discourse rests on your shoulders. Given the toxic nature of people’s comments, please be aware that you may see messages that are deeply offensive and an affront to all that is right and good. Neither the instructor nor the community at SI agrees with or supports these viewpoints. Their inclusion in the homework is entirely for the purposes of development new NLP technologies that help administrators and moderators identify and remove these messages from otherwise civil discourse. As such, it would be difficult to develop a hate speech classifier without actually seeing what they look like.
Notation Notes : For the whole assignment description we’ll refer to classifier features as x1,…,xn ∈ X where xi is a single feature and X is the set of all features. When start out each feature will correspond to the presence of a word; however, you are free in later parts of the homework to experiment with different kinds of features like bigrams that denote to consecutive words,e.g.,“notgood”isasinglefeature. Wereferthetotheclasslabelsas y1,…,yn ∈ Y where yi is a single class and Y is the set of all classes. In our case, we have a binary classification tasks so there’s really only y1 and y2. When you see a phrase like P(X = xi|Y = yj) you can read this as“theprobabilitythatweobservethefeature(X) xi istrue,giventhatwehaveseentheclass(Y) is yj”. We’ll also use the notation exp(xi) to refer to exi at times. This notation lets us avoid superscript when the font might become too small or when it makes equations harder to follow.
Implementation Note : Unless otherwise specified, your implementations should not use any existing off-the-shelf machine learning libraries or methods. You’ll be using plain old Python (or R, if brave) along with numeric libraries like numpy to accomplish everything.
3 Data Homework 2 has three associated files: • train.tsv This file is what you’ll use you train your classifiers. • dev.tsv This file is contains the development data. Development is used to tune the parameters of your data but is not used to train directly. For example, you can train a system with a vocabulary of 100 words and then compare the performance if you had used 1000 words. • test.tsv This file is the test set for the classifiers. You will make predictions for this file and upload them. 1Thedatahasbeenslightlymodifiedtocollapsetheiroffensive(butnothatespeech)classintothenot-hatespeech class so that the dataset is a binary classifications suitable for Logistic Regression.
The data is formatted as tab-separated files2 with three columns: instance id, text, and label. The instance ID is just a number that’s associated with the instance and is used for reporting predictions. The text is the raw text of the tweet. The label is binary and 0 for non-hate speech or 1 for hate speech. As a part of this assignment, we’ll be using Kaggle in the classroom to report predictions on thetestdata. Thishomeworkisnotachallenge,perse. Instead,we’reusingKagglesoyoucanget a sense of how good your implementation’s predictions are relative to other students. Since we’re all using the same data and implementing the same algorithms, your scores should be relatively closetootherstudents. Ifyoudecidetotakeupthe optionalpartand dosome featureengineering, youmighthaveslightlyhigherscores,butnoonewillbepenalizedforthis. Instructionsonhowto submit to Kaggle will be listed on Canvas.
4 Task1: Na¨ıveBayesClassifier 4.1 Part1 Implement a Na¨ıve Bayes classifier. Recall that the classifier was defined as
ˆ y = argmax yi∈Y
P(Y = yi|X) whereY is the set of classes (i.e., hate-speech or not, in our case). This equation can seem pretty opaqueatfirst,butrememberthatP(Y = yi|X)isreallywhat’stheprobabilityoftheclassyi given thedataweseeinaninstance. Assuch,wecanuseBayesRuletolearnthisfromthetrainingdata. In Task 1, you’ll implement each step of the Na¨ıve Bayes classifier in separate functions. As a part of this, you’ll also write code to read in and tokenize the text data. Here, tokenization refers to separating words in a string. In the second half, you’ll revisit tokenization to ask if you can do a better job at deciding what are words. Task 1 will provide much of the boiler plate code you’ll need for Task 2. • Write a function called tokenize that takes in a string and tokenizes it by whitespace, returningalistoftokens. Youshouldusethisfunctionasyoureadinthetrainingdatasothat each whitespace separated word will be considered a different feature (i.e., a different xi). • Writeafunctioncalled train tocompute P(X = xi), P(Y = yj),and P(X = xi|Y = yi) fromthetrainingdata. Thefunctionshouldalsoincludeanargumentcalledsmoothing alpha thatbydefaultis0butcantakeonanynon-negativevaluetodoadditivesmoothing(seeSlide 143 from Lecture 1), which you might also see as Laplacian smoothing. • Write a function called classify that takes in a tokenized document (i.e., a list of words) and computes the Na¨ıve Bayes classification, returning the class with the highest posterior probability. Note that not you might not have all the new document’s words in the training data. Be sure to take this into account in your implementation depending on whether you used smoothing or not! 2In the instructor’s opinion, everyone should always use tab-separated files instead of comma-separated files.
• Traintheclassifieronthetrainingdatatheruntheclassifieronthedevelopmentdatawithno smoothing and report your performance in your submission in terms of F1.3 • What happens as you change the value of smoothing alpha? Include a plot of your classifier’s performance on the development data where (i) your model’s performance is on they-axisand(ii)thechoiceinsmoothing alphaisonthex-axis. Notethatmostpeople use α = 1; does this value give good performance for you? • Submit your best model’s predictions on the test data to Kaggle. 4.2 Part2 Noticethatweareprobablydoingabadjoboftokenizingduetopunctuation. Forexample,“good” and “good,” are treated as different features because the latter has a comma at the end and “Good” and “good” are also different features because of capitalization. Do we want these to be different? Furthermore, do we want to include every token as a feature? (Hint: could a regular expression help you filter possible features?) Note that you have to implement the tokenization yourself; no importing NLTK and wrapping their functions (though you might take some inspiration from them). • Write a better tokenizing function called better tokenize that fixes these issues. In your report, describe which kinds of errors you fixed, what kind of features you included, and why you made the choices you did. • Recomputeyourperformanceonthedevelopmentdatausingthebetter tokenizemethod. Describe in your report what impact this had on the performance.
4.3 Part3(optional) Parts 1 and 2 only used unigrams, which are single words. However, longer sequences of words, known as n-grams, can be very informative as features. For example “not offensive” and “very offensive”areveryinformativefeatureswhoseunigramsmightnotbeinformativeforclassification on their own. However, the downside to using longer n-grams is that we now have many more features. For example, if we have n words in our training data, we could have n2 bigrams in the worst case; just 1000 words could quickly turn into 1,000,000 features, which will slow things down quite a bit. As a result, many people threshold bigrams based on frequency or other metrics to reduce the number of features. In Part 3, you’ll experiment with adding bigrams (two words) and trigrams (three words) and measuring their impact on performance. Part 3 is entirely optional and included for people who want to go a bit further into the feature engineering side of things. • Count how many unique, unigram, bigrams, and trigrams there are in the training data and report each number. 3You can use sklearn’s implementation of F1 (sklearn.metrics.f1 score) to make your life easier:
• Arethebigramandtrigramcountsyouobserveclosetotheworstcaseintermsofhowmany we could observe? If not, why do you think this is the case? (Hint: are all words equally common? You might also check out Zipf’s Law). • What percent of the unique bigrams and trigrams in the development data were also seen in the training data? • Choose a minimum frequency threshold and try updating your solution to use these as features. Werecommendcreatinganewmethodthatwrapsyourtokenizemethodandreturnsa list of features.
5 Task2: LogisticRegression In the second task, you’ll implement logistic regression, which you might recall is
P(y = 1|x,β) =
1 1 + exp−Pi=1,…,N xiβiNote that when you implemented Na ¨ ıve Bayes, it didn’t care how many classes were present. In contrast, Logistic Regression is restricted to two classes, which we represent as binary so that y is either 0 or 1. Conveniently, your training data is already set up for this (though you’ll need to use int() to convert the string). Your implementation will be one of the simplest possible formulations of logistic regression where you use gradient descent to iteratively find better parameters β. Smarter solutions like those in sklearn will use numerical solvers, which are much (much) faster. The purpose of this problem is to give you a sense of how to compute a gradient. For Task 2, you can (and should) re-use all the boilerplate and tokenization code from the Na¨ıve Bayes classifier. It is assumed that your solution will already include something to read in the training data and construct a matrix where rows are instances and columns denote features (e.g., if X[0,7] = 3, it means that the first instance (index 0) saw the word whose feature index is 7 occur 3 times). • Implement a function called sigmoid that implements the sigmoid function, S
S(X) ==
1 1 + exp(−X) . Your function should be vectorized so that it computes the sigmoid of a whole vector of numbers at once. Conveniently, numpy will often do this for you, so that if you multiply a number to a numpy array, it will multiply each item in the array (the same applies to functions used on an array (hint)). If you’re not sure, please ask us! You’ll need to use the sigmoid function to make predictions later. • Implement a function called log likelihood that calculates the log likelihood of the training data given our parameters β. Note that we could calculate the likelihoood but since we’ll be using log-arithmetic version (hence the log-likelihood) to avoid numeric underflow
andbecauseit’sfastertoworkwith. Notethatwecancalculatetheloglikelihood ll overthe whole training data as ll = X i=1,…,n yiBTxi −log(1 + exp(BTxi)) where β is the vector of all of our coefficients. • Given some choice of β to make predictions, we want to use the difference in our prediction ˆ Y from the ground truth Y to update β. The gradient of the log likelihood tells us which direction(positiveornegative)tomaketheupdateandhowlargetheupdateshouldbe. Write a function compute gradient to compute the gradient. Note that we can compute the whole gradient using ∇ll = XT(Y − ˆ Y ) Note that Y is a binary vector with our ground truth (i.e., the training data labels) and ˆ Y is the binary vector with the predictions. To get a sense of why this works, think about what gradient will equal if our prediction for item i, ˆ Yi is the same as the ground truth Yi; if we use this gradient to update our weight for βi, what effect will it have? • Putting it all together, write a function logistic regression that takes in a – a matrix X where each row has the features for that instance – a vector Y containing the class of the row – learning rate which is a parameter to control how much you change the β values each step – num step how many steps to update β before stopping Your function should iteratively update the weight vector β at each step by making predictions, ˆ Y, for each row of X and then using those predictions to calculate the gradient. You should also include an intercept coefficient.4 Notethatyoucanmakeyourlifeeasierbyusingmatrixoperations. Forexample,tocompute ˆ Y, multiply the whole feature matrix X by the β vector If you’re not sure how to do this, don’t worry! Please come talk to us during office hours! • Write a function predict that given some new vector (i.e., something like a row from X), predict the class. • Train your model on the training data learning rate=5e-5 (i.e., a very small number) and num steps = 300000. Make a plot of the log-likelihood every 10,000 steps. Did the model converge at some point (i.e., does the log likelihood remain stable)? • Changethelearning ratetoamuchlargerandmuchsmallervalueandrepeatthetrainingprocedureforeach. Plotallthreecurvestogetheranddescribewhatyouobserve. You’re welcome (encouraged, even!) to try additional learning rates. If your model is very slow or if it converges quickly, you can also reduce the number of steps for this question. 4The easiest way to do this is to add an extra feature (i.e., column) with value 1 to X; the functions np.ones and np.stack with axis=1 will help.
• Aftertrainingonthetrainingdata,useyourlogisticregressionclassifiertomakepredictions on the validation dataset and report your performance using the F1. • Submit your best model’s predictions on the test data to Kaggle. 6 Submission Please upload the following to Canvas by the deadline: 1. a PDF (preferred) or .docx with your responses and plots for the questions above 2. your code for the Na¨ıve Bayes part 3. your code for the Logistic Regression part Codemaybesubmittedasastand-alonefile(e.g.,a .py file)orasaJupyternotebook. Wereserve therighttorunanycodeyousubmit;codethatdoesnotrunorproducessubstantiallydifferent outputswillreceiveazero.