{"id":379,"date":"2020-01-17T10:47:56","date_gmt":"2020-01-17T09:47:56","guid":{"rendered":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/?post_type=chapter&#038;p=379"},"modified":"2024-01-28T16:13:34","modified_gmt":"2024-01-28T15:13:34","slug":"document-classification","status":"publish","type":"chapter","link":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/chapter\/document-classification\/","title":{"raw":"Document Classification","rendered":"Document Classification"},"content":{"raw":"<p style=\"text-align: justify\">In this chapter we discuss document classification, a common application of logistic regressions, and present an example using the software LightSide.<\/p>\r\n&nbsp;\r\n<p style=\"text-align: justify\"><strong>Introduction to document classification\r\n<\/strong><\/p>\r\n<a href=\"https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_vvgu8uz2\">https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_vvgu8uz2<\/a>\r\n<p style=\"text-align: justify\"><span>[iframe width=\"640\" height=\"360\" src=\"<\/span>https:\/\/tube.switch.ch\/embed\/b4fa4931<span>\" frameborder=\"0\" webkitallowfullscreen mozallowfullscreen allowfullscreen]<\/span><\/p>\r\n<p style=\"text-align: justify\">Although it would be possible to perform document classifications in R, we decided to use a free, sophisticated and user-friendly tool specifically designed for document classification: LightSide. Since we do not have a dataset which we can make available for you directly, you will not be able to replicate the example we present here. Still, we can only encourage you to download LightSide <a href=\"http:\/\/ankara.lti.cs.cmu.edu\/side\/\">here<\/a>, install it using their excellent <a href=\"http:\/\/ankara.lti.cs.cmu.edu\/side\/LightSide_Researchers_Manual.pdf\">user manual<\/a> (Mayfield et al. 2014) and play around with some data that interests you.[footnote]You might consider downloading novels by two different authors from Gutenberg, for instance.[\/footnote] In the unlikely event that you should find LightSide too daunting to dive into it head first, it should still be possible to follow the discussion below without performing each step along the way yourself.<\/p>\r\n<p style=\"text-align: justify\">Before we get into the details, however, we should say a few words about document classification. Document classification is an important branch of computational linguistics since it provides a productive way handle large quantities of text. Think for a moment about spam mails. According to analysis of the anti-virus company Kaspersky, more than half of the global email traffic is spam mail (Vergelis et al. 2019). How do email providers manage to sift these quantities of data? They use document classification to separate the important email traffic from spam mail. And how does document classification work? Essentially by solving a logistic regression.<\/p>\r\n<p style=\"text-align: justify\">When we discussed the logistic regression in the preceding chapter, we chose a set of explanatory variables to predict the outcome of categorical dependent variable. Since LightSide is a machine learning-based application, we don\u2019t have to define the explanatory variables beforehand. Instead we let LightSide select the features. In fact, LightSide uses all of the features in the documents we want to classify and it automatically identifies which words are most indicative of each class.<\/p>\r\n<p style=\"text-align: justify\">To begin with, we need to open LightSide. If you installed it following the manual\u2019s description, you should be able to launch it on a Windows computer by opening the \u2018LightSIDE.bat\u2019 file in the \u2018lightside\u2019 folder. On a Mac, you should be able to run LightSide by downloading the zip folder from the webpage, unzipping it, selecting \u2018lightside.command\u2019 with a right-click, choosing \u2018open with\u2019 and then selecting \u2018terminal\u2019. A window will pop open asking you whether you are sure you want to open this file, and there you can select yes. Now, LightSide should launch.<\/p>\r\n<p style=\"text-align: justify\">Once LightSide is open, the first thing we need to do is select the corpus that we want to classify. In our example, we work with Guerini et al.\u2019s (2013)\u00a0 CORPS 2 which contains more than 8 million words from 3618 speeches. Most of the speeches are by American politicians, but the corpus also includes speeches by Margaret Thatcher. The question we are asking is whether we can determine a politician\u2019s party only on the basis of their speeches. We are going to use speeches by those politicians who are either Democrat or Republican and who are represented by more than 10 speeches. Since the corpus is from 2013, the most recent developments in American politics are, of course, not captured here.<\/p>\r\n\r\n\r\n[caption id=\"attachment_556\" align=\"aligncenter\" width=\"306\"]<img src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/1.-speeches.jpg\" alt=\"\" class=\"size-full wp-image-556\" width=\"306\" height=\"385\" \/> Table 11.1: Speeches per politician[\/caption]\r\n<p style=\"text-align: justify\">We had to pre-process the speeches with some simple steps, and we want to show you the final form of the data so that you can get an idea of what the input for LightSide should look like. In the screenshot below, you see how the comma separated data looks in Excel. In the first column, we have the party affiliation which is either \u2018rep\u2019 or \u2018dem\u2019. The second column shows the file id. The third column contains the politician's name. As you might guess from only seeing Alan Keyes in the first rows, the table is sorted alphabetically according to the politician's name. Each cell in the fourth column contains an entire speech.<\/p>\r\n\r\n\r\n[caption id=\"attachment_506\" align=\"aligncenter\" width=\"1681\"]<img src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/2.-excel-speeches.jpg\" alt=\"\" class=\"size-full wp-image-506\" width=\"1681\" height=\"403\" \/> Table 11.2: The CORPS 2 in Excel[\/caption]\r\n\r\n&nbsp;\r\n\r\n<strong>Importing data into LightSide<\/strong>\r\n\r\n<a href=\"https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_d0zin257\">https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_d0zin257<\/a>\r\n\r\n<span>[iframe width=\"640\" height=\"360\" src=\"https:\/\/tube.switch.ch\/embed\/f2782227<\/span><span>\" frameborder=\"0\" webkitallowfullscreen mozallowfullscreen allowfullscreen]<\/span>\r\n<p style=\"text-align: justify\">When we load this table into LightSide, each of the words in the speeches is converted into an appropriate representation. For the documents it uses a vector space model, which we do not explain in detail since it is beyond the scope of this introduction. At this stage it suffices to know that each word is going to be turned into a feature, on the basis of which LightSide will make its predictions of the politicians' party. The fact that some words are more typical of one party can be leveraged to predict which party a given speech belongs to. Since we have 18,175 types, we have very many features which LightSide can use to construct a model.<\/p>\r\n<p style=\"text-align: justify\">As figure 11.1 shows, the LightSide window is divided into an upper and a lower half. Each of these halves is divided into three columns. First, we work through the columns in the upper half from left to right. To import a .csv file into LightSide, click on the little folder icon in the upper left column. A window will open, and you can drag-and-drop the file containing your text collection into this window. Once the corpus is imported, select the \u2018Class\u2019 to tell LightSide what you want to predict. In our example, we use the class \u2018party\u2019 since we are interested in identifying party affiliations. The party is a binary, nominal type of class, just like the realization of the recipient was in the last chapter on the logistic regression. LightSide automatically identifies party as such a binary class. We want to predict the party from the information in the \u2018text\u2019 column in our data, so we select that in the \u2018Text Fields\u2019 tab.<\/p>\r\n\r\n\r\n[caption id=\"attachment_508\" align=\"aligncenter\" width=\"1920\"]<img src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/3.-Pre-extraction.jpg\" alt=\"\" class=\"size-full wp-image-508\" width=\"1920\" height=\"1020\" \/> Figure 11.1: Settings for the feature extraction[\/caption]\r\n<p style=\"text-align: justify\">After setting these parameters, we can choose in the middle and right column which types of features we want LightSide to use for the prediction. In our example, we only use \u2018Unigrams\u2019 from the \u2018Basic Features\u2019.[footnote]LightSide is, in fact, a Weka wrapper. Weka is a workbench for machine learning. So we have at our disposal a very powerful tool which can scale to large datasets and which would be capable of incorporating a very wide variety of features into its models. For simplicity's sake, we stick to unigrams here.[\/footnote] Accordingly, we want to know which individual words are indicative of party affiliation. This is also known as a bag-of-words approach, since we do not regard word combinations at all here. When using LightSide for research purposes, it makes sense to play around with and compare different models based on different features.<\/p>\r\n<p style=\"text-align: justify\">Once we have selected the features we want LightSide to use, we can extract them by clicking on the \u2018Extract\u2019 tile. On the same height as the \u2018Extract\u2019 button but on the right side of the LightSide window, there is a counter telling us how many features have already been extracted. Depending on the corpus size, this may take a while.<\/p>\r\n&nbsp;\r\n\r\n<strong>Building a model with 1-gram features<\/strong>\r\n\r\n<a href=\"https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_p5l0a0sg\">https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_p5l0a0sg<\/a>\r\n\r\n<span>[iframe width=\"640\" height=\"360\" src=\"https:\/\/tube.switch.ch\/embed\/7f88ab2f<\/span><span>\" frameborder=\"0\" webkitallowfullscreen mozallowfullscreen allowfullscreen]<\/span>\r\n<p style=\"text-align: justify\">Once the features are extracted, we turn to the lower half of the screen. In the lower left column, the most interesting thing to note is the information of how many features are in our feature table. In our example, we have 18,175 features. The lower middle column gives us the option of looking at different metrics for each feature. While this can be useful at more advanced levels, we can ignore this at the moment. In the bottom right panel, we see our feature table in alphabetical order. In our example, this begins with a lot of year numbers, but if we scroll through the table, we see that the overwhelming majority of the features are words. Table 11.3 is a screenshot taken somewhere in the middle of our feature table.<\/p>\r\n\r\n\r\n[caption id=\"attachment_509\" align=\"aligncenter\" width=\"1027\"]<img src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/4.-Some-features.jpg\" alt=\"\" class=\"size-full wp-image-509\" width=\"1027\" height=\"511\" \/> Table 11.3: A tiny selection of features[\/caption]\r\n<p style=\"text-align: justify\">With the extracted features, we can jump directly to creating a model. To do so, we have to go to the \u2018Build Model\u2019 tab. Here we can select which algorithm we want our model to be based on. Of course, in our example we choose \u2018Logistic Regression\u2019, since this is the type of model which interests us here. What changes now in comparison to the example in the last chapter is that instead of a few dozen, we now use thousands of features to predict the dependent variable. The capability of handling very large numbers of features is one of the great affordances of machine learning.<\/p>\r\n\r\n\r\n[caption id=\"attachment_510\" align=\"aligncenter\" width=\"1920\"]<img src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/5.-Training-parameters.jpg\" alt=\"\" class=\"size-full wp-image-510\" width=\"1920\" height=\"1020\" \/> Figure 11.2: Parameters for training our logistic model[\/caption]\r\n<p style=\"text-align: justify\">Another important difference between this implementation of the logistic regression and the generalized linear model we saw in the last chapter is the \u2018Cross-Validation\u2019, which we can select in the \u2018Evaluation Options\u2019. This means instead of training the model on all of the available data, we train it on 90% of the data and then apply it to the remaining 10% of the data. This is done ten times, and each time a different 10% is excluded from the training data. This is known as 10-fold cross-validation, and it means that actually ten different models are being trained on ten different selections of the data. Then, the mean of the parameters of these ten models is taken to construct the final model. We can create the model by clicking the \u2018Train\u2019 tile. Again, this may take a while, and we can follow the progress with the counter, where we see which of our ten\u00a0 folds are testing.<\/p>\r\n&nbsp;\r\n\r\n<strong>Evaluating the model<\/strong>\r\n\r\n<a href=\"https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_g278bvaq\">https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_g278bvaq<\/a>\r\n\r\n<span>[iframe width=\"640\" height=\"360\" src=\"https:\/\/tube.switch.ch\/embed\/d104c292<\/span><span>\" frameborder=\"0\" webkitallowfullscreen mozallowfullscreen allowfullscreen]<\/span>\r\n<p style=\"text-align: justify\">The results are presented in the lower half of the window. In the lower middle column, we see the accuracy of the model. In our example, we get an accuracy of 97.9%, meaning that only about 2% of the speeches were classified incorrectly. The Kappa value expresses how much better our model is than the random choice baseline. The value of 0.957 we see here means that our model performs 95% better than if we classified the speeches at random. This is very similar to the null-model against which we compared our logistic model in the last chapter. In the lower right column, we see the actual confusion matrix, which tells us how many speeches were predicted correctly. All of these measures indicate that we can predict a politician\u2019s party quite reliably solely on the basis of the words in the speech.<\/p>\r\n\r\n\r\n[caption id=\"attachment_511\" align=\"aligncenter\" width=\"1430\"]<img src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/6.-Confusion-matrix.jpg\" alt=\"\" class=\"size-full wp-image-511\" width=\"1430\" height=\"565\" \/> Figure 11.3: Model evaluation measures[\/caption]\r\n<p style=\"text-align: justify\">For our linguistic purposes, it\u2019s very interesting to look at the feature weight, which is a measure for the importance of each feature for the model. To do this, we can go to the \u2018Explore Results\u2019 tab. Let\u2019s say we are interested in seeing the most typical Republican features. We can do this by ticking the box in the confusion matrix where the predicted and actual Republican outcomes intersect. Then we can choose the \u2018Frequency\u2019 and the \u2018Feature Influence\u2019 among the \u2018Evaluations to Display\u2019.<\/p>\r\n\r\n\r\n[caption id=\"attachment_512\" align=\"aligncenter\" width=\"1440\"]<img src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/7.-Exploring-results.jpg\" alt=\"\" class=\"size-full wp-image-512\" width=\"1440\" height=\"673\" \/> Figure 11.4: Exploring the results[\/caption]\r\n<p style=\"text-align: justify\">Now, we can sort the feature table in the upper right column according to the feature weight by clicking on the \u2018Feature Influence\u2019 tile. The negative features are strongly indicative of the Democratic party, while the positive features are strongly indicative of the Republican party.[footnote]Like with NPs and PPs in the last chapter, it is arbitrary which category is represented by positive and which by negative numbers. There are no political implications here.[\/footnote] The closer the feature influence is to zero, the less explanatory power it holds. Let\u2019s take a look at the features most indicative of being Republican:<\/p>\r\n\r\n\r\n[caption id=\"attachment_513\" align=\"aligncenter\" width=\"1020\"]<img src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/8.-Republican-features.jpg\" alt=\"\" class=\"size-full wp-image-513\" width=\"1020\" height=\"583\" \/> Figure 11.5: Most informative Republican features[\/caption]\r\n<p style=\"text-align: justify\">We see among the most Republican features several names - Bush, Nancy (Reagan), Laura (Bush) - and several familiar buzzwords - freedom, god, nation, terror. Linguistically, it is interesting to note that contractions - \u2018ve, \u2018m, \u2018ll - are decisively Republican features. A first interpretation of this could be that Republicans want to present themselves as being close to the public, using colloquial language rather than technical jargon.<\/p>\r\n<p style=\"text-align: justify\">Of course, if one were to pursue this line of investigation seriously, one would have to check whether these results are the consequence of different transcription conventions, and whether there is evidence of the \u2018one of the people\u2019 attitude in the speeches of Republicans (and how Democrat politicians present themselves). We are not going to expound on this at any greater length, but this example should demonstrate both how unigram features can provide a good basis for further analysis and how these features should be interpreted with some caution.<\/p>\r\n<p style=\"text-align: justify\">We see here how regression analysis provides a basis for understanding advanced machine learning tools such as LightSide. What we also see is how we strain the limit of interpretability when we perform regression analysis on this scale: in our example we have about 18,000 features which contribute to the model, which is in excess of what can be incorporated in a qualitative analysis. This problem is not limited to document classification. Whenever we want to analyze substantial quantities of text, human reading capacities are challenged. We are going to discuss topic modelling in the <a href=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/chapter\/topic-modelling\/\">next chapter<\/a> to address this issue of excess information.<\/p>\r\n&nbsp;\r\n\r\n<strong>References:<\/strong>\r\n<p style=\"text-align: justify\">Guerini M., Giampiccolo D., Moretti G., Sprugnoli R., &amp; Strapparava C. 2013. The New Release of CORPS: A Corpus of Political Speeches Annotated with Audience Reactions. <em>Multimodal Communication in Political Speech: Shaping Minds and Social Action<\/em>, ed. by Isabella Poggi, Francesca D\u2019Errico, Laura Vincze and Alessandro Vinciarelli, 86-98. Berlin: Springer.<\/p>\r\nLightside\r\n<p style=\"text-align: justify\">Mayfield, Elijah, David Adamson and Carolyn P. Ros\u00e9. 2014. LightSide: Researcher\u2019s Workbench User Manual. Online: <a href=\"http:\/\/ankara.lti.cs.cmu.edu\/side\/LightSide_Researchers_Manual.pdf\">http:\/\/ankara.lti.cs.cmu.edu\/side\/LightSide_Researchers_Manual.pdf<\/a>.<\/p>\r\n<p style=\"text-align: justify\">Vergelis, Maria, Tatyana Shcherbakova and Tatyana Sidorina. 2019. Spam and phishing in Q1 2019. <em>Kaspersky<\/em>. Online: <a href=\"https:\/\/securelist.com\/spam-and-phishing-in-q1-2019\/90795\/\">https:\/\/securelist.com\/spam-and-phishing-in-q1-2019\/90795\/<\/a>.<\/p>\r\n<p style=\"text-align: justify\"><\/p>","rendered":"<p style=\"text-align: justify\">In this chapter we discuss document classification, a common application of logistic regressions, and present an example using the software LightSide.<\/p>\n<p>&nbsp;<\/p>\n<p style=\"text-align: justify\"><strong>Introduction to document classification<br \/>\n<\/strong><\/p>\n<p><a href=\"https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_vvgu8uz2\">https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_vvgu8uz2<\/a><\/p>\n<p style=\"text-align: justify\"><span><br \/>\n<!-- iframe plugin v.6.0 wordpress.org\/plugins\/iframe\/ --><br \/>\n<iframe loading=\"lazy\" width=\"640\" height=\"360\" src=\"\/spanhttps:\/\/tube.switch.ch\/embed\/b4fa4931span\" frameborder=\"0\" 0=\"webkitallowfullscreen\" 1=\"mozallowfullscreen\" 2=\"allowfullscreen\" scrolling=\"yes\" class=\"iframe-class\"><\/iframe><br \/>\n<\/span><\/p>\n<p style=\"text-align: justify\">Although it would be possible to perform document classifications in R, we decided to use a free, sophisticated and user-friendly tool specifically designed for document classification: LightSide. Since we do not have a dataset which we can make available for you directly, you will not be able to replicate the example we present here. Still, we can only encourage you to download LightSide <a href=\"http:\/\/ankara.lti.cs.cmu.edu\/side\/\">here<\/a>, install it using their excellent <a href=\"http:\/\/ankara.lti.cs.cmu.edu\/side\/LightSide_Researchers_Manual.pdf\">user manual<\/a> (Mayfield et al. 2014) and play around with some data that interests you.<a class=\"footnote\" title=\"You might consider downloading novels by two different authors from Gutenberg, for instance.\" id=\"return-footnote-379-1\" href=\"#footnote-379-1\" aria-label=\"Footnote 1\"><sup class=\"footnote\">[1]<\/sup><\/a> In the unlikely event that you should find LightSide too daunting to dive into it head first, it should still be possible to follow the discussion below without performing each step along the way yourself.<\/p>\n<p style=\"text-align: justify\">Before we get into the details, however, we should say a few words about document classification. Document classification is an important branch of computational linguistics since it provides a productive way handle large quantities of text. Think for a moment about spam mails. According to analysis of the anti-virus company Kaspersky, more than half of the global email traffic is spam mail (Vergelis et al. 2019). How do email providers manage to sift these quantities of data? They use document classification to separate the important email traffic from spam mail. And how does document classification work? Essentially by solving a logistic regression.<\/p>\n<p style=\"text-align: justify\">When we discussed the logistic regression in the preceding chapter, we chose a set of explanatory variables to predict the outcome of categorical dependent variable. Since LightSide is a machine learning-based application, we don\u2019t have to define the explanatory variables beforehand. Instead we let LightSide select the features. In fact, LightSide uses all of the features in the documents we want to classify and it automatically identifies which words are most indicative of each class.<\/p>\n<p style=\"text-align: justify\">To begin with, we need to open LightSide. If you installed it following the manual\u2019s description, you should be able to launch it on a Windows computer by opening the \u2018LightSIDE.bat\u2019 file in the \u2018lightside\u2019 folder. On a Mac, you should be able to run LightSide by downloading the zip folder from the webpage, unzipping it, selecting \u2018lightside.command\u2019 with a right-click, choosing \u2018open with\u2019 and then selecting \u2018terminal\u2019. A window will pop open asking you whether you are sure you want to open this file, and there you can select yes. Now, LightSide should launch.<\/p>\n<p style=\"text-align: justify\">Once LightSide is open, the first thing we need to do is select the corpus that we want to classify. In our example, we work with Guerini et al.\u2019s (2013)\u00a0 CORPS 2 which contains more than 8 million words from 3618 speeches. Most of the speeches are by American politicians, but the corpus also includes speeches by Margaret Thatcher. The question we are asking is whether we can determine a politician\u2019s party only on the basis of their speeches. We are going to use speeches by those politicians who are either Democrat or Republican and who are represented by more than 10 speeches. Since the corpus is from 2013, the most recent developments in American politics are, of course, not captured here.<\/p>\n<figure id=\"attachment_556\" aria-describedby=\"caption-attachment-556\" style=\"width: 306px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/1.-speeches.jpg\" alt=\"\" class=\"size-full wp-image-556\" width=\"306\" height=\"385\" srcset=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/1.-speeches.jpg 306w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/1.-speeches-238x300.jpg 238w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/1.-speeches-65x82.jpg 65w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/1.-speeches-225x283.jpg 225w\" sizes=\"(max-width: 306px) 100vw, 306px\" \/><figcaption id=\"caption-attachment-556\" class=\"wp-caption-text\">Table 11.1: Speeches per politician<\/figcaption><\/figure>\n<p style=\"text-align: justify\">We had to pre-process the speeches with some simple steps, and we want to show you the final form of the data so that you can get an idea of what the input for LightSide should look like. In the screenshot below, you see how the comma separated data looks in Excel. In the first column, we have the party affiliation which is either \u2018rep\u2019 or \u2018dem\u2019. The second column shows the file id. The third column contains the politician&#8217;s name. As you might guess from only seeing Alan Keyes in the first rows, the table is sorted alphabetically according to the politician&#8217;s name. Each cell in the fourth column contains an entire speech.<\/p>\n<figure id=\"attachment_506\" aria-describedby=\"caption-attachment-506\" style=\"width: 1681px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/2.-excel-speeches.jpg\" alt=\"\" class=\"size-full wp-image-506\" width=\"1681\" height=\"403\" srcset=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/2.-excel-speeches.jpg 1681w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/2.-excel-speeches-300x72.jpg 300w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/2.-excel-speeches-1024x245.jpg 1024w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/2.-excel-speeches-768x184.jpg 768w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/2.-excel-speeches-1536x368.jpg 1536w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/2.-excel-speeches-65x16.jpg 65w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/2.-excel-speeches-225x54.jpg 225w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/2.-excel-speeches-350x84.jpg 350w\" sizes=\"(max-width: 1681px) 100vw, 1681px\" \/><figcaption id=\"caption-attachment-506\" class=\"wp-caption-text\">Table 11.2: The CORPS 2 in Excel<\/figcaption><\/figure>\n<p>&nbsp;<\/p>\n<p><strong>Importing data into LightSide<\/strong><\/p>\n<p><a href=\"https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_d0zin257\">https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_d0zin257<\/a><\/p>\n<p><span><br \/>\n<!-- iframe plugin v.6.0 wordpress.org\/plugins\/iframe\/ --><br \/>\n<iframe loading=\"lazy\" width=\"640\" height=\"360\" src=\"https:\/\/tube.switch.ch\/embed\/f2782227\/spanspan\" frameborder=\"0\" 0=\"webkitallowfullscreen\" 1=\"mozallowfullscreen\" 2=\"allowfullscreen\" scrolling=\"yes\" class=\"iframe-class\"><\/iframe><br \/>\n<\/span><\/p>\n<p style=\"text-align: justify\">When we load this table into LightSide, each of the words in the speeches is converted into an appropriate representation. For the documents it uses a vector space model, which we do not explain in detail since it is beyond the scope of this introduction. At this stage it suffices to know that each word is going to be turned into a feature, on the basis of which LightSide will make its predictions of the politicians&#8217; party. The fact that some words are more typical of one party can be leveraged to predict which party a given speech belongs to. Since we have 18,175 types, we have very many features which LightSide can use to construct a model.<\/p>\n<p style=\"text-align: justify\">As figure 11.1 shows, the LightSide window is divided into an upper and a lower half. Each of these halves is divided into three columns. First, we work through the columns in the upper half from left to right. To import a .csv file into LightSide, click on the little folder icon in the upper left column. A window will open, and you can drag-and-drop the file containing your text collection into this window. Once the corpus is imported, select the \u2018Class\u2019 to tell LightSide what you want to predict. In our example, we use the class \u2018party\u2019 since we are interested in identifying party affiliations. The party is a binary, nominal type of class, just like the realization of the recipient was in the last chapter on the logistic regression. LightSide automatically identifies party as such a binary class. We want to predict the party from the information in the \u2018text\u2019 column in our data, so we select that in the \u2018Text Fields\u2019 tab.<\/p>\n<figure id=\"attachment_508\" aria-describedby=\"caption-attachment-508\" style=\"width: 1920px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/3.-Pre-extraction.jpg\" alt=\"\" class=\"size-full wp-image-508\" width=\"1920\" height=\"1020\" srcset=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/3.-Pre-extraction.jpg 1920w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/3.-Pre-extraction-300x159.jpg 300w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/3.-Pre-extraction-1024x544.jpg 1024w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/3.-Pre-extraction-768x408.jpg 768w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/3.-Pre-extraction-1536x816.jpg 1536w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/3.-Pre-extraction-65x35.jpg 65w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/3.-Pre-extraction-225x120.jpg 225w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/3.-Pre-extraction-350x186.jpg 350w\" sizes=\"(max-width: 1920px) 100vw, 1920px\" \/><figcaption id=\"caption-attachment-508\" class=\"wp-caption-text\">Figure 11.1: Settings for the feature extraction<\/figcaption><\/figure>\n<p style=\"text-align: justify\">After setting these parameters, we can choose in the middle and right column which types of features we want LightSide to use for the prediction. In our example, we only use \u2018Unigrams\u2019 from the \u2018Basic Features\u2019.<a class=\"footnote\" title=\"LightSide is, in fact, a Weka wrapper. Weka is a workbench for machine learning. So we have at our disposal a very powerful tool which can scale to large datasets and which would be capable of incorporating a very wide variety of features into its models. For simplicity's sake, we stick to unigrams here.\" id=\"return-footnote-379-2\" href=\"#footnote-379-2\" aria-label=\"Footnote 2\"><sup class=\"footnote\">[2]<\/sup><\/a> Accordingly, we want to know which individual words are indicative of party affiliation. This is also known as a bag-of-words approach, since we do not regard word combinations at all here. When using LightSide for research purposes, it makes sense to play around with and compare different models based on different features.<\/p>\n<p style=\"text-align: justify\">Once we have selected the features we want LightSide to use, we can extract them by clicking on the \u2018Extract\u2019 tile. On the same height as the \u2018Extract\u2019 button but on the right side of the LightSide window, there is a counter telling us how many features have already been extracted. Depending on the corpus size, this may take a while.<\/p>\n<p>&nbsp;<\/p>\n<p><strong>Building a model with 1-gram features<\/strong><\/p>\n<p><a href=\"https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_p5l0a0sg\">https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_p5l0a0sg<\/a><\/p>\n<p><span><br \/>\n<!-- iframe plugin v.6.0 wordpress.org\/plugins\/iframe\/ --><br \/>\n<iframe loading=\"lazy\" width=\"640\" height=\"360\" src=\"https:\/\/tube.switch.ch\/embed\/7f88ab2f\/spanspan\" frameborder=\"0\" 0=\"webkitallowfullscreen\" 1=\"mozallowfullscreen\" 2=\"allowfullscreen\" scrolling=\"yes\" class=\"iframe-class\"><\/iframe><br \/>\n<\/span><\/p>\n<p style=\"text-align: justify\">Once the features are extracted, we turn to the lower half of the screen. In the lower left column, the most interesting thing to note is the information of how many features are in our feature table. In our example, we have 18,175 features. The lower middle column gives us the option of looking at different metrics for each feature. While this can be useful at more advanced levels, we can ignore this at the moment. In the bottom right panel, we see our feature table in alphabetical order. In our example, this begins with a lot of year numbers, but if we scroll through the table, we see that the overwhelming majority of the features are words. Table 11.3 is a screenshot taken somewhere in the middle of our feature table.<\/p>\n<figure id=\"attachment_509\" aria-describedby=\"caption-attachment-509\" style=\"width: 1027px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/4.-Some-features.jpg\" alt=\"\" class=\"size-full wp-image-509\" width=\"1027\" height=\"511\" srcset=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/4.-Some-features.jpg 1027w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/4.-Some-features-300x149.jpg 300w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/4.-Some-features-1024x510.jpg 1024w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/4.-Some-features-768x382.jpg 768w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/4.-Some-features-65x32.jpg 65w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/4.-Some-features-225x112.jpg 225w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/4.-Some-features-350x174.jpg 350w\" sizes=\"(max-width: 1027px) 100vw, 1027px\" \/><figcaption id=\"caption-attachment-509\" class=\"wp-caption-text\">Table 11.3: A tiny selection of features<\/figcaption><\/figure>\n<p style=\"text-align: justify\">With the extracted features, we can jump directly to creating a model. To do so, we have to go to the \u2018Build Model\u2019 tab. Here we can select which algorithm we want our model to be based on. Of course, in our example we choose \u2018Logistic Regression\u2019, since this is the type of model which interests us here. What changes now in comparison to the example in the last chapter is that instead of a few dozen, we now use thousands of features to predict the dependent variable. The capability of handling very large numbers of features is one of the great affordances of machine learning.<\/p>\n<figure id=\"attachment_510\" aria-describedby=\"caption-attachment-510\" style=\"width: 1920px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/5.-Training-parameters.jpg\" alt=\"\" class=\"size-full wp-image-510\" width=\"1920\" height=\"1020\" srcset=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/5.-Training-parameters.jpg 1920w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/5.-Training-parameters-300x159.jpg 300w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/5.-Training-parameters-1024x544.jpg 1024w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/5.-Training-parameters-768x408.jpg 768w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/5.-Training-parameters-1536x816.jpg 1536w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/5.-Training-parameters-65x35.jpg 65w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/5.-Training-parameters-225x120.jpg 225w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/5.-Training-parameters-350x186.jpg 350w\" sizes=\"(max-width: 1920px) 100vw, 1920px\" \/><figcaption id=\"caption-attachment-510\" class=\"wp-caption-text\">Figure 11.2: Parameters for training our logistic model<\/figcaption><\/figure>\n<p style=\"text-align: justify\">Another important difference between this implementation of the logistic regression and the generalized linear model we saw in the last chapter is the \u2018Cross-Validation\u2019, which we can select in the \u2018Evaluation Options\u2019. This means instead of training the model on all of the available data, we train it on 90% of the data and then apply it to the remaining 10% of the data. This is done ten times, and each time a different 10% is excluded from the training data. This is known as 10-fold cross-validation, and it means that actually ten different models are being trained on ten different selections of the data. Then, the mean of the parameters of these ten models is taken to construct the final model. We can create the model by clicking the \u2018Train\u2019 tile. Again, this may take a while, and we can follow the progress with the counter, where we see which of our ten\u00a0 folds are testing.<\/p>\n<p>&nbsp;<\/p>\n<p><strong>Evaluating the model<\/strong><\/p>\n<p><a href=\"https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_g278bvaq\">https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_g278bvaq<\/a><\/p>\n<p><span><br \/>\n<!-- iframe plugin v.6.0 wordpress.org\/plugins\/iframe\/ --><br \/>\n<iframe loading=\"lazy\" width=\"640\" height=\"360\" src=\"https:\/\/tube.switch.ch\/embed\/d104c292\/spanspan\" frameborder=\"0\" 0=\"webkitallowfullscreen\" 1=\"mozallowfullscreen\" 2=\"allowfullscreen\" scrolling=\"yes\" class=\"iframe-class\"><\/iframe><br \/>\n<\/span><\/p>\n<p style=\"text-align: justify\">The results are presented in the lower half of the window. In the lower middle column, we see the accuracy of the model. In our example, we get an accuracy of 97.9%, meaning that only about 2% of the speeches were classified incorrectly. The Kappa value expresses how much better our model is than the random choice baseline. The value of 0.957 we see here means that our model performs 95% better than if we classified the speeches at random. This is very similar to the null-model against which we compared our logistic model in the last chapter. In the lower right column, we see the actual confusion matrix, which tells us how many speeches were predicted correctly. All of these measures indicate that we can predict a politician\u2019s party quite reliably solely on the basis of the words in the speech.<\/p>\n<figure id=\"attachment_511\" aria-describedby=\"caption-attachment-511\" style=\"width: 1430px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/6.-Confusion-matrix.jpg\" alt=\"\" class=\"size-full wp-image-511\" width=\"1430\" height=\"565\" srcset=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/6.-Confusion-matrix.jpg 1430w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/6.-Confusion-matrix-300x119.jpg 300w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/6.-Confusion-matrix-1024x405.jpg 1024w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/6.-Confusion-matrix-768x303.jpg 768w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/6.-Confusion-matrix-65x26.jpg 65w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/6.-Confusion-matrix-225x89.jpg 225w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/6.-Confusion-matrix-350x138.jpg 350w\" sizes=\"(max-width: 1430px) 100vw, 1430px\" \/><figcaption id=\"caption-attachment-511\" class=\"wp-caption-text\">Figure 11.3: Model evaluation measures<\/figcaption><\/figure>\n<p style=\"text-align: justify\">For our linguistic purposes, it\u2019s very interesting to look at the feature weight, which is a measure for the importance of each feature for the model. To do this, we can go to the \u2018Explore Results\u2019 tab. Let\u2019s say we are interested in seeing the most typical Republican features. We can do this by ticking the box in the confusion matrix where the predicted and actual Republican outcomes intersect. Then we can choose the \u2018Frequency\u2019 and the \u2018Feature Influence\u2019 among the \u2018Evaluations to Display\u2019.<\/p>\n<figure id=\"attachment_512\" aria-describedby=\"caption-attachment-512\" style=\"width: 1440px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/7.-Exploring-results.jpg\" alt=\"\" class=\"size-full wp-image-512\" width=\"1440\" height=\"673\" srcset=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/7.-Exploring-results.jpg 1440w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/7.-Exploring-results-300x140.jpg 300w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/7.-Exploring-results-1024x479.jpg 1024w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/7.-Exploring-results-768x359.jpg 768w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/7.-Exploring-results-65x30.jpg 65w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/7.-Exploring-results-225x105.jpg 225w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/7.-Exploring-results-350x164.jpg 350w\" sizes=\"(max-width: 1440px) 100vw, 1440px\" \/><figcaption id=\"caption-attachment-512\" class=\"wp-caption-text\">Figure 11.4: Exploring the results<\/figcaption><\/figure>\n<p style=\"text-align: justify\">Now, we can sort the feature table in the upper right column according to the feature weight by clicking on the \u2018Feature Influence\u2019 tile. The negative features are strongly indicative of the Democratic party, while the positive features are strongly indicative of the Republican party.<a class=\"footnote\" title=\"Like with NPs and PPs in the last chapter, it is arbitrary which category is represented by positive and which by negative numbers. There are no political implications here.\" id=\"return-footnote-379-3\" href=\"#footnote-379-3\" aria-label=\"Footnote 3\"><sup class=\"footnote\">[3]<\/sup><\/a> The closer the feature influence is to zero, the less explanatory power it holds. Let\u2019s take a look at the features most indicative of being Republican:<\/p>\n<figure id=\"attachment_513\" aria-describedby=\"caption-attachment-513\" style=\"width: 1020px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/8.-Republican-features.jpg\" alt=\"\" class=\"size-full wp-image-513\" width=\"1020\" height=\"583\" srcset=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/8.-Republican-features.jpg 1020w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/8.-Republican-features-300x171.jpg 300w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/8.-Republican-features-768x439.jpg 768w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/8.-Republican-features-65x37.jpg 65w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/8.-Republican-features-225x129.jpg 225w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2020\/01\/8.-Republican-features-350x200.jpg 350w\" sizes=\"(max-width: 1020px) 100vw, 1020px\" \/><figcaption id=\"caption-attachment-513\" class=\"wp-caption-text\">Figure 11.5: Most informative Republican features<\/figcaption><\/figure>\n<p style=\"text-align: justify\">We see among the most Republican features several names &#8211; Bush, Nancy (Reagan), Laura (Bush) &#8211; and several familiar buzzwords &#8211; freedom, god, nation, terror. Linguistically, it is interesting to note that contractions &#8211; \u2018ve, \u2018m, \u2018ll &#8211; are decisively Republican features. A first interpretation of this could be that Republicans want to present themselves as being close to the public, using colloquial language rather than technical jargon.<\/p>\n<p style=\"text-align: justify\">Of course, if one were to pursue this line of investigation seriously, one would have to check whether these results are the consequence of different transcription conventions, and whether there is evidence of the \u2018one of the people\u2019 attitude in the speeches of Republicans (and how Democrat politicians present themselves). We are not going to expound on this at any greater length, but this example should demonstrate both how unigram features can provide a good basis for further analysis and how these features should be interpreted with some caution.<\/p>\n<p style=\"text-align: justify\">We see here how regression analysis provides a basis for understanding advanced machine learning tools such as LightSide. What we also see is how we strain the limit of interpretability when we perform regression analysis on this scale: in our example we have about 18,000 features which contribute to the model, which is in excess of what can be incorporated in a qualitative analysis. This problem is not limited to document classification. Whenever we want to analyze substantial quantities of text, human reading capacities are challenged. We are going to discuss topic modelling in the <a href=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/chapter\/topic-modelling\/\">next chapter<\/a> to address this issue of excess information.<\/p>\n<p>&nbsp;<\/p>\n<p><strong>References:<\/strong><\/p>\n<p style=\"text-align: justify\">Guerini M., Giampiccolo D., Moretti G., Sprugnoli R., &amp; Strapparava C. 2013. The New Release of CORPS: A Corpus of Political Speeches Annotated with Audience Reactions. <em>Multimodal Communication in Political Speech: Shaping Minds and Social Action<\/em>, ed. by Isabella Poggi, Francesca D\u2019Errico, Laura Vincze and Alessandro Vinciarelli, 86-98. Berlin: Springer.<\/p>\n<p>Lightside<\/p>\n<p style=\"text-align: justify\">Mayfield, Elijah, David Adamson and Carolyn P. Ros\u00e9. 2014. LightSide: Researcher\u2019s Workbench User Manual. Online: <a href=\"http:\/\/ankara.lti.cs.cmu.edu\/side\/LightSide_Researchers_Manual.pdf\">http:\/\/ankara.lti.cs.cmu.edu\/side\/LightSide_Researchers_Manual.pdf<\/a>.<\/p>\n<p style=\"text-align: justify\">Vergelis, Maria, Tatyana Shcherbakova and Tatyana Sidorina. 2019. Spam and phishing in Q1 2019. <em>Kaspersky<\/em>. Online: <a href=\"https:\/\/securelist.com\/spam-and-phishing-in-q1-2019\/90795\/\">https:\/\/securelist.com\/spam-and-phishing-in-q1-2019\/90795\/<\/a>.<\/p>\n<p style=\"text-align: justify\">\n<hr class=\"before-footnotes clear\" \/><div class=\"footnotes\"><ol><li id=\"footnote-379-1\">You might consider downloading novels by two different authors from Gutenberg, for instance. <a href=\"#return-footnote-379-1\" class=\"return-footnote\" aria-label=\"Return to footnote 1\">&crarr;<\/a><\/li><li id=\"footnote-379-2\">LightSide is, in fact, a Weka wrapper. Weka is a workbench for machine learning. So we have at our disposal a very powerful tool which can scale to large datasets and which would be capable of incorporating a very wide variety of features into its models. For simplicity's sake, we stick to unigrams here. <a href=\"#return-footnote-379-2\" class=\"return-footnote\" aria-label=\"Return to footnote 2\">&crarr;<\/a><\/li><li id=\"footnote-379-3\">Like with NPs and PPs in the last chapter, it is arbitrary which category is represented by positive and which by negative numbers. There are no political implications here. <a href=\"#return-footnote-379-3\" class=\"return-footnote\" aria-label=\"Return to footnote 3\">&crarr;<\/a><\/li><\/ol><\/div>","protected":false},"author":29,"menu_order":3,"template":"","meta":{"pb_show_title":"on","pb_short_title":"","pb_subtitle":"","pb_authors":[],"pb_section_license":""},"chapter-type":[],"contributor":[],"license":[],"class_list":["post-379","chapter","type-chapter","status-publish","hentry"],"part":56,"_links":{"self":[{"href":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-json\/pressbooks\/v2\/chapters\/379"}],"collection":[{"href":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-json\/pressbooks\/v2\/chapters"}],"about":[{"href":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-json\/wp\/v2\/types\/chapter"}],"author":[{"embeddable":true,"href":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-json\/wp\/v2\/users\/29"}],"version-history":[{"count":14,"href":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-json\/pressbooks\/v2\/chapters\/379\/revisions"}],"predecessor-version":[{"id":639,"href":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-json\/pressbooks\/v2\/chapters\/379\/revisions\/639"}],"part":[{"href":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-json\/pressbooks\/v2\/parts\/56"}],"metadata":[{"href":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-json\/pressbooks\/v2\/chapters\/379\/metadata\/"}],"wp:attachment":[{"href":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-json\/wp\/v2\/media?parent=379"}],"wp:term":[{"taxonomy":"chapter-type","embeddable":true,"href":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-json\/pressbooks\/v2\/chapter-type?post=379"},{"taxonomy":"contributor","embeddable":true,"href":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-json\/wp\/v2\/contributor?post=379"},{"taxonomy":"license","embeddable":true,"href":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-json\/wp\/v2\/license?post=379"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}