{"id":53,"date":"2019-08-14T10:59:14","date_gmt":"2019-08-14T08:59:14","guid":{"rendered":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/?post_type=chapter&#038;p=53"},"modified":"2020-02-04T23:10:21","modified_gmt":"2020-02-04T22:10:21","slug":"n-grams","status":"publish","type":"chapter","link":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/chapter\/n-grams\/","title":{"raw":"N-Grams","rendered":"N-Grams"},"content":{"raw":"<p style=\"text-align: justify\">In the last chapter, we discussed language at the level of words. In this chapter, we turn our focus on interactions between words and introduce so-called n-gram language models.<\/p>\r\n&nbsp;\r\n<p style=\"text-align: justify\"><strong>Expanding the simplest language model<\/strong><\/p>\r\n<p style=\"text-align: justify\">In the preceding chapter, we had the example of the urn which contains N word types, N being the size of the lexicon. We drew M word tokens from the urn, and replaced the word type we drew each time. M is the length of our text. The two values N and M were enough to characterize a simple 1-gram language model.<\/p>\r\n[h5p id=\"31\"]\r\n<p style=\"text-align: justify\">Obviously, these assumptions render the model very unrealistic. In the last chapter we saw in the discussion of Zipf's law that different words have extremely different probabilities of being used in language production. Now let us think about the other assumption. To improve our language model, we have to get away from the assumption that draws are independent. This is necessary because we know that word order matters. One might think of a far-fetched example, indeed one from a galaxy far, far away. Yoda's speech is instantly recognizable as his on account of it's unusual subject-object-verb structure. English sentences are most frequently structured along subject-verb-object. This means that if we know the first word of a sentence, we can assign a higher probability to having a verb in the second place than having an object. To account for this in our language model, we can use conditional probabilities, as they are known in statistics. Let\u2019s look at the simplest of the conditional probability models, the bigram language model.<\/p>\r\n<p style=\"text-align: justify\">The bigram language model assumes a dependency structure where the probability of a word occurring depends on the previous word. Formally, we can express this as:<\/p>\r\n<p style=\"text-align: center\">[latex]\\text{Conditional probability in a bigram model}=p(word | previous word)[\/latex]<\/p>\r\n[Exercise to come]\r\n\r\n&nbsp;\r\n<p style=\"text-align: justify\"><strong>The improved monkey at the typewriter<\/strong><\/p>\r\n<p style=\"text-align: justify\">(Or, Shakespeare in love with random text generation)<\/p>\r\n<p style=\"text-align: justify\">How much of a difference there is between the 1-gram and more advanced n-gram models becomes apparent when looking at some examples. Trained on a healthy diet of Shakespeare, you can see how well the different models fare.<\/p>\r\n<p style=\"text-align: justify\">The 1-gram model, with independent draws and equal word probabilities [latex]p(word)[\/latex] looks like this:<\/p>\r\n\r\n<ol style=\"text-align: justify\">\r\n \t<li><em> To him swallowed confess hear both. Which. Of save on trail for are ay device and rote life have.<\/em><\/li>\r\n \t<li><em> Every enter now severally so, let<\/em><\/li>\r\n \t<li><em> Hill he late speaks; or! a more to leg less first you enter<\/em><\/li>\r\n \t<li><em> Are where exeunt and sighs have rise excellency took of ... Sleep knave we. near; vile like.<\/em><\/li>\r\n<\/ol>\r\n<p style=\"text-align: justify\">The bigram model, with two-word sequences [latex]p(word | previous word)[\/latex], looks only vaguely better:<\/p>\r\n\r\n<ol style=\"text-align: justify\">\r\n \t<li><em> What means, sir. I confess she? then all sorts, he is trim, captain.<\/em><\/li>\r\n \t<li><em> Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Live king. Follow.<\/em><\/li>\r\n \t<li><em> What we, hath got so she that I rest and sent to scold and nature bankrupt, nor the first gentleman?<\/em><\/li>\r\n \t<li><em> Thou whoreson chops. Consumption catch your dearest friend, well, and I know where many mouths upon my undoing all but be, how soon, then; we\u2019ll execute upon my love\u2019s bonds and we do you will?<\/em><\/li>\r\n<\/ol>\r\n<p style=\"text-align: justify\">In contrast, the trigram model, with three-word sequences [latex]p(word | word-1, word-2)[\/latex], is no longer as bad:<\/p>\r\n\r\n<ol style=\"text-align: justify\">\r\n \t<li><em> Sweet prince, Falstaff shall die. Harry of Monmouth\u2019s grave.<\/em><\/li>\r\n \t<li><em> This shall forbid it should be branded, if renown made it empty.<\/em><\/li>\r\n \t<li><em> Indeed the duke; and had a very good friend.<\/em><\/li>\r\n \t<li><em> Fly, and will rid me these news of price. Therefore the sadness of parting, as they say, \u2018tis done.<\/em><\/li>\r\n<\/ol>\r\n<p style=\"text-align: justify\">But by no means does the trigram model hold the torch to the quadrigram model. Although we probably still wouldn't pay money to sit through a play generated like this, the four-word sequences of [latex]p(word | word-1, word-2, word-3)[\/latex] begins to approximate meaningful language:<\/p>\r\n\r\n<ol style=\"text-align: justify\">\r\n \t<li><em> King Henry. What! I will go seek the traitor Gloucester. Exeunt some of the watch. A great banquet serv\u2019d in;<\/em><\/li>\r\n \t<li><em> Will you not tell me who I am?<\/em><\/li>\r\n \t<li><em> Indeed the short and long. Marry, \u2018tis a noble Lepidus.<\/em><\/li>\r\n \t<li><em> Enter Leonato\u2019s brother Antonio, and the rest, but seek the weary beds of people sick.<\/em><\/li>\r\n<\/ol>\r\n<p style=\"text-align: justify\">These examples of n-gram models with varying degrees of complexity are often cited creations taken from Jurafsky and Martin (2009).<\/p>\r\n<p style=\"text-align: justify\">In order to understand the progression of these models in more detail, we need a bit of linguistic theory. Chomsky says:<\/p>\r\n<p style=\"text-align: justify\"><em>Conversely, a descriptively adequate grammar is not quite equivalent to nondistinctness in the sense of distinctive feature theory. It appears that the speaker-hearer's linguistic intuition raises serious doubts about an important distinction in language use. To characterize a linguistic level L, the notion of level of grammaticalness is, apparently, determined by a parasitic gap construction. A consequence of the approach just outlined is that the theory of syntactic features developed earlier is to be regarded as irrelevant intervening contexts in selectional rules. In the discussion of resumptive pronouns following (81), the earlier discussion of deviance is rather different from the requirement that branching is not tolerated within the dominance scope of a complex symbol.<\/em><\/p>\r\n<p style=\"text-align: justify\">In case this does not answer your questions fully, we recommend you read the original source: <a href=\"http:\/\/www.rubberducky.org\/cgi-bin\/chomsky.pl\">http:\/\/www.rubberducky.org\/cgi-bin\/chomsky.pl<\/a><\/p>\r\n<p style=\"text-align: justify\">Does this leave you quite speechless? Well, it certainly left Chomsky speechless. Or, at least, he didn't say these words. In a sense, though, they are of his creation: the Chomskybot is a phrase-based language model trained on (some of the linguistic) works of Noam Chomsky.<\/p>\r\n<p style=\"text-align: justify\">If you prefer your prose less dense, we can also recommend the automatically generated <em>Harry Potter and the Portrait of What Looked Like a Large Pile of Ash<\/em>, which you can find here: <a href=\"https:\/\/botnik.org\/content\/harry-potter.html\">https:\/\/botnik.org\/content\/harry-potter.html<\/a><\/p>\r\n<p style=\"text-align: justify\">You can see that different language models produce outputs with varying degrees of realism. While these language generators are good fun, there are many other areas of language processing which also make use of n-gram models:<\/p>\r\n\r\n<ul style=\"text-align: justify\">\r\n \t<li>Speech recognition\r\n<ul>\r\n \t<li>\u201cI ate a cherry\u201d is a more likely sentence than \u201cEye eight uh Jerry\u201d<\/li>\r\n<\/ul>\r\n<\/li>\r\n \t<li>OCR &amp; Handwriting recognition\r\n<ul>\r\n \t<li>More probable words\/sentences are more likely correct readings.<\/li>\r\n \t<li>Machine translation<\/li>\r\n \t<li>More likely phrases\/sentences are probably better translations.<\/li>\r\n<\/ul>\r\n<\/li>\r\n \t<li>Natural Language Generation\r\n<ul>\r\n \t<li>More likely phrases\/sentences are probably better NL generations.<\/li>\r\n<\/ul>\r\n<\/li>\r\n \t<li>Context sensitive spelling correction\r\n<ul>\r\n \t<li>\u201cTheir are problems wit this sentence.\u201d<\/li>\r\n<\/ul>\r\n<\/li>\r\n \t<li>Part-of-speech tagging (word and tag n-grams)\r\n<ul>\r\n \t<li><em>I like to run \/ I like the run.<\/em><\/li>\r\n \t<li>Time flies like an arrow. Fruit flies like a banana.<\/li>\r\n<\/ul>\r\n<\/li>\r\n \t<li>Collocation detection<\/li>\r\n<\/ul>\r\nIn the next section, we want to spend some time on the last of these topics.\r\n\r\n&nbsp;\r\n<p style=\"text-align: justify\"><strong>Collocations<\/strong><\/p>\r\n<p style=\"text-align: justify\">Take words like <em>strong <\/em>and <em>powerful<\/em>. They are near synonyms, and yet <em>strong tea <\/em>is more likely than <em>powerful tea<\/em>, and <em>powerful car<\/em> is more likely than <em>strong car<\/em>. We will informally use the term <em>collocation<\/em> to refer to pairs of words which have a tendency to \u201cstick together\u201d and discuss how we can use statistical methods to answer these questions:<\/p>\r\n\r\n<ul style=\"text-align: justify\">\r\n \t<li>Can we quantify the tendency of two words to 'stick together'?<\/li>\r\n \t<li>Can we show that \u201cstrong tea\u201d is more likely than chance, and \u201cpowerful tea\u201d is less likely than chance?<\/li>\r\n \t<li>Can we detect collocations and idioms automatically in large corpora?<\/li>\r\n \t<li>How well does the obvious approach of using bigram conditional probabilities fare, calculating, for instance [latex]p(tea | strong) &gt; p(tea | powerful)[\/latex]?<\/li>\r\n<\/ul>\r\n<p style=\"text-align: justify\">But first, let\u2019s take a look at the linguistic background of collocations. Choueka (1988) defines collocations as \u201ca sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components\u201d. Some criteria that distinguish collocations from other bigrams are:<\/p>\r\n\r\n<ul style=\"text-align: justify\">\r\n \t<li>Non-compositionality<\/li>\r\n \t<li>Meaning not compositional (e.g. \u201ckick the bucket\u201d)<\/li>\r\n \t<li>Non-substitutability<\/li>\r\n \t<li>Near synonyms cannot be used (e.g. \u201cyellow wine\u201d?)<\/li>\r\n \t<li>Non-modifiability: \u201ckick the bucket\u201d, \u201c*kick the buckets\u201d, \u201c*kick my bucket\u201d, \u201c*kick the blue bucket\u201d<\/li>\r\n \t<li>Non-literal translations: red wine &lt;-&gt; vino tinto; take decisions &lt;-&gt; Entscheidungen treffen<\/li>\r\n<\/ul>\r\n<p style=\"text-align: justify\">However, these criteria should be taken with a grain of salt. For one, the criteria are all quite vague, and there is no clear agreement as to what can be termed a collocation. Is the defining feature the grammatical relation (adjective + noun or verb + subject) or is it the proximity of the two words? Moreover, there is no sharp line demarcating collocations from phrasal verbs, idioms, named entities and technical terminology composed of multiple words. And what are we to make of the co-occurrence of thematically related words (e.g. doctor and nurse, or plane and airport)? Or the co-occurrence of words in different languages across parallel corpora? Obviously, there is a cline going from free combination through collocations to idioms. Between these different elements of language there is an area of gradience which makes collocations particularly interesting.<\/p>\r\n&nbsp;\r\n<p style=\"text-align: justify\"><strong>Statistical collocation measures<\/strong><\/p>\r\n<p style=\"text-align: justify\">With raw frequencies, z-scores, t-scores and chi-square tests we have at our disposal an arsenal of statistical methods we could use to measure collocation strength. In this section we will discuss how to use raw frequencies and chi-square tests to this end, and additionally we will introduce three new measures: Mutual Information (MI), Observed\/Expected (O\/E) and log-likelihood. In general, these measures are used to rank pair types as candidates for collocations. The different measures have different strengths and weaknesses, which means that they can yield a variety of results. Moreover, the association scores computed by different measures cannot be compared directly.<\/p>\r\n<p style=\"text-align: justify\"><strong>Frequencies<\/strong><\/p>\r\n<p style=\"text-align: justify\">Frequencies can be used to find collocations by counting the number of occurrences. For this, we typically use word bigrams. Usually, these searches result in a lot of function word pairs that need to be filtered out. Take a look at these collocations from an example corpus:<\/p>\r\n\r\n\r\n[caption id=\"attachment_446\" align=\"aligncenter\" width=\"225\"]<img src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2019\/08\/1.-frequencies.jpg\" alt=\"\" class=\"size-full wp-image-446\" width=\"225\" height=\"505\" \/> Figure 8.1: Raw collocation frequencies[\/caption]\r\n<p style=\"text-align: justify\">Except for <em>New York<\/em>, all the bigrams are pairs of function words. This is neither particularly surprising, nor particularly useful. If you have access to a POS-tagged corpus, or know how to apply POS-tags to a corpus, you can improve the results a bit by only searching for bigrams with certain POS-tag sequences (e.g. noun-noun, adj-noun).<\/p>\r\n<p style=\"text-align: justify\">Despite it being an imperfect measure, let\u2019s look at how we can identify raw frequencies for bigrams in R. For this, download the ICE GB written corpus in a bigram version:<\/p>\r\n<p style=\"text-align: justify\">[file: ice gb written bigram]<\/p>\r\n<p style=\"text-align: justify\">Import it into R using your preferred command. We assigned the corpus to a variable called <em>icegbbi<\/em>, and so we open the first ten entries of the vector:<\/p>\r\n\r\n<pre><code>icegbbi[1:10]\r\n[1] \"Tom\\tRodwell\"\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \"Rodwell\\tWith\"\r\n[3] \"With\\tthese\"\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \"these\\tprofound\"\r\n[5] \"profound\\twords\"\u00a0\u00a0\u00a0\u00a0\u00a0 \"words\\tJ.\"\r\n[7] \"J.\\tN.\"\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \"N.\\tL.\"\r\n[9] \"L.\\tMyres\"\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \"Myres\\tbegins\"\r\n<\/code><\/pre>\r\n<p style=\"text-align: justify\">We can see that each position contains two words separated by a tab, which is what the<em> \\t<\/em> stands for. Transform the vector into a table, sort it in decreasing order and take a look at the highest ranking entries:<\/p>\r\n\r\n<pre><code>&gt; ticegbbi = table(icegbbi)\r\n&gt; sortticegbbi = as.table(sort(ticegbbi,decreasing=T))\r\n&gt; sortticegbbi[1:16]\r\nicegbbi\r\n\u00a0 of\\tthe\u00a0\u00a0\u00a0 in\\tthe\u00a0\u00a0\u00a0 to\\tthe\u00a0\u00a0\u00a0\u00a0 to\\tbe\u00a0\u00a0 and\\tthe\u00a0\u00a0\u00a0 on\\tthe\u00a0\u00a0 for\\tthe\u00a0 from\\tthe\r\n\u00a0\u00a0\u00a0\u00a0 3623\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 2126\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 1341\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 908\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 881\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 872\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 696\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 627\r\n\u00a0 by\\tthe\u00a0\u00a0\u00a0 at\\tthe\u00a0 with\\tthe\u00a0\u00a0\u00a0\u00a0\u00a0 of\\ta\u00a0 that\\tthe\u00a0\u00a0\u00a0\u00a0 it\\tis\u00a0\u00a0 will\\tbe\u00a0\u00a0\u00a0\u00a0\u00a0 is\\ta\r\n\u00a0\u00a0\u00a0\u00a0\u00a0 602\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 597\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 576\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 571\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 559\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 542\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 442\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 420\r\n<\/code><\/pre>\r\n<p style=\"text-align: justify\">The results in the ICE GB written are very similar to the results in the table above. In fact, among the 16 most frequent bigrams, there are only pairs of function words.<\/p>\r\n<p style=\"text-align: justify\">[H5P: Search the frequency table some more. What is the most frequent bigram which you could classify as a collocation?]<\/p>\r\n&nbsp;\r\n<p style=\"text-align: justify\"><strong>Mutual Information and Observed\/Expected<\/strong><\/p>\r\n<p style=\"text-align: justify\">Fortunately, there are better collocation measures than raw frequencies. One of these is the Mutual Information (MI) score. We can easily calculate the probability that a collocation composed of the words x and y occurs due to chance. To do so, we simply need to assume that they are independent events. Under this assumption, we can calculate the individual probabilities from the raw frequencies f(x) and f(y) and the corpus size N:<\/p>\r\n<p style=\"text-align: center\">[latex]p(x)=\\frac{f(x)}{N}[\/latex] and [latex]p(y)=\\frac{f(y)}{N}[\/latex]<\/p>\r\n<p style=\"text-align: justify\">Of course, we can also calculate the observed probability of finding x and y together. We do this joint probability by dividing number of times we find x and y in sequence, f(x,y), by the corpus size N:<\/p>\r\n<p style=\"text-align: center\">[latex]p(x,y)=\\frac{f(x,y)}{N}[\/latex]<\/p>\r\n<p style=\"text-align: justify\">If the collocation of x and y is due to chance, which is to say if x and y are independent, the observed probability will be roughly equal to the expected probability:<\/p>\r\n<p style=\"text-align: center\">[latex]p(x,y){\\approx}p(x)*p(y)[\/latex]<\/p>\r\n<p style=\"text-align: justify\">If the collocation is not due to chance, we have one of two cases. Either the observed probability is substantially higher than the expected probability, in which we have a strong correlation:<\/p>\r\n<p style=\"text-align: center\">[latex]p(x,y)&gt;&gt;p(x)*p(y)[\/latex]<\/p>\r\n<p style=\"text-align: justify\">Or the inverse is true, and the observed probability is much lower than the expected one:<\/p>\r\n<p style=\"text-align: center\">[latex]p(x,y)&lt;&lt;p(x)*p(y)[\/latex]<\/p>\r\n<p style=\"text-align: justify\">The latter scenario can be referred to as \u2018negative\u2019 collocation, which simply means that the two words hardly occur together.<\/p>\r\n<p style=\"text-align: justify\">The comparison of joint and independent probabilities is known as Mutual Information (MI) and originates in Information Theory =&gt; surprise in bits. What we outlined above is a basic version of MI which we call O\/E, and which can be simply calculated by dividing Observation by Expectation:<\/p>\r\n<p style=\"text-align: center\">[latex]O\/E=\\frac{p(x,y)}{p(x)*p(y)}[\/latex]<\/p>\r\n<p style=\"text-align: justify\">After this introduction to the concept, let\u2019s take a look at how to implement the O\/E in R.<\/p>\r\n<p style=\"text-align: justify\">[R exercise with available corpus?]<\/p>\r\n<p style=\"text-align: justify\">Okay, let's look at one more set of examples, this time with data from the British National Corpus (BNC). The BNC contains 97,626,093 words in total. The table below shows the frequencies for the words 'powerful', 'strong', 'car' and 'tea', as well as all of the bigrams composed of these four word:<\/p>\r\n\r\n<table class=\"grid\" style=\"border-collapse: collapse;width: 67.4874%;height: 131px\" border=\"0\"><caption>Table 8.1: Powerful, strong, car, tea<\/caption>\r\n<tbody>\r\n<tr>\r\n<td style=\"width: 25%\">Unigrams<\/td>\r\n<td style=\"width: 25%\">Frequency<\/td>\r\n<td style=\"width: 25%\">Bigrams<\/td>\r\n<td style=\"width: 25%\">Frequency<\/td>\r\n<\/tr>\r\n<tr>\r\n<td style=\"width: 25%\">powerful<\/td>\r\n<td style=\"width: 25%\">7,070<\/td>\r\n<td style=\"width: 25%\">powerful car<\/td>\r\n<td style=\"width: 25%\">15<\/td>\r\n<\/tr>\r\n<tr>\r\n<td style=\"width: 25%\">strong<\/td>\r\n<td style=\"width: 25%\">15,768<\/td>\r\n<td style=\"width: 25%\">powerful tea<\/td>\r\n<td style=\"width: 25%\">3<\/td>\r\n<\/tr>\r\n<tr>\r\n<td style=\"width: 25%\">car<\/td>\r\n<td style=\"width: 25%\">26,731<\/td>\r\n<td style=\"width: 25%\">strong car<\/td>\r\n<td style=\"width: 25%\">4<\/td>\r\n<\/tr>\r\n<tr>\r\n<td style=\"width: 25%\">tea<\/td>\r\n<td style=\"width: 25%\">8,030<\/td>\r\n<td style=\"width: 25%\">strong tea<\/td>\r\n<td style=\"width: 25%\">28<\/td>\r\n<\/tr>\r\n<\/tbody>\r\n<\/table>\r\n<p style=\"text-align: justify\">Use the formula above and the information in the table to calculate the O\/E for the four bigrams in the table.<\/p>\r\n<p style=\"text-align: justify\">[h5p id=\"33\"]<\/p>\r\n&nbsp;\r\n<p style=\"text-align: justify\"><strong>Low counts and other problems<\/strong><\/p>\r\n<p style=\"text-align: justify\">In the exercise above, we get one counterintuitive result. According to our calculations, <em>powerful tea<\/em> is still a collocation. What is going on here? Well, if you look at the raw frequencies, you can see that we have 28 observations of <em>strong tea<\/em>, but only three of<em> powerful tea<\/em>. Evidently, three occurrences make for a very rare phenomenon which is not statistically significant. Interestingly, for once we can actually account for these three occurrences of <em>powerful tea<\/em>. Take a look at the image below, which contains the BNC results for the query \u201cpowerful tea\u201d.<\/p>\r\n\r\n\r\n[caption id=\"attachment_550\" align=\"aligncenter\" width=\"712\"]<img src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2019\/08\/2.-Powerful-tea.jpg\" alt=\"\" class=\"size-full wp-image-550\" width=\"712\" height=\"285\" \/> Figure 8.2: Powerful tea in the British National Corpus[\/caption]\r\n<p style=\"text-align: justify\">If you read the context in which <em>powerful tea<\/em> occurs, you can see easily enough that we are dealing with the observer\u2019s paradox here: we only find <em>powerful tea<\/em> in this corpus as an example of a bad collocation. Although this is a rare situation, it makes clear why it is worth reading some examples whenever working with large corpora. This is good advice in any case, but it is especially true with low counts and surprising results.<\/p>\r\n<p style=\"text-align: justify\">Each of the statistical measures we discussed so far has its limitations, and the same holds true for MI. For one, its assumptions are flawed: not a normal distribution. Moreover, while the MI is a good measure of independence, the absolute values are not very meaningful. And, as we have seen, it does not perform well with low counts. With low counts, the MI tends to overestimate the strength of the collocation.<\/p>\r\n<p style=\"text-align: justify\">One response to these limitations is to use a significance test (chi-square, t-test or z-test) instead. This allows us to eliminate the low count problem.<\/p>\r\n&nbsp;\r\n<p style=\"text-align: justify\"><strong>Chi-square test<\/strong><\/p>\r\n<p style=\"text-align: justify\">As we have seen <a href=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/chapter\/chi-square-test-significance-2\/\">earlier<\/a>, the chi-square statistic sums the differences between observed and expected values in all squares of the table and scales them by the magnitude of the expected values.<\/p>\r\n<p style=\"text-align: justify\">Let\u2019s take a look at how the chi-square test can be used to identify collocations. The table below contains a set of collocations using the words <em>new<\/em> and <em>companies<\/em>:<\/p>\r\n\r\n\r\n[caption id=\"attachment_551\" align=\"aligncenter\" width=\"537\"]<img src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2019\/08\/3.-New-companies.jpg\" alt=\"\" class=\"size-full wp-image-551\" width=\"537\" height=\"148\" \/> Figure 8.3: 'New' and 'companies' in all possible and sensible bigram combinations[\/caption]\r\n<p style=\"text-align: justify\">Specifically, the table juxtaposes <em>new companies<\/em> with all other bigrams in the corpus.<\/p>\r\n<p style=\"text-align: justify\">[H5P: What is N? In other words, how many bigrams are in the corpus?]<\/p>\r\n<p style=\"text-align: justify\">We use [latex]O_{i,j}[\/latex] to refer to the observed value for cell (i,j), and [latex]E_{i,j}[\/latex] to refer to the expected value. The observed values are given in the table, e.g. [latex]O_{1,1}=8[\/latex]. The corpus size is N = 14,229,981 bigrams. As we know, the expected values can be determined from the marginal probabilities, e.g. for <em>new companies<\/em> in cell (1,1):<\/p>\r\n<p style=\"text-align: center\">[latex]E_{1,1}=\\frac{((8+15820)*(8+4667))}{N}=5.2[\/latex]<\/p>\r\n<p style=\"text-align: justify\">If we do this for each cell and calculate the chi-square sum, we receive a value of 1.55. In order to determine whether <em>new companies<\/em> is a statistically significant collocation, we have to look up the significance value for the level of 0.05 in the chi-square value table we saw in that chapter (or in any of the many tables available elsewhere online).<\/p>\r\n[h5p id=\"34\"]\r\n<p style=\"text-align: justify\">Because our chi-square value at 1.55, which is substantially smaller than the required one at the 95% level, we cannot reject the null hypothesis. In other words, there is no evidence that <em>new companies<\/em> should be a collocation.<\/p>\r\n&nbsp;\r\n<p style=\"text-align: justify\"><strong>Problems with significance tests and what comes next\r\n<\/strong><\/p>\r\n<p style=\"text-align: justify\">The chi-square, t-test and the z-test are often used as collocation measures. These significance tests are popular as collocation tests, since they empirically seem to fit. However, strictly speaking, collocation strength and significance are quite different things.<\/p>\r\n<p style=\"text-align: justify\">Where does this leave us? We now know that there are two main classes of collocation tests. Measures of surprise, which mainly report rare collocations and float a lot of garbage due to data sparseness, and significance tests, which mainly report frequent collocations and miss rare ones. One way of addressing the limitations we encounter in our collocation measures is to use the MI or O\/E in combination with a t-test significance filter. Alternatively, some tests exist which are slightly more complex and slightly more balance. These include MI3 and log-likelihood, which we discuss below in [Chapter Regression].<\/p>\r\n<p style=\"text-align: justify\">For further reading on collocation detection, we recommend:<\/p>\r\n\r\n<ul>\r\n \t<li style=\"text-align: justify\">Stephan Evert's http:\/\/www.collocations.de\/<\/li>\r\n \t<li style=\"text-align: justify\">Martin and Jurafsky<\/li>\r\n<\/ul>\r\n&nbsp;\r\n<p style=\"text-align: justify\"><strong>References:<\/strong><\/p>\r\n<p style=\"text-align: justify\">Botnik. 2018. Harry Potter and the portrait of what looked like a large pile of ash. Online: <a href=\"https:\/\/botnik.org\/content\/harry-potter.html\">https:\/\/botnik.org\/content\/harry-potter.html<\/a>.<\/p>\r\n<p style=\"text-align: justify\">Choueka, Yaacov. 1988. Looking for needles in a haystack, or: locating interesting collocational expressions in large textual database. <em>Proceedings of the RIAO<\/em>. 609-623.<\/p>\r\n<p style=\"text-align: justify\">Jurafsky, Dan and James H. Martin. 2009. <em>Speech and language processing: an introduction to natural language processing, computational linguistics and speech recognition<\/em>. 2nd Edition. Upper Saddle River, NJ: Pearson Prentice Hall.<\/p>\r\n<p style=\"text-align: justify\">Lawler, John and Kevin McGowan. The Chomskybot. Online: <a href=\"http:\/\/rubberducky.org\/cgi-bin\/chomsky.pl\">http:\/\/rubberducky.org\/cgi-bin\/chomsky.pl<\/a>.<\/p>\r\n<p style=\"text-align: justify\"><span><span class=\"highlight selected\"><\/span><\/span><span>\r\n<\/span><\/p>","rendered":"<p style=\"text-align: justify\">In the last chapter, we discussed language at the level of words. In this chapter, we turn our focus on interactions between words and introduce so-called n-gram language models.<\/p>\n<p>&nbsp;<\/p>\n<p style=\"text-align: justify\"><strong>Expanding the simplest language model<\/strong><\/p>\n<p style=\"text-align: justify\">In the preceding chapter, we had the example of the urn which contains N word types, N being the size of the lexicon. We drew M word tokens from the urn, and replaced the word type we drew each time. M is the length of our text. The two values N and M were enough to characterize a simple 1-gram language model.<\/p>\n<div id=\"h5p-31\">\n<div class=\"h5p-iframe-wrapper\"><iframe id=\"h5p-iframe-31\" class=\"h5p-iframe\" data-content-id=\"31\" style=\"height:1px\" src=\"about:blank\" frameBorder=\"0\" scrolling=\"no\" title=\"8.1 1-gram assumptions\"><\/iframe><\/div>\n<\/div>\n<p style=\"text-align: justify\">Obviously, these assumptions render the model very unrealistic. In the last chapter we saw in the discussion of Zipf&#8217;s law that different words have extremely different probabilities of being used in language production. Now let us think about the other assumption. To improve our language model, we have to get away from the assumption that draws are independent. This is necessary because we know that word order matters. One might think of a far-fetched example, indeed one from a galaxy far, far away. Yoda&#8217;s speech is instantly recognizable as his on account of it&#8217;s unusual subject-object-verb structure. English sentences are most frequently structured along subject-verb-object. This means that if we know the first word of a sentence, we can assign a higher probability to having a verb in the second place than having an object. To account for this in our language model, we can use conditional probabilities, as they are known in statistics. Let\u2019s look at the simplest of the conditional probability models, the bigram language model.<\/p>\n<p style=\"text-align: justify\">The bigram language model assumes a dependency structure where the probability of a word occurring depends on the previous word. Formally, we can express this as:<\/p>\n<p style=\"text-align: center\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/ql-cache\/quicklatex.com-b4054c19b53781c8a36fe9f2b5d006ac_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#92;&#116;&#101;&#120;&#116;&#123;&#67;&#111;&#110;&#100;&#105;&#116;&#105;&#111;&#110;&#97;&#108;&#32;&#112;&#114;&#111;&#98;&#97;&#98;&#105;&#108;&#105;&#116;&#121;&#32;&#105;&#110;&#32;&#97;&#32;&#98;&#105;&#103;&#114;&#97;&#109;&#32;&#109;&#111;&#100;&#101;&#108;&#125;&#61;&#112;&#40;&#119;&#111;&#114;&#100;&#32;&#124;&#32;&#112;&#114;&#101;&#118;&#105;&#111;&#117;&#115;&#32;&#119;&#111;&#114;&#100;&#41;\" title=\"Rendered by QuickLaTeX.com\" height=\"18\" width=\"527\" style=\"vertical-align: -4px;\" \/><\/p>\n<p>[Exercise to come]<\/p>\n<p>&nbsp;<\/p>\n<p style=\"text-align: justify\"><strong>The improved monkey at the typewriter<\/strong><\/p>\n<p style=\"text-align: justify\">(Or, Shakespeare in love with random text generation)<\/p>\n<p style=\"text-align: justify\">How much of a difference there is between the 1-gram and more advanced n-gram models becomes apparent when looking at some examples. Trained on a healthy diet of Shakespeare, you can see how well the different models fare.<\/p>\n<p style=\"text-align: justify\">The 1-gram model, with independent draws and equal word probabilities <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/ql-cache\/quicklatex.com-c0be6ce46073615b2bc7676a74714a55_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#112;&#40;&#119;&#111;&#114;&#100;&#41;\" title=\"Rendered by QuickLaTeX.com\" height=\"18\" width=\"62\" style=\"vertical-align: -4px;\" \/> looks like this:<\/p>\n<ol style=\"text-align: justify\">\n<li><em> To him swallowed confess hear both. Which. Of save on trail for are ay device and rote life have.<\/em><\/li>\n<li><em> Every enter now severally so, let<\/em><\/li>\n<li><em> Hill he late speaks; or! a more to leg less first you enter<\/em><\/li>\n<li><em> Are where exeunt and sighs have rise excellency took of &#8230; Sleep knave we. near; vile like.<\/em><\/li>\n<\/ol>\n<p style=\"text-align: justify\">The bigram model, with two-word sequences <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/ql-cache\/quicklatex.com-5f0bd7614f14ccf7b2ed98c58057df9f_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#112;&#40;&#119;&#111;&#114;&#100;&#32;&#124;&#32;&#112;&#114;&#101;&#118;&#105;&#111;&#117;&#115;&#32;&#119;&#111;&#114;&#100;&#41;\" title=\"Rendered by QuickLaTeX.com\" height=\"18\" width=\"175\" style=\"vertical-align: -4px;\" \/>, looks only vaguely better:<\/p>\n<ol style=\"text-align: justify\">\n<li><em> What means, sir. I confess she? then all sorts, he is trim, captain.<\/em><\/li>\n<li><em> Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Live king. Follow.<\/em><\/li>\n<li><em> What we, hath got so she that I rest and sent to scold and nature bankrupt, nor the first gentleman?<\/em><\/li>\n<li><em> Thou whoreson chops. Consumption catch your dearest friend, well, and I know where many mouths upon my undoing all but be, how soon, then; we\u2019ll execute upon my love\u2019s bonds and we do you will?<\/em><\/li>\n<\/ol>\n<p style=\"text-align: justify\">In contrast, the trigram model, with three-word sequences <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/ql-cache\/quicklatex.com-1c566b101b93ed0c183b258aa88b1968_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#112;&#40;&#119;&#111;&#114;&#100;&#32;&#124;&#32;&#119;&#111;&#114;&#100;&#45;&#49;&#44;&#32;&#119;&#111;&#114;&#100;&#45;&#50;&#41;\" title=\"Rendered by QuickLaTeX.com\" height=\"18\" width=\"215\" style=\"vertical-align: -4px;\" \/>, is no longer as bad:<\/p>\n<ol style=\"text-align: justify\">\n<li><em> Sweet prince, Falstaff shall die. Harry of Monmouth\u2019s grave.<\/em><\/li>\n<li><em> This shall forbid it should be branded, if renown made it empty.<\/em><\/li>\n<li><em> Indeed the duke; and had a very good friend.<\/em><\/li>\n<li><em> Fly, and will rid me these news of price. Therefore the sadness of parting, as they say, \u2018tis done.<\/em><\/li>\n<\/ol>\n<p style=\"text-align: justify\">But by no means does the trigram model hold the torch to the quadrigram model. Although we probably still wouldn&#8217;t pay money to sit through a play generated like this, the four-word sequences of <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/ql-cache\/quicklatex.com-c1998bb3f736cd9f77b9f470273f2a9f_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#112;&#40;&#119;&#111;&#114;&#100;&#32;&#124;&#32;&#119;&#111;&#114;&#100;&#45;&#49;&#44;&#32;&#119;&#111;&#114;&#100;&#45;&#50;&#44;&#32;&#119;&#111;&#114;&#100;&#45;&#51;&#41;\" title=\"Rendered by QuickLaTeX.com\" height=\"18\" width=\"293\" style=\"vertical-align: -4px;\" \/> begins to approximate meaningful language:<\/p>\n<ol style=\"text-align: justify\">\n<li><em> King Henry. What! I will go seek the traitor Gloucester. Exeunt some of the watch. A great banquet serv\u2019d in;<\/em><\/li>\n<li><em> Will you not tell me who I am?<\/em><\/li>\n<li><em> Indeed the short and long. Marry, \u2018tis a noble Lepidus.<\/em><\/li>\n<li><em> Enter Leonato\u2019s brother Antonio, and the rest, but seek the weary beds of people sick.<\/em><\/li>\n<\/ol>\n<p style=\"text-align: justify\">These examples of n-gram models with varying degrees of complexity are often cited creations taken from Jurafsky and Martin (2009).<\/p>\n<p style=\"text-align: justify\">In order to understand the progression of these models in more detail, we need a bit of linguistic theory. Chomsky says:<\/p>\n<p style=\"text-align: justify\"><em>Conversely, a descriptively adequate grammar is not quite equivalent to nondistinctness in the sense of distinctive feature theory. It appears that the speaker-hearer&#8217;s linguistic intuition raises serious doubts about an important distinction in language use. To characterize a linguistic level L, the notion of level of grammaticalness is, apparently, determined by a parasitic gap construction. A consequence of the approach just outlined is that the theory of syntactic features developed earlier is to be regarded as irrelevant intervening contexts in selectional rules. In the discussion of resumptive pronouns following (81), the earlier discussion of deviance is rather different from the requirement that branching is not tolerated within the dominance scope of a complex symbol.<\/em><\/p>\n<p style=\"text-align: justify\">In case this does not answer your questions fully, we recommend you read the original source: <a href=\"http:\/\/www.rubberducky.org\/cgi-bin\/chomsky.pl\">http:\/\/www.rubberducky.org\/cgi-bin\/chomsky.pl<\/a><\/p>\n<p style=\"text-align: justify\">Does this leave you quite speechless? Well, it certainly left Chomsky speechless. Or, at least, he didn&#8217;t say these words. In a sense, though, they are of his creation: the Chomskybot is a phrase-based language model trained on (some of the linguistic) works of Noam Chomsky.<\/p>\n<p style=\"text-align: justify\">If you prefer your prose less dense, we can also recommend the automatically generated <em>Harry Potter and the Portrait of What Looked Like a Large Pile of Ash<\/em>, which you can find here: <a href=\"https:\/\/botnik.org\/content\/harry-potter.html\">https:\/\/botnik.org\/content\/harry-potter.html<\/a><\/p>\n<p style=\"text-align: justify\">You can see that different language models produce outputs with varying degrees of realism. While these language generators are good fun, there are many other areas of language processing which also make use of n-gram models:<\/p>\n<ul style=\"text-align: justify\">\n<li>Speech recognition\n<ul>\n<li>\u201cI ate a cherry\u201d is a more likely sentence than \u201cEye eight uh Jerry\u201d<\/li>\n<\/ul>\n<\/li>\n<li>OCR &amp; Handwriting recognition\n<ul>\n<li>More probable words\/sentences are more likely correct readings.<\/li>\n<li>Machine translation<\/li>\n<li>More likely phrases\/sentences are probably better translations.<\/li>\n<\/ul>\n<\/li>\n<li>Natural Language Generation\n<ul>\n<li>More likely phrases\/sentences are probably better NL generations.<\/li>\n<\/ul>\n<\/li>\n<li>Context sensitive spelling correction\n<ul>\n<li>\u201cTheir are problems wit this sentence.\u201d<\/li>\n<\/ul>\n<\/li>\n<li>Part-of-speech tagging (word and tag n-grams)\n<ul>\n<li><em>I like to run \/ I like the run.<\/em><\/li>\n<li>Time flies like an arrow. Fruit flies like a banana.<\/li>\n<\/ul>\n<\/li>\n<li>Collocation detection<\/li>\n<\/ul>\n<p>In the next section, we want to spend some time on the last of these topics.<\/p>\n<p>&nbsp;<\/p>\n<p style=\"text-align: justify\"><strong>Collocations<\/strong><\/p>\n<p style=\"text-align: justify\">Take words like <em>strong <\/em>and <em>powerful<\/em>. They are near synonyms, and yet <em>strong tea <\/em>is more likely than <em>powerful tea<\/em>, and <em>powerful car<\/em> is more likely than <em>strong car<\/em>. We will informally use the term <em>collocation<\/em> to refer to pairs of words which have a tendency to \u201cstick together\u201d and discuss how we can use statistical methods to answer these questions:<\/p>\n<ul style=\"text-align: justify\">\n<li>Can we quantify the tendency of two words to &#8216;stick together&#8217;?<\/li>\n<li>Can we show that \u201cstrong tea\u201d is more likely than chance, and \u201cpowerful tea\u201d is less likely than chance?<\/li>\n<li>Can we detect collocations and idioms automatically in large corpora?<\/li>\n<li>How well does the obvious approach of using bigram conditional probabilities fare, calculating, for instance <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/ql-cache\/quicklatex.com-b202ebbc818239096752d7ea3760398e_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#112;&#40;&#116;&#101;&#97;&#32;&#124;&#32;&#115;&#116;&#114;&#111;&#110;&#103;&#41;&#32;&#62;&#32;&#112;&#40;&#116;&#101;&#97;&#32;&#124;&#32;&#112;&#111;&#119;&#101;&#114;&#102;&#117;&#108;&#41;\" title=\"Rendered by QuickLaTeX.com\" height=\"18\" width=\"252\" style=\"vertical-align: -4px;\" \/>?<\/li>\n<\/ul>\n<p style=\"text-align: justify\">But first, let\u2019s take a look at the linguistic background of collocations. Choueka (1988) defines collocations as \u201ca sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components\u201d. Some criteria that distinguish collocations from other bigrams are:<\/p>\n<ul style=\"text-align: justify\">\n<li>Non-compositionality<\/li>\n<li>Meaning not compositional (e.g. \u201ckick the bucket\u201d)<\/li>\n<li>Non-substitutability<\/li>\n<li>Near synonyms cannot be used (e.g. \u201cyellow wine\u201d?)<\/li>\n<li>Non-modifiability: \u201ckick the bucket\u201d, \u201c*kick the buckets\u201d, \u201c*kick my bucket\u201d, \u201c*kick the blue bucket\u201d<\/li>\n<li>Non-literal translations: red wine &lt;-&gt; vino tinto; take decisions &lt;-&gt; Entscheidungen treffen<\/li>\n<\/ul>\n<p style=\"text-align: justify\">However, these criteria should be taken with a grain of salt. For one, the criteria are all quite vague, and there is no clear agreement as to what can be termed a collocation. Is the defining feature the grammatical relation (adjective + noun or verb + subject) or is it the proximity of the two words? Moreover, there is no sharp line demarcating collocations from phrasal verbs, idioms, named entities and technical terminology composed of multiple words. And what are we to make of the co-occurrence of thematically related words (e.g. doctor and nurse, or plane and airport)? Or the co-occurrence of words in different languages across parallel corpora? Obviously, there is a cline going from free combination through collocations to idioms. Between these different elements of language there is an area of gradience which makes collocations particularly interesting.<\/p>\n<p>&nbsp;<\/p>\n<p style=\"text-align: justify\"><strong>Statistical collocation measures<\/strong><\/p>\n<p style=\"text-align: justify\">With raw frequencies, z-scores, t-scores and chi-square tests we have at our disposal an arsenal of statistical methods we could use to measure collocation strength. In this section we will discuss how to use raw frequencies and chi-square tests to this end, and additionally we will introduce three new measures: Mutual Information (MI), Observed\/Expected (O\/E) and log-likelihood. In general, these measures are used to rank pair types as candidates for collocations. The different measures have different strengths and weaknesses, which means that they can yield a variety of results. Moreover, the association scores computed by different measures cannot be compared directly.<\/p>\n<p style=\"text-align: justify\"><strong>Frequencies<\/strong><\/p>\n<p style=\"text-align: justify\">Frequencies can be used to find collocations by counting the number of occurrences. For this, we typically use word bigrams. Usually, these searches result in a lot of function word pairs that need to be filtered out. Take a look at these collocations from an example corpus:<\/p>\n<figure id=\"attachment_446\" aria-describedby=\"caption-attachment-446\" style=\"width: 225px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2019\/08\/1.-frequencies.jpg\" alt=\"\" class=\"size-full wp-image-446\" width=\"225\" height=\"505\" srcset=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2019\/08\/1.-frequencies.jpg 225w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2019\/08\/1.-frequencies-134x300.jpg 134w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2019\/08\/1.-frequencies-65x146.jpg 65w\" sizes=\"auto, (max-width: 225px) 100vw, 225px\" \/><figcaption id=\"caption-attachment-446\" class=\"wp-caption-text\">Figure 8.1: Raw collocation frequencies<\/figcaption><\/figure>\n<p style=\"text-align: justify\">Except for <em>New York<\/em>, all the bigrams are pairs of function words. This is neither particularly surprising, nor particularly useful. If you have access to a POS-tagged corpus, or know how to apply POS-tags to a corpus, you can improve the results a bit by only searching for bigrams with certain POS-tag sequences (e.g. noun-noun, adj-noun).<\/p>\n<p style=\"text-align: justify\">Despite it being an imperfect measure, let\u2019s look at how we can identify raw frequencies for bigrams in R. For this, download the ICE GB written corpus in a bigram version:<\/p>\n<p style=\"text-align: justify\">[file: ice gb written bigram]<\/p>\n<p style=\"text-align: justify\">Import it into R using your preferred command. We assigned the corpus to a variable called <em>icegbbi<\/em>, and so we open the first ten entries of the vector:<\/p>\n<pre><code>icegbbi[1:10]\r\n[1] \"Tom\\tRodwell\"\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \"Rodwell\\tWith\"\r\n[3] \"With\\tthese\"\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \"these\\tprofound\"\r\n[5] \"profound\\twords\"\u00a0\u00a0\u00a0\u00a0\u00a0 \"words\\tJ.\"\r\n[7] \"J.\\tN.\"\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \"N.\\tL.\"\r\n[9] \"L.\\tMyres\"\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \"Myres\\tbegins\"\r\n<\/code><\/pre>\n<p style=\"text-align: justify\">We can see that each position contains two words separated by a tab, which is what the<em> \\t<\/em> stands for. Transform the vector into a table, sort it in decreasing order and take a look at the highest ranking entries:<\/p>\n<pre><code>&gt; ticegbbi = table(icegbbi)\r\n&gt; sortticegbbi = as.table(sort(ticegbbi,decreasing=T))\r\n&gt; sortticegbbi[1:16]\r\nicegbbi\r\n\u00a0 of\\tthe\u00a0\u00a0\u00a0 in\\tthe\u00a0\u00a0\u00a0 to\\tthe\u00a0\u00a0\u00a0\u00a0 to\\tbe\u00a0\u00a0 and\\tthe\u00a0\u00a0\u00a0 on\\tthe\u00a0\u00a0 for\\tthe\u00a0 from\\tthe\r\n\u00a0\u00a0\u00a0\u00a0 3623\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 2126\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 1341\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 908\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 881\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 872\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 696\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 627\r\n\u00a0 by\\tthe\u00a0\u00a0\u00a0 at\\tthe\u00a0 with\\tthe\u00a0\u00a0\u00a0\u00a0\u00a0 of\\ta\u00a0 that\\tthe\u00a0\u00a0\u00a0\u00a0 it\\tis\u00a0\u00a0 will\\tbe\u00a0\u00a0\u00a0\u00a0\u00a0 is\\ta\r\n\u00a0\u00a0\u00a0\u00a0\u00a0 602\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 597\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 576\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 571\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 559\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 542\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 442\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 420\r\n<\/code><\/pre>\n<p style=\"text-align: justify\">The results in the ICE GB written are very similar to the results in the table above. In fact, among the 16 most frequent bigrams, there are only pairs of function words.<\/p>\n<p style=\"text-align: justify\">[H5P: Search the frequency table some more. What is the most frequent bigram which you could classify as a collocation?]<\/p>\n<p>&nbsp;<\/p>\n<p style=\"text-align: justify\"><strong>Mutual Information and Observed\/Expected<\/strong><\/p>\n<p style=\"text-align: justify\">Fortunately, there are better collocation measures than raw frequencies. One of these is the Mutual Information (MI) score. We can easily calculate the probability that a collocation composed of the words x and y occurs due to chance. To do so, we simply need to assume that they are independent events. Under this assumption, we can calculate the individual probabilities from the raw frequencies f(x) and f(y) and the corpus size N:<\/p>\n<p style=\"text-align: center\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/ql-cache\/quicklatex.com-9c1531e3b604e553abf2fc32382734b0_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#112;&#40;&#120;&#41;&#61;&#92;&#102;&#114;&#97;&#99;&#123;&#102;&#40;&#120;&#41;&#125;&#123;&#78;&#125;\" title=\"Rendered by QuickLaTeX.com\" height=\"26\" width=\"86\" style=\"vertical-align: -6px;\" \/> and <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/ql-cache\/quicklatex.com-09c05f370e7f7d195f1acde1cdecac56_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#112;&#40;&#121;&#41;&#61;&#92;&#102;&#114;&#97;&#99;&#123;&#102;&#40;&#121;&#41;&#125;&#123;&#78;&#125;\" title=\"Rendered by QuickLaTeX.com\" height=\"26\" width=\"86\" style=\"vertical-align: -6px;\" \/><\/p>\n<p style=\"text-align: justify\">Of course, we can also calculate the observed probability of finding x and y together. We do this joint probability by dividing number of times we find x and y in sequence, f(x,y), by the corpus size N:<\/p>\n<p style=\"text-align: center\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/ql-cache\/quicklatex.com-9fbf675b8431412efa06e1782fe3d319_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#112;&#40;&#120;&#44;&#121;&#41;&#61;&#92;&#102;&#114;&#97;&#99;&#123;&#102;&#40;&#120;&#44;&#121;&#41;&#125;&#123;&#78;&#125;\" title=\"Rendered by QuickLaTeX.com\" height=\"26\" width=\"115\" style=\"vertical-align: -6px;\" \/><\/p>\n<p style=\"text-align: justify\">If the collocation of x and y is due to chance, which is to say if x and y are independent, the observed probability will be roughly equal to the expected probability:<\/p>\n<p style=\"text-align: center\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/ql-cache\/quicklatex.com-9583bff8f85e0a9286fd1076253f1bd5_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#112;&#40;&#120;&#44;&#121;&#41;&#123;&#92;&#97;&#112;&#112;&#114;&#111;&#120;&#125;&#112;&#40;&#120;&#41;&#42;&#112;&#40;&#121;&#41;\" title=\"Rendered by QuickLaTeX.com\" height=\"18\" width=\"146\" style=\"vertical-align: -4px;\" \/><\/p>\n<p style=\"text-align: justify\">If the collocation is not due to chance, we have one of two cases. Either the observed probability is substantially higher than the expected probability, in which we have a strong correlation:<\/p>\n<p style=\"text-align: center\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/ql-cache\/quicklatex.com-62b7e7e730339299b9aefdf75fd10432_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#112;&#40;&#120;&#44;&#121;&#41;&#62;&#62;&#112;&#40;&#120;&#41;&#42;&#112;&#40;&#121;&#41;\" title=\"Rendered by QuickLaTeX.com\" height=\"18\" width=\"169\" style=\"vertical-align: -4px;\" \/><\/p>\n<p style=\"text-align: justify\">Or the inverse is true, and the observed probability is much lower than the expected one:<\/p>\n<p style=\"text-align: center\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/ql-cache\/quicklatex.com-374185a0833a409df8823312933b3f25_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#112;&#40;&#120;&#44;&#121;&#41;&#60;&#60;&#112;&#40;&#120;&#41;&#42;&#112;&#40;&#121;&#41;\" title=\"Rendered by QuickLaTeX.com\" height=\"18\" width=\"169\" style=\"vertical-align: -4px;\" \/><\/p>\n<p style=\"text-align: justify\">The latter scenario can be referred to as \u2018negative\u2019 collocation, which simply means that the two words hardly occur together.<\/p>\n<p style=\"text-align: justify\">The comparison of joint and independent probabilities is known as Mutual Information (MI) and originates in Information Theory =&gt; surprise in bits. What we outlined above is a basic version of MI which we call O\/E, and which can be simply calculated by dividing Observation by Expectation:<\/p>\n<p style=\"text-align: center\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/ql-cache\/quicklatex.com-4e5e502e68b9b60788ad2aac976753ed_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#79;&#47;&#69;&#61;&#92;&#102;&#114;&#97;&#99;&#123;&#112;&#40;&#120;&#44;&#121;&#41;&#125;&#123;&#112;&#40;&#120;&#41;&#42;&#112;&#40;&#121;&#41;&#125;\" title=\"Rendered by QuickLaTeX.com\" height=\"29\" width=\"121\" style=\"vertical-align: -9px;\" \/><\/p>\n<p style=\"text-align: justify\">After this introduction to the concept, let\u2019s take a look at how to implement the O\/E in R.<\/p>\n<p style=\"text-align: justify\">[R exercise with available corpus?]<\/p>\n<p style=\"text-align: justify\">Okay, let&#8217;s look at one more set of examples, this time with data from the British National Corpus (BNC). The BNC contains 97,626,093 words in total. The table below shows the frequencies for the words &#8216;powerful&#8217;, &#8216;strong&#8217;, &#8216;car&#8217; and &#8216;tea&#8217;, as well as all of the bigrams composed of these four word:<\/p>\n<table class=\"grid\" style=\"border-collapse: collapse;width: 67.4874%;height: 131px\">\n<caption>Table 8.1: Powerful, strong, car, tea<\/caption>\n<tbody>\n<tr>\n<td style=\"width: 25%\">Unigrams<\/td>\n<td style=\"width: 25%\">Frequency<\/td>\n<td style=\"width: 25%\">Bigrams<\/td>\n<td style=\"width: 25%\">Frequency<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 25%\">powerful<\/td>\n<td style=\"width: 25%\">7,070<\/td>\n<td style=\"width: 25%\">powerful car<\/td>\n<td style=\"width: 25%\">15<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 25%\">strong<\/td>\n<td style=\"width: 25%\">15,768<\/td>\n<td style=\"width: 25%\">powerful tea<\/td>\n<td style=\"width: 25%\">3<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 25%\">car<\/td>\n<td style=\"width: 25%\">26,731<\/td>\n<td style=\"width: 25%\">strong car<\/td>\n<td style=\"width: 25%\">4<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 25%\">tea<\/td>\n<td style=\"width: 25%\">8,030<\/td>\n<td style=\"width: 25%\">strong tea<\/td>\n<td style=\"width: 25%\">28<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p style=\"text-align: justify\">Use the formula above and the information in the table to calculate the O\/E for the four bigrams in the table.<\/p>\n<p style=\"text-align: justify\">\n<div id=\"h5p-33\">\n<div class=\"h5p-iframe-wrapper\"><iframe id=\"h5p-iframe-33\" class=\"h5p-iframe\" data-content-id=\"33\" style=\"height:1px\" src=\"about:blank\" frameBorder=\"0\" scrolling=\"no\" title=\"8.4 MI with O\/E\"><\/iframe><\/div>\n<\/div>\n<p>&nbsp;<\/p>\n<p style=\"text-align: justify\"><strong>Low counts and other problems<\/strong><\/p>\n<p style=\"text-align: justify\">In the exercise above, we get one counterintuitive result. According to our calculations, <em>powerful tea<\/em> is still a collocation. What is going on here? Well, if you look at the raw frequencies, you can see that we have 28 observations of <em>strong tea<\/em>, but only three of<em> powerful tea<\/em>. Evidently, three occurrences make for a very rare phenomenon which is not statistically significant. Interestingly, for once we can actually account for these three occurrences of <em>powerful tea<\/em>. Take a look at the image below, which contains the BNC results for the query \u201cpowerful tea\u201d.<\/p>\n<figure id=\"attachment_550\" aria-describedby=\"caption-attachment-550\" style=\"width: 712px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2019\/08\/2.-Powerful-tea.jpg\" alt=\"\" class=\"size-full wp-image-550\" width=\"712\" height=\"285\" srcset=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2019\/08\/2.-Powerful-tea.jpg 712w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2019\/08\/2.-Powerful-tea-300x120.jpg 300w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2019\/08\/2.-Powerful-tea-65x26.jpg 65w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2019\/08\/2.-Powerful-tea-225x90.jpg 225w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2019\/08\/2.-Powerful-tea-350x140.jpg 350w\" sizes=\"auto, (max-width: 712px) 100vw, 712px\" \/><figcaption id=\"caption-attachment-550\" class=\"wp-caption-text\">Figure 8.2: Powerful tea in the British National Corpus<\/figcaption><\/figure>\n<p style=\"text-align: justify\">If you read the context in which <em>powerful tea<\/em> occurs, you can see easily enough that we are dealing with the observer\u2019s paradox here: we only find <em>powerful tea<\/em> in this corpus as an example of a bad collocation. Although this is a rare situation, it makes clear why it is worth reading some examples whenever working with large corpora. This is good advice in any case, but it is especially true with low counts and surprising results.<\/p>\n<p style=\"text-align: justify\">Each of the statistical measures we discussed so far has its limitations, and the same holds true for MI. For one, its assumptions are flawed: not a normal distribution. Moreover, while the MI is a good measure of independence, the absolute values are not very meaningful. And, as we have seen, it does not perform well with low counts. With low counts, the MI tends to overestimate the strength of the collocation.<\/p>\n<p style=\"text-align: justify\">One response to these limitations is to use a significance test (chi-square, t-test or z-test) instead. This allows us to eliminate the low count problem.<\/p>\n<p>&nbsp;<\/p>\n<p style=\"text-align: justify\"><strong>Chi-square test<\/strong><\/p>\n<p style=\"text-align: justify\">As we have seen <a href=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/chapter\/chi-square-test-significance-2\/\">earlier<\/a>, the chi-square statistic sums the differences between observed and expected values in all squares of the table and scales them by the magnitude of the expected values.<\/p>\n<p style=\"text-align: justify\">Let\u2019s take a look at how the chi-square test can be used to identify collocations. The table below contains a set of collocations using the words <em>new<\/em> and <em>companies<\/em>:<\/p>\n<figure id=\"attachment_551\" aria-describedby=\"caption-attachment-551\" style=\"width: 537px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2019\/08\/3.-New-companies.jpg\" alt=\"\" class=\"size-full wp-image-551\" width=\"537\" height=\"148\" srcset=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2019\/08\/3.-New-companies.jpg 537w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2019\/08\/3.-New-companies-300x83.jpg 300w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2019\/08\/3.-New-companies-65x18.jpg 65w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2019\/08\/3.-New-companies-225x62.jpg 225w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2019\/08\/3.-New-companies-350x96.jpg 350w\" sizes=\"auto, (max-width: 537px) 100vw, 537px\" \/><figcaption id=\"caption-attachment-551\" class=\"wp-caption-text\">Figure 8.3: &#8216;New&#8217; and &#8216;companies&#8217; in all possible and sensible bigram combinations<\/figcaption><\/figure>\n<p style=\"text-align: justify\">Specifically, the table juxtaposes <em>new companies<\/em> with all other bigrams in the corpus.<\/p>\n<p style=\"text-align: justify\">[H5P: What is N? In other words, how many bigrams are in the corpus?]<\/p>\n<p style=\"text-align: justify\">We use <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/ql-cache\/quicklatex.com-2a46205676bf7f032baf11c56ae52f73_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#79;&#95;&#123;&#105;&#44;&#106;&#125;\" title=\"Rendered by QuickLaTeX.com\" height=\"18\" width=\"29\" style=\"vertical-align: -6px;\" \/> to refer to the observed value for cell (i,j), and <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/ql-cache\/quicklatex.com-13a30f5699b5791cbade6c644cf9fc96_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#69;&#95;&#123;&#105;&#44;&#106;&#125;\" title=\"Rendered by QuickLaTeX.com\" height=\"18\" width=\"28\" style=\"vertical-align: -6px;\" \/> to refer to the expected value. The observed values are given in the table, e.g. <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/ql-cache\/quicklatex.com-d098db742d03b0198e8b6f6d268afeb5_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#79;&#95;&#123;&#49;&#44;&#49;&#125;&#61;&#56;\" title=\"Rendered by QuickLaTeX.com\" height=\"18\" width=\"65\" style=\"vertical-align: -6px;\" \/>. The corpus size is N = 14,229,981 bigrams. As we know, the expected values can be determined from the marginal probabilities, e.g. for <em>new companies<\/em> in cell (1,1):<\/p>\n<p style=\"text-align: center\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/ql-cache\/quicklatex.com-883bfff0d2ad9b260dc59d83732261f6_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#69;&#95;&#123;&#49;&#44;&#49;&#125;&#61;&#92;&#102;&#114;&#97;&#99;&#123;&#40;&#40;&#56;&#43;&#49;&#53;&#56;&#50;&#48;&#41;&#42;&#40;&#56;&#43;&#52;&#54;&#54;&#55;&#41;&#41;&#125;&#123;&#78;&#125;&#61;&#53;&#46;&#50;\" title=\"Rendered by QuickLaTeX.com\" height=\"26\" width=\"241\" style=\"vertical-align: -6px;\" \/><\/p>\n<p style=\"text-align: justify\">If we do this for each cell and calculate the chi-square sum, we receive a value of 1.55. In order to determine whether <em>new companies<\/em> is a statistically significant collocation, we have to look up the significance value for the level of 0.05 in the chi-square value table we saw in that chapter (or in any of the many tables available elsewhere online).<\/p>\n<div id=\"h5p-34\">\n<div class=\"h5p-iframe-wrapper\"><iframe id=\"h5p-iframe-34\" class=\"h5p-iframe\" data-content-id=\"34\" style=\"height:1px\" src=\"about:blank\" frameBorder=\"0\" scrolling=\"no\" title=\"8.6 Chi-square\"><\/iframe><\/div>\n<\/div>\n<p style=\"text-align: justify\">Because our chi-square value at 1.55, which is substantially smaller than the required one at the 95% level, we cannot reject the null hypothesis. In other words, there is no evidence that <em>new companies<\/em> should be a collocation.<\/p>\n<p>&nbsp;<\/p>\n<p style=\"text-align: justify\"><strong>Problems with significance tests and what comes next<br \/>\n<\/strong><\/p>\n<p style=\"text-align: justify\">The chi-square, t-test and the z-test are often used as collocation measures. These significance tests are popular as collocation tests, since they empirically seem to fit. However, strictly speaking, collocation strength and significance are quite different things.<\/p>\n<p style=\"text-align: justify\">Where does this leave us? We now know that there are two main classes of collocation tests. Measures of surprise, which mainly report rare collocations and float a lot of garbage due to data sparseness, and significance tests, which mainly report frequent collocations and miss rare ones. One way of addressing the limitations we encounter in our collocation measures is to use the MI or O\/E in combination with a t-test significance filter. Alternatively, some tests exist which are slightly more complex and slightly more balance. These include MI3 and log-likelihood, which we discuss below in [Chapter Regression].<\/p>\n<p style=\"text-align: justify\">For further reading on collocation detection, we recommend:<\/p>\n<ul>\n<li style=\"text-align: justify\">Stephan Evert&#8217;s http:\/\/www.collocations.de\/<\/li>\n<li style=\"text-align: justify\">Martin and Jurafsky<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<p style=\"text-align: justify\"><strong>References:<\/strong><\/p>\n<p style=\"text-align: justify\">Botnik. 2018. Harry Potter and the portrait of what looked like a large pile of ash. Online: <a href=\"https:\/\/botnik.org\/content\/harry-potter.html\">https:\/\/botnik.org\/content\/harry-potter.html<\/a>.<\/p>\n<p style=\"text-align: justify\">Choueka, Yaacov. 1988. Looking for needles in a haystack, or: locating interesting collocational expressions in large textual database. <em>Proceedings of the RIAO<\/em>. 609-623.<\/p>\n<p style=\"text-align: justify\">Jurafsky, Dan and James H. Martin. 2009. <em>Speech and language processing: an introduction to natural language processing, computational linguistics and speech recognition<\/em>. 2nd Edition. Upper Saddle River, NJ: Pearson Prentice Hall.<\/p>\n<p style=\"text-align: justify\">Lawler, John and Kevin McGowan. The Chomskybot. Online: <a href=\"http:\/\/rubberducky.org\/cgi-bin\/chomsky.pl\">http:\/\/rubberducky.org\/cgi-bin\/chomsky.pl<\/a>.<\/p>\n<p style=\"text-align: justify\"><span><span class=\"highlight selected\"><\/span><\/span><span><br \/>\n<\/span><\/p>\n","protected":false},"author":29,"menu_order":2,"template":"","meta":{"pb_show_title":"on","pb_short_title":"","pb_subtitle":"","pb_authors":[],"pb_section_license":""},"chapter-type":[],"contributor":[],"license":[],"class_list":["post-53","chapter","type-chapter","status-publish","hentry"],"part":48,"_links":{"self":[{"href":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-json\/pressbooks\/v2\/chapters\/53","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-json\/pressbooks\/v2\/chapters"}],"about":[{"href":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-json\/wp\/v2\/types\/chapter"}],"author":[{"embeddable":true,"href":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-json\/wp\/v2\/users\/29"}],"version-history":[{"count":17,"href":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-json\/pressbooks\/v2\/chapters\/53\/revisions"}],"predecessor-version":[{"id":585,"href":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-json\/pressbooks\/v2\/chapters\/53\/revisions\/585"}],"part":[{"href":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-json\/pressbooks\/v2\/parts\/48"}],"metadata":[{"href":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-json\/pressbooks\/v2\/chapters\/53\/metadata\/"}],"wp:attachment":[{"href":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-json\/wp\/v2\/media?parent=53"}],"wp:term":[{"taxonomy":"chapter-type","embeddable":true,"href":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-json\/pressbooks\/v2\/chapter-type?post=53"},{"taxonomy":"contributor","embeddable":true,"href":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-json\/wp\/v2\/contributor?post=53"},{"taxonomy":"license","embeddable":true,"href":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-json\/wp\/v2\/license?post=53"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}