The saying that one cannot study anything today without studying statistics definitely holds true in linguistics. Some of you may roll your eyes, and close the book, but we would advise against doing so. Not only because it would be shame if we lost your attention at the word go, but because learning statistics means learning methodologies which enable you to produce better research, in linguistics and beyond.
Maybe it helps to consider an example. You might notice a conspicuous pattern in language usage. For instance, you may get the impression that some people often use a prepositional phrase – “John gave a book to the students” – while others use the noun phrase – “John gave the students a book”. From this you might intuitively draw the conclusion that some people just have a preference for one or the other version, which is a fair interpretation, but not a very scientific one. Now, imagine that you additionally observe that younger people use the noun phrase more often than the prepositional phrase. At this point, you might consider how you can test this, since you still have that linguistics paper you have to write over the summer. With this in mind, you begin thinking about how you can translate your hunch into a research questions, and turn to a corpus. Luckily, the corpus is annotated and you can formulate the necessary search queries, perhaps using regular expressions. You get some reliable counts for both forms, and a set of demographic variables on each speaker. And this is where you run into problems: you can count who uses the two forms how often, but the numbers look somewhat similar. If only there were a test to show you if your hunch is correct, and if there actually is a difference in how different speakers use noun phrases and prepositional phrases. You won’t be the least bit surprised when we tell you that there is. In fact, there is more than one. And that we discuss them in this book. In fact, we don’t only discuss the tests you can use, but also how to formulate those search queries.
Linguists rely on statistical methods for a whole host of reasons. For one, linguistic differences are often subtle. It is hardly the case that some speaker group, some social class, some genre, etc. uses an expression that others never use. But there are preferences. Consider, for instance, written and language. They are, you know, quite different, but errmm, what are, like, really the differences, like? Many of our written utterances would be acceptable in spoken, and vice versa. With statistics, we can empirically investigate which of these differences are strongest. Or think about language acquisition: a new teaching method claims to be more successful at teaching English than established practices. How can proponents of the new method prove that? By using statistical methods which show that the reduction of errors in their classroom is not only a coincidence. If you are more interested in the language than in the people who speak it, you can use linguistic statistical models to get detailed, data-driven insights into complex interactions, for example into how certain words attract each other in collocations. Alternatively, you may be pondering a philosophical question: how do we even process language? Why is it difficult for us to understand multiple embedded clauses? Why are some zero relatives easy and others difficult for us to parse? When is a sentence a garden path sentence? Well, we veered slightly from the philosophical trajectory, but it will not surprise you to read that statistics can give us a handle to start tackling these psycholinguistic questions. Although the examples here are far from exhaustive, we leave you with a final area where statistics play an especially important role: computational linguistics. Many applications in computational linguistics, from part-of-speech tagging and machine translation, to word-sense disambiguation and information retrieval, use complex statistical methods, and in order to understand how these applications work, you need some understanding of how statistics work.
What all of the issues raised above have in common is that one approach to understanding them is through the analysis of numerical data, for instance in terms of frequencies and counts of errors and mistakes when we are dealing with new teaching techniques or in terms of reaction times in psycholinguistics. And therein lies the great affordance of statistics: it is a toolkit with different means of analysing numerical data. The statistical methods are all the more worth learning since they can be used in many different contexts. Once you know, for instance, how significance tests work, any kind of empirical research becomes more transparent. And, of course, you can apply it yourself, on the great swathes of linguistic data which are available in corpora. Indeed, it is no coincidence that quantitative linguistics, which has existed in some form since the late 19th century, rose to its current prominence just as personal computers, these gateways to linguistics data, have become an integral part of daily life. Although the importance of statistics is now widely recognised, it is not yet widely taught. This is a shame, since statstics is a demanding subject, and in the beginning it is good to have a helping hand: perhaps that is why you came here.
And perhaps we can answer some questions that may have arisen.
What does here mean? Here is, of course, the book Statistics for Linguists, written by Dr. Gerold Schneider and Max Lauber. The book is based on an e-learning module which Gerold created for the University of Zurich in the mid 2010s. Some of the instructional videos integrated in the book originate from this e-learning course and refer to the slides which accompanied that older course. We have since turned these slides into the text you can read today. If some of the references in the video are not immediately clear, they should become so if you read the surrounding text.
Is this the book for me? We don’t know you well enough to presume to answer this, but we can tell you who we made it for. It is a book for someone with either some background or a healthy interest in linguistics, but little theoretical or practical knowledge of statistical methods. And it’s a book for someone ready to (metaphorically) get their hands dirty. We discuss fundamental concepts of statistics as well as some more advanced methods in this book, and show you how to implement them in the programming language R. In our examples, we mainly use linguistic data, which we also tend to discuss briefly, but the main focus is clearly not on linguistic theory or insights but on statistical methods.
Will I know everything I need about statistics after reading this? Unfortunately, no. You will get a solid introduction to several important and necessary concepts for understanding statistics, but the book is far from comprehensive. What we do try, however, is to refer you to further material which we find helpful.
How is this book structured? We divided the book into four parts. We begin “The Foundations” by introducing our toolkit R and by discussing the basics of descriptive statistics. In “Inferential Statistics” we introduce two significance tests, the t-test and the chi-square test. We proceed to “Language Models”, where we show how to use these statistical methods in the context of two basic language models. Next, we discuss four “Advanced Methods”, namely linear and logistic regression analysis, machine learning and topic modelling. Finally, we discuss “The Next Step” required for original research, namely how to phrase search queries to retrieve the phenomena you want to investigate from text corpora.
How should I read this book? There are four things to say about this.
Firstly, we recommend reading on a computer which has R installed. This will give you the opportunity to follow our discussions of the different commands and functionalities in R while trying them out yourself.
Second, it is worth reading the book in a web browser to benefit from the exercises and video instructions.
Third, we recommend reading the book in the original sequence, especially the “Foundations” and “Inferential Statistics” sections which are written so as to build from one chapter to the next. The “Language Models” and “Advanced Methods” sections probably work somewhat better on their own, but also build strongly on the preceding chapters. This being said, we can only encourage you to use this as a reference work if that is how you benefit most from it.
Finally, we should say that while we decided to make the book available already, we are still putting on the finishing touches. Specifically, we are still waiting for the go-ahead to release a dataset for the chapters in the “Language Models” and “Next Step” sections. Until then, the chapters in these parts will remain a tad rough. Other adjustments may also still occur.
With that, let’s take the first step towards statistics.