{"id":21,"date":"2019-08-15T09:47:35","date_gmt":"2019-08-15T07:47:35","guid":{"rendered":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/?post_type=chapter&#038;p=21"},"modified":"2024-01-28T17:11:40","modified_gmt":"2024-01-28T16:11:40","slug":"first-steps-in-r-importing-and-retrieving-corpus-data","status":"publish","type":"chapter","link":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/chapter\/first-steps-in-r-importing-and-retrieving-corpus-data\/","title":{"raw":"Importing and Retrieving Corpus Data: First Steps in R","rendered":"Importing and Retrieving Corpus Data: First Steps in R"},"content":{"raw":"<p style=\"text-align: justify\">In this first part of the book, the Foundations, we want to make sure that everyone has a practical understanding of the programming language R, which we use throughout. In this chapter, you will make your first steps in R and learn how to import data into the program.<\/p>\r\n&nbsp;\r\n<p style=\"text-align: justify\"><strong>What is R?<\/strong><\/p>\r\n<a href=\"https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_deh5auqn\">https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_deh5auqn<\/a>\r\n<p style=\"text-align: justify\">[iframe width=\"640\" height=\"360\" src=\"https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_deh5auqn\" frameborder=\"0\" webkitallowfullscreen mozallowfullscreen allowfullscreen]<\/p>\r\n[iframe id=\"kmsembed-0_deh5auqn\" width=\"826\" height=\"465\" src=\"https:\/\/uzh.mediaspace.cast.switch.ch\/embed\/secure\/iframe\/entryId\/0_deh5auqn\/uiConfId\/23449004\/pbc\/12943\/st\/0\" class=\"kmsembed\" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow=\"autoplay *; fullscreen *; encrypted-media *\" referrerPolicy=\"no-referrer-when-downgrade\" sandbox=\"allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation\" frameborder=\"0\" title=\"1.1 Preliminaries\"]\r\n<p style=\"text-align: justify\">The R project page is <a href=\"http:\/\/www.r-project.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">http:\/\/www.r-project.org\/<\/a>. Download the program for your OS from the ETH mirror <a href=\"http:\/\/stat.ethz.ch\/CRAN\/\" target=\"_blank\" rel=\"noopener noreferrer\">http:\/\/stat.ethz.ch\/CRAN\/<\/a>. Install as you usually install programs. Once you have installed the R base, you have all the tools you need for the next chapters at your disposal.<\/p>\r\n<p style=\"text-align: justify\">Nonetheless, you might want to consider installing RStudio on top of the R base. RStudio is an integrated development environment which allows you to keep track of the data and variables you are using. Moreover, when you are writing a piece of code composed of several commands, the script window in RStudio allows you to break it down into one command per line. If you think these functionalities sound helpful, download RStudio Desktop from <a href=\"https:\/\/rstudio.com\/products\/rstudio\/\">https:\/\/rstudio.com\/products\/rstudio\/<\/a> and install.<\/p>\r\n<p style=\"text-align: justify\">R base and RStudio have somewhat different graphic user interfaces, as you can see below. In R base, there is only one window: the console. The console is where you communicate with the program. Here, you enter the commands you want R to execute, and it is where you see the output. In most programming languages, including R, you can recognize the console by the prompt, which looks like this:<code> &gt;<\/code>.<\/p>\r\n<p style=\"text-align: justify\">[h5p id=\"2\"]<\/p>\r\n<p style=\"text-align: justify\">In RStudio, there are four windows. The top left window is the script. Here you can enter multiple lines of code without immediately executing them. The top right window shows the global environment. Here you find all the data and variables in use in your current session. The bottom left window is where plots will be displayed. Finally, and most importantly for now, you find the console in the bottom left window. This corresponds to the R base console in that the commands you enter here will be executed immediately.<\/p>\r\n<p style=\"text-align: justify\">In the video tutorials, you see Gerold working with R base. If you are using RStudio, you will want to try what Gerold does in the console.<\/p>\r\n<a href=\"https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_quelm2x2\">https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_quelm2x2<\/a>\r\n\r\n[iframe width=\"640\" height=\"360\" src=\"https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_quelm2x2\" frameborder=\"0\" webkitallowfullscreen mozallowfullscreen allowfullscreen]\r\n\r\n[iframe id=\"kmsembed-0_quelm2x2\" width=\"826\" height=\"465\" src=\"https:\/\/uzh.mediaspace.cast.switch.ch\/embed\/secure\/iframe\/entryId\/0_quelm2x2\/uiConfId\/23449004\/pbc\/12943\/st\/0\" class=\"kmsembed\" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow=\"autoplay *; fullscreen *; encrypted-media *\" referrerPolicy=\"no-referrer-when-downgrade\" sandbox=\"allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation\" frameborder=\"0\" title=\"1.2 Downloadng and Assigning Verbs\"]\r\n<p style=\"text-align: justify\">When you start R, the prompt <code>&gt;<\/code> appears in the console. Select the console window, and type your first command:<\/p>\r\n\r\n<pre><code>&gt; 2+2<\/code><\/pre>\r\n<p style=\"text-align: justify\">When you hit enter, R executes your command and gives you the output. The output here is:<\/p>\r\n\r\n<pre><code>[1] 4\r\n<\/code><\/pre>\r\n<p style=\"text-align: justify\">You are hopefully not surprised. You have just used R as a calculator, and this if a very legitimate way to use it. However, the program is capable of doing much more, and we begin exploring the possibilities in this chapter.<\/p>\r\n<p style=\"text-align: justify\">Let us begin by importing some data, such as the file <em>verbs.txt<\/em>.[footnote]The data in verbs.txt is a simplification of a dataset compiled by Joan Bresnan.[\/footnote] Take a look at this raw text file by opening it in a new browser tab:<\/p>\r\n<p style=\"text-align: justify\"><a href=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2019\/08\/verbs.txt\">verbs.txt<\/a><\/p>\r\n<p style=\"text-align: justify\">You may have the impression that the table looks skewed, that not all rows are aligned. This is no reason to worry: it is simply because the fields are separated by tabulator characters. Tabulator-separated fields are very flexible, a simple de facto standard which allows us to import data into spreadsheets like Excel or databases like FileMaker.<\/p>\r\nTo save the file on your computer, right-click on the link and select \"Save As...\". The image below shows this step in a German version of Firefox:\r\n\r\n[caption id=\"attachment_154\" align=\"aligncenter\" width=\"740\"]<img src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2019\/08\/3.-Save-as.jpg\" alt=\"\" class=\"wp-image-154 size-full\" width=\"740\" height=\"492\" \/> Figure 1.1: How to save data from a web browser[\/caption]\r\n<p style=\"text-align: justify\">Save the file as a raw text file from the browser. We suggest calling it <i>verbs.txt<\/i>. Remember where you save it, because you will need to know the file's location in order to import it into R in the next step.<\/p>\r\n<p style=\"text-align: justify\">But first, a brief comment on saving files: it is generally worth putting some effort into organizing your data and folder structure. As a practical step towards that, we would suggest creating a folder where you store all data files that you download over the course of the book. Of course, you can create subfolders according to chapters, or topics, or whatnot, depending on your personal preferences. But being somewhat inundated with files ourselves, we cannot overstate the value of having a system for data storage.<\/p>\r\n<p style=\"text-align: justify\">Returning to R, you can load the <i>verb.txt<\/i> file via the console. When you enter the following command, a dialogue window will open:<\/p>\r\n\r\n<pre><code>&gt; verbs &lt;- read.table(file.choose(), header=TRUE, comment.char=\"\", row.names=1)<\/code><\/pre>\r\n<p style=\"text-align: justify\">In the dialogue window, navigate to the folder where you stored the file, and open the <em>verbs.txt<\/em> text file. When you open <em>verbs.txt<\/em>, you assign it to the variable <em>verbs<\/em>. This what the arrow construction, <code>&lt;-<\/code>, effects. The variable <em>verbs <\/em>now contains the table that you opened in the browser earlier.<\/p>\r\n&nbsp;\r\n<p style=\"text-align: justify\"><b>What is a variable?<\/b><\/p>\r\nIn this section, we loosely follow chapter 1 in Baayen (2008).\r\n\r\n<a href=\"https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_v7pi1adx\">https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_v7pi1adx<\/a>\r\n\r\n[iframe width=\"640\" height=\"360\" src=\"https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_v7pi1adx\" frameborder=\"0\" webkitallowfullscreen mozallowfullscreen allowfullscreen]\r\n\r\n[iframe id=\"kmsembed-0_v7pi1adx\" width=\"826\" height=\"465\" src=\"https:\/\/uzh.mediaspace.cast.switch.ch\/embed\/secure\/iframe\/entryId\/0_v7pi1adx\/uiConfId\/23449004\/pbc\/12943\/st\/0\" class=\"kmsembed\" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow=\"autoplay *; fullscreen *; encrypted-media *\" referrerPolicy=\"no-referrer-when-downgrade\" sandbox=\"allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation\" frameborder=\"0\" title=\"1.3 Assigning Variables\"]\r\n<p style=\"text-align: justify\">What are we talking about when we say that you assigned the table <em>verbs.txt<\/em> to a variable <em>verbs<\/em>? We are not talking about linguistic variables (not just yet), but about computational variables. In R, variables are objects to which you can assign different values. At its simplest, a variable contains a single value. Consider the following example:<\/p>\r\n\r\n<pre id=\"rstudio_console_output\" class=\"GNKRCKGCGSB\"><code><span><span class=\"GNKRCKGCMSB ace_keyword\">&gt; <\/span><span class=\"GNKRCKGCMRB ace_keyword\">x &lt;- 1 + 2\r\n<\/span><\/span><span><span class=\"GNKRCKGCMSB ace_keyword\">&gt; <\/span><span class=\"GNKRCKGCMRB ace_keyword\">1 + 2 -&gt; x\r\n<\/span><\/span><span><span class=\"GNKRCKGCMSB ace_keyword\">&gt; <\/span><span class=\"GNKRCKGCMRB ace_keyword\">x\r\n<\/span><span class=\"GNKRCKGCGSB\">[1] 3<\/span><\/span><\/code><\/pre>\r\n<p style=\"text-align: justify\">In this example, we assign the outcome of a simple calculation to the variable <em>x<\/em>. And when we tell R to show <code>x<\/code>, the output is the outcome of the calculation.<\/p>\r\n<p style=\"text-align: justify\">With this in mind, consider again what we did with the <em>verbs <\/em>variable. We assigned a whole table to it, and we can look at it. However, since the whole table is too big to be represented usefully in the R console, let's just take a look at the first few rows. We can do this with the <code>head()<\/code> command:<\/p>\r\n\r\n<pre><code>&gt; head(verbs)\r\n<\/code><code>  RealizationOfRec  Verb AnimacyOfRec AnimacyOfTheme LengthOfTheme\r\n1               NP  feed      animate      inanimate      2.639057\r\n2               NP  give      animate      inanimate      1.098612\r\n3               NP  give      animate      inanimate      2.564949\r\n4               NP  give      animate      inanimate      1.609438\r\n5               NP offer      animate      inanimate      1.098612\r\n6               NP  give      animate      inanimate      1.386294\r\n<\/code><\/pre>\r\n<p style=\"text-align: justify\">By default, <code>head()<\/code> shows the first six rows. To view the first 10 rows, we can modify <code>head()<\/code> with the argument <code>n=10<\/code>:<\/p>\r\n\r\n<pre id=\"rstudio_console_output\" class=\"GNKRCKGCGSB\"><code><span><span class=\"GNKRCKGCMSB ace_keyword\">&gt; <\/span><span class=\"GNKRCKGCMRB ace_keyword\">head(verbs, n=10)\r\n<\/span><span class=\"GNKRCKGCGSB\">   RealizationOfRec  Verb AnimacyOfRec AnimacyOfTheme LengthOfTheme\r\n1                NP  feed      animate      inanimate     2.6390573\r\n2                NP  give      animate      inanimate     1.0986123\r\n3                NP  give      animate      inanimate     2.5649494\r\n4                NP  give      animate      inanimate     1.6094379\r\n5                NP offer      animate      inanimate     1.0986123\r\n6                NP  give      animate      inanimate     1.3862944\r\n7                NP   pay      animate      inanimate     1.3862944\r\n8                NP bring      animate      inanimate     0.0000000\r\n9                NP teach      animate      inanimate     2.3978953\r\n10               NP  give      animate      inanimate     0.6931472<\/span><\/span><\/code><\/pre>\r\n<p style=\"text-align: justify\">Now that we have imported the data into R and taken a first look at it, one question remains unanswered: what does it mean?<\/p>\r\n&nbsp;\r\n<p style=\"text-align: justify\"><b>How is the data to be interpreted?<\/b><\/p>\r\n<a href=\"https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_9vhnuvjp\">https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_9vhnuvjp<\/a>\r\n\r\n[iframe width=\"640\" height=\"360\" src=\"https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_9vhnuvjp\" frameborder=\"0\" webkitallowfullscreen mozallowfullscreen allowfullscreen]\r\n\r\n[iframe id=\"kmsembed-0_9vhnuvjp\" width=\"826\" height=\"465\" src=\"https:\/\/uzh.mediaspace.cast.switch.ch\/embed\/secure\/iframe\/entryId\/0_9vhnuvjp\/uiConfId\/23449004\/pbc\/12943\/st\/0\" class=\"kmsembed\" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow=\"autoplay *; fullscreen *; encrypted-media *\" referrerPolicy=\"no-referrer-when-downgrade\" sandbox=\"allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation\" frameborder=\"0\" title=\"1.4 Interpreting the Verb Table\"]\r\n<p style=\"text-align: justify\">The table <em>verbs <\/em>contains linguistic data on the realization of ditransitive verbs. Each row represents one instance of a ditransitive verb from a corpus. The columns contain information on several aspects of each realization:<\/p>\r\n\r\n<ul>\r\n \t<li style=\"text-align: justify\"><i>RealizationOfRec<\/i>: is the recipient an NP (as in <i>I gave <b>you<\/b> a book<\/i>) or a PP (as in <i>I gave a book <b>to you<\/b><\/i>)?<\/li>\r\n \t<li style=\"text-align: justify\"><i>Verb<\/i>: which ditransitive verb is used in this instance?<\/li>\r\n \t<li style=\"text-align: justify\"><i>AnimacyOfRec<\/i>: is the recipient animate?<\/li>\r\n \t<li style=\"text-align: justify\"><i>AnimacyOfTheme<\/i>: is the theme animate?<\/li>\r\n \t<li style=\"text-align: justify\"><i>LengthOfTheme<\/i>: how long is the theme?<\/li>\r\n<\/ul>\r\n<p style=\"text-align: justify\">We can see immediately that there are different types of scale in this data. Three of the columns, RealizationOfRec, AnimacyOfRec and AnimacyOfTheme contain nominal variables. This means that they can take one of two mutually exclusive values. A recipient can be realized either as an NP or a PP, but it cannot be anything in between or any thir option. The same is evidently true of the animacy of the recipient and theme. English does not have a third mode of animacy for the half-dead. The Verb column contains what is known as a categorical variable. Again, we have mutually exclusive categories, but this time we have more than two categories. What nominal and categorical variables have in common is that even if we translate the categories into numbers - which can be very useful - the number we assign a value does not weigh in a mathematical way. In other words, even if the categories are described numerically, we do not rank them the way we would rank numerical data. In the final column, LengthOfTheme, we actually do have numerical data. As we can see from the first few lines, the length of the theme is a continuous numeric variable. This table then contains the most important types of scales we use in linguistics.<\/p>\r\n<p style=\"text-align: justify\">Which brings us back to the question: how can we interpret a table like this linguistically? Well, this dataset can, for example, be used to find out if certain verbs have a preference for realizing the recipient as a noun phrase (NP) or a prepositional phrase (PP). But before we dig deeper into how we can evaluate data like this with statistical methods, which we will do in later chapters, let's look in some more detail at our variable <em>verbs <\/em>as an object in R.<\/p>\r\n<p style=\"text-align: justify\">As you know, <em>verbs <\/em>is a table. In R, tables are referred to as data frames. In this book, we talk about both tables and data frames, in both cases to referring to structured data arranged by rows and columns. The differences is that we use tables generally, and data frames only when we talk about the objects in R.<\/p>\r\n<p style=\"text-align: justify\">When we are handling a data frame in R, it is possible to view specific values:<\/p>\r\n\r\n<pre><code>&gt; verbs[1,]   # displays row 1\r\n&gt; verbs[,1]   # displays column 1<\/code><\/pre>\r\n<p style=\"text-align: justify\">When displaying an entire row, the output shows the value in each column of that row. When displaying an entire column, the output shows the value in each row of that column.<\/p>\r\n<p style=\"text-align: justify\">If you use the square brackets to take a look at each of the five columns, you will see that the output of the first four columns conatins an additional line after the contents of the 903 rows. This line is titled <code>Levels<\/code>, and it contains a list of the categories in these columns. In other words, the levels are the different values nominal and categorical data takes. In this line we see that the <em>RealizationOfRec<\/em> can be either a \"NP\" or a \"PP\", and we see that there are 65 verbs, from \"accord\" to \"wish\".<\/p>\r\n<p style=\"text-align: justify\">We can also view the levels with the command <code>levels()<\/code>:<\/p>\r\n\r\n\r\n[caption id=\"attachment_592\" align=\"aligncenter\" width=\"754\"]<img src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2019\/08\/5.-levels-verbs.jpg\" alt=\"\" class=\"size-full wp-image-592\" width=\"754\" height=\"231\" \/> Figure 1.5: The 65 levels of the verbs column[\/caption]\r\n<p style=\"text-align: justify\">Here you see, for instance, all of the 65 verbs. Looking at levels like this comes in especially handy if you are not certain what your dataset contains in detail, for instance when you download something in the context of a textbook, and want to see whether it features that one verb you are interested in.<\/p>\r\n<p style=\"text-align: justify\">If you look at the last column, you will see that R does not give you a <code>Levels<\/code> line in the output. This is because numerical data can theoretically take an infinte number of values and accordingly is not structured in levels the way categorical data is.<\/p>\r\n<p style=\"text-align: justify\">Continuing the work with the square bracket queries, we can look up the value of an specific cell. For instance, we can look at row 401 in the <em>AnimacyOfRec<\/em> column using the following command:<\/p>\r\n\r\n<pre id=\"rstudio_console_output\" class=\"GNKRCKGCGSB\"><code><span><span class=\"GNKRCKGCMSB ace_keyword\">&gt; <\/span><span class=\"GNKRCKGCMRB ace_keyword\">verbs[401,3]\r\n<\/span><span class=\"GNKRCKGCGSB\">[1] animate\r\nLevels: animate inanimate<\/span><\/span><\/code><\/pre>\r\n<p style=\"text-align: justify\">We can also formulate more complex queries and search, for instance, using ranges. In R, we express ranges with a colon, as in 2:5. Consider the following search query:<\/p>\r\n\r\n<pre id=\"rstudio_console_output\" class=\"GNKRCKGCGSB\"><code><span><span class=\"GNKRCKGCMSB ace_keyword\">&gt; <\/span><span class=\"GNKRCKGCMRB ace_keyword\">verbs[2:5,1:3]\r\n<\/span><span class=\"GNKRCKGCGSB\">  RealizationOfRec  Verb AnimacyOfRec\r\n2               NP  give      animate\r\n3               NP  give      animate\r\n4               NP  give      animate\r\n5               NP offer      animate<\/span><\/span><\/code><code><\/code><\/pre>\r\nYou see that with these range searches, R displays rows 2 through 5 and columns 1 through 3.\r\n<p style=\"text-align: justify\">Alternatively, we can restrict our searches to small selections. We do this by using the command <code>c()<\/code>. This command combines values into a vector or list. You could also think of <code>c()<\/code> as a command to concatenate.<\/p>\r\n<p style=\"text-align: justify\">Say we want to look at rows 1 to 6, but for some reason we really don't want the third row to be displayed. We can do this using <code>c()<\/code>:<\/p>\r\n\r\n<pre><code>&gt; verbs[c(1,2,4,5,6),]\r\n<\/code><\/pre>\r\n<p style=\"text-align: justify\">We can, of course, do the same thing for columns:<\/p>\r\n\r\n<pre><code>&gt; verbs[,c(1,3)]\r\n<\/code><\/pre>\r\n<p style=\"text-align: justify\">All of the search queries we looked at so far are concerned with how to access certain positions in a data frame. But usually we are less interested in seeing a specific cell than looking at certain values.<\/p>\r\n&nbsp;\r\n<p style=\"text-align: justify\"><b>How can we formulate queries for specific values?<\/b><\/p>\r\n<a href=\"https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_7qvq0p4y\">https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_7qvq0p4y<\/a>\r\n\r\n[iframe width=\"640\" height=\"360\" src=\"https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_7qvq0p4y\" frameborder=\"0\" webkitallowfullscreen mozallowfullscreen allowfullscreen]\r\n\r\n[iframe id=\"kmsembed-0_7qvq0p4y\" width=\"826\" height=\"465\" src=\"https:\/\/uzh.mediaspace.cast.switch.ch\/embed\/secure\/iframe\/entryId\/0_7qvq0p4y\/uiConfId\/23449004\/pbc\/12943\/st\/0\" class=\"kmsembed\" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow=\"autoplay *; fullscreen *; encrypted-media *\" referrerPolicy=\"no-referrer-when-downgrade\" sandbox=\"allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation\" frameborder=\"0\" title=\"1.4 Interpreting the Verb Table\"]\r\n<p style=\"text-align: justify\">When the contents of a cell (or several cells) interest us more than its position, we can formulate something like a database query. For instance, we may be interested in seeing all rows in <em>verbs <\/em>which have an animate theme. In this case we would type the following command:<\/p>\r\n\r\n<pre id=\"rstudio_console_output\" class=\"GNKRCKGCGSB\"><code><span><span class=\"GNKRCKGCMSB ace_keyword\">&gt; <\/span><span class=\"GNKRCKGCMRB ace_keyword\">verbs[verbs$AnimacyOfTheme == \"animate\",]\r\n<\/span><span class=\"GNKRCKGCGSB\">    RealizationOfRec  Verb AnimacyOfRec AnimacyOfTheme LengthOfTheme\r\n58                NP  give      animate        animate     1.0986123\r\n100               NP  give      animate        animate     2.8903718\r\n143               NP  give    inanimate        animate     2.6390573\r\n390               NP  lend      animate        animate     0.6931472\r\n506               NP  give      animate        animate     1.9459101\r\n736               PP trade      animate        animate     1.6094379<\/span><\/span><\/code><\/pre>\r\n<p style=\"text-align: justify\">This query looks more complex than anything we have seen so far, so let's break it down. As before, we have <em>verbs <\/em>followed by square brackets. This means that we want to access the content of the variable <em>verbs <\/em>as defined by the criterion in the square brackets. In the square brackets we have <code><span><span class=\"GNKRCKGCMRB ace_keyword\">verbs$AnimacyOfTheme<\/span><\/span><\/code> which means that within <em>verbs<\/em>, we want only those cases in which the animacy of theme is animate. The two equal signs are used to test for exact equality, and this is what gives us the desired output.[footnote]For a definition of all operators, see \"Chapter 3: Evaluation of Expressions\" in the <em>R Language Definition<\/em> manual, which is available at <a href=\"https:\/\/cran.r-project.org\/doc\/manuals\/r-release\/R-lang.pdf\">https:\/\/cran.r-project.org\/doc\/manuals\/r-release\/R-lang.pdf<\/a>[\/footnote] This type of query takes some getting used to, but it is how R allows us to do something like a database query.<\/p>\r\n<p style=\"text-align: justify\">With this understanding, you can easily formulate analogous queries for different criteria:<\/p>\r\n\r\n<pre><code>&gt; verbs[verbs$Verb == \"lend\",]<\/code><\/pre>\r\n<p style=\"text-align: justify\">Of course, these can be expanded to include multiple elements:<\/p>\r\n\r\n<pre id=\"rstudio_console_output\" class=\"GNKRCKGCGSB\"><code><span><span class=\"GNKRCKGCMSB ace_keyword\">&gt; <\/span><span class=\"GNKRCKGCMRB ace_keyword\">verbs[verbs$Verb==\"lend\" | verbs$Verb==\"sell\",]<\/span><\/span><\/code><\/pre>\r\n<p style=\"text-align: justify\">The <code>|<\/code> means \"or\", so R displays all the rows containing either the verb \"sell\" or \"lend\".<\/p>\r\n<p style=\"text-align: justify\">Now, our data frame has multiple columns of categorical data, so we can use this syntax to search for rows which are consistent with two conditions:<\/p>\r\n\r\n<pre><code>&gt; verbs[verbs$Verb==\"lend\" &amp; verbs$RealizationOfRec==\"PP\",]<\/code><\/pre>\r\nAs you would expect the <code>&amp;<\/code> means \"and\", wherefore the output contains those rows in which the recipient of the word lend is realized as PP.\r\n<p style=\"text-align: justify\">That's it! You have seen how to import data and phrase search queries, and you will learn more about R in the next chapter, <a href=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/chapter\/introduction-to-r\/\">Introduction to R<\/a>.<\/p>\r\n&nbsp;\r\n<p style=\"text-align: justify\"><strong>Reference:<\/strong><\/p>\r\n<p style=\"text-align: justify\">Baayen, R. H. (2008) <i>Analyzing Linguistic Data. A Practical Introduction to Statistics Using R<\/i>. Cambridge University Press, Cambridge.<\/p>","rendered":"<p style=\"text-align: justify\">In this first part of the book, the Foundations, we want to make sure that everyone has a practical understanding of the programming language R, which we use throughout. In this chapter, you will make your first steps in R and learn how to import data into the program.<\/p>\n<p>&nbsp;<\/p>\n<p style=\"text-align: justify\"><strong>What is R?<\/strong><\/p>\n<p><a href=\"https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_deh5auqn\">https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_deh5auqn<\/a><\/p>\n<p style=\"text-align: justify\">\n<!-- iframe plugin v.6.0 wordpress.org\/plugins\/iframe\/ --><br \/>\n<iframe loading=\"lazy\" width=\"640\" height=\"360\" src=\"https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_deh5auqn\" frameborder=\"0\" 0=\"webkitallowfullscreen\" 1=\"mozallowfullscreen\" 2=\"allowfullscreen\" scrolling=\"yes\" class=\"iframe-class\"><\/iframe>\n<\/p>\n<p><!-- iframe plugin v.6.0 wordpress.org\/plugins\/iframe\/ --><br \/>\n<iframe loading=\"lazy\" id=\"kmsembed-0_deh5auqn\" width=\"826\" height=\"465\" src=\"https:\/\/uzh.mediaspace.cast.switch.ch\/embed\/secure\/iframe\/entryId\/0_deh5auqn\/uiConfId\/23449004\/pbc\/12943\/st\/0\" class=\"kmsembed\" 0=\"allowfullscreen\" 1=\"webkitallowfullscreen\" 2=\"mozAllowFullScreen\" allow=\"autoplay *; fullscreen *; encrypted-media *\" referrerpolicy=\"no-referrer-when-downgrade\" sandbox=\"allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation\" frameborder=\"0\" title=\"1.1 Preliminaries\" scrolling=\"yes\"><\/iframe><\/p>\n<p style=\"text-align: justify\">The R project page is <a href=\"http:\/\/www.r-project.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">http:\/\/www.r-project.org\/<\/a>. Download the program for your OS from the ETH mirror <a href=\"http:\/\/stat.ethz.ch\/CRAN\/\" target=\"_blank\" rel=\"noopener noreferrer\">http:\/\/stat.ethz.ch\/CRAN\/<\/a>. Install as you usually install programs. Once you have installed the R base, you have all the tools you need for the next chapters at your disposal.<\/p>\n<p style=\"text-align: justify\">Nonetheless, you might want to consider installing RStudio on top of the R base. RStudio is an integrated development environment which allows you to keep track of the data and variables you are using. Moreover, when you are writing a piece of code composed of several commands, the script window in RStudio allows you to break it down into one command per line. If you think these functionalities sound helpful, download RStudio Desktop from <a href=\"https:\/\/rstudio.com\/products\/rstudio\/\">https:\/\/rstudio.com\/products\/rstudio\/<\/a> and install.<\/p>\n<p style=\"text-align: justify\">R base and RStudio have somewhat different graphic user interfaces, as you can see below. In R base, there is only one window: the console. The console is where you communicate with the program. Here, you enter the commands you want R to execute, and it is where you see the output. In most programming languages, including R, you can recognize the console by the prompt, which looks like this:<code> &gt;<\/code>.<\/p>\n<p style=\"text-align: justify\">\n<div id=\"h5p-2\">\n<div class=\"h5p-iframe-wrapper\"><iframe id=\"h5p-iframe-2\" class=\"h5p-iframe\" data-content-id=\"2\" style=\"height:1px\" src=\"about:blank\" frameBorder=\"0\" scrolling=\"no\" title=\"1.1 R and RStudio\"><\/iframe><\/div>\n<\/div>\n<p style=\"text-align: justify\">In RStudio, there are four windows. The top left window is the script. Here you can enter multiple lines of code without immediately executing them. The top right window shows the global environment. Here you find all the data and variables in use in your current session. The bottom left window is where plots will be displayed. Finally, and most importantly for now, you find the console in the bottom left window. This corresponds to the R base console in that the commands you enter here will be executed immediately.<\/p>\n<p style=\"text-align: justify\">In the video tutorials, you see Gerold working with R base. If you are using RStudio, you will want to try what Gerold does in the console.<\/p>\n<p><a href=\"https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_quelm2x2\">https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_quelm2x2<\/a><\/p>\n<p><!-- iframe plugin v.6.0 wordpress.org\/plugins\/iframe\/ --><br \/>\n<iframe loading=\"lazy\" width=\"640\" height=\"360\" src=\"https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_quelm2x2\" frameborder=\"0\" 0=\"webkitallowfullscreen\" 1=\"mozallowfullscreen\" 2=\"allowfullscreen\" scrolling=\"yes\" class=\"iframe-class\"><\/iframe><\/p>\n<p><!-- iframe plugin v.6.0 wordpress.org\/plugins\/iframe\/ --><br \/>\n<iframe loading=\"lazy\" id=\"kmsembed-0_quelm2x2\" width=\"826\" height=\"465\" src=\"https:\/\/uzh.mediaspace.cast.switch.ch\/embed\/secure\/iframe\/entryId\/0_quelm2x2\/uiConfId\/23449004\/pbc\/12943\/st\/0\" class=\"kmsembed\" 0=\"allowfullscreen\" 1=\"webkitallowfullscreen\" 2=\"mozAllowFullScreen\" allow=\"autoplay *; fullscreen *; encrypted-media *\" referrerpolicy=\"no-referrer-when-downgrade\" sandbox=\"allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation\" frameborder=\"0\" title=\"1.2 Downloadng and Assigning Verbs\" scrolling=\"yes\"><\/iframe><\/p>\n<p style=\"text-align: justify\">When you start R, the prompt <code>&gt;<\/code> appears in the console. Select the console window, and type your first command:<\/p>\n<pre><code>&gt; 2+2<\/code><\/pre>\n<p style=\"text-align: justify\">When you hit enter, R executes your command and gives you the output. The output here is:<\/p>\n<pre><code>[1] 4\r\n<\/code><\/pre>\n<p style=\"text-align: justify\">You are hopefully not surprised. You have just used R as a calculator, and this if a very legitimate way to use it. However, the program is capable of doing much more, and we begin exploring the possibilities in this chapter.<\/p>\n<p style=\"text-align: justify\">Let us begin by importing some data, such as the file <em>verbs.txt<\/em>.<a class=\"footnote\" title=\"The data in verbs.txt is a simplification of a dataset compiled by Joan Bresnan.\" id=\"return-footnote-21-1\" href=\"#footnote-21-1\" aria-label=\"Footnote 1\"><sup class=\"footnote\">[1]<\/sup><\/a> Take a look at this raw text file by opening it in a new browser tab:<\/p>\n<p style=\"text-align: justify\"><a href=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2019\/08\/verbs.txt\">verbs.txt<\/a><\/p>\n<p style=\"text-align: justify\">You may have the impression that the table looks skewed, that not all rows are aligned. This is no reason to worry: it is simply because the fields are separated by tabulator characters. Tabulator-separated fields are very flexible, a simple de facto standard which allows us to import data into spreadsheets like Excel or databases like FileMaker.<\/p>\n<p>To save the file on your computer, right-click on the link and select &#8220;Save As&#8230;&#8221;. The image below shows this step in a German version of Firefox:<\/p>\n<figure id=\"attachment_154\" aria-describedby=\"caption-attachment-154\" style=\"width: 740px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2019\/08\/3.-Save-as.jpg\" alt=\"\" class=\"wp-image-154 size-full\" width=\"740\" height=\"492\" srcset=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2019\/08\/3.-Save-as.jpg 740w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2019\/08\/3.-Save-as-300x199.jpg 300w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2019\/08\/3.-Save-as-65x43.jpg 65w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2019\/08\/3.-Save-as-225x150.jpg 225w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2019\/08\/3.-Save-as-350x233.jpg 350w\" sizes=\"(max-width: 740px) 100vw, 740px\" \/><figcaption id=\"caption-attachment-154\" class=\"wp-caption-text\">Figure 1.1: How to save data from a web browser<\/figcaption><\/figure>\n<p style=\"text-align: justify\">Save the file as a raw text file from the browser. We suggest calling it <i>verbs.txt<\/i>. Remember where you save it, because you will need to know the file&#8217;s location in order to import it into R in the next step.<\/p>\n<p style=\"text-align: justify\">But first, a brief comment on saving files: it is generally worth putting some effort into organizing your data and folder structure. As a practical step towards that, we would suggest creating a folder where you store all data files that you download over the course of the book. Of course, you can create subfolders according to chapters, or topics, or whatnot, depending on your personal preferences. But being somewhat inundated with files ourselves, we cannot overstate the value of having a system for data storage.<\/p>\n<p style=\"text-align: justify\">Returning to R, you can load the <i>verb.txt<\/i> file via the console. When you enter the following command, a dialogue window will open:<\/p>\n<pre><code>&gt; verbs &lt;- read.table(file.choose(), header=TRUE, comment.char=\"\", row.names=1)<\/code><\/pre>\n<p style=\"text-align: justify\">In the dialogue window, navigate to the folder where you stored the file, and open the <em>verbs.txt<\/em> text file. When you open <em>verbs.txt<\/em>, you assign it to the variable <em>verbs<\/em>. This what the arrow construction, <code>&lt;-<\/code>, effects. The variable <em>verbs <\/em>now contains the table that you opened in the browser earlier.<\/p>\n<p>&nbsp;<\/p>\n<p style=\"text-align: justify\"><b>What is a variable?<\/b><\/p>\n<p>In this section, we loosely follow chapter 1 in Baayen (2008).<\/p>\n<p><a href=\"https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_v7pi1adx\">https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_v7pi1adx<\/a><\/p>\n<p><!-- iframe plugin v.6.0 wordpress.org\/plugins\/iframe\/ --><br \/>\n<iframe loading=\"lazy\" width=\"640\" height=\"360\" src=\"https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_v7pi1adx\" frameborder=\"0\" 0=\"webkitallowfullscreen\" 1=\"mozallowfullscreen\" 2=\"allowfullscreen\" scrolling=\"yes\" class=\"iframe-class\"><\/iframe><\/p>\n<p><!-- iframe plugin v.6.0 wordpress.org\/plugins\/iframe\/ --><br \/>\n<iframe loading=\"lazy\" id=\"kmsembed-0_v7pi1adx\" width=\"826\" height=\"465\" src=\"https:\/\/uzh.mediaspace.cast.switch.ch\/embed\/secure\/iframe\/entryId\/0_v7pi1adx\/uiConfId\/23449004\/pbc\/12943\/st\/0\" class=\"kmsembed\" 0=\"allowfullscreen\" 1=\"webkitallowfullscreen\" 2=\"mozAllowFullScreen\" allow=\"autoplay *; fullscreen *; encrypted-media *\" referrerpolicy=\"no-referrer-when-downgrade\" sandbox=\"allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation\" frameborder=\"0\" title=\"1.3 Assigning Variables\" scrolling=\"yes\"><\/iframe><\/p>\n<p style=\"text-align: justify\">What are we talking about when we say that you assigned the table <em>verbs.txt<\/em> to a variable <em>verbs<\/em>? We are not talking about linguistic variables (not just yet), but about computational variables. In R, variables are objects to which you can assign different values. At its simplest, a variable contains a single value. Consider the following example:<\/p>\n<pre id=\"rstudio_console_output\" class=\"GNKRCKGCGSB\"><code><span><span class=\"GNKRCKGCMSB ace_keyword\">&gt; <\/span><span class=\"GNKRCKGCMRB ace_keyword\">x &lt;- 1 + 2\r\n<\/span><\/span><span><span class=\"GNKRCKGCMSB ace_keyword\">&gt; <\/span><span class=\"GNKRCKGCMRB ace_keyword\">1 + 2 -&gt; x\r\n<\/span><\/span><span><span class=\"GNKRCKGCMSB ace_keyword\">&gt; <\/span><span class=\"GNKRCKGCMRB ace_keyword\">x\r\n<\/span><span class=\"GNKRCKGCGSB\">[1] 3<\/span><\/span><\/code><\/pre>\n<p style=\"text-align: justify\">In this example, we assign the outcome of a simple calculation to the variable <em>x<\/em>. And when we tell R to show <code>x<\/code>, the output is the outcome of the calculation.<\/p>\n<p style=\"text-align: justify\">With this in mind, consider again what we did with the <em>verbs <\/em>variable. We assigned a whole table to it, and we can look at it. However, since the whole table is too big to be represented usefully in the R console, let&#8217;s just take a look at the first few rows. We can do this with the <code>head()<\/code> command:<\/p>\n<pre><code>&gt; head(verbs)\r\n<\/code><code>  RealizationOfRec  Verb AnimacyOfRec AnimacyOfTheme LengthOfTheme\r\n1               NP  feed      animate      inanimate      2.639057\r\n2               NP  give      animate      inanimate      1.098612\r\n3               NP  give      animate      inanimate      2.564949\r\n4               NP  give      animate      inanimate      1.609438\r\n5               NP offer      animate      inanimate      1.098612\r\n6               NP  give      animate      inanimate      1.386294\r\n<\/code><\/pre>\n<p style=\"text-align: justify\">By default, <code>head()<\/code> shows the first six rows. To view the first 10 rows, we can modify <code>head()<\/code> with the argument <code>n=10<\/code>:<\/p>\n<pre class=\"GNKRCKGCGSB\"><code><span><span class=\"GNKRCKGCMSB ace_keyword\">&gt; <\/span><span class=\"GNKRCKGCMRB ace_keyword\">head(verbs, n=10)\r\n<\/span><span class=\"GNKRCKGCGSB\">   RealizationOfRec  Verb AnimacyOfRec AnimacyOfTheme LengthOfTheme\r\n1                NP  feed      animate      inanimate     2.6390573\r\n2                NP  give      animate      inanimate     1.0986123\r\n3                NP  give      animate      inanimate     2.5649494\r\n4                NP  give      animate      inanimate     1.6094379\r\n5                NP offer      animate      inanimate     1.0986123\r\n6                NP  give      animate      inanimate     1.3862944\r\n7                NP   pay      animate      inanimate     1.3862944\r\n8                NP bring      animate      inanimate     0.0000000\r\n9                NP teach      animate      inanimate     2.3978953\r\n10               NP  give      animate      inanimate     0.6931472<\/span><\/span><\/code><\/pre>\n<p style=\"text-align: justify\">Now that we have imported the data into R and taken a first look at it, one question remains unanswered: what does it mean?<\/p>\n<p>&nbsp;<\/p>\n<p style=\"text-align: justify\"><b>How is the data to be interpreted?<\/b><\/p>\n<p><a href=\"https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_9vhnuvjp\">https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_9vhnuvjp<\/a><\/p>\n<p><!-- iframe plugin v.6.0 wordpress.org\/plugins\/iframe\/ --><br \/>\n<iframe loading=\"lazy\" width=\"640\" height=\"360\" src=\"https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_9vhnuvjp\" frameborder=\"0\" 0=\"webkitallowfullscreen\" 1=\"mozallowfullscreen\" 2=\"allowfullscreen\" scrolling=\"yes\" class=\"iframe-class\"><\/iframe><\/p>\n<p><!-- iframe plugin v.6.0 wordpress.org\/plugins\/iframe\/ --><br \/>\n<iframe loading=\"lazy\" id=\"kmsembed-0_9vhnuvjp\" width=\"826\" height=\"465\" src=\"https:\/\/uzh.mediaspace.cast.switch.ch\/embed\/secure\/iframe\/entryId\/0_9vhnuvjp\/uiConfId\/23449004\/pbc\/12943\/st\/0\" class=\"kmsembed\" 0=\"allowfullscreen\" 1=\"webkitallowfullscreen\" 2=\"mozAllowFullScreen\" allow=\"autoplay *; fullscreen *; encrypted-media *\" referrerpolicy=\"no-referrer-when-downgrade\" sandbox=\"allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation\" frameborder=\"0\" title=\"1.4 Interpreting the Verb Table\" scrolling=\"yes\"><\/iframe><\/p>\n<p style=\"text-align: justify\">The table <em>verbs <\/em>contains linguistic data on the realization of ditransitive verbs. Each row represents one instance of a ditransitive verb from a corpus. The columns contain information on several aspects of each realization:<\/p>\n<ul>\n<li style=\"text-align: justify\"><i>RealizationOfRec<\/i>: is the recipient an NP (as in <i>I gave <b>you<\/b> a book<\/i>) or a PP (as in <i>I gave a book <b>to you<\/b><\/i>)?<\/li>\n<li style=\"text-align: justify\"><i>Verb<\/i>: which ditransitive verb is used in this instance?<\/li>\n<li style=\"text-align: justify\"><i>AnimacyOfRec<\/i>: is the recipient animate?<\/li>\n<li style=\"text-align: justify\"><i>AnimacyOfTheme<\/i>: is the theme animate?<\/li>\n<li style=\"text-align: justify\"><i>LengthOfTheme<\/i>: how long is the theme?<\/li>\n<\/ul>\n<p style=\"text-align: justify\">We can see immediately that there are different types of scale in this data. Three of the columns, RealizationOfRec, AnimacyOfRec and AnimacyOfTheme contain nominal variables. This means that they can take one of two mutually exclusive values. A recipient can be realized either as an NP or a PP, but it cannot be anything in between or any thir option. The same is evidently true of the animacy of the recipient and theme. English does not have a third mode of animacy for the half-dead. The Verb column contains what is known as a categorical variable. Again, we have mutually exclusive categories, but this time we have more than two categories. What nominal and categorical variables have in common is that even if we translate the categories into numbers &#8211; which can be very useful &#8211; the number we assign a value does not weigh in a mathematical way. In other words, even if the categories are described numerically, we do not rank them the way we would rank numerical data. In the final column, LengthOfTheme, we actually do have numerical data. As we can see from the first few lines, the length of the theme is a continuous numeric variable. This table then contains the most important types of scales we use in linguistics.<\/p>\n<p style=\"text-align: justify\">Which brings us back to the question: how can we interpret a table like this linguistically? Well, this dataset can, for example, be used to find out if certain verbs have a preference for realizing the recipient as a noun phrase (NP) or a prepositional phrase (PP). But before we dig deeper into how we can evaluate data like this with statistical methods, which we will do in later chapters, let&#8217;s look in some more detail at our variable <em>verbs <\/em>as an object in R.<\/p>\n<p style=\"text-align: justify\">As you know, <em>verbs <\/em>is a table. In R, tables are referred to as data frames. In this book, we talk about both tables and data frames, in both cases to referring to structured data arranged by rows and columns. The differences is that we use tables generally, and data frames only when we talk about the objects in R.<\/p>\n<p style=\"text-align: justify\">When we are handling a data frame in R, it is possible to view specific values:<\/p>\n<pre><code>&gt; verbs[1,]   # displays row 1\r\n&gt; verbs[,1]   # displays column 1<\/code><\/pre>\n<p style=\"text-align: justify\">When displaying an entire row, the output shows the value in each column of that row. When displaying an entire column, the output shows the value in each row of that column.<\/p>\n<p style=\"text-align: justify\">If you use the square brackets to take a look at each of the five columns, you will see that the output of the first four columns conatins an additional line after the contents of the 903 rows. This line is titled <code>Levels<\/code>, and it contains a list of the categories in these columns. In other words, the levels are the different values nominal and categorical data takes. In this line we see that the <em>RealizationOfRec<\/em> can be either a &#8220;NP&#8221; or a &#8220;PP&#8221;, and we see that there are 65 verbs, from &#8220;accord&#8221; to &#8220;wish&#8221;.<\/p>\n<p style=\"text-align: justify\">We can also view the levels with the command <code>levels()<\/code>:<\/p>\n<figure id=\"attachment_592\" aria-describedby=\"caption-attachment-592\" style=\"width: 754px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2019\/08\/5.-levels-verbs.jpg\" alt=\"\" class=\"size-full wp-image-592\" width=\"754\" height=\"231\" srcset=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2019\/08\/5.-levels-verbs.jpg 754w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2019\/08\/5.-levels-verbs-300x92.jpg 300w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2019\/08\/5.-levels-verbs-65x20.jpg 65w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2019\/08\/5.-levels-verbs-225x69.jpg 225w, https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-content\/uploads\/sites\/23\/2019\/08\/5.-levels-verbs-350x107.jpg 350w\" sizes=\"(max-width: 754px) 100vw, 754px\" \/><figcaption id=\"caption-attachment-592\" class=\"wp-caption-text\">Figure 1.5: The 65 levels of the verbs column<\/figcaption><\/figure>\n<p style=\"text-align: justify\">Here you see, for instance, all of the 65 verbs. Looking at levels like this comes in especially handy if you are not certain what your dataset contains in detail, for instance when you download something in the context of a textbook, and want to see whether it features that one verb you are interested in.<\/p>\n<p style=\"text-align: justify\">If you look at the last column, you will see that R does not give you a <code>Levels<\/code> line in the output. This is because numerical data can theoretically take an infinte number of values and accordingly is not structured in levels the way categorical data is.<\/p>\n<p style=\"text-align: justify\">Continuing the work with the square bracket queries, we can look up the value of an specific cell. For instance, we can look at row 401 in the <em>AnimacyOfRec<\/em> column using the following command:<\/p>\n<pre class=\"GNKRCKGCGSB\"><code><span><span class=\"GNKRCKGCMSB ace_keyword\">&gt; <\/span><span class=\"GNKRCKGCMRB ace_keyword\">verbs[401,3]\r\n<\/span><span class=\"GNKRCKGCGSB\">[1] animate\r\nLevels: animate inanimate<\/span><\/span><\/code><\/pre>\n<p style=\"text-align: justify\">We can also formulate more complex queries and search, for instance, using ranges. In R, we express ranges with a colon, as in 2:5. Consider the following search query:<\/p>\n<pre class=\"GNKRCKGCGSB\"><code><span><span class=\"GNKRCKGCMSB ace_keyword\">&gt; <\/span><span class=\"GNKRCKGCMRB ace_keyword\">verbs[2:5,1:3]\r\n<\/span><span class=\"GNKRCKGCGSB\">  RealizationOfRec  Verb AnimacyOfRec\r\n2               NP  give      animate\r\n3               NP  give      animate\r\n4               NP  give      animate\r\n5               NP offer      animate<\/span><\/span><\/code><code><\/code><\/pre>\n<p>You see that with these range searches, R displays rows 2 through 5 and columns 1 through 3.<\/p>\n<p style=\"text-align: justify\">Alternatively, we can restrict our searches to small selections. We do this by using the command <code>c()<\/code>. This command combines values into a vector or list. You could also think of <code>c()<\/code> as a command to concatenate.<\/p>\n<p style=\"text-align: justify\">Say we want to look at rows 1 to 6, but for some reason we really don&#8217;t want the third row to be displayed. We can do this using <code>c()<\/code>:<\/p>\n<pre><code>&gt; verbs[c(1,2,4,5,6),]\r\n<\/code><\/pre>\n<p style=\"text-align: justify\">We can, of course, do the same thing for columns:<\/p>\n<pre><code>&gt; verbs[,c(1,3)]\r\n<\/code><\/pre>\n<p style=\"text-align: justify\">All of the search queries we looked at so far are concerned with how to access certain positions in a data frame. But usually we are less interested in seeing a specific cell than looking at certain values.<\/p>\n<p>&nbsp;<\/p>\n<p style=\"text-align: justify\"><b>How can we formulate queries for specific values?<\/b><\/p>\n<p><a href=\"https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_7qvq0p4y\">https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_7qvq0p4y<\/a><\/p>\n<p><!-- iframe plugin v.6.0 wordpress.org\/plugins\/iframe\/ --><br \/>\n<iframe loading=\"lazy\" width=\"640\" height=\"360\" src=\"https:\/\/uzh.mediaspace.cast.switch.ch\/media\/0_7qvq0p4y\" frameborder=\"0\" 0=\"webkitallowfullscreen\" 1=\"mozallowfullscreen\" 2=\"allowfullscreen\" scrolling=\"yes\" class=\"iframe-class\"><\/iframe><\/p>\n<p><!-- iframe plugin v.6.0 wordpress.org\/plugins\/iframe\/ --><br \/>\n<iframe loading=\"lazy\" id=\"kmsembed-0_7qvq0p4y\" width=\"826\" height=\"465\" src=\"https:\/\/uzh.mediaspace.cast.switch.ch\/embed\/secure\/iframe\/entryId\/0_7qvq0p4y\/uiConfId\/23449004\/pbc\/12943\/st\/0\" class=\"kmsembed\" 0=\"allowfullscreen\" 1=\"webkitallowfullscreen\" 2=\"mozAllowFullScreen\" allow=\"autoplay *; fullscreen *; encrypted-media *\" referrerpolicy=\"no-referrer-when-downgrade\" sandbox=\"allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation\" frameborder=\"0\" title=\"1.4 Interpreting the Verb Table\" scrolling=\"yes\"><\/iframe><\/p>\n<p style=\"text-align: justify\">When the contents of a cell (or several cells) interest us more than its position, we can formulate something like a database query. For instance, we may be interested in seeing all rows in <em>verbs <\/em>which have an animate theme. In this case we would type the following command:<\/p>\n<pre class=\"GNKRCKGCGSB\"><code><span><span class=\"GNKRCKGCMSB ace_keyword\">&gt; <\/span><span class=\"GNKRCKGCMRB ace_keyword\">verbs[verbs$AnimacyOfTheme == \"animate\",]\r\n<\/span><span class=\"GNKRCKGCGSB\">    RealizationOfRec  Verb AnimacyOfRec AnimacyOfTheme LengthOfTheme\r\n58                NP  give      animate        animate     1.0986123\r\n100               NP  give      animate        animate     2.8903718\r\n143               NP  give    inanimate        animate     2.6390573\r\n390               NP  lend      animate        animate     0.6931472\r\n506               NP  give      animate        animate     1.9459101\r\n736               PP trade      animate        animate     1.6094379<\/span><\/span><\/code><\/pre>\n<p style=\"text-align: justify\">This query looks more complex than anything we have seen so far, so let&#8217;s break it down. As before, we have <em>verbs <\/em>followed by square brackets. This means that we want to access the content of the variable <em>verbs <\/em>as defined by the criterion in the square brackets. In the square brackets we have <code><span><span class=\"GNKRCKGCMRB ace_keyword\">verbs$AnimacyOfTheme<\/span><\/span><\/code> which means that within <em>verbs<\/em>, we want only those cases in which the animacy of theme is animate. The two equal signs are used to test for exact equality, and this is what gives us the desired output.<a class=\"footnote\" title=\"For a definition of all operators, see &quot;Chapter 3: Evaluation of Expressions&quot; in the R Language Definition manual, which is available at https:\/\/cran.r-project.org\/doc\/manuals\/r-release\/R-lang.pdf\" id=\"return-footnote-21-2\" href=\"#footnote-21-2\" aria-label=\"Footnote 2\"><sup class=\"footnote\">[2]<\/sup><\/a> This type of query takes some getting used to, but it is how R allows us to do something like a database query.<\/p>\n<p style=\"text-align: justify\">With this understanding, you can easily formulate analogous queries for different criteria:<\/p>\n<pre><code>&gt; verbs[verbs$Verb == \"lend\",]<\/code><\/pre>\n<p style=\"text-align: justify\">Of course, these can be expanded to include multiple elements:<\/p>\n<pre class=\"GNKRCKGCGSB\"><code><span><span class=\"GNKRCKGCMSB ace_keyword\">&gt; <\/span><span class=\"GNKRCKGCMRB ace_keyword\">verbs[verbs$Verb==\"lend\" | verbs$Verb==\"sell\",]<\/span><\/span><\/code><\/pre>\n<p style=\"text-align: justify\">The <code>|<\/code> means &#8220;or&#8221;, so R displays all the rows containing either the verb &#8220;sell&#8221; or &#8220;lend&#8221;.<\/p>\n<p style=\"text-align: justify\">Now, our data frame has multiple columns of categorical data, so we can use this syntax to search for rows which are consistent with two conditions:<\/p>\n<pre><code>&gt; verbs[verbs$Verb==\"lend\" &amp; verbs$RealizationOfRec==\"PP\",]<\/code><\/pre>\n<p>As you would expect the <code>&amp;<\/code> means &#8220;and&#8221;, wherefore the output contains those rows in which the recipient of the word lend is realized as PP.<\/p>\n<p style=\"text-align: justify\">That&#8217;s it! You have seen how to import data and phrase search queries, and you will learn more about R in the next chapter, <a href=\"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/chapter\/introduction-to-r\/\">Introduction to R<\/a>.<\/p>\n<p>&nbsp;<\/p>\n<p style=\"text-align: justify\"><strong>Reference:<\/strong><\/p>\n<p style=\"text-align: justify\">Baayen, R. H. (2008) <i>Analyzing Linguistic Data. A Practical Introduction to Statistics Using R<\/i>. Cambridge University Press, Cambridge.<\/p>\n<hr class=\"before-footnotes clear\" \/><div class=\"footnotes\"><ol><li id=\"footnote-21-1\">The data in verbs.txt is a simplification of a dataset compiled by Joan Bresnan. <a href=\"#return-footnote-21-1\" class=\"return-footnote\" aria-label=\"Return to footnote 1\">&crarr;<\/a><\/li><li id=\"footnote-21-2\">For a definition of all operators, see \"Chapter 3: Evaluation of Expressions\" in the <em>R Language Definition<\/em> manual, which is available at <a href=\"https:\/\/cran.r-project.org\/doc\/manuals\/r-release\/R-lang.pdf\">https:\/\/cran.r-project.org\/doc\/manuals\/r-release\/R-lang.pdf<\/a> <a href=\"#return-footnote-21-2\" class=\"return-footnote\" aria-label=\"Return to footnote 2\">&crarr;<\/a><\/li><\/ol><\/div>","protected":false},"author":29,"menu_order":1,"template":"","meta":{"pb_show_title":"on","pb_short_title":"","pb_subtitle":"","pb_authors":[],"pb_section_license":""},"chapter-type":[47],"contributor":[],"license":[],"class_list":["post-21","chapter","type-chapter","status-publish","hentry","chapter-type-standard"],"part":3,"_links":{"self":[{"href":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-json\/pressbooks\/v2\/chapters\/21"}],"collection":[{"href":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-json\/pressbooks\/v2\/chapters"}],"about":[{"href":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-json\/wp\/v2\/types\/chapter"}],"author":[{"embeddable":true,"href":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-json\/wp\/v2\/users\/29"}],"version-history":[{"count":55,"href":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-json\/pressbooks\/v2\/chapters\/21\/revisions"}],"predecessor-version":[{"id":651,"href":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-json\/pressbooks\/v2\/chapters\/21\/revisions\/651"}],"part":[{"href":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-json\/pressbooks\/v2\/parts\/3"}],"metadata":[{"href":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-json\/pressbooks\/v2\/chapters\/21\/metadata\/"}],"wp:attachment":[{"href":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-json\/wp\/v2\/media?parent=21"}],"wp:term":[{"taxonomy":"chapter-type","embeddable":true,"href":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-json\/pressbooks\/v2\/chapter-type?post=21"},{"taxonomy":"contributor","embeddable":true,"href":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-json\/wp\/v2\/contributor?post=21"},{"taxonomy":"license","embeddable":true,"href":"https:\/\/dlf.uzh.ch\/openbooks\/statisticsforlinguists\/wp-json\/wp\/v2\/license?post=21"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}