How many lines can R process

Introduction to data analysis with R package 'dplyr' - R User Group Nuremberg

introduction

Within the R landscape, the dplyr package has developed into one of the most popular packages within a short period of time; it provides an innovative concept of data analysis. dplyr is characterized by two ideas. The first idea is that only tables (“dataframes” or “tibbles”) are processed, not other data structures. These tables are passed on from function to function. The focus on tables simplifies the analysis, as columns do not have to be created individually or in loops. The second idea is to “grammaticalize” typical data analysis activities using a taxonomy. A few building blocks can be identified with which the typical tasks of data analysis can be carried out. The workshop presents both ideas from dplyr; First, the logic of dplyr is explained without resorting to the R syntax. Then the functionality of will be practiced. The workshop lasts about 90-120 minutes. Basic knowledge of R is required.

Organizational matters

Please bring a computer to the event. We use the following software in the workshop; please install in advance:

  • R and RStudio
  • R packets,,

We are also downloading a dataset; please make sure you have an internet connection.

The package loads, and other packages (for a list see). It is therefore more convenient to load, which saves typing. The actual functionality that we use in this chapter comes from the package.

With Data judo is meant to “prepare” the data for the actual analysis. Under Processing What is meant here is the reshaping, checking, cleaning, grouping and summarizing of data. The descriptive statistics fall under the heading processing. In short: Everything you do after the data is “there” and before you start (more) demanding modeling.

Even if the processing of data is not statistically demanding, it is still of great importance and often quite time-consuming. An anecdote on the relevance of data preparation, which (as the story would have it) was told to me at a bar after a relevant conference (so no sources, you understand ...). A computer scientist from the USA (of German origin) had an impressive track record of victories in data analysis competitions. In fact, she hadn't used any special, sophisticated modeling techniques; classical regression was her method of choice. In a competition in which the aim was to predict cancer cases from medical data (e.g. from X-rays) she found out after a long data judo that information had seeped into the “ID variables” that did not belong there and that she could use for surprising ( from the point of view of fellow campaigners) good predictions for cancer cases. How was that possible? The data came from several clinics, and each clinic used a different system to generate IDs for patients. Everywhere the IDs were strong enough to ensure the anonymity of the patients, but at the same time it was possible (after some judo) to tell which ID came from which clinic. What does that bring? Some clinics were purely screening centers that catered for the general population. Few cancer cases are to be expected there. Other clinics, however, were oncology centers for known patients or for patients with a special risk situation. Not surprisingly, higher cancer rates can then be predicted. Quite simple, actually; special math is not behind it (at least in this story). And, if you know the trick, it's easy. But as is so often the case, it is not easy to find the trick. Careful data judo has brought the key to success here.

Typical problems

Before you can really open your statistics bag of tricks, you often have to get the data into shape first. That's not difficult in the sense that it would be about complicated math. However, it can sometimes take a long time and a few (or many) manual tricks are helpful. The following chapter should help here.

Some typical problems that keep coming up are:

  • Missing values: Somebody didn't answer one of my nice questions in the survey!
  • Unexpected data: When asked how many Facebook friends he or she had, the person wrote “I like you a lot”. What to do???
  • Data needs to be transformed: Joachim set up a Google Forms questionnaire for each of the two groups in his study. Now he has two tables that he wants to "marry off". Is the?
  • Calculate new variables (columns): A student asks for the number of correct tasks in the statistics mock exam. We want to help and create a column in the corresponding data set in which the number of correctly answered questions per person is listed.

Prepare data with

There are many ways to prepare data with R; dplyr is a popular package for this. A central idea of ​​is that there should only be a few basic building blocks that can be easily combined. In other words: Few basic functions with narrowly defined functionality. The author, Hadley Wickham, once said in a forum (citation needed) that these commands can do little, but do the little well. A disadvantage of this concept can be that you have to combine quite a few of these components in order to achieve the desired result. In addition, you have to understand the logic of the construction kit well - so the learning curve is steeper at first. You don't have to rely on “Mrs. Right” somewhere who can do exactly what I want. In addition, you don't need to remember a lot of functions. It is enough to know a small set of functions (which are practically consistent in syntax and methodology).

Welcome to the world of! got its name because it is exclusively D.ataframes endeavored; it expects a data frame as input and returns a data frame (at least for most commands).

The two principles of

There are many ways to prepare data with R; ^ [https://cran.r-project.org/web/packages/dplyr/index.html] is a popular package for this. is based on two ideas:

  1. Lego principle Break down complex data analyzes into building blocks (
  2. Whistle through: All operations are only applied to data frames; each operation expects a data frame as input and outputs a data frame again.

The first principle of is that there are only a few few basic building blocks that can be combined well. In other words: Few basic functions with narrowly defined functionality. The author, Hadley Wickham, once said in a forum (citation needed…) that these commands can do little, but do the little well. A disadvantage of this concept can be that you have to combine quite a few of these components in order to achieve the desired result. You also have to understand the logic of the kit well - so the learning curve is steeper at first. You don't have to rely on “Mrs. Right” somewhere who can do exactly what I want. In addition, you don't need to remember a lot of functions. It is enough to know a small set of functions (which are practically consistent in syntax and methodology). These building blocks are typical activities in dealing with data; nothing surprising. Let's take a closer look at these building blocks in a moment.

The second principle of is to move a dataframe from operation to operation to pass through. so works just with data frames. Each work step at expects a data frame as input and in return outputs a data frame.

Let's take a look at a few typical building blocks of.

Central building blocks of

Filter lines with

Often you want to filter certain rows from a table; . For example, you work for the cigarette industry and are only interested in smokers, not non-smokers; only the sales figures of the last quarter are to be examined, not the previous quarters; only the data from laboratory X (not laboratory Y) should be evaluated etc.

A symbol:

Note:

The function filters lines from a data frame.

Let's look at some examples; do not forget to load the data first. Attention: “Living” the data in a package, this package must be installed in order to be able to access the data.

Not that difficult, is it? More generally speaking, those lines are filtered (i.e. retained or returned) for which the filter criterion is.

Exercises F, R, F, F, R

Solution: F, R, F, F, R

In-depth study: advanced examples for

Some advanced examples of:

You can filter all elements (lines) that belong to a set with this operator::

Text data in particular invite some extra considerations; let's say we want to filter all people who mention cats among pets. It should be enough if part of the text is; so would be OK (should be filtered). To do this, we use a package for processing strings (text data):

A common case is lines without to filter missing values ​​(s). It's easy:

But what if we are only concerned about missing values ​​on certain columns? Let's say at and at:

Select columns with

The counterpart to is; this command returns the selected columns. This is often useful when the data set is very “wide”, that is, contains many columns. Then it can be clearer to only select the relevant ones. The symbol for this command:

Note:

The function selects columns from a data frame.

First, let's load a data set.

This dataset contains data on a statistics exam.

In fact, the command is very flexible; there are many ways to choose columns. A good overview can be found in the cheatsheet.

tasks

Solution: F, F, R, R, F

Sort lines with

One can differentiate between two ways of dealing with R: on the one hand “interactive use” and on the other hand “correct programming”. In interactive use, we want to (quickly) answer the questions about the currently available data set. It's not about developing a general solution that we can send out into the world and that solves a specific problem there without the developer (we) having to provide assistance. “Right” software, like an R-Package or Microsoft Powerpoint, has to meet this expectation; “Proper programming” is required for this. Of course, in this case the demands on the syntax (the “code” sounds cooler) are much higher. In that case you have to foresee all eventualities and ensure that the program does its job well even with the strangest user. Here, when it comes to interactive use, we have lower standards or different goals.

When using R interactively (or any analysis program), sorting rows is a fairly common activity. A typical example would be the teacher who has a table of grades and wants to know which students are the worst or the best in a particular subject. Or when checking sales by branch, we would like to know the branches with the highest and weakest sales.

An R command to do this is; some examples show how it works best:

Some comments. The general syntax is, where denotes the data frame and the first column to be sorted; the dots indicate that further parameters can be passed. You can sort numeric columns as well as text columns. The most important thing here is that you can pass more columns. More on that in a moment.

Sorted by default ascending (because small numbers come before large numbers in the number line). If you want to reverse this order (assign large values, i.e. descending), you can put a minus sign in front of the name of the column.

To give two or more Columns, the values ​​of etc are sorted for each value of; see the output of the example above.

Note:

The arrange function sorts the lines of a data frame.

A symbol for clarification:

A similar result is obtained with one, which the greatest ranks reflects:

To give no Column, it refers to the last column in the data record.

Since several people share the highest rank (value 40) here, we get Not 3 lines returned, but more accordingly.

tasks

Solution: F, F, F, F, R

Group data set with

Grouping a record is a common thing: What is the median sales in Region X compared to Region Y? Is the reaction time in the experimental group shorter than in the control group? Can men pull out of a parking space faster than women? You can see that grouping is particularly useful in connection with mean values ​​or other summaries; more on this in the next section.

Grouping means dividing a data set using a discrete variable (e.g. gender) in such a way that partial data sets are created - one sub-data set per group (e.g. man vs. woman).

In the illustration, the data set has been divided into several groups based on the column. Next, we could have e.g. mean values ​​per subject - i.e. per group (per expression of) - output; in this case four groups (subjects A to D).

If you look at the data set, you first see little effect of the grouping. R only tells us that there are 7 groups, but there is no extra column or other indication of grouping. But don't worry, if we calculate an average right away, we will get the average per group!

A few hints: mean that the output refers to one (see details here), i.e. a certain type of data frame. shows that the Tibble is divided into 7 groups - according to the values ​​of.

in itself is not really useful. It only becomes useful if you apply further functions to the grouped data set - e.g. calculating mean values ​​(e.g. with, see below). The following functions (if they come off) take the grouping into account. So you can easily calculate mean values ​​per group. then combines the summaries (e.g. mean values) of the individual groups in a data frame and then outputs this.

The idea of ​​“grouping - summarizing - combining” is flexible; you can use them often. It is worth learning this idea (see Fig. \ @Ref (fig: sac)).

tasks

Solution: R, F, R, R

Note:

With group_by you divide a data record into groups according to the values ​​of one of several columns.

Combine a column with

Perhaps the most important or common activity in analyzing data is adding a column one Summarize value; does this. In other words: calculate a mean value, find the largest (smallest) value, calculate the correlation or display any other statistic. What these operations have in common is that they combine a column into one value, “turn column into number”, so to speak. Hence the name of the command is quite appropriate. More precisely, this command combines a column into a number based a function like or. Any function that requires a column as input and returns a number is allowed; other functions are not allowed at.

One could translate this command into German like this:. Do not forget, if the column has missing values, the command will acknowledge this by default with. If you add to the parameter, R ignores missing values ​​and the command returns a result.

Now we can also use the grouping:

The command recognizes when there is a (with) grouped table. Any summary that we request will be broken down based on the grouping information. In the example we get an average for each value of. Interestingly, we see that the mean tends to get bigger the bigger it gets.

All of these commands return a data frame, which is convenient for further processing. In this case the columns are called and. Second name is not that nice, so let's change it to:

Now we can also use the grouping:

Now the second column is called. causes a mean value to be returned for missing values ​​(the lines with missing values ​​are ignored in this case).

Basically, the philosophy of the commands is: “Do only one thing, do it well for it”. Correspondingly, only can columns summarize, but not Lines.

Note:

With summarize you can combine a column of a data frame into one value.

Descriptive statistics with

Descriptive statistics have two main areas: measures of position and measures of dispersion.

Location dimensions indicate the “typical”, “mean” or “representative” representative of the distribution. You immediately think of that when looking at the location dimensions arithmetic mean (synonymous: mean; often abbreviated as \ (\ bar {X} \);).A disadvantage of mean values ​​is that they are not robust against extreme values: Even a comparatively large individual value can change the mean value significantly and thus call into question the representativeness of the mean value for the total amount of data. A robust variant is the Median (Md;). If the number of (different) manifestations is not too large in relation to the number of cases, it is mode a meaningful statistic; it indicates the most common expression (Der mode is not represented in the standard R with its own command. But you can easily determine it by hand; See below There are also some packages that offer this function: e.g. see above).

Measures of dispersion} reflect the difference in the data; in other words: are the data similar or are the values ​​significantly different? Central statistics are that mean absolute distance (MAA; MAD; The MAD is not represented in the standard R with its own command. There are some packages that offer this function: e.g. see above), the Standard deviation (sd;), the Variance (Var;) and the Interquartile range (IQR;). Since only the IQR Not based on the mean, it is the most robust. Any quantile can be obtained with the R command.

The command is suitable for calculating descriptive statistics.

Of course, you could also write more simply:

but in contrast to etc. always returns a data frame. Since the data frame is the typical data structure, it is often useful to get a data frame back that you can continue working with. In addition, etc. do not allow grouping operations; but you can achieve this at.

tasks

Solution: R, R, R, R, R

  1. (Advanced) Build your own way to calculate the mean absolute distance! For the sake of simplicity (first) assume a vector with the values ​​(1,2,3)!

Solution:

Lines count with and

It is also useful to count lines. In contrast to the standard command (standard command means that the function belongs to R's standard repertoire, i.e. does not have to be loaded separately via a package), the command also understands groupings. may only be used within or similar commands.

Outside of grouped records is usually more practical.

The command, which is nothing more than the series connection of and, is more practical. With we count the frequencies by group; Groups are mostly the values ​​of a variable to be counted (or several variables to be counted). This makes it an important helper when analyzing frequency data.

In more general terms, the syntax is:, where is the data frame and the first (there can be several) column to be counted. For example, if you specify two columns, the frequencies of the 2nd column are output for each value in the 1st column.

Note:

n and count count the number of rows, i.e. the number of cases.

tasks

Solution: R, R, F, F

  1. Build your own way to get the mode using and!

Ah! The score is the most common!

The pipe

The second idea can be casually referred to as “whistling through” or the “idea of ​​the whistle”; represented iconographically with a pipe-like symbol `%>% '. The term “pipe through” is freely taken from the English “to pipe”. The famous picture by René Magritte was the inspiration.

What is meant here is to put a data record on an assembly line, so to speak, and to carry out a work step at each workstation. The key point is that a data frame is entered as “raw material” and each work step in turn outputs a data frame. This is a very nice way to achieve a “flow” of processing, saves typing and makes the syntax more readable. In order for the whistling to work, commands are required that expect a data frame as input and return a data frame. The diagram shows an example of a whistling sequence.

The so-called “pipe”, alluding to the famous picture by René Magritte, concatenates commands. This is useful because it simplifies the syntax. Compare this syntax

with this

It is helpful to translate this “pipe syntax” into German pseudo-code.

The second syntax, in “pipe form”, is much easier to understand than the first! The first syntax is nested, you have to read it inside out. It's complicated. The whistle in the 2nd syntax makes it much easier to understand the Snytax, as the commands are placed “one behind the other” (organized sequentially).

The whistle breaks down the “Russian doll”, that is, nested code, into sequential steps in the correct order (according to the processing). We no longer have to read the code from the inside out (as is the case with a mathematical formula), but can read “first ..., second ..., third ...” like a cooking recipe. The whistle makes the syntax easier. Of course, we could have broken down the nested syntax into many individual commands and saved an intermediate result with the assignment arrow and then explicitly passed the intermediate result on to the next command. Actually, the pipe does just that - just with less typing. And easier to read too. Flow!

Calculate columns with

If you use the whistle, the command is very practical: it calculates a column. Usually you can just calculate a column with the assignment operator:

For example like this:

However, this does not work (so well) within a pipe syntax. You are better advised to use the function; do just the same as the pseudo-syntax above:

In words:

However, groupings also take into account, conveniently. The main advantage is better readability by resolving the nesting.

A concrete example:

This syntax creates a new column within; this column checks for each persion whether is> 25. If yes (TRUE), then it is TRUE, otherwise it is FALSE (bad luck). shows the first 6 lines of the resulting data frame.

A symbol for:

tasks

  1. Decipher this monster! Translate this syntax into German:
  1. Now decipher this syntax or translate it into German:
  1. (difficult) The whistle at
  • Translate the following pseudo-syntax into ERRRische!

Solution:

  • Calculate the sd of in! Compare it with the result of the previous exercise!

Solution:

  • What did the pipe syntax compute above?

Solution: the sd from

Command overview

Package :: functiondescription
dplyr :: arrangeSorts columns
dplyr :: filterFilters lines
dplyr :: selectSelects columns
dplyr :: group_bygroups a data frame
dplyr :: ncounts lines
dplyr :: countcounts lines according to subgroups
%>% (dplyr)concatenates commands
dplyr :: mutatecreates / calculates columns

References

Case study too

Read this more detailed case study at: https://sebastiansauer.github.io/Fallstudie_Flights/.

rstatstalksstatstidyverse