Brief Introduction to dplyr

The learning curve for R syntax is slow. One of the more difficult aspects that requires some getting used to is subsetting data tables. The dplyr packages brings these tasks closer to English and we are therefore going to introduce two simple functions: one is used to subset and the other to select columns.

Take a look at the dataset we read in:

filename <- "femaleMiceWeights.csv"
dat <- read.csv(filename)
head(dat) #In R Studio use View(dat)
##   Diet Bodyweight
## 1 chow      21.51
## 2 chow      28.14
## 3 chow      24.04
## 4 chow      23.45
## 5 chow      23.68
## 6 chow      19.79

There are two types of diets, which are denoted in the first column. If we want just the weights, we only need the second column. So if we want the weights for mice on the chow diet, we subset and filter like this:

library(dplyr) 
chow <- filter(dat, Diet=="chow") #keep only the ones with chow diet
head(chow)
##   Diet Bodyweight
## 1 chow      21.51
## 2 chow      28.14
## 3 chow      24.04
## 4 chow      23.45
## 5 chow      23.68
## 6 chow      19.79

And now we can select only the column with the values:

chowVals <- select(chow,Bodyweight)
head(chowVals)
##   Bodyweight
## 1      21.51
## 2      28.14
## 3      24.04
## 4      23.45
## 5      23.68
## 6      19.79

A nice feature of the dplyr package is that you can perform consecutive tasks by using what is called a “pipe”. In dplyr we use %>% to denote a pipe. This symbol tells the program to first do one thing and then do something else to the result of the first. Hence, we can perform several data manipulations in one line. For example:

chowVals <- filter(dat, Diet=="chow") %>% select(Bodyweight)

In the second task, we no longer have to specify the object we are editing since it is whatever comes from the previous call.

Also, note that if dplyr receives a data.frame it will return a data.frame.

class(dat)
## [1] "data.frame"
class(chowVals)
## [1] "data.frame"

For pedagogical reasons, we will often want the final result to be a simple numeric vector. To obtain such a vector with dplyr, we can apply the unlist function which turns lists, such as data.frames, into numeric vectors:

chowVals <- filter(dat, Diet=="chow") %>% select(Bodyweight) %>% unlist
class( chowVals )
## [1] "numeric"

To do this in R without dplyr the code is the following:

chowVals <- dat[ dat$Diet=="chow", colnames(dat)=="Bodyweight"]