Data summaries: summary, str

First we load an example data frame:

rats <- data.frame(id = paste0("rat", 1:10), sex = factor(rep(c("female", "male"), 
    each = 5)), weight = c(2, 4, 1, 11, 18, 12, 7, 12, 19, 20), length = c(100, 
    105, 115, 130, 95, 150, 165, 180, 190, 175))
rats
##       id    sex weight length
## 1   rat1 female      2    100
## 2   rat2 female      4    105
## 3   rat3 female      1    115
## 4   rat4 female     11    130
## 5   rat5 female     18     95
## 6   rat6   male     12    150
## 7   rat7   male      7    165
## 8   rat8   male     12    180
## 9   rat9   male     19    190
## 10 rat10   male     20    175

The summary and str functions are two helpful functions for getting a sense of data. summary works on vectors or matrix-like objects (including data.frames). str works on an arbitrary R object and will compactly display the structure.

summary(rats)
##        id        sex        weight          length   
##  rat1   :1   female:5   Min.   : 1.00   Min.   : 95  
##  rat10  :1   male  :5   1st Qu.: 4.75   1st Qu.:108  
##  rat2   :1              Median :11.50   Median :140  
##  rat3   :1              Mean   :10.60   Mean   :140  
##  rat4   :1              3rd Qu.:16.50   3rd Qu.:172  
##  rat5   :1              Max.   :20.00   Max.   :190  
##  (Other):4
summary(rats$weight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    4.75   11.50   10.60   16.50   20.00
str(rats)
## 'data.frame':	10 obs. of  4 variables:
##  $ id    : Factor w/ 10 levels "rat1","rat10",..: 1 3 4 5 6 7 8 9 10 2
##  $ sex   : Factor w/ 2 levels "female","male": 1 1 1 1 1 2 2 2 2 2
##  $ weight: num  2 4 1 11 18 12 7 12 19 20
##  $ length: num  100 105 115 130 95 150 165 180 190 175

Aligning two objects: match, merge

We load another example data frame, with the original ID and another secretID. Suppose we want to sort the original data frame by the secretID.

ratsTable <- data.frame(id = paste0("rat", c(6, 9, 7, 3, 5, 1, 10, 4, 8, 2)), 
    secretID = 1:10)
ratsTable
##       id secretID
## 1   rat6        1
## 2   rat9        2
## 3   rat7        3
## 4   rat3        4
## 5   rat5        5
## 6   rat1        6
## 7  rat10        7
## 8   rat4        8
## 9   rat8        9
## 10  rat2       10
# wrong!
cbind(rats, ratsTable)
##       id    sex weight length    id secretID
## 1   rat1 female      2    100  rat6        1
## 2   rat2 female      4    105  rat9        2
## 3   rat3 female      1    115  rat7        3
## 4   rat4 female     11    130  rat3        4
## 5   rat5 female     18     95  rat5        5
## 6   rat6   male     12    150  rat1        6
## 7   rat7   male      7    165 rat10        7
## 8   rat8   male     12    180  rat4        8
## 9   rat9   male     19    190  rat8        9
## 10 rat10   male     20    175  rat2       10

match is a very useful function in R, which can give us this order, but it’s easy to get its arguments mixed up. Remember that match gives you, for each element in the first vector, the index of the first match in the second vector. So typically the data.frame or vector you are reordering would appear as the second argument to match. It’s always a good idea to check that you got it right, which you can do by using cbind to line up both data frames.

match(ratsTable$id, rats$id)
##  [1]  6  9  7  3  5  1 10  4  8  2
rats[match(ratsTable$id, rats$id), ]
##       id    sex weight length
## 6   rat6   male     12    150
## 9   rat9   male     19    190
## 7   rat7   male      7    165
## 3   rat3 female      1    115
## 5   rat5 female     18     95
## 1   rat1 female      2    100
## 10 rat10   male     20    175
## 4   rat4 female     11    130
## 8   rat8   male     12    180
## 2   rat2 female      4    105
cbind(rats[match(ratsTable$id, rats$id), ], ratsTable)
##       id    sex weight length    id secretID
## 6   rat6   male     12    150  rat6        1
## 9   rat9   male     19    190  rat9        2
## 7   rat7   male      7    165  rat7        3
## 3   rat3 female      1    115  rat3        4
## 5   rat5 female     18     95  rat5        5
## 1   rat1 female      2    100  rat1        6
## 10 rat10   male     20    175 rat10        7
## 4   rat4 female     11    130  rat4        8
## 8   rat8   male     12    180  rat8        9
## 2   rat2 female      4    105  rat2       10

Or you can use the merge function which will handle everything for you. You can tell it the names of the columns to merge on, or it will look for columns with the same name.

ratsMerged <- merge(rats, ratsTable, by.x = "id", by.y = "id")
ratsMerged[order(ratsMerged$secretID), ]
##       id    sex weight length secretID
## 7   rat6   male     12    150        1
## 10  rat9   male     19    190        2
## 8   rat7   male      7    165        3
## 4   rat3 female      1    115        4
## 6   rat5 female     18     95        5
## 1   rat1 female      2    100        6
## 2  rat10   male     20    175        7
## 5   rat4 female     11    130        8
## 9   rat8   male     12    180        9
## 3   rat2 female      4    105       10

Analysis over groups: split, tapply, and dplyr libary

Suppose we need to calculate the average rat weight for each sex. We could start by splitting the weight vector into a list of weight vectors divided by sex. split is a useful function for breaking up a vector into groups defined by a second vector, typically a factor. We can then use the lapply function to calculate the average of each element of the list, which are vectors of weights.

sp <- split(rats$weight, rats$sex)
sp
## $female
## [1]  2  4  1 11 18
## 
## $male
## [1] 12  7 12 19 20
lapply(sp, mean)
## $female
## [1] 7.2
## 
## $male
## [1] 14

A shortcut for this is to use tapply and give the function which should run on each element of the list as a third argument:

tapply(rats$weight, rats$sex, mean)
## female   male 
##    7.2   14.0

R is constantly being developed in the form of add-on packages, which can sometimes greatly simplify basic analysis tasks. A new library “dplyr” can accomplish the same task as above, and can be extended to many other more complicated operations. The “d” in the name is for data.frame, and the “ply” is because the library attempts to simplify tasks typically used by the set of functions: sapply, lapply, tapply, etc. Here is the same task as before done with the dplyr functions group_by and summarise:

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
sexes <- group_by(rats, sex)
summarise(sexes, ave = mean(weight))
## Source: local data frame [2 x 2]
## 
##      sex  ave
## 1 female  7.2
## 2   male 14.0

With dplyr, you can chain operations using the %.% operator:

rats %.% group_by(sex) %.% summarise(ave = mean(weight))
## Source: local data frame [2 x 2]
## 
##      sex  ave
## 1 female  7.2
## 2   male 14.0