Mapping features to genes

Using Bioconductor annotation packages

This unit will focus on mapping features to genes, i.e., getting annotation information from one format to another. We start by loading in the maPooling dataset from previous lectures.

# library(devtools) install_github('dagdata','genomicsclass')
library(dagdata)
library(Biobase)

## Loading required package: BiocGenerics
## Loading required package: methods
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## 
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## 
## The following object is masked from 'package:stats':
## 
##     xtabs
## 
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, as.vector, cbind,
##     colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
##     intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rep.int, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unlist
## 
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

data(maPooling)
e <- maPooling
head(rownames(e))

## [1] "1367452_at" "1367453_at" "1367454_at" "1367455_at" "1367456_at"
## [6] "1367457_at"

annotation(e)

## [1] "rae230a"

The annotation for this ExpressionSet is rae230a. Many platforms will have database annotation packages already existing on Bioconductor. We can access these, first by installing, and then loading the library. We will use the AnnotationDbi package to query the information in the library.

While in this unit we will use a microarray annotation package as an example, the same commands can be used for an organism package, such as the homo sapiens annotation package org.Hs.eg.db, which let’s one query from one kind of gene annotation to another.

# biocLite(paste0(annotation(e),'.db'))
library(rae230a.db)

## Loading required package: AnnotationDbi
## Loading required package: GenomeInfoDb
## Loading required package: org.Rn.eg.db
## Loading required package: DBI

# biocLite('AnnotationDbi')
library(AnnotationDbi)

Annotation packages have columns, some of which may be keys. You can query the database using a key, and ask for one or more columns in return. We will use the rownames of the ExpressionSet as keys.

columns(rae230a.db)

##  [1] "PROBEID"      "ENTREZID"     "PFAM"         "IPI"         
##  [5] "PROSITE"      "ACCNUM"       "ALIAS"        "CHR"         
##  [9] "CHRLOC"       "CHRLOCEND"    "ENZYME"       "PATH"        
## [13] "PMID"         "REFSEQ"       "SYMBOL"       "UNIGENE"     
## [17] "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS" "GENENAME"    
## [21] "UNIPROT"      "GO"           "EVIDENCE"     "ONTOLOGY"    
## [25] "GOALL"        "EVIDENCEALL"  "ONTOLOGYALL"

keytypes(rae230a.db)

##  [1] "ENTREZID"     "PFAM"         "IPI"          "PROSITE"     
##  [5] "ACCNUM"       "ALIAS"        "CHR"          "CHRLOC"      
##  [9] "CHRLOCEND"    "ENZYME"       "PATH"         "PMID"        
## [13] "REFSEQ"       "SYMBOL"       "UNIGENE"      "ENSEMBL"     
## [17] "ENSEMBLPROT"  "ENSEMBLTRANS" "GENENAME"     "UNIPROT"     
## [21] "GO"           "EVIDENCE"     "ONTOLOGY"     "GOALL"       
## [25] "EVIDENCEALL"  "ONTOLOGYALL"  "PROBEID"

head(keys(rae230a.db, keytype = "PROBEID"))

## [1] "1367452_at" "1367453_at" "1367454_at" "1367455_at" "1367456_at"
## [6] "1367457_at"

head(rownames(e))

## [1] "1367452_at" "1367453_at" "1367454_at" "1367455_at" "1367456_at"
## [6] "1367457_at"

The following select call will return the Entrez ID, ENSEMBL ID, and gene symbol for each Probe ID, which are the rownames of the ExpressionSet.

res <- select(rae230a.db, keys = rownames(e), columns = c("ENTREZID", "ENSEMBL", 
    "SYMBOL"), keytype = "PROBEID")

## Warning: 'select' resulted in 1:many mapping between keys and return rows

head(res)

##      PROBEID ENTREZID            ENSEMBL SYMBOL
## 1 1367452_at   690244 ENSRNOG00000032840  Sumo2
## 2 1367453_at   114562 ENSRNOG00000033426  Cdc37
## 3 1367454_at    60384 ENSRNOG00000045871  Copb2
## 4 1367455_at   116643 ENSRNOG00000034242    Vcp
## 5 1367456_at    81920 ENSRNOG00000013741 Ube2d3
## 6 1367456_at   641452 ENSRNOG00000013741 Ube2d2

idx <- match(rownames(e), res$PROBEID)

We need to align the res object so that we pull out, in order, one row for each row of the ExpressionSet.

head(rownames(e))

## [1] "1367452_at" "1367453_at" "1367454_at" "1367455_at" "1367456_at"
## [6] "1367457_at"

head(res$PROBEID, 7)

## [1] "1367452_at" "1367453_at" "1367454_at" "1367455_at" "1367456_at"
## [6] "1367456_at" "1367457_at"

head(idx)

## [1] 1 2 3 4 5 7

Here we add the new information to the fData of e. If there were already information in fData, we would have used cbind to add the new columns. Note here that, since we have a one-to-many mapping, the match function gave us the first match that it found. You could also collapse all possible matches of the Probe ID to the Genes using split and paste with the collapse argument. However, here we keep it simple and just take the first match in the res object.

fData(e) <- res[idx, ]
head(fData(e), 10)

##       PROBEID ENTREZID            ENSEMBL SYMBOL
## 1  1367452_at   690244 ENSRNOG00000032840  Sumo2
## 2  1367453_at   114562 ENSRNOG00000033426  Cdc37
## 3  1367454_at    60384 ENSRNOG00000045871  Copb2
## 4  1367455_at   116643 ENSRNOG00000034242    Vcp
## 5  1367456_at    81920 ENSRNOG00000013741 Ube2d3
## 7  1367457_at   114558 ENSRNOG00000020513  Becn1
## 8  1367458_at    83510 ENSRNOG00000010067 Lypla2
## 9  1367459_at    64310 ENSRNOG00000030591   Arf1
## 10 1367460_at    29662 ENSRNOG00000018091   Gdi2
## 11 1367461_at   114023 ENSRNOG00000012000  Copb1

all.equal(fData(e)$PROBEID, rownames(e))

## [1] TRUE

Using Biomart

An alternate way to map from one annotation to another is using the biomaRt package. For more information on which Biomarts are available and how to access them, see the biomaRt vignette.

# biocLite('biomaRt')
library(biomaRt)
# vignette('biomaRt')
m <- useMart("ensembl", dataset = "rnorvegicus_gene_ensembl")
map <- getBM(mart = m, attributes = c("ensembl_gene_id", "entrezgene"), filters = "ensembl_gene_id", 
    values = fData(e)$ENSEMBL)
head(map)

##      ensembl_gene_id entrezgene
## 1 ENSRNOG00000000007      24379
## 2 ENSRNOG00000000010     498922
## 3 ENSRNOG00000000024     362454
## 4 ENSRNOG00000000028      83836
## 5 ENSRNOG00000000029      24582
## 6 ENSRNOG00000000033     305095

Finally, we need to align the new information with the old information using the match function as before, again picking the first match from a one-to-many mapping. We see that for the most part the new and the old Entrez IDs are the same, though some differences occur when we pick one from the one-to-many mappings that exist.

idx <- match(fData(e)$ENSEMBL, map$ensembl_gene_id)
fData(e)$NEW_ENTREZID <- map$entrezgene[idx]
head(fData(e))

##      PROBEID ENTREZID            ENSEMBL SYMBOL NEW_ENTREZID
## 1 1367452_at   690244 ENSRNOG00000032840  Sumo2       682787
## 2 1367453_at   114562 ENSRNOG00000033426  Cdc37       114562
## 3 1367454_at    60384 ENSRNOG00000045871  Copb2        60384
## 4 1367455_at   116643 ENSRNOG00000034242    Vcp       116643
## 5 1367456_at    81920 ENSRNOG00000013741 Ube2d3        81920
## 7 1367457_at   114558 ENSRNOG00000020513  Becn1       114558

mean(fData(e)$ENTREZID == fData(e)$NEW_ENTREZID, na.rm = TRUE)

## [1] 0.993