Genomic annotation in Bioconductor: Cheat sheet
Summarizing the key genome annotation resources in Bioconductor
Executive summary
Organism-oriented annotation
For biological annotation, generally sequence or gene based, there are three key types of package
- Reference sequence packages: BSgenome.[Organism].[Curator].[BuildID]
- Gene model database packages: TxDb.[Organism].[Curator].[BuildID].[Catalog], and, EnsDb.[Organism].[version], for Ensembl-derived annotation
- Annotation map package: org.[Organism2let].[Institution].db
wherever brackets are used, you must substitute an appropriate token. You can survey all annotation packages at the annotation page.
Packages Homo.sapiens, Mus.musculus and Rattus.norvegicus are specialized integrative annotation resources with an evolving interface.
Systems biology oriented annotation
Packages GO.db, KEGG.db, KEGGREST, and reactome.db are primarily intended as organism-independent resources organizing genes into groups. However, there are organism-specific mappings between gene-oriented annotation and these resources, that involve specific abbreviations and symbol conventions. These are described when these packages are used.
Names for organisms and their abbreviations
The standard Linnaean taxonomy is used very generally. So you need to know that
- Human = Homo sapiens
- Mouse = Mus musculus
- Rat = Rattus norvegicus
- Yeast = Saccharomyces cerevisiae
- Zebrafish = Danio rerio
- Cow = Bos taurus
and so on. We use two sorts of abbreviations. For Biostrings-based packages, the contraction of first and second names is used
- Human = Hsapiens
- Mouse = Mmusculus
- Rat = Rnorvegicus
- Yeast = Scerevisiae …
For NCBI-based annotation maps, we contract further
- Human = Hs
- Mouse = Mm
- Rat = Rn
- Yeast = Sc …
Genomic sequence
These packages have four-component names that specify the reference build used
- Human = BSgenome.Hsapiens.UCSC.hg19
- Mouse = BSgenome.Mmusculus.UCSC.mm10
- Rat = BSgenome.Rnorvegicus.UCSC.rn5
- Yeast = BSgenome.Scerevisiae.UCSC.sacCer3
Gene models
These packages have five-component names that specify the reference build used and the gene catalog
- Human = TxDb.Hsapiens.UCSC.hg19.knownGene
- Mouse = TxDb.Mmusculus.UCSC.mm10.knownGene
- Rat = TxDb.Rnorvegicus.UCSC.rn5.knownGene
- Yeast = TxDb.Scerevisiae.UCSC.sacCer3.sgdGene
Additional packages that are relevant are
- Human = TxDb.Hsapiens.UCSC.hg38.knownGene
- Human = EnsDb.Hsapiens.v75 – related to hg19/GRCh37
Annotation maps
These packages have four component names, with two components fixed. The variable components indicate organism and curating institution.
- Human = org.Hs.eg.db
- Mouse = org.Mm.eg.db
- Rat = org.Rn.eg.db
- Yeast = org.Sc.sgd.db
Additional options
There are often alternative curating institutions available such as Ensembl.