Importing NGS data into Bioconductor
There are two main packages for working with NGS data in R: the Rsamtools package and the GenomicAlignments package. You can think of the difference as:
- Rsamtools provides raw access to the information in NGS data files
- GenomicAlignments uses the Rsamtools functions to provide NGS data in R as high-level Bioconductor objects (based on GRanges for example). We will see more examples below.
Rsamtools description
The Rsamtools package has the description:
… provides an interface to the ‘samtools’, ‘bcftools’, and ‘tabix’ utilities for manipulating SAM (Sequence Alignment / Map), FASTA, binary variant call (BCF) and compressed indexed tab-delimited (tabix) files.
What are BAM files?
You might not be familiar with all these formats, but the ones we are interested in for now are SAM and it’s compressed form BAM. We will refer here to BAM files, because these are the kind of files which are kept most often because they are smaller (there is a SAMtools utility for converting SAM to BAM and BAM to SAM).
SAM and BAM files contain information about the alignment of NGS reads to a reference genome. These files are produced by alignment software, which take as input:
- the FASTQ files from the sequencing machine (either 1 file for a single-end sequencing sample, or 2 files for a paired-end sequencing sample).
- an genomic index, which is typically produced by special software packaged with the alignment software. The genomic index is created from the reference genome. Sometimes the genomic index files for popular reference genomes can be downloaded.
Note: alignment software is typically application specific. In particular, alignment programs for RNA-seq are different than those for genomic DNA sequencing, because in the case of the former, it is expected that reads might fall on exon-exon boundaries. The read will not contain intron sequence, because it is typically the mature, spliced mRNA which is converted to cDNA and sequenced, and the introns are already spliced out of this molecule. This is not a concern for genomic DNA sequencing.
How to import NGS data using Rsamtools
We will use example BAM files from the pasillaBamSubset package to examine the Rsamtools functions:
library(pasillaBamSubset)
library(Rsamtools)
filename <- untreated1_chr4()
We can create a BamFile
object using the function BamFile
, which allows other functions to know how to process the file.
(bf <- BamFile(filename))
## class: BamFile
## path: /usr/local/lib/R/site-library/pasillaBamSubset.../untreated1_chr4.bam
## index: /usr/local/lib/R/site-library/pasillaBamS.../untreated1_chr4.bam.bai
## isOpen: FALSE
## yieldSize: NA
## obeyQname: FALSE
## asMates: FALSE
## qnamePrefixEnd: NA
## qnameSuffixStart: NA
We can ask about information on the chromosomes which are declared in the header of the BAM file:
seqinfo(bf)
## Seqinfo object with 8 sequences from an unspecified genome:
## seqnames seqlengths isCircular genome
## chr2L 23011544 NA <NA>
## chr2R 21146708 NA <NA>
## chr3L 24543557 NA <NA>
## chr3R 27905053 NA <NA>
## chr4 1351857 NA <NA>
## chrM 19517 NA <NA>
## chrX 22422827 NA <NA>
## chrYHet 347038 NA <NA>
(sl <- seqlengths(bf))
## chr2L chr2R chr3L chr3R chr4 chrM chrX chrYHet
## 23011544 21146708 24543557 27905053 1351857 19517 22422827 347038
A summary of the kind of alignments in the file can be generated:
quickBamFlagSummary(bf)
## group | nb of | nb of | mean / max
## of | records | unique | records per
## records | in group | QNAMEs | unique QNAME
## All records........................ A | 204355 | 190770 | 1.07 / 10
## o template has single segment.... S | 204355 | 190770 | 1.07 / 10
## o template has multiple segments. M | 0 | 0 | NA / NA
## - first segment.............. F | 0 | 0 | NA / NA
## - last segment............... L | 0 | 0 | NA / NA
## - other segment.............. O | 0 | 0 | NA / NA
##
## Note that (S, M) is a partitioning of A, and (F, L, O) is a partitioning of M.
## Indentation reflects this.
##
## Details for group S:
## o record is mapped.............. S1 | 204355 | 190770 | 1.07 / 10
## - primary alignment......... S2 | 204355 | 190770 | 1.07 / 10
## - secondary alignment....... S3 | 0 | 0 | NA / NA
## o record is unmapped............ S4 | 0 | 0 | NA / NA
Specifying: what and which
A number of functions in Rsamtools take an argument param
, which expects a ScanBamParam
specification. There are full details available by looking up ?scanBamParam
, but two important options are:
- what - what kind of information to extract?
- which - which ranges of alignments to extract?
BAM files are often paired with an index file (if not they can be indexed from R with indexBam
), and so we can quickly pull out information about reads from a particular genomic range. Here we count the number of records (reads) on chromosome 4:
(gr <- GRanges("chr4",IRanges(1, sl["chr4"])))
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr4 [1, 1351857] *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
countBam(bf, param=ScanBamParam(which = gr))
## space start end width file records nucleotides
## 1 chr4 1 1351857 1351857 untreated1_chr4.bam 204355 15326625
We can pull out all the information with scanBam
. Here, we specify a new BamFile
, and use the yieldSize
argument. This limits the number of reads which will be extracted to 5 at a time. Each time we call scanBam
we will get 5 more reads, until there are no reads left. If we do not specify yieldSize
we get all the reads at once. yieldSize
is useful mainly for two reasons: (1) for limiting the number of reads at a time, for example, 1 or 2 million reads at a time, to keep within the memory limits of a given machine, say in the 5 GB range (2) or, for debugging, working through small examples while writing software.
reads <- scanBam(BamFile(filename, yieldSize=5))
Examining the output of scanBam
reads
is a list of lists. The outer list indexes over the ranges in the which
command. Since we didn’t specify which
, here it is a list of length 1. The inner list contains different pieces of information from the BAM file. Since we didn’t specify what
we get everything. See ?scanBam
for the possible kinds of information to specify for what
.
class(reads)
## [1] "list"
names(reads[[1]])
## [1] "qname" "flag" "rname" "strand" "pos" "qwidth" "mapq"
## [8] "cigar" "mrnm" "mpos" "isize" "seq" "qual"
reads[[1]]$pos # the aligned start position
## [1] 892 919 924 936 949
reads[[1]]$rname # the chromosome
## [1] chr4 chr4 chr4 chr4 chr4
## Levels: chr2L chr2R chr3L chr3R chr4 chrM chrX chrYHet
reads[[1]]$strand # the strand
## [1] - - + + +
## Levels: + - *
reads[[1]]$qwidth # the width of the read
## [1] 75 75 75 75 75
reads[[1]]$seq # the sequence of the read
## A DNAStringSet instance of length 5
## width seq
## [1] 75 CTGTGGTGACCAACACCACAGAATGGTTCGG...GTTCCCTGCCCCTTTCCTGGCTAGGTTGTCC
## [2] 75 TCGGGCCCAATTAGAGGGTTCCCTGCCCCTT...GTCCGCTAGCTCATTTCCTGGGCTGTTGTTG
## [3] 75 CCCAATTAGAGGATTCTCTGCCCCTTTCCTG...GTAGCTCATTTCCCGGGATGTTGTTGTGTCC
## [4] 75 GTTCTCTGCCCCTTTCCTGGCTAGGTTGTCC...CCGAGATGTTGTTGTGTCCCGGGACCCACCT
## [5] 75 TTCCTGGCTAGGTTGTCCGCTAGCTCATTTC...GTGTCCCGGGACACACCTTATTGTGAGTTTG
Here we give an example of specifying what
and which
:
gr <- GRanges("chr4",IRanges(500000, 700000))
reads <- scanBam(bf, param=ScanBamParam(what=c("pos","strand"), which=gr))
How are the start positions distributed:
hist(reads[[1]]$pos)
A slightly more complicated picture: split positions by strand, tabulate in bins and make a stacked barplot:
readsByStrand <- split(reads[[1]]$pos, reads[[1]]$strand)
myHist <- function(x) table(cut(x, 50:70 * 10000 ))
tab <- sapply(readsByStrand, myHist)
barplot( t(tab) )
GenomicAlignments description
The GenomicAlignments package is described with:
Provides efficient containers for storing and manipulating short genomic alignments (typically obtained by aligning short reads to a reference genome). This includes read counting, computing the coverage, junction detection, and working with the nucleotide content of the alignments.
This package defines the classes and functions which are used to represent genomic alignments in Bioconductor. Two of the most important functions in GenomicAlignments are:
- readGAlignments - this and other similarly named functions read data from BAM files
- summarizeOverlaps - this function simplifies counting reads in genomic ranges across one or more files
The summarizeOverlaps
function is covered in more depth in the read counting page. summarizeOverlaps
is a function which wraps up other functions in GenomicAlignments function for counting reads.
Here we will examine the output of the readGAlignments
function, continuing with the BAM file from the pasilla dataset.
library(GenomicAlignments)
(ga <- readGAlignments(bf))
## GAlignments object with 204355 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end
## <Rle> <Rle> <character> <integer> <integer> <integer>
## [1] chr4 - 75M 75 892 966
## [2] chr4 - 75M 75 919 993
## [3] chr4 + 75M 75 924 998
## [4] chr4 + 75M 75 936 1010
## [5] chr4 + 75M 75 949 1023
## ... ... ... ... ... ... ...
## [204351] chr4 + 75M 75 1348268 1348342
## [204352] chr4 + 75M 75 1348268 1348342
## [204353] chr4 + 75M 75 1348268 1348342
## [204354] chr4 - 75M 75 1348449 1348523
## [204355] chr4 - 75M 75 1350124 1350198
## width njunc
## <integer> <integer>
## [1] 75 0
## [2] 75 0
## [3] 75 0
## [4] 75 0
## [5] 75 0
## ... ... ...
## [204351] 75 0
## [204352] 75 0
## [204353] 75 0
## [204354] 75 0
## [204355] 75 0
## -------
## seqinfo: 8 sequences from an unspecified genome
length(ga)
## [1] 204355
Note that we can extract the GRanges object within the GAlignments object, although we will see below that we can often work directly with the GAlignments object.
granges(ga[1])
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr4 [892, 966] -
## -------
## seqinfo: 8 sequences from an unspecified genome
Some of our familiar GenomicRanges functions work on GAlignments: we can use findOverlaps
, countOverlaps
and %over%
directly on the GAlignments object. Note that location of ga
and gr
in the calls below:
gr <- GRanges("chr4", IRanges(700000, 800000))
(fo <- findOverlaps(ga, gr)) # which reads over this range
## Hits of length 6465
## queryLength: 204355
## subjectLength: 1
## queryHits subjectHits
## <integer> <integer>
## 1 127157 1
## 2 127238 1
## 3 127240 1
## 4 127242 1
## 5 127243 1
## ... ... ...
## 6461 144660 1
## 6462 144661 1
## 6463 144662 1
## 6464 144663 1
## 6465 144664 1
countOverlaps(gr, ga) # count overlaps of range with the reads
## [1] 6465
table(ga %over% gr) # logical vector of read overlaps with the range
##
## FALSE TRUE
## 197890 6465
If we had run countOverlaps(ga, gr)
it would return an integer vector with the number of overlaps for each read with the range in gr
.