Example of how to download CEL files from GEO

contributed by Stephanie Hicks

If the GEOquery R/Biocondcutor package is not installed, use biocLite() to install the package:

source("http://bioconductor.org/biocLite.R")
biocLite("GEOquery")

Load the GEOquery R/Bioconductor package:

library(GEOquery)

Access the GEO Series Data

To access the GEO Sample (GSM), GEO Series (GSE) (lists of GSM files that together form a single experiment) or GEO Dataset (GDS), use the function getGEO() which returns a list of ExpressionSets:

### This will download a 20 Mb
gse <- getGEO("GSE21653", GSEMatrix = TRUE)
show(gse)

Accessing raw data from GEO

If raw data such as .CEL files exist on GEO, you can easily access this dea using the getGEOSuppFiles() function. The function takes in a GEO accession as the argument and will download all the raw data associated with that accession. By default the getGEOSuppFiles() function will create a directory within the current working directory to store the raw data. Here, the file paths of the downloaded files (often with as a .tar extension) are stored in a data frame called filePaths.

filePaths = getGEOSuppFiles("GSE21653")
filePaths

From here you can use, for example, ReadAffy() to read in the CEL files.

Access GSE Data Tables from GEO

To access the phenotypic information about the samples, the best way is to use getGEO() function to obtain the GSE object and then extract the phenoData object from that. Unfortunately this means downloadint the entire GSE Matrix file.

dim(pData(gse[[1]]))
head(pData(gse[[1]])[, 1:3])

Sometimes GSEs are include separate data tables with the sample information. If these exist, you can uuse the getGSEDataTables() function. For example here is the phenoData object from a different GSE accession GSE3494 with a Data Table.

df1 <- getGSEDataTables("GSE3494")
lapply(df1, head)
## [[1]]
##   INDEX (ID) p53 seq mut status (p53+=mutant; p53-=wt)
## 1    X101B88                                      p53+
## 2    X102B06                                      p53+
## 3    X104B91                                      p53+
## 4    X110B34                                      p53+
## 5    X111B51                                      p53+
## 6    X127B00                                      p53+
##   p53 DLDA classifier result (0=wt-like, 1=mt-like)
## 1                                                 1
## 2                                                 1
## 3                                                 0
## 4                                                 1
## 5                                                 1
## 6                                                 1
##   DLDA error (1=yes, 0=no) Elston histologic grade ER status PgR status
## 1                        0                      G3       ER-       PgR-
## 2                        0                      G3       ER+       PgR+
## 3                        1                      G3       ER+       PgR+
## 4                        0                      G2       ER+       PgR+
## 5                        0                      G3       ER+       PgR+
## 6                        0                      G3       ER+       PgR+
##   age at diagnosis tumor size (mm) Lymph node status
## 1               40              12               LN-
## 2               51              26               LN-
## 3               80              24               LN?
## 4               74              20               LN-
## 5               41              33               LN-
## 6               57              22               LN-
##   DSS TIME (Disease-Specific Survival Time in years)
## 1                                             11.833
## 2                                             11.833
## 3                                              3.583
## 4                                             11.667
## 5                                              7.167
## 6                                              4.667
##   DSS EVENT (Disease-Specific Survival EVENT; 1=death from breast cancer, 0=alive or censored )
## 1                                                                                             0
## 2                                                                                             0
## 3                                                                                             0
## 4                                                                                             0
## 5                                                                                             1
## 6                                                                                             1
## 
## [[2]]
##   GEO Sample Accession # Patient ID Affy platform
## 1               GSM79114    X100B08      HG-U133A
## 2               GSM79115    X101B88      HG-U133A
## 3               GSM79116    X102B06      HG-U133A
## 4               GSM79117    X103B41      HG-U133A
## 5               GSM79118    X104B91      HG-U133A
## 6               GSM79119    X105B13      HG-U133A