R can import data from local storage or Internet, and export it locally. Let us first setup the current folder, where our results will be stored.
rm(list=ls()) ## clean memory
getwd() # shows current folder
dir() # shows files in the current folder
setwd("D:/Data/R") # sets the current folder
Here is a file with some sample annotation Samples.
Theoretically, we can read values from unformatted text file using scan()
.
Samples = scan("http://edu.sablab.net/rt2018/data/samples.txt",what = "",sep="\n") # what - defines the value class
head(Samples)
## [1] "ID\tType\tState\tPatient\tGroup\tHTA.file\tHTA.size\tHTA.date\tCuffNum\tFolders\tSample\tComment\tFolder Tgene\tLane"
## [2] "AN387\tAC\tNormal\t387\tAN\t387N_(HTA-2_0).CEL\t69392593\t07/02/2014\t0\tA1\tAN1\t\t110302_SN386_0088_AB0198ABXX\t2"
## [3] "AT387\tAC\tTumor\t387\tAT\t387_(HTA-2_0).CEL\t69432921\t07/02/2014\t0\tA1\tAC1\tchanged\t110302_SN386_0088_AB0198ABXX\t1"
## [4] "AN414\tAC\tNormal\t414\tAN\t414N_(HTA-2_0).CEL\t69310345\t20/02/2014\t1\tA2\tAN2\t\t110330_SN388_0226_AB0095ABXX\t2"
## [5] "AT414\tAC\tTumor\t414\tAT\t414_(HTA-2_0).CEL\t69349997\t20/02/2014\t1\tA2\tAC2\t\t110330_SN388_0226_AB0095ABXX\t1"
## [6] "AN451\tAC\tNormal\t451\tAN\t451N_(HTA-2_0).CEL\t69368085\t13/02/2014\t2\tA3\tAN3\tchanged\t110302_SN386_0088_AB0198ABXX\t4"
Not very informative. But you can download an entire webpage by scan
to parse it afterwards. It’s funny, but we need to get readable data.
One example where scan()
fits perfectly - list of genes. Let’s load oncogenes.txt
oncogenes = scan("http://edu.sablab.net/rt2018/data/oncogenes.txt",what = "",sep="\n") # what - defines the value class
print(oncogenes)
## [1] "SOX2" "PDGFRA" "FGFR1" "WHSC1L1" "CDKN2A" "TP53" "NFE2L2"
## [8] "KEAP1" "ADGRB3" "FBXW7" "GRM8" "MUC16" "RUNX1T1" "STK11"
## [15] "ERBB4" "DDR2" "FGFR1"
We will use read.table()
to import the data as a data frame.
ID | Type | State | Patient | Group | HTA.file | HTA.size | HTA.date | CuffNum | Folders | Sample | Comment | Folder.Tgene | Lane |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AN387 | AC | Normal | 387 | AN | 387N_(HTA-2_0).CEL | 69392593 | 07/02/2014 | 0 | A1 | AN1 | 110302_SN386_0088_AB0198ABXX | 2 | |
AT387 | AC | Tumor | 387 | AT | 387_(HTA-2_0).CEL | 69432921 | 07/02/2014 | 0 | A1 | AC1 | changed | 110302_SN386_0088_AB0198ABXX | 1 |
AN414 | AC | Normal | 414 | AN | 414N_(HTA-2_0).CEL | 69310345 | 20/02/2014 | 1 | A2 | AN2 | 110330_SN388_0226_AB0095ABXX | 2 | |
AT414 | AC | Tumor | 414 | AT | 414_(HTA-2_0).CEL | 69349997 | 20/02/2014 | 1 | A2 | AC2 | 110330_SN388_0226_AB0095ABXX | 1 | |
AN451 | AC | Normal | 451 | AN | 451N_(HTA-2_0).CEL | 69368085 | 13/02/2014 | 2 | A3 | AN3 | changed | 110302_SN386_0088_AB0198ABXX | 4 |
Some parameters are important in read.table()
:
header
- set it TRUE
if there is a header linesep
- separator character. "\t"
stands for tabulationas.is
- prevents transforming character columns to factors.row.names
- should the first column be considered as rownames?Samples = read.table("http://edu.sablab.net/rt2018/data/samples.txt", header=T, sep="\t", as.is=T)
str(Samples)
## 'data.frame': 32 obs. of 14 variables:
## $ ID : chr "AN387" "AT387" "AN414" "AT414" ...
## $ Type : chr "AC" "AC" "AC" "AC" ...
## $ State : chr "Normal" "Tumor" "Normal" "Tumor" ...
## $ Patient : int 387 387 414 414 451 451 562 562 596 596 ...
## $ Group : chr "AN" "AT" "AN" "AT" ...
## $ HTA.file : chr "387N_(HTA-2_0).CEL" "387_(HTA-2_0).CEL" "414N_(HTA-2_0).CEL" "414_(HTA-2_0).CEL" ...
## $ HTA.size : int 69392593 69432921 69310345 69349997 69368085 69369445 69342145 69343933 69324581 69304733 ...
## $ HTA.date : chr "07/02/2014" "07/02/2014" "20/02/2014" "20/02/2014" ...
## $ CuffNum : int 0 0 1 1 2 2 3 3 7 7 ...
## $ Folders : chr "A1" "A1" "A2" "A2" ...
## $ Sample : chr "AN1" "AC1" "AN2" "AC2" ...
## $ Comment : chr "" "changed" "" "" ...
## $ Folder.Tgene: chr "110302_SN386_0088_AB0198ABXX" "110302_SN386_0088_AB0198ABXX" "110330_SN388_0226_AB0095ABXX" "110330_SN388_0226_AB0095ABXX" ...
## $ Lane : int 2 1 2 1 4 3 2 1 7 6 ...
Do not forget functions that allow you seeing, what is inside your data:
head(Samples)
## ID Type State Patient Group HTA.file HTA.size HTA.date
## 1 AN387 AC Normal 387 AN 387N_(HTA-2_0).CEL 69392593 07/02/2014
## 2 AT387 AC Tumor 387 AT 387_(HTA-2_0).CEL 69432921 07/02/2014
## 3 AN414 AC Normal 414 AN 414N_(HTA-2_0).CEL 69310345 20/02/2014
## 4 AT414 AC Tumor 414 AT 414_(HTA-2_0).CEL 69349997 20/02/2014
## 5 AN451 AC Normal 451 AN 451N_(HTA-2_0).CEL 69368085 13/02/2014
## 6 AT451 AC Tumor 451 AT 451_(HTA-2_0).CEL 69369445 13/02/2014
## CuffNum Folders Sample Comment Folder.Tgene Lane
## 1 0 A1 AN1 110302_SN386_0088_AB0198ABXX 2
## 2 0 A1 AC1 changed 110302_SN386_0088_AB0198ABXX 1
## 3 1 A2 AN2 110330_SN388_0226_AB0095ABXX 2
## 4 1 A2 AC2 110330_SN388_0226_AB0095ABXX 1
## 5 2 A3 AN3 changed 110302_SN386_0088_AB0198ABXX 4
## 6 2 A3 AC3 110302_SN386_0088_AB0198ABXX 3
summary(Samples)
## ID Type State Patient
## Length:32 Length:32 Length:32 Min. :327.0
## Class :character Class :character Class :character 1st Qu.:378.8
## Mode :character Mode :character Mode :character Median :449.5
## Mean :478.4
## 3rd Qu.:600.2
## Max. :679.0
## Group HTA.file HTA.size
## Length:32 Length:32 Min. :69274325
## Class :character Class :character 1st Qu.:69337531
## Mode :character Mode :character Median :69368697
## Mean :69379888
## 3rd Qu.:69392923
## Max. :69584273
## HTA.date CuffNum Folders Sample
## Length:32 Min. :0.000 Length:32 Length:32
## Class :character 1st Qu.:2.000 Class :character Class :character
## Mode :character Median :4.000 Mode :character Mode :character
## Mean :4.125
## 3rd Qu.:6.250
## Max. :9.000
## Comment Folder.Tgene Lane
## Length:32 Length:32 Min. :1.000
## Class :character Class :character 1st Qu.:2.000
## Mode :character Mode :character Median :4.500
## Mean :4.406
## 3rd Qu.:6.250
## Max. :8.000
View(Samples)
R can keep data in GZip-ed form, automatically loading the variables into memory. Such files have .RData extension. This is a fast & easy way to store your data. Let us first download the data in RData format into you working directory using download.file()
and then load it by load()
. Parameters of downloading:
destfile
- the file name, under which you would like to store the downloaded file.mode
- the way you would like to treat the data (as text or binary). To keep binary data unchanged, use wb
!Example: Data from TCGA for 40 lung squamous cell carcinoma samples and 20 controls.
download.file("http://edu.sablab.net/rt2018/data/LUSC60.RData", destfile="LUSC60.RData",mode = "wb")
getwd() # show current folder
## [1] "B:/DOCUMENTs/Pedagogics/RTrans2018/www"
dir(pattern=".RData") # show files in the current folder
## [1] "all.RData" "LUSC60.RData" "SamplesSCC.RData"
## [4] "workspace.RData"
load("LUSC60.RData") # load the data
ls() # you should see 'LUSC' object among variables
## [1] "LUSC" "oncogenes" "Samples"
str(LUSC)
## List of 5
## $ nsamples : num 60
## $ nfeatures: int 20531
## $ anno :'data.frame': 20531 obs. of 2 variables:
## ..$ gene : chr [1:20531] "?|100130426" "?|100133144" "?|100134869" "?|10357" ...
## ..$ location: chr [1:20531] "chr9:79791679-79792806:+" "chr15:82647286-82708202:+;chr15:83023773-83084727:+" "chr15:82647286-82707815:+;chr15:83023773-83084340:+" "chr20:56064083-56063450:-" ...
## $ meta :'data.frame': 60 obs. of 5 variables:
## ..$ id : chr [1:60] "NT.1" "NT.2" "NT.3" "NT.4" ...
## ..$ disease : chr [1:60] "LUSC" "LUSC" "LUSC" "LUSC" ...
## ..$ sample_type: chr [1:60] "NT" "NT" "NT" "NT" ...
## ..$ platform : chr [1:60] "ILLUMINA" "ILLUMINA" "ILLUMINA" "ILLUMINA" ...
## ..$ aliquot_id : chr [1:60] "a7309f60-5fb7-4cd9-90e8-54d850b4b241" "cabf7b00-4cb0-4873-a632-a5ffc72df2c8" "ee465ac7-c842-4337-8bd4-58fb49fce9bb" "6178dbb9-01d2-4ba3-8315-f2add186879a" ...
## $ counts : num [1:20531, 1:60] 0 3 13 272 1516 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:20531] "?|100130426" "?|100133144" "?|100134869" "?|10357" ...
## .. ..$ : chr [1:60] "NT.1" "NT.2" "NT.3" "NT.4" ...
You can see row and column names of the loaded data.frame object:
attr(LUSC$counts,"dimnames") # annotation of the dimensions
rownames(LUSC$counts)[1:10]
colnames(LUSC$counts)[1:10]
There are several ways to export your data. Let’s consider the most simple.
write()
- writes a column of numbers / characterswrite.table()
- writes a data tablesave()
- saves one or several variables into a binary RData file.Parameters of write.table
are:
eol
- cheracter for the end of line (can be differ with OS). The standard one is “”dec
- decimal separatorquote
- do we put “” around character values or notrow.names
- do we put row names as a column or notLet’s select only SCC (squamous cell carcinoma) patients from Samples
and save the file locally
SamplesSCC = Samples[Samples$Type == "SCC", ]
write.table(SamplesSCC,file = "SamplesSCC.txt",sep = "\t",row.names = FALSE, quote=FALSE)
save(SamplesSCC,file="SamplesSCC.RData") # save as binary file
save(list=ls(),file="workspace.RData") # save all variables as binary file
getwd()
## [1] "B:/DOCUMENTs/Pedagogics/RTrans2018/www"
dir() # see the results
## [1] "_site" "_site.yml" "all.RData"
## [4] "DEA_HTA.txt" "images" "index.html"
## [7] "index.Rmd" "LUSC60.RData" "SamplesSCC.RData"
## [10] "SamplesSCC.txt" "scripts1.Rmd" "scripts2.Rmd"
## [13] "scripts3.Rmd" "scripts4.Rmd" "site_libs"
## [16] "website.Rproj" "workspace.RData"
Now, let’s do some exercises. You will need knowledge from the Introduction to R course. Do not hesitate to ask if some points are not clear or you do not know how to proceed! After each task you can see the functions or structures that can help solving it. You always can see the help for a function by typing ?function_name
- Annotation file http://edu.sablab.net/rt2018/data/counts_anno.txt contains over 50k of genes described in hg19 genome (canonical chromosomes only). Load it, check its structure and view the summary.
read.table
,View
,str
,summary
,head
- Build summary of gene biotypes (protein coding, lincRNA, etc.). Sort biotypes by the number of genes. Hint: you may need to transform column
gene_biotype
to factors
Data$Column
,factor
,summary
,sort
- Calculate statistics on gene length and GC-content for protein_coding and lincRNA.
X$CA[X$CB == "bbb"]
,summary
- Save annotation for protein coding genes and lincRNAs into 2 separate files.
X[X$CB == "bbb",]
,write.table