1.1. Current folder

R can import data from local storage or Internet, and export it locally. Let us first setup the current folder, where our results will be stored.

rm(list=ls()) ## clean memory

getwd()   # shows current folder
dir()     # shows files in the current folder
setwd("D:/Data/R") # sets the current folder

1.2. Read values from a text file

Here is a file with some sample annotation Samples.

Theoretically, we can read values from unformatted text file using scan().

Samples = scan("http://edu.sablab.net/rt2018/data/samples.txt",what = "",sep="\n") # what - defines the value class
head(Samples)

## [1] "ID\tType\tState\tPatient\tGroup\tHTA.file\tHTA.size\tHTA.date\tCuffNum\tFolders\tSample\tComment\tFolder Tgene\tLane"      
## [2] "AN387\tAC\tNormal\t387\tAN\t387N_(HTA-2_0).CEL\t69392593\t07/02/2014\t0\tA1\tAN1\t\t110302_SN386_0088_AB0198ABXX\t2"       
## [3] "AT387\tAC\tTumor\t387\tAT\t387_(HTA-2_0).CEL\t69432921\t07/02/2014\t0\tA1\tAC1\tchanged\t110302_SN386_0088_AB0198ABXX\t1"  
## [4] "AN414\tAC\tNormal\t414\tAN\t414N_(HTA-2_0).CEL\t69310345\t20/02/2014\t1\tA2\tAN2\t\t110330_SN388_0226_AB0095ABXX\t2"       
## [5] "AT414\tAC\tTumor\t414\tAT\t414_(HTA-2_0).CEL\t69349997\t20/02/2014\t1\tA2\tAC2\t\t110330_SN388_0226_AB0095ABXX\t1"         
## [6] "AN451\tAC\tNormal\t451\tAN\t451N_(HTA-2_0).CEL\t69368085\t13/02/2014\t2\tA3\tAN3\tchanged\t110302_SN386_0088_AB0198ABXX\t4"

Not very informative. But you can download an entire webpage by scan to parse it afterwards. It’s funny, but we need to get readable data.

One example where scan() fits perfectly - list of genes. Let’s load oncogenes.txt

oncogenes = scan("http://edu.sablab.net/rt2018/data/oncogenes.txt",what = "",sep="\n") # what - defines the value class
print(oncogenes)

##  [1] "SOX2"    "PDGFRA"  "FGFR1"   "WHSC1L1" "CDKN2A"  "TP53"    "NFE2L2" 
##  [8] "KEAP1"   "ADGRB3"  "FBXW7"   "GRM8"    "MUC16"   "RUNX1T1" "STK11"  
## [15] "ERBB4"   "DDR2"    "FGFR1"

1.3. Read tables

We will use read.table() to import the data as a data frame.

ID	Type	State	Patient	Group	HTA.file	HTA.size	HTA.date	CuffNum	Folders	Sample	Comment	Folder.Tgene	Lane
AN387	AC	Normal	387	AN	387N_(HTA-2_0).CEL	69392593	07/02/2014	0	A1	AN1		110302_SN386_0088_AB0198ABXX	2
AT387	AC	Tumor	387	AT	387_(HTA-2_0).CEL	69432921	07/02/2014	0	A1	AC1	changed	110302_SN386_0088_AB0198ABXX	1
AN414	AC	Normal	414	AN	414N_(HTA-2_0).CEL	69310345	20/02/2014	1	A2	AN2		110330_SN388_0226_AB0095ABXX	2
AT414	AC	Tumor	414	AT	414_(HTA-2_0).CEL	69349997	20/02/2014	1	A2	AC2		110330_SN388_0226_AB0095ABXX	1
AN451	AC	Normal	451	AN	451N_(HTA-2_0).CEL	69368085	13/02/2014	2	A3	AN3	changed	110302_SN386_0088_AB0198ABXX	4

Some parameters are important in read.table():

header - set it TRUE if there is a header line
sep - separator character. "\t" stands for tabulation
as.is - prevents transforming character columns to factors.
row.names - should the first column be considered as rownames?

Samples = read.table("http://edu.sablab.net/rt2018/data/samples.txt", header=T, sep="\t", as.is=T) 
str(Samples)

## 'data.frame':    32 obs. of  14 variables:
##  $ ID          : chr  "AN387" "AT387" "AN414" "AT414" ...
##  $ Type        : chr  "AC" "AC" "AC" "AC" ...
##  $ State       : chr  "Normal" "Tumor" "Normal" "Tumor" ...
##  $ Patient     : int  387 387 414 414 451 451 562 562 596 596 ...
##  $ Group       : chr  "AN" "AT" "AN" "AT" ...
##  $ HTA.file    : chr  "387N_(HTA-2_0).CEL" "387_(HTA-2_0).CEL" "414N_(HTA-2_0).CEL" "414_(HTA-2_0).CEL" ...
##  $ HTA.size    : int  69392593 69432921 69310345 69349997 69368085 69369445 69342145 69343933 69324581 69304733 ...
##  $ HTA.date    : chr  "07/02/2014" "07/02/2014" "20/02/2014" "20/02/2014" ...
##  $ CuffNum     : int  0 0 1 1 2 2 3 3 7 7 ...
##  $ Folders     : chr  "A1" "A1" "A2" "A2" ...
##  $ Sample      : chr  "AN1" "AC1" "AN2" "AC2" ...
##  $ Comment     : chr  "" "changed" "" "" ...
##  $ Folder.Tgene: chr  "110302_SN386_0088_AB0198ABXX" "110302_SN386_0088_AB0198ABXX" "110330_SN388_0226_AB0095ABXX" "110330_SN388_0226_AB0095ABXX" ...
##  $ Lane        : int  2 1 2 1 4 3 2 1 7 6 ...

Do not forget functions that allow you seeing, what is inside your data:

head(Samples)

##      ID Type  State Patient Group           HTA.file HTA.size   HTA.date
## 1 AN387   AC Normal     387    AN 387N_(HTA-2_0).CEL 69392593 07/02/2014
## 2 AT387   AC  Tumor     387    AT  387_(HTA-2_0).CEL 69432921 07/02/2014
## 3 AN414   AC Normal     414    AN 414N_(HTA-2_0).CEL 69310345 20/02/2014
## 4 AT414   AC  Tumor     414    AT  414_(HTA-2_0).CEL 69349997 20/02/2014
## 5 AN451   AC Normal     451    AN 451N_(HTA-2_0).CEL 69368085 13/02/2014
## 6 AT451   AC  Tumor     451    AT  451_(HTA-2_0).CEL 69369445 13/02/2014
##   CuffNum Folders Sample Comment                 Folder.Tgene Lane
## 1       0      A1    AN1         110302_SN386_0088_AB0198ABXX    2
## 2       0      A1    AC1 changed 110302_SN386_0088_AB0198ABXX    1
## 3       1      A2    AN2         110330_SN388_0226_AB0095ABXX    2
## 4       1      A2    AC2         110330_SN388_0226_AB0095ABXX    1
## 5       2      A3    AN3 changed 110302_SN386_0088_AB0198ABXX    4
## 6       2      A3    AC3         110302_SN386_0088_AB0198ABXX    3

summary(Samples)

##       ID                Type              State              Patient     
##  Length:32          Length:32          Length:32          Min.   :327.0  
##  Class :character   Class :character   Class :character   1st Qu.:378.8  
##  Mode  :character   Mode  :character   Mode  :character   Median :449.5  
##                                                           Mean   :478.4  
##                                                           3rd Qu.:600.2  
##                                                           Max.   :679.0  
##     Group             HTA.file            HTA.size       
##  Length:32          Length:32          Min.   :69274325  
##  Class :character   Class :character   1st Qu.:69337531  
##  Mode  :character   Mode  :character   Median :69368697  
##                                        Mean   :69379888  
##                                        3rd Qu.:69392923  
##                                        Max.   :69584273  
##    HTA.date            CuffNum        Folders             Sample         
##  Length:32          Min.   :0.000   Length:32          Length:32         
##  Class :character   1st Qu.:2.000   Class :character   Class :character  
##  Mode  :character   Median :4.000   Mode  :character   Mode  :character  
##                     Mean   :4.125                                        
##                     3rd Qu.:6.250                                        
##                     Max.   :9.000                                        
##    Comment          Folder.Tgene            Lane      
##  Length:32          Length:32          Min.   :1.000  
##  Class :character   Class :character   1st Qu.:2.000  
##  Mode  :character   Mode  :character   Median :4.500  
##                                        Mean   :4.406  
##                                        3rd Qu.:6.250  
##                                        Max.   :8.000

View(Samples)

1.4. Read values from binary file

R can keep data in GZip-ed form, automatically loading the variables into memory. Such files have .RData extension. This is a fast & easy way to store your data. Let us first download the data in RData format into you working directory using download.file() and then load it by load(). Parameters of downloading:

destfile - the file name, under which you would like to store the downloaded file.
mode - the way you would like to treat the data (as text or binary). To keep binary data unchanged, use wb!

Example: Data from TCGA for 40 lung squamous cell carcinoma samples and 20 controls.

download.file("http://edu.sablab.net/rt2018/data/LUSC60.RData", destfile="LUSC60.RData",mode = "wb")

getwd()                # show current folder

## [1] "B:/DOCUMENTs/Pedagogics/RTrans2018/www"

dir(pattern=".RData")  # show files in the current folder

## [1] "all.RData"        "LUSC60.RData"     "SamplesSCC.RData"
## [4] "workspace.RData"

load("LUSC60.RData")   # load the data

ls()                   # you should see 'LUSC' object among variables

## [1] "LUSC"      "oncogenes" "Samples"

str(LUSC)

## List of 5
##  $ nsamples : num 60
##  $ nfeatures: int 20531
##  $ anno     :'data.frame':   20531 obs. of  2 variables:
##   ..$ gene    : chr [1:20531] "?|100130426" "?|100133144" "?|100134869" "?|10357" ...
##   ..$ location: chr [1:20531] "chr9:79791679-79792806:+" "chr15:82647286-82708202:+;chr15:83023773-83084727:+" "chr15:82647286-82707815:+;chr15:83023773-83084340:+" "chr20:56064083-56063450:-" ...
##  $ meta     :'data.frame':   60 obs. of  5 variables:
##   ..$ id         : chr [1:60] "NT.1" "NT.2" "NT.3" "NT.4" ...
##   ..$ disease    : chr [1:60] "LUSC" "LUSC" "LUSC" "LUSC" ...
##   ..$ sample_type: chr [1:60] "NT" "NT" "NT" "NT" ...
##   ..$ platform   : chr [1:60] "ILLUMINA" "ILLUMINA" "ILLUMINA" "ILLUMINA" ...
##   ..$ aliquot_id : chr [1:60] "a7309f60-5fb7-4cd9-90e8-54d850b4b241" "cabf7b00-4cb0-4873-a632-a5ffc72df2c8" "ee465ac7-c842-4337-8bd4-58fb49fce9bb" "6178dbb9-01d2-4ba3-8315-f2add186879a" ...
##  $ counts   : num [1:20531, 1:60] 0 3 13 272 1516 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:20531] "?|100130426" "?|100133144" "?|100134869" "?|10357" ...
##   .. ..$ : chr [1:60] "NT.1" "NT.2" "NT.3" "NT.4" ...

You can see row and column names of the loaded data.frame object:

attr(LUSC$counts,"dimnames")    # annotation of the dimensions
rownames(LUSC$counts)[1:10]
colnames(LUSC$counts)[1:10]

1.5. Data export

There are several ways to export your data. Let’s consider the most simple.

write() - writes a column of numbers / characters
write.table() - writes a data table
save() - saves one or several variables into a binary RData file.

Parameters of write.table are:

eol - cheracter for the end of line (can be differ with OS). The standard one is “”
dec - decimal separator
quote - do we put “” around character values or not
row.names - do we put row names as a column or not

Let’s select only SCC (squamous cell carcinoma) patients from Samples and save the file locally

SamplesSCC = Samples[Samples$Type == "SCC", ]

write.table(SamplesSCC,file = "SamplesSCC.txt",sep = "\t",row.names = FALSE, quote=FALSE)

save(SamplesSCC,file="SamplesSCC.RData") # save as binary file

save(list=ls(),file="workspace.RData") # save all variables as binary file

getwd()

## [1] "B:/DOCUMENTs/Pedagogics/RTrans2018/www"

dir()      # see the results

##  [1] "_site"            "_site.yml"        "all.RData"       
##  [4] "DEA_HTA.txt"      "images"           "index.html"      
##  [7] "index.Rmd"        "LUSC60.RData"     "SamplesSCC.RData"
## [10] "SamplesSCC.txt"   "scripts1.Rmd"     "scripts2.Rmd"    
## [13] "scripts3.Rmd"     "scripts4.Rmd"     "site_libs"       
## [16] "website.Rproj"    "workspace.RData"

Now, let’s do some exercises. You will need knowledge from the Introduction to R course. Do not hesitate to ask if some points are not clear or you do not know how to proceed! After each task you can see the functions or structures that can help solving it. You always can see the help for a function by typing ?function_name

Exercises

Annotation file http://edu.sablab.net/rt2018/data/counts_anno.txt contains over 50k of genes described in hg19 genome (canonical chromosomes only). Load it, check its structure and view the summary.

read.table, View, str, summary, head

Build summary of gene biotypes (protein coding, lincRNA, etc.). Sort biotypes by the number of genes. Hint: you may need to transform column gene_biotype to factors

Data$Column, factor, summary, sort

Calculate statistics on gene length and GC-content for protein_coding and lincRNA.

X$CA[X$CB == "bbb"], summary

Save annotation for protein coding genes and lincRNAs into 2 separate files.

X[X$CB == "bbb",], write.table

Import and Export

Petr V. Nazarov, LIH

2018-03-05