1.1. Current folder

R can import data from local storage or Internet, and export it locally. Let us first setup the current folder, where our results will be stored.

rm(list=ls()) ## clean memory

getwd()   # shows current folder
dir()     # shows files in the current folder
setwd("D:/Data/R") # sets the current folder

 

1.2. Read values from a text file

Here is a file with some sample annotation Samples.

Theoretically, we can read values from unformatted text file using scan().

Samples = scan("http://edu.sablab.net/rt2018/data/samples.txt",what = "",sep="\n") # what - defines the value class
head(Samples)
## [1] "ID\tType\tState\tPatient\tGroup\tHTA.file\tHTA.size\tHTA.date\tCuffNum\tFolders\tSample\tComment\tFolder Tgene\tLane"      
## [2] "AN387\tAC\tNormal\t387\tAN\t387N_(HTA-2_0).CEL\t69392593\t07/02/2014\t0\tA1\tAN1\t\t110302_SN386_0088_AB0198ABXX\t2"       
## [3] "AT387\tAC\tTumor\t387\tAT\t387_(HTA-2_0).CEL\t69432921\t07/02/2014\t0\tA1\tAC1\tchanged\t110302_SN386_0088_AB0198ABXX\t1"  
## [4] "AN414\tAC\tNormal\t414\tAN\t414N_(HTA-2_0).CEL\t69310345\t20/02/2014\t1\tA2\tAN2\t\t110330_SN388_0226_AB0095ABXX\t2"       
## [5] "AT414\tAC\tTumor\t414\tAT\t414_(HTA-2_0).CEL\t69349997\t20/02/2014\t1\tA2\tAC2\t\t110330_SN388_0226_AB0095ABXX\t1"         
## [6] "AN451\tAC\tNormal\t451\tAN\t451N_(HTA-2_0).CEL\t69368085\t13/02/2014\t2\tA3\tAN3\tchanged\t110302_SN386_0088_AB0198ABXX\t4"

Not very informative. But you can download an entire webpage by scan to parse it afterwards. It’s funny, but we need to get readable data.

One example where scan() fits perfectly - list of genes. Let’s load oncogenes.txt

oncogenes = scan("http://edu.sablab.net/rt2018/data/oncogenes.txt",what = "",sep="\n") # what - defines the value class
print(oncogenes)
##  [1] "SOX2"    "PDGFRA"  "FGFR1"   "WHSC1L1" "CDKN2A"  "TP53"    "NFE2L2" 
##  [8] "KEAP1"   "ADGRB3"  "FBXW7"   "GRM8"    "MUC16"   "RUNX1T1" "STK11"  
## [15] "ERBB4"   "DDR2"    "FGFR1"

 

1.3. Read tables

We will use read.table() to import the data as a data frame.

ID Type State Patient Group HTA.file HTA.size HTA.date CuffNum Folders Sample Comment Folder.Tgene Lane
AN387 AC Normal 387 AN 387N_(HTA-2_0).CEL 69392593 07/02/2014 0 A1 AN1 110302_SN386_0088_AB0198ABXX 2
AT387 AC Tumor 387 AT 387_(HTA-2_0).CEL 69432921 07/02/2014 0 A1 AC1 changed 110302_SN386_0088_AB0198ABXX 1
AN414 AC Normal 414 AN 414N_(HTA-2_0).CEL 69310345 20/02/2014 1 A2 AN2 110330_SN388_0226_AB0095ABXX 2
AT414 AC Tumor 414 AT 414_(HTA-2_0).CEL 69349997 20/02/2014 1 A2 AC2 110330_SN388_0226_AB0095ABXX 1
AN451 AC Normal 451 AN 451N_(HTA-2_0).CEL 69368085 13/02/2014 2 A3 AN3 changed 110302_SN386_0088_AB0198ABXX 4

Some parameters are important in read.table():

Samples = read.table("http://edu.sablab.net/rt2018/data/samples.txt", header=T, sep="\t", as.is=T) 
str(Samples)
## 'data.frame':    32 obs. of  14 variables:
##  $ ID          : chr  "AN387" "AT387" "AN414" "AT414" ...
##  $ Type        : chr  "AC" "AC" "AC" "AC" ...
##  $ State       : chr  "Normal" "Tumor" "Normal" "Tumor" ...
##  $ Patient     : int  387 387 414 414 451 451 562 562 596 596 ...
##  $ Group       : chr  "AN" "AT" "AN" "AT" ...
##  $ HTA.file    : chr  "387N_(HTA-2_0).CEL" "387_(HTA-2_0).CEL" "414N_(HTA-2_0).CEL" "414_(HTA-2_0).CEL" ...
##  $ HTA.size    : int  69392593 69432921 69310345 69349997 69368085 69369445 69342145 69343933 69324581 69304733 ...
##  $ HTA.date    : chr  "07/02/2014" "07/02/2014" "20/02/2014" "20/02/2014" ...
##  $ CuffNum     : int  0 0 1 1 2 2 3 3 7 7 ...
##  $ Folders     : chr  "A1" "A1" "A2" "A2" ...
##  $ Sample      : chr  "AN1" "AC1" "AN2" "AC2" ...
##  $ Comment     : chr  "" "changed" "" "" ...
##  $ Folder.Tgene: chr  "110302_SN386_0088_AB0198ABXX" "110302_SN386_0088_AB0198ABXX" "110330_SN388_0226_AB0095ABXX" "110330_SN388_0226_AB0095ABXX" ...
##  $ Lane        : int  2 1 2 1 4 3 2 1 7 6 ...

Do not forget functions that allow you seeing, what is inside your data:

head(Samples)
##      ID Type  State Patient Group           HTA.file HTA.size   HTA.date
## 1 AN387   AC Normal     387    AN 387N_(HTA-2_0).CEL 69392593 07/02/2014
## 2 AT387   AC  Tumor     387    AT  387_(HTA-2_0).CEL 69432921 07/02/2014
## 3 AN414   AC Normal     414    AN 414N_(HTA-2_0).CEL 69310345 20/02/2014
## 4 AT414   AC  Tumor     414    AT  414_(HTA-2_0).CEL 69349997 20/02/2014
## 5 AN451   AC Normal     451    AN 451N_(HTA-2_0).CEL 69368085 13/02/2014
## 6 AT451   AC  Tumor     451    AT  451_(HTA-2_0).CEL 69369445 13/02/2014
##   CuffNum Folders Sample Comment                 Folder.Tgene Lane
## 1       0      A1    AN1         110302_SN386_0088_AB0198ABXX    2
## 2       0      A1    AC1 changed 110302_SN386_0088_AB0198ABXX    1
## 3       1      A2    AN2         110330_SN388_0226_AB0095ABXX    2
## 4       1      A2    AC2         110330_SN388_0226_AB0095ABXX    1
## 5       2      A3    AN3 changed 110302_SN386_0088_AB0198ABXX    4
## 6       2      A3    AC3         110302_SN386_0088_AB0198ABXX    3
summary(Samples)
##       ID                Type              State              Patient     
##  Length:32          Length:32          Length:32          Min.   :327.0  
##  Class :character   Class :character   Class :character   1st Qu.:378.8  
##  Mode  :character   Mode  :character   Mode  :character   Median :449.5  
##                                                           Mean   :478.4  
##                                                           3rd Qu.:600.2  
##                                                           Max.   :679.0  
##     Group             HTA.file            HTA.size       
##  Length:32          Length:32          Min.   :69274325  
##  Class :character   Class :character   1st Qu.:69337531  
##  Mode  :character   Mode  :character   Median :69368697  
##                                        Mean   :69379888  
##                                        3rd Qu.:69392923  
##                                        Max.   :69584273  
##    HTA.date            CuffNum        Folders             Sample         
##  Length:32          Min.   :0.000   Length:32          Length:32         
##  Class :character   1st Qu.:2.000   Class :character   Class :character  
##  Mode  :character   Median :4.000   Mode  :character   Mode  :character  
##                     Mean   :4.125                                        
##                     3rd Qu.:6.250                                        
##                     Max.   :9.000                                        
##    Comment          Folder.Tgene            Lane      
##  Length:32          Length:32          Min.   :1.000  
##  Class :character   Class :character   1st Qu.:2.000  
##  Mode  :character   Mode  :character   Median :4.500  
##                                        Mean   :4.406  
##                                        3rd Qu.:6.250  
##                                        Max.   :8.000
View(Samples)

 

1.4. Read values from binary file

R can keep data in GZip-ed form, automatically loading the variables into memory. Such files have .RData extension. This is a fast & easy way to store your data. Let us first download the data in RData format into you working directory using download.file() and then load it by load(). Parameters of downloading:

Example: Data from TCGA for 40 lung squamous cell carcinoma samples and 20 controls.

download.file("http://edu.sablab.net/rt2018/data/LUSC60.RData", destfile="LUSC60.RData",mode = "wb")

getwd()                # show current folder
## [1] "B:/DOCUMENTs/Pedagogics/RTrans2018/www"
dir(pattern=".RData")  # show files in the current folder
## [1] "all.RData"        "LUSC60.RData"     "SamplesSCC.RData"
## [4] "workspace.RData"
load("LUSC60.RData")   # load the data

ls()                   # you should see 'LUSC' object among variables
## [1] "LUSC"      "oncogenes" "Samples"
str(LUSC)
## List of 5
##  $ nsamples : num 60
##  $ nfeatures: int 20531
##  $ anno     :'data.frame':   20531 obs. of  2 variables:
##   ..$ gene    : chr [1:20531] "?|100130426" "?|100133144" "?|100134869" "?|10357" ...
##   ..$ location: chr [1:20531] "chr9:79791679-79792806:+" "chr15:82647286-82708202:+;chr15:83023773-83084727:+" "chr15:82647286-82707815:+;chr15:83023773-83084340:+" "chr20:56064083-56063450:-" ...
##  $ meta     :'data.frame':   60 obs. of  5 variables:
##   ..$ id         : chr [1:60] "NT.1" "NT.2" "NT.3" "NT.4" ...
##   ..$ disease    : chr [1:60] "LUSC" "LUSC" "LUSC" "LUSC" ...
##   ..$ sample_type: chr [1:60] "NT" "NT" "NT" "NT" ...
##   ..$ platform   : chr [1:60] "ILLUMINA" "ILLUMINA" "ILLUMINA" "ILLUMINA" ...
##   ..$ aliquot_id : chr [1:60] "a7309f60-5fb7-4cd9-90e8-54d850b4b241" "cabf7b00-4cb0-4873-a632-a5ffc72df2c8" "ee465ac7-c842-4337-8bd4-58fb49fce9bb" "6178dbb9-01d2-4ba3-8315-f2add186879a" ...
##  $ counts   : num [1:20531, 1:60] 0 3 13 272 1516 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:20531] "?|100130426" "?|100133144" "?|100134869" "?|10357" ...
##   .. ..$ : chr [1:60] "NT.1" "NT.2" "NT.3" "NT.4" ...

You can see row and column names of the loaded data.frame object:

attr(LUSC$counts,"dimnames")    # annotation of the dimensions
rownames(LUSC$counts)[1:10]
colnames(LUSC$counts)[1:10]

 

1.5. Data export

There are several ways to export your data. Let’s consider the most simple.

Parameters of write.table are:

Let’s select only SCC (squamous cell carcinoma) patients from Samples and save the file locally

SamplesSCC = Samples[Samples$Type == "SCC", ]

write.table(SamplesSCC,file = "SamplesSCC.txt",sep = "\t",row.names = FALSE, quote=FALSE)

save(SamplesSCC,file="SamplesSCC.RData") # save as binary file

save(list=ls(),file="workspace.RData") # save all variables as binary file

getwd()
## [1] "B:/DOCUMENTs/Pedagogics/RTrans2018/www"
dir()      # see the results
##  [1] "_site"            "_site.yml"        "all.RData"       
##  [4] "DEA_HTA.txt"      "images"           "index.html"      
##  [7] "index.Rmd"        "LUSC60.RData"     "SamplesSCC.RData"
## [10] "SamplesSCC.txt"   "scripts1.Rmd"     "scripts2.Rmd"    
## [13] "scripts3.Rmd"     "scripts4.Rmd"     "site_libs"       
## [16] "website.Rproj"    "workspace.RData"

Now, let’s do some exercises. You will need knowledge from the Introduction to R course. Do not hesitate to ask if some points are not clear or you do not know how to proceed! After each task you can see the functions or structures that can help solving it. You always can see the help for a function by typing ?function_name


Exercises

  1. Annotation file http://edu.sablab.net/rt2018/data/counts_anno.txt contains over 50k of genes described in hg19 genome (canonical chromosomes only). Load it, check its structure and view the summary.

read.table, View, str, summary, head

  1. Build summary of gene biotypes (protein coding, lincRNA, etc.). Sort biotypes by the number of genes. Hint: you may need to transform column gene_biotype to factors

Data$Column, factor, summary, sort

  1. Calculate statistics on gene length and GC-content for protein_coding and lincRNA.

X$CA[X$CB == "bbb"], summary

  1. Save annotation for protein coding genes and lincRNAs into 2 separate files.

X[X$CB == "bbb",], write.table


LIH