1.1. Current folder

R can import data from local storage or Internet, and export it locally. Let us first setup the current folder, where our results will be stored.

rm(list=ls()) ## clean memory

getwd()   # shows current folder
dir()     # shows files in the current folder
setwd("D:/Data/R") # sets the current folder

1.2. Read values from a text file

Here is a file with some marker genes Markers.

We can read values from an unformatted text file using scan().

markers = scan("http://edu.sablab.net/rp2019/data/markers.txt",what = "",sep="\n") # what - defines the value class
print(markers)

##  [1] "Gfap"    "Grid1"   "Grin2b"  "Mag"     "Mki67"   "Msi1"    "Nes"    
##  [8] "Plp1"    "Slc6a1"  "Slc6a11" "Sox2"    "Syp"

You can download an entire webpage by scan to parse it afterwards.

1.3. Read tables

We will use read.table() to import the data as a data frame. Let’s try on sample annotation file: sampleAnnotation.txt.

Some parameters are important in read.table():

header - set it TRUE if there is a header line
sep - separator character. "\t" stands for tabulation
as.is - prevents transforming character columns to factors.
row.names - should the first column be considered as rownames?

Samples = read.table("http://edu.sablab.net/rp2019/data/sampleAnnotation.txt", header=T, sep="\t", as.is=T, row.names = 1)

	Time	Replicate
BO2day4rep1	day04	rep1
BO2day4rep2	day04	rep2
BO2day4rep3	day04	rep3
BO2day4rep4	day04	rep4
BO2day4rep5	day04	rep5
BO2day10rep1	day10	rep1
BO2day10rep2	day10	rep2
BO2day10rep3	day10	rep3
BO2day10rep4	day10	rep4
BO2day10rep5	day10	rep5
BO2day14rep1	day14	rep1
BO2day14rep2	day14	rep2
BO2day14rep3	day14	rep3
BO2day14rep4	day14	rep4
BO2day14rep5	day14	rep5
BO2day21rep1	day21	rep1
BO2day21rep2	day21	rep2
BO2day21rep3	day21	rep3
BO2day21rep4	day21	rep4

Do not forget functions that allow you seeing, what is inside your data:

str(Samples)

## 'data.frame':    19 obs. of  2 variables:
##  $ Time     : chr  "day04" "day04" "day04" "day04" ...
##  $ Replicate: chr  "rep1" "rep2" "rep3" "rep4" ...

head(Samples)

##               Time Replicate
## BO2day4rep1  day04      rep1
## BO2day4rep2  day04      rep2
## BO2day4rep3  day04      rep3
## BO2day4rep4  day04      rep4
## BO2day4rep5  day04      rep5
## BO2day10rep1 day10      rep1

summary(Samples)

##      Time            Replicate        
##  Length:19          Length:19         
##  Class :character   Class :character  
##  Mode  :character   Mode  :character

View(Samples)

You can see row and column names of the loaded data.frame object:

rownames(Samples)

##  [1] "BO2day4rep1"  "BO2day4rep2"  "BO2day4rep3"  "BO2day4rep4" 
##  [5] "BO2day4rep5"  "BO2day10rep1" "BO2day10rep2" "BO2day10rep3"
##  [9] "BO2day10rep4" "BO2day10rep5" "BO2day14rep1" "BO2day14rep2"
## [13] "BO2day14rep3" "BO2day14rep4" "BO2day14rep5" "BO2day21rep1"
## [17] "BO2day21rep2" "BO2day21rep3" "BO2day21rep4"

colnames(Samples)

## [1] "Time"      "Replicate"

1.4. Read and extract values from a MaxQuant file

1.4.1. Reading data table

Here we will try to read protein group file generated by MaxQuant. Let us first do it manually first. We will work with data from rat brain organoid experiment. The data originated from Elena Martinez’ work in collaboration with Rolf Bjerkvig. You can download the proteinGroups.txt file to use it locally, or import it directly from Internet.

file = "D:/Data/RProteo2019/proteinGroups.txt"
## uncomment the following line to import from Internet
# file = "http://edu.sablab.net/rp2019/data/proteinGroups.txt"
Tab = read.table(file, header=T, sep="\t", as.is=T, row.names = 1) 

## see number of rows and columns
dim(Tab)

## [1] 5460  185

The table contains too many columns to be properly seen after str(). Let’s show all column names:

colnames(Tab)

##   [1] "Majority.protein.IDs"                
##   [2] "Peptide.counts..all."                
##   [3] "Peptide.counts..razor.unique."       
##   [4] "Peptide.counts..unique."             
##   [5] "Protein.names"                       
##   [6] "Gene.names"                          
##   [7] "Fasta.headers"                       
##   [8] "Number.of.proteins"                  
##   [9] "Peptides"                            
##  [10] "Razor...unique.peptides"             
##  [11] "Unique.peptides"                     
##  [12] "Peptides.BO2day10rep1"               
##  [13] "Peptides.BO2day10rep2"               
##  [14] "Peptides.BO2day10rep3"               
##  [15] "Peptides.BO2day10rep4"               
##  [16] "Peptides.BO2day10rep5"               
##  [17] "Peptides.BO2day14rep1"               
##  [18] "Peptides.BO2day14rep2"               
##  [19] "Peptides.BO2day14rep3"               
##  [20] "Peptides.BO2day14rep4"               
##  [21] "Peptides.BO2day14rep5"               
##  [22] "Peptides.BO2day21rep1"               
##  [23] "Peptides.BO2day21rep2"               
##  [24] "Peptides.BO2day21rep3"               
##  [25] "Peptides.BO2day21rep4"               
##  [26] "Peptides.BO2day4rep1"                
##  [27] "Peptides.BO2day4rep2"                
##  [28] "Peptides.BO2day4rep3"                
##  [29] "Peptides.BO2day4rep4"                
##  [30] "Peptides.BO2day4rep5"                
##  [31] "Razor...unique.peptides.BO2day10rep1"
##  [32] "Razor...unique.peptides.BO2day10rep2"
##  [33] "Razor...unique.peptides.BO2day10rep3"
##  [34] "Razor...unique.peptides.BO2day10rep4"
##  [35] "Razor...unique.peptides.BO2day10rep5"
##  [36] "Razor...unique.peptides.BO2day14rep1"
##  [37] "Razor...unique.peptides.BO2day14rep2"
##  [38] "Razor...unique.peptides.BO2day14rep3"
##  [39] "Razor...unique.peptides.BO2day14rep4"
##  [40] "Razor...unique.peptides.BO2day14rep5"
##  [41] "Razor...unique.peptides.BO2day21rep1"
##  [42] "Razor...unique.peptides.BO2day21rep2"
##  [43] "Razor...unique.peptides.BO2day21rep3"
##  [44] "Razor...unique.peptides.BO2day21rep4"
##  [45] "Razor...unique.peptides.BO2day4rep1" 
##  [46] "Razor...unique.peptides.BO2day4rep2" 
##  [47] "Razor...unique.peptides.BO2day4rep3" 
##  [48] "Razor...unique.peptides.BO2day4rep4" 
##  [49] "Razor...unique.peptides.BO2day4rep5" 
##  [50] "Unique.peptides.BO2day10rep1"        
##  [51] "Unique.peptides.BO2day10rep2"        
##  [52] "Unique.peptides.BO2day10rep3"        
##  [53] "Unique.peptides.BO2day10rep4"        
##  [54] "Unique.peptides.BO2day10rep5"        
##  [55] "Unique.peptides.BO2day14rep1"        
##  [56] "Unique.peptides.BO2day14rep2"        
##  [57] "Unique.peptides.BO2day14rep3"        
##  [58] "Unique.peptides.BO2day14rep4"        
##  [59] "Unique.peptides.BO2day14rep5"        
##  [60] "Unique.peptides.BO2day21rep1"        
##  [61] "Unique.peptides.BO2day21rep2"        
##  [62] "Unique.peptides.BO2day21rep3"        
##  [63] "Unique.peptides.BO2day21rep4"        
##  [64] "Unique.peptides.BO2day4rep1"         
##  [65] "Unique.peptides.BO2day4rep2"         
##  [66] "Unique.peptides.BO2day4rep3"         
##  [67] "Unique.peptides.BO2day4rep4"         
##  [68] "Unique.peptides.BO2day4rep5"         
##  [69] "Sequence.coverage...."               
##  [70] "Unique...razor.sequence.coverage...."
##  [71] "Unique.sequence.coverage...."        
##  [72] "Mol..weight..kDa."                   
##  [73] "Sequence.length"                     
##  [74] "Sequence.lengths"                    
##  [75] "Q.value"                             
##  [76] "Score"                               
##  [77] "Identification.type.BO2day10rep1"    
##  [78] "Identification.type.BO2day10rep2"    
##  [79] "Identification.type.BO2day10rep3"    
##  [80] "Identification.type.BO2day10rep4"    
##  [81] "Identification.type.BO2day10rep5"    
##  [82] "Identification.type.BO2day14rep1"    
##  [83] "Identification.type.BO2day14rep2"    
##  [84] "Identification.type.BO2day14rep3"    
##  [85] "Identification.type.BO2day14rep4"    
##  [86] "Identification.type.BO2day14rep5"    
##  [87] "Identification.type.BO2day21rep1"    
##  [88] "Identification.type.BO2day21rep2"    
##  [89] "Identification.type.BO2day21rep3"    
##  [90] "Identification.type.BO2day21rep4"    
##  [91] "Identification.type.BO2day4rep1"     
##  [92] "Identification.type.BO2day4rep2"     
##  [93] "Identification.type.BO2day4rep3"     
##  [94] "Identification.type.BO2day4rep4"     
##  [95] "Identification.type.BO2day4rep5"     
##  [96] "Sequence.coverage.BO2day10rep1...."  
##  [97] "Sequence.coverage.BO2day10rep2...."  
##  [98] "Sequence.coverage.BO2day10rep3...."  
##  [99] "Sequence.coverage.BO2day10rep4...."  
## [100] "Sequence.coverage.BO2day10rep5...."  
## [101] "Sequence.coverage.BO2day14rep1...."  
## [102] "Sequence.coverage.BO2day14rep2...."  
## [103] "Sequence.coverage.BO2day14rep3...."  
## [104] "Sequence.coverage.BO2day14rep4...."  
## [105] "Sequence.coverage.BO2day14rep5...."  
## [106] "Sequence.coverage.BO2day21rep1...."  
## [107] "Sequence.coverage.BO2day21rep2...."  
## [108] "Sequence.coverage.BO2day21rep3...."  
## [109] "Sequence.coverage.BO2day21rep4...."  
## [110] "Sequence.coverage.BO2day4rep1...."   
## [111] "Sequence.coverage.BO2day4rep2...."   
## [112] "Sequence.coverage.BO2day4rep3...."   
## [113] "Sequence.coverage.BO2day4rep4...."   
## [114] "Sequence.coverage.BO2day4rep5...."   
## [115] "Intensity"                           
## [116] "Intensity.BO2day10rep1"              
## [117] "Intensity.BO2day10rep2"              
## [118] "Intensity.BO2day10rep3"              
## [119] "Intensity.BO2day10rep4"              
## [120] "Intensity.BO2day10rep5"              
## [121] "Intensity.BO2day14rep1"              
## [122] "Intensity.BO2day14rep2"              
## [123] "Intensity.BO2day14rep3"              
## [124] "Intensity.BO2day14rep4"              
## [125] "Intensity.BO2day14rep5"              
## [126] "Intensity.BO2day21rep1"              
## [127] "Intensity.BO2day21rep2"              
## [128] "Intensity.BO2day21rep3"              
## [129] "Intensity.BO2day21rep4"              
## [130] "Intensity.BO2day4rep1"               
## [131] "Intensity.BO2day4rep2"               
## [132] "Intensity.BO2day4rep3"               
## [133] "Intensity.BO2day4rep4"               
## [134] "Intensity.BO2day4rep5"               
## [135] "LFQ.intensity.BO2day10rep1"          
## [136] "LFQ.intensity.BO2day10rep2"          
## [137] "LFQ.intensity.BO2day10rep3"          
## [138] "LFQ.intensity.BO2day10rep4"          
## [139] "LFQ.intensity.BO2day10rep5"          
## [140] "LFQ.intensity.BO2day14rep1"          
## [141] "LFQ.intensity.BO2day14rep2"          
## [142] "LFQ.intensity.BO2day14rep3"          
## [143] "LFQ.intensity.BO2day14rep4"          
## [144] "LFQ.intensity.BO2day14rep5"          
## [145] "LFQ.intensity.BO2day21rep1"          
## [146] "LFQ.intensity.BO2day21rep2"          
## [147] "LFQ.intensity.BO2day21rep3"          
## [148] "LFQ.intensity.BO2day21rep4"          
## [149] "LFQ.intensity.BO2day4rep1"           
## [150] "LFQ.intensity.BO2day4rep2"           
## [151] "LFQ.intensity.BO2day4rep3"           
## [152] "LFQ.intensity.BO2day4rep4"           
## [153] "LFQ.intensity.BO2day4rep5"           
## [154] "MS.MS.count.BO2day10rep1"            
## [155] "MS.MS.count.BO2day10rep2"            
## [156] "MS.MS.count.BO2day10rep3"            
## [157] "MS.MS.count.BO2day10rep4"            
## [158] "MS.MS.count.BO2day10rep5"            
## [159] "MS.MS.count.BO2day14rep1"            
## [160] "MS.MS.count.BO2day14rep2"            
## [161] "MS.MS.count.BO2day14rep3"            
## [162] "MS.MS.count.BO2day14rep4"            
## [163] "MS.MS.count.BO2day14rep5"            
## [164] "MS.MS.count.BO2day21rep1"            
## [165] "MS.MS.count.BO2day21rep2"            
## [166] "MS.MS.count.BO2day21rep3"            
## [167] "MS.MS.count.BO2day21rep4"            
## [168] "MS.MS.count.BO2day4rep1"             
## [169] "MS.MS.count.BO2day4rep2"             
## [170] "MS.MS.count.BO2day4rep3"             
## [171] "MS.MS.count.BO2day4rep4"             
## [172] "MS.MS.count.BO2day4rep5"             
## [173] "MS.MS.count"                         
## [174] "Only.identified.by.site"             
## [175] "Reverse"                             
## [176] "Potential.contaminant"               
## [177] "id"                                  
## [178] "Peptide.IDs"                         
## [179] "Peptide.is.razor"                    
## [180] "Mod..peptide.IDs"                    
## [181] "Evidence.IDs"                        
## [182] "MS.MS.IDs"                           
## [183] "Best.MS.MS"                          
## [184] "Oxidation..M..site.IDs"              
## [185] "Oxidation..M..site.positions"

First, we should remove bad features using information in Reverse and Potential.contaminant columns.

print("Original size:")

## [1] "Original size:"

dim(Tab)

## [1] 5460  185

ikeep = Tab$Reverse=="" & Tab$Potential.contaminant==""
Tab = Tab[ikeep,]
print("Filtered:")

## [1] "Filtered:"

dim(Tab)

## [1] 5306  185

Now, let us define a data container - it will be easier to manipulate later. We are interested in several columns to annotate the features (in addition to rownames, which is protein group IDs). Let’s keep: Protein.names, Gene.names, Fasta.headers. Sample information we shall take from

Data = list()
## get feature annotation
Data$Anno = Tab[,c("Protein.names","Gene.names","Fasta.headers")]
## get sample annotation
Data$Meta = read.table("http://edu.sablab.net/rp2019/data/sampleAnnotation.txt", header=T, sep="\t", as.is=T, row.names = 1) 

str(Data)

## List of 2
##  $ Anno:'data.frame':    5306 obs. of  3 variables:
##   ..$ Protein.names: chr [1:5306] "Synapsin-3" "" "" "" ...
##   ..$ Gene.names   : chr [1:5306] "Syn3" "Abcf2" "Morc3" "Ddx17" ...
##   ..$ Fasta.headers: chr [1:5306] "tr|A0A096MIT7|A0A096MIT7_RAT Synapsin-3 OS=Rattus norvegicus OX=10116 GN=Syn3 PE=1 SV=2;sp|O70441|SYN3_RAT Syna"| __truncated__ "tr|A0A096MIV5|A0A096MIV5_RAT ATP-binding cassette subfamily F member 2 OS=Rattus norvegicus OX=10116 GN=Abcf2 P"| __truncated__ "tr|D4A552|D4A552_RAT MORC family CW-type zinc finger 3 OS=Rattus norvegicus OX=10116 GN=Morc3 PE=1 SV=1;tr|A0A0"| __truncated__ "tr|E9PT29|E9PT29_RAT DEAD-box helicase 17 OS=Rattus norvegicus OX=10116 GN=Ddx17 PE=1 SV=3;tr|A0A096MIX2|A0A096"| __truncated__ ...
##  $ Meta:'data.frame':    19 obs. of  2 variables:
##   ..$ Time     : chr [1:19] "day04" "day04" "day04" "day04" ...
##   ..$ Replicate: chr [1:19] "rep1" "rep2" "rep3" "rep4" ...

Now, let us extract intensity data. We will use normalized LFQ intensities

## column indexes, containing LFQ intensities
icol = grep("LFQ.intensity",colnames(Tab))
## extract columns
Data$X0 = Tab[,icol]
## remove uninformative part of column names
colnames(Data$X0) = sub("LFQ.intensity.","",colnames(Data$X0))  
## matrices are faster than data.frames => transform to a matrix
Data$X0 = as.matrix(Data$X0)
str(Data)

## List of 3
##  $ Anno:'data.frame':    5306 obs. of  3 variables:
##   ..$ Protein.names: chr [1:5306] "Synapsin-3" "" "" "" ...
##   ..$ Gene.names   : chr [1:5306] "Syn3" "Abcf2" "Morc3" "Ddx17" ...
##   ..$ Fasta.headers: chr [1:5306] "tr|A0A096MIT7|A0A096MIT7_RAT Synapsin-3 OS=Rattus norvegicus OX=10116 GN=Syn3 PE=1 SV=2;sp|O70441|SYN3_RAT Syna"| __truncated__ "tr|A0A096MIV5|A0A096MIV5_RAT ATP-binding cassette subfamily F member 2 OS=Rattus norvegicus OX=10116 GN=Abcf2 P"| __truncated__ "tr|D4A552|D4A552_RAT MORC family CW-type zinc finger 3 OS=Rattus norvegicus OX=10116 GN=Morc3 PE=1 SV=1;tr|A0A0"| __truncated__ "tr|E9PT29|E9PT29_RAT DEAD-box helicase 17 OS=Rattus norvegicus OX=10116 GN=Ddx17 PE=1 SV=3;tr|A0A096MIX2|A0A096"| __truncated__ ...
##  $ Meta:'data.frame':    19 obs. of  2 variables:
##   ..$ Time     : chr [1:19] "day04" "day04" "day04" "day04" ...
##   ..$ Replicate: chr [1:19] "rep1" "rep2" "rep3" "rep4" ...
##  $ X0  : num [1:5306, 1:19] 2.46e+07 1.46e+07 1.44e+07 4.63e+08 4.78e+06 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:5306] "A0A096MIT7;O70441" "A0A096MIV5;A2VD14;F1M8H5;A0A096MJ15" "D4A552;A0A096MIX0" "E9PT29;A0A096MIX2;A0A096MJW9" ...
##   .. ..$ : chr [1:19] "BO2day10rep1" "BO2day10rep2" "BO2day10rep3" "BO2day10rep4" ...

1.4.2. Preprocessing

Now we have data in the matrix Data$X0. The data are extremely right-skewed, so log-transformation is needed before using any exploratory data analysis. Let us assume, that all 0 intensities come from proteins with low detection limit (another approach is to consider them as missing values - NA, but we will not use this here).

## define ofset to avoid -Inf as 1st percentile of non-zero data
os = quantile(Data$X0[Data$X0>0],probs = 0.01, na.rm=TRUE)
## log-transorm and shift left-most values to 0
Data$LX = log2(os + Data$X0) - log2(os)
Data$LX[is.na(Data$LX)] = 0
str(Data)

## List of 4
##  $ Anno:'data.frame':    5306 obs. of  3 variables:
##   ..$ Protein.names: chr [1:5306] "Synapsin-3" "" "" "" ...
##   ..$ Gene.names   : chr [1:5306] "Syn3" "Abcf2" "Morc3" "Ddx17" ...
##   ..$ Fasta.headers: chr [1:5306] "tr|A0A096MIT7|A0A096MIT7_RAT Synapsin-3 OS=Rattus norvegicus OX=10116 GN=Syn3 PE=1 SV=2;sp|O70441|SYN3_RAT Syna"| __truncated__ "tr|A0A096MIV5|A0A096MIV5_RAT ATP-binding cassette subfamily F member 2 OS=Rattus norvegicus OX=10116 GN=Abcf2 P"| __truncated__ "tr|D4A552|D4A552_RAT MORC family CW-type zinc finger 3 OS=Rattus norvegicus OX=10116 GN=Morc3 PE=1 SV=1;tr|A0A0"| __truncated__ "tr|E9PT29|E9PT29_RAT DEAD-box helicase 17 OS=Rattus norvegicus OX=10116 GN=Ddx17 PE=1 SV=3;tr|A0A096MIX2|A0A096"| __truncated__ ...
##  $ Meta:'data.frame':    19 obs. of  2 variables:
##   ..$ Time     : chr [1:19] "day04" "day04" "day04" "day04" ...
##   ..$ Replicate: chr [1:19] "rep1" "rep2" "rep3" "rep4" ...
##  $ X0  : num [1:5306, 1:19] 2.46e+07 1.46e+07 1.44e+07 4.63e+08 4.78e+06 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:5306] "A0A096MIT7;O70441" "A0A096MIV5;A2VD14;F1M8H5;A0A096MJ15" "D4A552;A0A096MIX0" "E9PT29;A0A096MIX2;A0A096MJW9" ...
##   .. ..$ : chr [1:19] "BO2day10rep1" "BO2day10rep2" "BO2day10rep3" "BO2day10rep4" ...
##  $ LX  : num [1:5306, 1:19] 4.87 4.15 4.14 9.06 2.7 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:5306] "A0A096MIT7;O70441" "A0A096MIV5;A2VD14;F1M8H5;A0A096MJ15" "D4A552;A0A096MIX0" "E9PT29;A0A096MIX2;A0A096MJW9" ...
##   .. ..$ : chr [1:19] "BO2day10rep1" "BO2day10rep2" "BO2day10rep3" "BO2day10rep4" ...

Finaly, let’s order the samples in the way, defined in Data$Meta.

## check consistancy
if (!all(colnames(Data$X0) %in% rownames(Data$Meta)))
  stop("Error! Sample annotation in Meta does not fit such in the data!")
Data$X0 = Data$X[,rownames(Data$Meta)]
Data$LX = Data$X[,rownames(Data$Meta)]
str(Data)

## List of 4
##  $ Anno:'data.frame':    5306 obs. of  3 variables:
##   ..$ Protein.names: chr [1:5306] "Synapsin-3" "" "" "" ...
##   ..$ Gene.names   : chr [1:5306] "Syn3" "Abcf2" "Morc3" "Ddx17" ...
##   ..$ Fasta.headers: chr [1:5306] "tr|A0A096MIT7|A0A096MIT7_RAT Synapsin-3 OS=Rattus norvegicus OX=10116 GN=Syn3 PE=1 SV=2;sp|O70441|SYN3_RAT Syna"| __truncated__ "tr|A0A096MIV5|A0A096MIV5_RAT ATP-binding cassette subfamily F member 2 OS=Rattus norvegicus OX=10116 GN=Abcf2 P"| __truncated__ "tr|D4A552|D4A552_RAT MORC family CW-type zinc finger 3 OS=Rattus norvegicus OX=10116 GN=Morc3 PE=1 SV=1;tr|A0A0"| __truncated__ "tr|E9PT29|E9PT29_RAT DEAD-box helicase 17 OS=Rattus norvegicus OX=10116 GN=Ddx17 PE=1 SV=3;tr|A0A096MIX2|A0A096"| __truncated__ ...
##  $ Meta:'data.frame':    19 obs. of  2 variables:
##   ..$ Time     : chr [1:19] "day04" "day04" "day04" "day04" ...
##   ..$ Replicate: chr [1:19] "rep1" "rep2" "rep3" "rep4" ...
##  $ X0  : num [1:5306, 1:19] 2.15e+07 1.92e+07 1.37e+07 5.36e+08 1.20e+07 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:5306] "A0A096MIT7;O70441" "A0A096MIV5;A2VD14;F1M8H5;A0A096MJ15" "D4A552;A0A096MIX0" "E9PT29;A0A096MIX2;A0A096MJW9" ...
##   .. ..$ : chr [1:19] "BO2day4rep1" "BO2day4rep2" "BO2day4rep3" "BO2day4rep4" ...
##  $ LX  : num [1:5306, 1:19] 2.15e+07 1.92e+07 1.37e+07 5.36e+08 1.20e+07 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:5306] "A0A096MIT7;O70441" "A0A096MIV5;A2VD14;F1M8H5;A0A096MJ15" "D4A552;A0A096MIX0" "E9PT29;A0A096MIX2;A0A096MJW9" ...
##   .. ..$ : chr [1:19] "BO2day4rep1" "BO2day4rep2" "BO2day4rep3" "BO2day4rep4" ...

1.4.3. Alternative: `readMaxQuant(..)`

Alternatively, I prepared a script which can do this automaticaly

## load the script from Internet
source("http://sablab.net/scripts/readMaxQuant.r")

Data = readMaxQuant(file, 
             key.data ="LFQ.intensity.", 
             names.anno = c("Protein.names","Gene.names","Fasta.headers"), 
             samples = "http://edu.sablab.net/rp2019/data/sampleAnnotation.txt", 
             keep.na = FALSE, 
             row.names=1, 
             os.quantile=0.01)

## 5460 features were read.
## Reverse and contaminants are removed, resulting in 5306 features 
## Remove 12 uninformative features
## [1] "Final data container:"
## List of 6
##  $ nf  : int 5294
##  $ ns  : int 19
##  $ Anno:'data.frame':    5294 obs. of  3 variables:
##   ..$ Protein.names: chr [1:5294] "Synapsin-3" "" "" "" ...
##   ..$ Gene.names   : chr [1:5294] "Syn3" "Abcf2" "Morc3" "Ddx17" ...
##   ..$ Fasta.headers: chr [1:5294] "tr|A0A096MIT7|A0A096MIT7_RAT Synapsin-3 OS=Rattus norvegicus OX=10116 GN=Syn3 PE=1 SV=2;sp|O70441|SYN3_RAT Syna"| __truncated__ "tr|A0A096MIV5|A0A096MIV5_RAT ATP-binding cassette subfamily F member 2 OS=Rattus norvegicus OX=10116 GN=Abcf2 P"| __truncated__ "tr|D4A552|D4A552_RAT MORC family CW-type zinc finger 3 OS=Rattus norvegicus OX=10116 GN=Morc3 PE=1 SV=1;tr|A0A0"| __truncated__ "tr|E9PT29|E9PT29_RAT DEAD-box helicase 17 OS=Rattus norvegicus OX=10116 GN=Ddx17 PE=1 SV=3;tr|A0A096MIX2|A0A096"| __truncated__ ...
##  $ Meta:'data.frame':    19 obs. of  2 variables:
##   ..$ Time     : chr [1:19] "day04" "day04" "day04" "day04" ...
##   ..$ Replicate: chr [1:19] "rep1" "rep2" "rep3" "rep4" ...
##  $ X0  : num [1:5294, 1:19] 2.15e+07 1.92e+07 1.37e+07 5.36e+08 1.20e+07 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:5294] "A0A096MIT7;O70441" "A0A096MIV5;A2VD14;F1M8H5;A0A096MJ15" "D4A552;A0A096MIX0" "E9PT29;A0A096MIX2;A0A096MJW9" ...
##   .. ..$ : chr [1:19] "BO2day4rep1" "BO2day4rep2" "BO2day4rep3" "BO2day4rep4" ...
##  $ LX  : num [1:5294, 1:19] 4.69 4.53 4.07 9.27 3.89 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:5294] "A0A096MIT7;O70441" "A0A096MIV5;A2VD14;F1M8H5;A0A096MJ15" "D4A552;A0A096MIX0" "E9PT29;A0A096MIX2;A0A096MJW9" ...
##   .. ..$ : chr [1:19] "BO2day4rep1" "BO2day4rep2" "BO2day4rep3" "BO2day4rep4" ...

1.5. Read values from binary file (optional)

R can keep data in GZip-ed form, automatically loading the variables into memory. Such files have .RData extension. This is a fast & easy way to store your data. Let us first download the data in RData format into you working directory using download.file() and then load it by load(). Parameters of downloading:

destfile - the file name, under which you would like to store the downloaded file.
mode - the way you would like to treat the data (as text or binary). To keep binary data unchanged, use wb!

Example: Data from TCGA for 40 lung squamous cell carcinoma samples and 20 controls.

download.file("http://edu.sablab.net/rt2018/data/LUSC60.RData", destfile="LUSC60.RData",mode = "wb")

getwd()                # show current folder
dir(pattern=".RData")  # show files in the current folder
load("LUSC60.RData")   # load the data

ls()                   # you should see 'LUSC' object among variables

str(LUSC)

1.6. Data export (optional)

There are several ways to export your data. Let’s consider the most simple.

write() - writes a column of numbers / characters
write.table() - writes a data table
save() - saves one or several variables into a binary RData file.

Parameters of write.table are:

eol - cheracter for the end of line (can be differ with OS). The standard one is “”
dec - decimal separator
quote - do we put “” around character values or not
row.names - do we put row names as a column or not

Let’s select only SCC (squamous cell carcinoma) patients from Samples and save the file locally

## get indexes of differentiation markers
imarker = Data$Anno$Gene.names %in% markers

## extract markers and annotate them by gene names
X = Data$LX[imarker, ]
row.names(X) = Data$Anno$Gene.names[imarker]

write.table(X,file = "MarkerBO.txt",sep = "\t",row.names = TRUE, col.names=NA, quote=FALSE)

# save Data as a binary file
save(Data,file="Data.RData") 

# save all variables as binary file
save(list=ls(),file="workspace.RData") 

getwd()
dir()      # see the results

Data import

Petr V. Nazarov, LIH

2019-10-10

1.1. Current folder

1.2. Read values from a text file

1.3. Read tables

1.4. Read and extract values from a MaxQuant file

1.4.1. Reading data table

1.4.2. Preprocessing

1.4.3. Alternative: `readMaxQuant(..)`

1.5. Read values from binary file (optional)

1.6. Data export (optional)

Data import

Petr V. Nazarov, LIH

2019-10-10

1.1. Current folder

1.2. Read values from a text file

1.3. Read tables

1.4. Read and extract values from a MaxQuant file

1.4.1. Reading data table

1.4.2. Preprocessing

1.4.3. Alternative: readMaxQuant(..)

1.5. Read values from binary file (optional)

1.6. Data export (optional)

1.4.3. Alternative: `readMaxQuant(..)`