R can import data from local storage or Internet, and export it locally. Let us first setup the current folder, where our results will be stored.
rm(list=ls()) ## clean memory
getwd() # shows current folder
dir() # shows files in the current folder
setwd("D:/Data/R") # sets the current folder
Here is a file with some marker genes Markers.
We can read values from an unformatted text file using scan()
.
markers = scan("http://edu.sablab.net/rp2019/data/markers.txt",what = "",sep="\n") # what - defines the value class
print(markers)
## [1] "Gfap" "Grid1" "Grin2b" "Mag" "Mki67" "Msi1" "Nes"
## [8] "Plp1" "Slc6a1" "Slc6a11" "Sox2" "Syp"
You can download an entire webpage by scan
to parse it afterwards.
We will use read.table()
to import the data as a data frame. Let’s try on sample annotation file: sampleAnnotation.txt.
Some parameters are important in read.table()
:
header
- set it TRUE
if there is a header linesep
- separator character. "\t"
stands for tabulationas.is
- prevents transforming character columns to factors.row.names
- should the first column be considered as rownames?Samples = read.table("http://edu.sablab.net/rp2019/data/sampleAnnotation.txt", header=T, sep="\t", as.is=T, row.names = 1)
Time | Replicate | |
---|---|---|
BO2day4rep1 | day04 | rep1 |
BO2day4rep2 | day04 | rep2 |
BO2day4rep3 | day04 | rep3 |
BO2day4rep4 | day04 | rep4 |
BO2day4rep5 | day04 | rep5 |
BO2day10rep1 | day10 | rep1 |
BO2day10rep2 | day10 | rep2 |
BO2day10rep3 | day10 | rep3 |
BO2day10rep4 | day10 | rep4 |
BO2day10rep5 | day10 | rep5 |
BO2day14rep1 | day14 | rep1 |
BO2day14rep2 | day14 | rep2 |
BO2day14rep3 | day14 | rep3 |
BO2day14rep4 | day14 | rep4 |
BO2day14rep5 | day14 | rep5 |
BO2day21rep1 | day21 | rep1 |
BO2day21rep2 | day21 | rep2 |
BO2day21rep3 | day21 | rep3 |
BO2day21rep4 | day21 | rep4 |
Do not forget functions that allow you seeing, what is inside your data:
str(Samples)
## 'data.frame': 19 obs. of 2 variables:
## $ Time : chr "day04" "day04" "day04" "day04" ...
## $ Replicate: chr "rep1" "rep2" "rep3" "rep4" ...
head(Samples)
## Time Replicate
## BO2day4rep1 day04 rep1
## BO2day4rep2 day04 rep2
## BO2day4rep3 day04 rep3
## BO2day4rep4 day04 rep4
## BO2day4rep5 day04 rep5
## BO2day10rep1 day10 rep1
summary(Samples)
## Time Replicate
## Length:19 Length:19
## Class :character Class :character
## Mode :character Mode :character
View(Samples)
You can see row and column names of the loaded data.frame object:
rownames(Samples)
## [1] "BO2day4rep1" "BO2day4rep2" "BO2day4rep3" "BO2day4rep4"
## [5] "BO2day4rep5" "BO2day10rep1" "BO2day10rep2" "BO2day10rep3"
## [9] "BO2day10rep4" "BO2day10rep5" "BO2day14rep1" "BO2day14rep2"
## [13] "BO2day14rep3" "BO2day14rep4" "BO2day14rep5" "BO2day21rep1"
## [17] "BO2day21rep2" "BO2day21rep3" "BO2day21rep4"
colnames(Samples)
## [1] "Time" "Replicate"
Here we will try to read protein group file generated by MaxQuant. Let us first do it manually first. We will work with data from rat brain organoid experiment. The data originated from Elena Martinez’ work in collaboration with Rolf Bjerkvig. You can download the proteinGroups.txt file to use it locally, or import it directly from Internet.
file = "D:/Data/RProteo2019/proteinGroups.txt"
## uncomment the following line to import from Internet
# file = "http://edu.sablab.net/rp2019/data/proteinGroups.txt"
Tab = read.table(file, header=T, sep="\t", as.is=T, row.names = 1)
## see number of rows and columns
dim(Tab)
## [1] 5460 185
The table contains too many columns to be properly seen after str()
. Let’s show all column names:
colnames(Tab)
## [1] "Majority.protein.IDs"
## [2] "Peptide.counts..all."
## [3] "Peptide.counts..razor.unique."
## [4] "Peptide.counts..unique."
## [5] "Protein.names"
## [6] "Gene.names"
## [7] "Fasta.headers"
## [8] "Number.of.proteins"
## [9] "Peptides"
## [10] "Razor...unique.peptides"
## [11] "Unique.peptides"
## [12] "Peptides.BO2day10rep1"
## [13] "Peptides.BO2day10rep2"
## [14] "Peptides.BO2day10rep3"
## [15] "Peptides.BO2day10rep4"
## [16] "Peptides.BO2day10rep5"
## [17] "Peptides.BO2day14rep1"
## [18] "Peptides.BO2day14rep2"
## [19] "Peptides.BO2day14rep3"
## [20] "Peptides.BO2day14rep4"
## [21] "Peptides.BO2day14rep5"
## [22] "Peptides.BO2day21rep1"
## [23] "Peptides.BO2day21rep2"
## [24] "Peptides.BO2day21rep3"
## [25] "Peptides.BO2day21rep4"
## [26] "Peptides.BO2day4rep1"
## [27] "Peptides.BO2day4rep2"
## [28] "Peptides.BO2day4rep3"
## [29] "Peptides.BO2day4rep4"
## [30] "Peptides.BO2day4rep5"
## [31] "Razor...unique.peptides.BO2day10rep1"
## [32] "Razor...unique.peptides.BO2day10rep2"
## [33] "Razor...unique.peptides.BO2day10rep3"
## [34] "Razor...unique.peptides.BO2day10rep4"
## [35] "Razor...unique.peptides.BO2day10rep5"
## [36] "Razor...unique.peptides.BO2day14rep1"
## [37] "Razor...unique.peptides.BO2day14rep2"
## [38] "Razor...unique.peptides.BO2day14rep3"
## [39] "Razor...unique.peptides.BO2day14rep4"
## [40] "Razor...unique.peptides.BO2day14rep5"
## [41] "Razor...unique.peptides.BO2day21rep1"
## [42] "Razor...unique.peptides.BO2day21rep2"
## [43] "Razor...unique.peptides.BO2day21rep3"
## [44] "Razor...unique.peptides.BO2day21rep4"
## [45] "Razor...unique.peptides.BO2day4rep1"
## [46] "Razor...unique.peptides.BO2day4rep2"
## [47] "Razor...unique.peptides.BO2day4rep3"
## [48] "Razor...unique.peptides.BO2day4rep4"
## [49] "Razor...unique.peptides.BO2day4rep5"
## [50] "Unique.peptides.BO2day10rep1"
## [51] "Unique.peptides.BO2day10rep2"
## [52] "Unique.peptides.BO2day10rep3"
## [53] "Unique.peptides.BO2day10rep4"
## [54] "Unique.peptides.BO2day10rep5"
## [55] "Unique.peptides.BO2day14rep1"
## [56] "Unique.peptides.BO2day14rep2"
## [57] "Unique.peptides.BO2day14rep3"
## [58] "Unique.peptides.BO2day14rep4"
## [59] "Unique.peptides.BO2day14rep5"
## [60] "Unique.peptides.BO2day21rep1"
## [61] "Unique.peptides.BO2day21rep2"
## [62] "Unique.peptides.BO2day21rep3"
## [63] "Unique.peptides.BO2day21rep4"
## [64] "Unique.peptides.BO2day4rep1"
## [65] "Unique.peptides.BO2day4rep2"
## [66] "Unique.peptides.BO2day4rep3"
## [67] "Unique.peptides.BO2day4rep4"
## [68] "Unique.peptides.BO2day4rep5"
## [69] "Sequence.coverage...."
## [70] "Unique...razor.sequence.coverage...."
## [71] "Unique.sequence.coverage...."
## [72] "Mol..weight..kDa."
## [73] "Sequence.length"
## [74] "Sequence.lengths"
## [75] "Q.value"
## [76] "Score"
## [77] "Identification.type.BO2day10rep1"
## [78] "Identification.type.BO2day10rep2"
## [79] "Identification.type.BO2day10rep3"
## [80] "Identification.type.BO2day10rep4"
## [81] "Identification.type.BO2day10rep5"
## [82] "Identification.type.BO2day14rep1"
## [83] "Identification.type.BO2day14rep2"
## [84] "Identification.type.BO2day14rep3"
## [85] "Identification.type.BO2day14rep4"
## [86] "Identification.type.BO2day14rep5"
## [87] "Identification.type.BO2day21rep1"
## [88] "Identification.type.BO2day21rep2"
## [89] "Identification.type.BO2day21rep3"
## [90] "Identification.type.BO2day21rep4"
## [91] "Identification.type.BO2day4rep1"
## [92] "Identification.type.BO2day4rep2"
## [93] "Identification.type.BO2day4rep3"
## [94] "Identification.type.BO2day4rep4"
## [95] "Identification.type.BO2day4rep5"
## [96] "Sequence.coverage.BO2day10rep1...."
## [97] "Sequence.coverage.BO2day10rep2...."
## [98] "Sequence.coverage.BO2day10rep3...."
## [99] "Sequence.coverage.BO2day10rep4...."
## [100] "Sequence.coverage.BO2day10rep5...."
## [101] "Sequence.coverage.BO2day14rep1...."
## [102] "Sequence.coverage.BO2day14rep2...."
## [103] "Sequence.coverage.BO2day14rep3...."
## [104] "Sequence.coverage.BO2day14rep4...."
## [105] "Sequence.coverage.BO2day14rep5...."
## [106] "Sequence.coverage.BO2day21rep1...."
## [107] "Sequence.coverage.BO2day21rep2...."
## [108] "Sequence.coverage.BO2day21rep3...."
## [109] "Sequence.coverage.BO2day21rep4...."
## [110] "Sequence.coverage.BO2day4rep1...."
## [111] "Sequence.coverage.BO2day4rep2...."
## [112] "Sequence.coverage.BO2day4rep3...."
## [113] "Sequence.coverage.BO2day4rep4...."
## [114] "Sequence.coverage.BO2day4rep5...."
## [115] "Intensity"
## [116] "Intensity.BO2day10rep1"
## [117] "Intensity.BO2day10rep2"
## [118] "Intensity.BO2day10rep3"
## [119] "Intensity.BO2day10rep4"
## [120] "Intensity.BO2day10rep5"
## [121] "Intensity.BO2day14rep1"
## [122] "Intensity.BO2day14rep2"
## [123] "Intensity.BO2day14rep3"
## [124] "Intensity.BO2day14rep4"
## [125] "Intensity.BO2day14rep5"
## [126] "Intensity.BO2day21rep1"
## [127] "Intensity.BO2day21rep2"
## [128] "Intensity.BO2day21rep3"
## [129] "Intensity.BO2day21rep4"
## [130] "Intensity.BO2day4rep1"
## [131] "Intensity.BO2day4rep2"
## [132] "Intensity.BO2day4rep3"
## [133] "Intensity.BO2day4rep4"
## [134] "Intensity.BO2day4rep5"
## [135] "LFQ.intensity.BO2day10rep1"
## [136] "LFQ.intensity.BO2day10rep2"
## [137] "LFQ.intensity.BO2day10rep3"
## [138] "LFQ.intensity.BO2day10rep4"
## [139] "LFQ.intensity.BO2day10rep5"
## [140] "LFQ.intensity.BO2day14rep1"
## [141] "LFQ.intensity.BO2day14rep2"
## [142] "LFQ.intensity.BO2day14rep3"
## [143] "LFQ.intensity.BO2day14rep4"
## [144] "LFQ.intensity.BO2day14rep5"
## [145] "LFQ.intensity.BO2day21rep1"
## [146] "LFQ.intensity.BO2day21rep2"
## [147] "LFQ.intensity.BO2day21rep3"
## [148] "LFQ.intensity.BO2day21rep4"
## [149] "LFQ.intensity.BO2day4rep1"
## [150] "LFQ.intensity.BO2day4rep2"
## [151] "LFQ.intensity.BO2day4rep3"
## [152] "LFQ.intensity.BO2day4rep4"
## [153] "LFQ.intensity.BO2day4rep5"
## [154] "MS.MS.count.BO2day10rep1"
## [155] "MS.MS.count.BO2day10rep2"
## [156] "MS.MS.count.BO2day10rep3"
## [157] "MS.MS.count.BO2day10rep4"
## [158] "MS.MS.count.BO2day10rep5"
## [159] "MS.MS.count.BO2day14rep1"
## [160] "MS.MS.count.BO2day14rep2"
## [161] "MS.MS.count.BO2day14rep3"
## [162] "MS.MS.count.BO2day14rep4"
## [163] "MS.MS.count.BO2day14rep5"
## [164] "MS.MS.count.BO2day21rep1"
## [165] "MS.MS.count.BO2day21rep2"
## [166] "MS.MS.count.BO2day21rep3"
## [167] "MS.MS.count.BO2day21rep4"
## [168] "MS.MS.count.BO2day4rep1"
## [169] "MS.MS.count.BO2day4rep2"
## [170] "MS.MS.count.BO2day4rep3"
## [171] "MS.MS.count.BO2day4rep4"
## [172] "MS.MS.count.BO2day4rep5"
## [173] "MS.MS.count"
## [174] "Only.identified.by.site"
## [175] "Reverse"
## [176] "Potential.contaminant"
## [177] "id"
## [178] "Peptide.IDs"
## [179] "Peptide.is.razor"
## [180] "Mod..peptide.IDs"
## [181] "Evidence.IDs"
## [182] "MS.MS.IDs"
## [183] "Best.MS.MS"
## [184] "Oxidation..M..site.IDs"
## [185] "Oxidation..M..site.positions"
First, we should remove bad features using information in Reverse
and Potential.contaminant
columns.
print("Original size:")
## [1] "Original size:"
dim(Tab)
## [1] 5460 185
ikeep = Tab$Reverse=="" & Tab$Potential.contaminant==""
Tab = Tab[ikeep,]
print("Filtered:")
## [1] "Filtered:"
dim(Tab)
## [1] 5306 185
Now, let us define a data container - it will be easier to manipulate later. We are interested in several columns to annotate the features (in addition to rownames, which is protein group IDs). Let’s keep: Protein.names
, Gene.names
, Fasta.headers
. Sample information we shall take from
Data = list()
## get feature annotation
Data$Anno = Tab[,c("Protein.names","Gene.names","Fasta.headers")]
## get sample annotation
Data$Meta = read.table("http://edu.sablab.net/rp2019/data/sampleAnnotation.txt", header=T, sep="\t", as.is=T, row.names = 1)
str(Data)
## List of 2
## $ Anno:'data.frame': 5306 obs. of 3 variables:
## ..$ Protein.names: chr [1:5306] "Synapsin-3" "" "" "" ...
## ..$ Gene.names : chr [1:5306] "Syn3" "Abcf2" "Morc3" "Ddx17" ...
## ..$ Fasta.headers: chr [1:5306] "tr|A0A096MIT7|A0A096MIT7_RAT Synapsin-3 OS=Rattus norvegicus OX=10116 GN=Syn3 PE=1 SV=2;sp|O70441|SYN3_RAT Syna"| __truncated__ "tr|A0A096MIV5|A0A096MIV5_RAT ATP-binding cassette subfamily F member 2 OS=Rattus norvegicus OX=10116 GN=Abcf2 P"| __truncated__ "tr|D4A552|D4A552_RAT MORC family CW-type zinc finger 3 OS=Rattus norvegicus OX=10116 GN=Morc3 PE=1 SV=1;tr|A0A0"| __truncated__ "tr|E9PT29|E9PT29_RAT DEAD-box helicase 17 OS=Rattus norvegicus OX=10116 GN=Ddx17 PE=1 SV=3;tr|A0A096MIX2|A0A096"| __truncated__ ...
## $ Meta:'data.frame': 19 obs. of 2 variables:
## ..$ Time : chr [1:19] "day04" "day04" "day04" "day04" ...
## ..$ Replicate: chr [1:19] "rep1" "rep2" "rep3" "rep4" ...
Now, let us extract intensity data. We will use normalized LFQ intensities
## column indexes, containing LFQ intensities
icol = grep("LFQ.intensity",colnames(Tab))
## extract columns
Data$X0 = Tab[,icol]
## remove uninformative part of column names
colnames(Data$X0) = sub("LFQ.intensity.","",colnames(Data$X0))
## matrices are faster than data.frames => transform to a matrix
Data$X0 = as.matrix(Data$X0)
str(Data)
## List of 3
## $ Anno:'data.frame': 5306 obs. of 3 variables:
## ..$ Protein.names: chr [1:5306] "Synapsin-3" "" "" "" ...
## ..$ Gene.names : chr [1:5306] "Syn3" "Abcf2" "Morc3" "Ddx17" ...
## ..$ Fasta.headers: chr [1:5306] "tr|A0A096MIT7|A0A096MIT7_RAT Synapsin-3 OS=Rattus norvegicus OX=10116 GN=Syn3 PE=1 SV=2;sp|O70441|SYN3_RAT Syna"| __truncated__ "tr|A0A096MIV5|A0A096MIV5_RAT ATP-binding cassette subfamily F member 2 OS=Rattus norvegicus OX=10116 GN=Abcf2 P"| __truncated__ "tr|D4A552|D4A552_RAT MORC family CW-type zinc finger 3 OS=Rattus norvegicus OX=10116 GN=Morc3 PE=1 SV=1;tr|A0A0"| __truncated__ "tr|E9PT29|E9PT29_RAT DEAD-box helicase 17 OS=Rattus norvegicus OX=10116 GN=Ddx17 PE=1 SV=3;tr|A0A096MIX2|A0A096"| __truncated__ ...
## $ Meta:'data.frame': 19 obs. of 2 variables:
## ..$ Time : chr [1:19] "day04" "day04" "day04" "day04" ...
## ..$ Replicate: chr [1:19] "rep1" "rep2" "rep3" "rep4" ...
## $ X0 : num [1:5306, 1:19] 2.46e+07 1.46e+07 1.44e+07 4.63e+08 4.78e+06 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:5306] "A0A096MIT7;O70441" "A0A096MIV5;A2VD14;F1M8H5;A0A096MJ15" "D4A552;A0A096MIX0" "E9PT29;A0A096MIX2;A0A096MJW9" ...
## .. ..$ : chr [1:19] "BO2day10rep1" "BO2day10rep2" "BO2day10rep3" "BO2day10rep4" ...
Now we have data in the matrix Data$X0
. The data are extremely right-skewed, so log-transformation is needed before using any exploratory data analysis. Let us assume, that all 0 intensities come from proteins with low detection limit (another approach is to consider them as missing values - NA
, but we will not use this here).
## define ofset to avoid -Inf as 1st percentile of non-zero data
os = quantile(Data$X0[Data$X0>0],probs = 0.01, na.rm=TRUE)
## log-transorm and shift left-most values to 0
Data$LX = log2(os + Data$X0) - log2(os)
Data$LX[is.na(Data$LX)] = 0
str(Data)
## List of 4
## $ Anno:'data.frame': 5306 obs. of 3 variables:
## ..$ Protein.names: chr [1:5306] "Synapsin-3" "" "" "" ...
## ..$ Gene.names : chr [1:5306] "Syn3" "Abcf2" "Morc3" "Ddx17" ...
## ..$ Fasta.headers: chr [1:5306] "tr|A0A096MIT7|A0A096MIT7_RAT Synapsin-3 OS=Rattus norvegicus OX=10116 GN=Syn3 PE=1 SV=2;sp|O70441|SYN3_RAT Syna"| __truncated__ "tr|A0A096MIV5|A0A096MIV5_RAT ATP-binding cassette subfamily F member 2 OS=Rattus norvegicus OX=10116 GN=Abcf2 P"| __truncated__ "tr|D4A552|D4A552_RAT MORC family CW-type zinc finger 3 OS=Rattus norvegicus OX=10116 GN=Morc3 PE=1 SV=1;tr|A0A0"| __truncated__ "tr|E9PT29|E9PT29_RAT DEAD-box helicase 17 OS=Rattus norvegicus OX=10116 GN=Ddx17 PE=1 SV=3;tr|A0A096MIX2|A0A096"| __truncated__ ...
## $ Meta:'data.frame': 19 obs. of 2 variables:
## ..$ Time : chr [1:19] "day04" "day04" "day04" "day04" ...
## ..$ Replicate: chr [1:19] "rep1" "rep2" "rep3" "rep4" ...
## $ X0 : num [1:5306, 1:19] 2.46e+07 1.46e+07 1.44e+07 4.63e+08 4.78e+06 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:5306] "A0A096MIT7;O70441" "A0A096MIV5;A2VD14;F1M8H5;A0A096MJ15" "D4A552;A0A096MIX0" "E9PT29;A0A096MIX2;A0A096MJW9" ...
## .. ..$ : chr [1:19] "BO2day10rep1" "BO2day10rep2" "BO2day10rep3" "BO2day10rep4" ...
## $ LX : num [1:5306, 1:19] 4.87 4.15 4.14 9.06 2.7 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:5306] "A0A096MIT7;O70441" "A0A096MIV5;A2VD14;F1M8H5;A0A096MJ15" "D4A552;A0A096MIX0" "E9PT29;A0A096MIX2;A0A096MJW9" ...
## .. ..$ : chr [1:19] "BO2day10rep1" "BO2day10rep2" "BO2day10rep3" "BO2day10rep4" ...
Finaly, let’s order the samples in the way, defined in Data$Meta
.
## check consistancy
if (!all(colnames(Data$X0) %in% rownames(Data$Meta)))
stop("Error! Sample annotation in Meta does not fit such in the data!")
Data$X0 = Data$X[,rownames(Data$Meta)]
Data$LX = Data$X[,rownames(Data$Meta)]
str(Data)
## List of 4
## $ Anno:'data.frame': 5306 obs. of 3 variables:
## ..$ Protein.names: chr [1:5306] "Synapsin-3" "" "" "" ...
## ..$ Gene.names : chr [1:5306] "Syn3" "Abcf2" "Morc3" "Ddx17" ...
## ..$ Fasta.headers: chr [1:5306] "tr|A0A096MIT7|A0A096MIT7_RAT Synapsin-3 OS=Rattus norvegicus OX=10116 GN=Syn3 PE=1 SV=2;sp|O70441|SYN3_RAT Syna"| __truncated__ "tr|A0A096MIV5|A0A096MIV5_RAT ATP-binding cassette subfamily F member 2 OS=Rattus norvegicus OX=10116 GN=Abcf2 P"| __truncated__ "tr|D4A552|D4A552_RAT MORC family CW-type zinc finger 3 OS=Rattus norvegicus OX=10116 GN=Morc3 PE=1 SV=1;tr|A0A0"| __truncated__ "tr|E9PT29|E9PT29_RAT DEAD-box helicase 17 OS=Rattus norvegicus OX=10116 GN=Ddx17 PE=1 SV=3;tr|A0A096MIX2|A0A096"| __truncated__ ...
## $ Meta:'data.frame': 19 obs. of 2 variables:
## ..$ Time : chr [1:19] "day04" "day04" "day04" "day04" ...
## ..$ Replicate: chr [1:19] "rep1" "rep2" "rep3" "rep4" ...
## $ X0 : num [1:5306, 1:19] 2.15e+07 1.92e+07 1.37e+07 5.36e+08 1.20e+07 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:5306] "A0A096MIT7;O70441" "A0A096MIV5;A2VD14;F1M8H5;A0A096MJ15" "D4A552;A0A096MIX0" "E9PT29;A0A096MIX2;A0A096MJW9" ...
## .. ..$ : chr [1:19] "BO2day4rep1" "BO2day4rep2" "BO2day4rep3" "BO2day4rep4" ...
## $ LX : num [1:5306, 1:19] 2.15e+07 1.92e+07 1.37e+07 5.36e+08 1.20e+07 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:5306] "A0A096MIT7;O70441" "A0A096MIV5;A2VD14;F1M8H5;A0A096MJ15" "D4A552;A0A096MIX0" "E9PT29;A0A096MIX2;A0A096MJW9" ...
## .. ..$ : chr [1:19] "BO2day4rep1" "BO2day4rep2" "BO2day4rep3" "BO2day4rep4" ...
readMaxQuant(..)
Alternatively, I prepared a script which can do this automaticaly
## load the script from Internet
source("http://sablab.net/scripts/readMaxQuant.r")
Data = readMaxQuant(file,
key.data ="LFQ.intensity.",
names.anno = c("Protein.names","Gene.names","Fasta.headers"),
samples = "http://edu.sablab.net/rp2019/data/sampleAnnotation.txt",
keep.na = FALSE,
row.names=1,
os.quantile=0.01)
## 5460 features were read.
## Reverse and contaminants are removed, resulting in 5306 features
## Remove 12 uninformative features
## [1] "Final data container:"
## List of 6
## $ nf : int 5294
## $ ns : int 19
## $ Anno:'data.frame': 5294 obs. of 3 variables:
## ..$ Protein.names: chr [1:5294] "Synapsin-3" "" "" "" ...
## ..$ Gene.names : chr [1:5294] "Syn3" "Abcf2" "Morc3" "Ddx17" ...
## ..$ Fasta.headers: chr [1:5294] "tr|A0A096MIT7|A0A096MIT7_RAT Synapsin-3 OS=Rattus norvegicus OX=10116 GN=Syn3 PE=1 SV=2;sp|O70441|SYN3_RAT Syna"| __truncated__ "tr|A0A096MIV5|A0A096MIV5_RAT ATP-binding cassette subfamily F member 2 OS=Rattus norvegicus OX=10116 GN=Abcf2 P"| __truncated__ "tr|D4A552|D4A552_RAT MORC family CW-type zinc finger 3 OS=Rattus norvegicus OX=10116 GN=Morc3 PE=1 SV=1;tr|A0A0"| __truncated__ "tr|E9PT29|E9PT29_RAT DEAD-box helicase 17 OS=Rattus norvegicus OX=10116 GN=Ddx17 PE=1 SV=3;tr|A0A096MIX2|A0A096"| __truncated__ ...
## $ Meta:'data.frame': 19 obs. of 2 variables:
## ..$ Time : chr [1:19] "day04" "day04" "day04" "day04" ...
## ..$ Replicate: chr [1:19] "rep1" "rep2" "rep3" "rep4" ...
## $ X0 : num [1:5294, 1:19] 2.15e+07 1.92e+07 1.37e+07 5.36e+08 1.20e+07 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:5294] "A0A096MIT7;O70441" "A0A096MIV5;A2VD14;F1M8H5;A0A096MJ15" "D4A552;A0A096MIX0" "E9PT29;A0A096MIX2;A0A096MJW9" ...
## .. ..$ : chr [1:19] "BO2day4rep1" "BO2day4rep2" "BO2day4rep3" "BO2day4rep4" ...
## $ LX : num [1:5294, 1:19] 4.69 4.53 4.07 9.27 3.89 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:5294] "A0A096MIT7;O70441" "A0A096MIV5;A2VD14;F1M8H5;A0A096MJ15" "D4A552;A0A096MIX0" "E9PT29;A0A096MIX2;A0A096MJW9" ...
## .. ..$ : chr [1:19] "BO2day4rep1" "BO2day4rep2" "BO2day4rep3" "BO2day4rep4" ...
R can keep data in GZip-ed form, automatically loading the variables into memory. Such files have .RData extension. This is a fast & easy way to store your data. Let us first download the data in RData format into you working directory using download.file()
and then load it by load()
. Parameters of downloading:
destfile
- the file name, under which you would like to store the downloaded file.mode
- the way you would like to treat the data (as text or binary). To keep binary data unchanged, use wb
!Example: Data from TCGA for 40 lung squamous cell carcinoma samples and 20 controls.
download.file("http://edu.sablab.net/rt2018/data/LUSC60.RData", destfile="LUSC60.RData",mode = "wb")
getwd() # show current folder
dir(pattern=".RData") # show files in the current folder
load("LUSC60.RData") # load the data
ls() # you should see 'LUSC' object among variables
str(LUSC)
There are several ways to export your data. Let’s consider the most simple.
write()
- writes a column of numbers / characterswrite.table()
- writes a data tablesave()
- saves one or several variables into a binary RData file.Parameters of write.table
are:
eol
- cheracter for the end of line (can be differ with OS). The standard one is “”dec
- decimal separatorquote
- do we put “” around character values or notrow.names
- do we put row names as a column or notLet’s select only SCC (squamous cell carcinoma) patients from Samples
and save the file locally
## get indexes of differentiation markers
imarker = Data$Anno$Gene.names %in% markers
## extract markers and annotate them by gene names
X = Data$LX[imarker, ]
row.names(X) = Data$Anno$Gene.names[imarker]
write.table(X,file = "MarkerBO.txt",sep = "\t",row.names = TRUE, col.names=NA, quote=FALSE)
# save Data as a binary file
save(Data,file="Data.RData")
# save all variables as binary file
save(list=ls(),file="workspace.RData")
getwd()
dir() # see the results