How to run kallisto on NCBI SRA RNA-Seq data for differential expression using the mac terminal
Attention Conservation Notice: This post explains how to run the exceptionally fast RNA-seq k-mer aligner kallisto from the Pachter lab on data you download from NCBI's Short Read Archive, and then analyze it for differential expression using voom/limma. As with everything in bioinformatics, this will likely be obsolete in months, if not weeks.
Kallisto is really fast and non-memory intensive, without sacrificing accuracy (at least, according to their paper), and therefore has the potential to make your life a lot easier when it comes to analyzing RNA-seq data.
As a test data set, I used the very useful SRA DNA Nexus to search the SRA database for a transcriptome study from human samples with 2-3 biological replicates in 2 groups, so that I could achieve more robust differential expression calls without having to download too much data.
I ended up using SRP006900, which is titled "RNA-seq of two paired normal and cancer tissues in two stage III colorectal cancer patients." I actually wasn't able to find a study describing it for differential expression analysis in a 3-5 min search, but since it was public almost four years ago, I'm sure that it has been analyzed in such a manner. It is single-end reads with an average of 65 base pairs per read.
In order to download this data set (which totals ~ 10 GB over its four files, in .fastq format) from the SRA via the command line, I used the SRA toolkit. You'll want to make sure this is downloaded correctly by running a test command such as the following on the command line:
[code language="bash"] $PATH/sratoolkit.2.5.0-mac64/bin/fastq-dump -X 5 -Z SRR390728 [/code]
You'll also want to download kallisto. In order to install it, I copied it to my usr/local/bin/ folder to make it globally executable, as described on the kallisto website.
You also need to download the fasta file for your transcriptome of interest and create an index file that kallisto can work on with. You can download the hg38 reference transcriptome from the kallisto website or from UCSC.
For pre-reqs, it is helpful to have homebrew installed as a package manager. If you do (or once you do), these three commands did the trick for me:
[code language="bash"] brew update brew install cmake brew install homebrew/science/hdf5 [/code]
Next you'll need to actually download the data from each run, convert them to .fastq, and then run kallisto on them. For extensibility, I did this in the following shell script.
First, a config file with the sample names and average read length (will be needed for kallisto if the data is single-end, in my experience):
[code language="bash"] declare -a arr=("SRR222175" "SRR222176" "SRR222177" "SRR222178") AVG_READ_LENGTH=65 [/code]
(Update 5/13/15: The average read length should be average fragment length, which I'm working on figuring out the best way of estimating from single-end reads (if one is possible). In the meantime, it may be easier to choose a paired-end read sample from SRA instead, from which the average fragment length can be estimated by kallisto. Thanks to Hani Goodarzi for the helpful pointer that I was using the wrong parameter.)
Next, I use a shell script to execute the relevant commands on these sample names to build your abundance.txt files. Note that this data set has single-end reads, so you only call one .fastq file, rather than the two in their example. I put each output in a file called output_$RUNNAME that I can call from R for downstream analyses. Here's the full shell script (also github):
[code language="bash"] #path to where the SRA bin folder is located; could eliminate by making SRAT globally executable PATH_SRA_DIR="your_path_here/"
#go to where your kallisto test directory is found, if necessary echo "cd your_path_here"
#load the config file source test_kallisto/config_SRP006900.file
#download the data for i in "${arr[@]}" do "$PATH_SRA_DIR"sratoolkit.2.5.0-mac64/bin/fastq-dump --accession $i --outdir test_kallisto done
#create the intranscripts index file kallisto index -i transcripts.idx transcripts.fasta.gz
#create output directories, and then run kallisto for each of the runs for i in "${arr[@]}" do mkdir output_"$i" kallisto quant -i transcripts.idx -o output_"$i" -l "$AVG_READ_LENGTH" "$i"".fastq" done [/code]
Once you have these files, you can analyze them in R using limma. Here's how I went about it. First, read in the files and merge the transcripts per million (tpm) calls into one data frame:
[code language="R"] library(limma)
#samples names of downloaded from SRA sample_list_n = c("SRR222175", "SRR222176", "SRR222177", "SRR222178")
for(i in 1:length(sample_list_n)){ tmp = read.table(file = paste0("test_kallisto/output_", sample_list_n[i], "/abundance.txt"), header = TRUE) assign(sample_list_n[i], tmp) }
sample_list = mget(sample_list_n)\
#give the list unique names sample_list_uni = Map(function(x, i) setNames(x, ifelse(names(x) %in% "target_id", names(x), sprintf('%s.%d', names(x), i))), sample_list, seq_along(sample_list))
full_kalli = Reduce(function(...) merge(..., by = "target_id", all=T), sample_list_uni)
tpm_vals = full_kalli[, grep("tpm", names(full_kalli))] rownames(tpm_vals) = full_kalli$target_id [/code]
Then make a contrasts matrix for normal vs control samples:
[code language="R"] groups = c("normal", "colon_cancer", "normal", "colon_cancer") condition <- model.matrix(~0 + groups) colnames(condition) = c("normal", "colon_cancer") cont_matrix = makeContrasts(norm_canc = normal - colon_cancer, levels = condition) [/code]
Then compute differential expression using that contrast matrix:
[code language="R"] #this spits out an EList object v = voom(counts = tpm_vals, design = condition) fit = lmFit(v, condition) fit = contrasts.fit(fit, cont_matrix) fit = eBayes(fit) top_table = topTable(fit, n = 10000, sort.by = "p") [/code]
And that's it. Interestingly, the transcript called as the #1 most differentially expressed using this method, ATP5A1, has been previously implicated as a marker for colon cancer.
Update 5/13/15: Fixed two typos, switched one link, and switched "genome" to "transcriptome" (I was sloppy previously) per @pmelsted's useful comment on twitter.
Update 8/10/15: As Seb Battaglia points out on twitter, "voom assumes you cut out out 0s and low intensity tpm. I'd add that step or data hist() will be skewed." I don't have the opportunity to re-run the analyses right now (!), but please take this into consideration.