Zhang Lab@Columbia by Hanrui Zhang [2019-06-10] and updated by Philip Ha [2020-07-16]
1. Introduction to RNA-seq
This video provides a thorough introduction about RNA-seq.
2. Install R and RStudio
We will use data sets in R. Please install the latest version of R or RStudio.
If you need a refresher, or have never used R before, please step through these tutorials.
- Download R: R is the free software programming language we will use. Choose the correct version for your laptop: Mac/Windows
- Download R studio: R studio is free software that will help us develop programs in R. Choose the correct version for your laptop: Mac/Windows
- How to Download R and R studio: A tutorial on how to download R and R Studio
- How to Install a Package in R studio: Steps to install a package in r studio
- Additional useful learning resources (for now and the future):
- Unix cheat sheet
- Unix tutorial
- Introduction to R: A free edX class on R fundamentals using Datacamp platform
- HarvardX Biomedical Data Science Open Online Training
- R for data science
3. Install Miniconda
4. Salmon: Transcript quantification from RNA-seq data
4.1 Install Salmon
-
Follow the instruction to install salmon via bioconda.
$ conda config --add channels conda-forge
$ conda config --add channels bioconda
$ conda create -n salmon salmon
-
The environment can then be activated via:
$ conda activate salmon
-
To deactivate
$ conda deactivate
-
Use the following command line to get the help file for all the arguments.
$ salmon quant –h
4.2 Download Gencode annotation
- Go here to downlead the necessary files
- Download the release you need (we use human v30 in this workflow)
- Download both the GTF and the Fasta file
4.3 Obtain fastq file from SRA
- The first step is to install SRA Toolkit
- To test whether the installation is successful, Open a terminal or command prompt and “cd” into the directory containing the toolkit executables
(e.g., [download_location]/sratoolkit[version]/bin/).
- For Linux/Mac OSX: ./fastq-dump -X 5 -Z SRR390728.
- For Windows:fastq-dump.exe -X 5 -Z SRR390728.
If successful, the test should connect to NCBI, download a small amount of data from SRR390728 and the reference sequence needed to extract the data, and stream the first 5 spots of the file (“-X 5” option) to the screen (“-Z” option).
- If the installation was unsuccessful, the toolkit may need to be reconfigured and adjusted to the default settings. The guide can also be found above.
- Now we are ready to download the fastq files we will analyze
- Here are our RNA-seq data in GEO with accession number GSE55536.
-
There are totally 33 samples. Let’s try 6 samples.
Sample ID SRR Description HMDM MAC Rep1 SRR1182374 M0-HMDM, Replicate 1 HMDM MAC Rep2 SRR1182375 M0-HMDM, Replicate 2 HMDM MAC Rep3 SRR2910663 M0-HMDM, Replicate 3 HMDM M1 Rep1 SRR1182376 M1-HMDM, Replicate 1 HMDM M1 Rep2 SRR1182377 M1-HMDM, Replicate 2 HMDM M1 Rep3 SRR2910664 M1-HMDM, Replicate 3 HMDM: human monocyte-derived macrophages;
M0: Resting HMDM without stimulation;
M1: HMDM treated with endotoxin and interferon-gamma for 18-20 hours to induce inflammatoryresponse. -
Use the command line below in terminal to download the fastq file (for now let’s do it one by one). The code means “to only download the first 1M reads from SRR, and split the pair-end reads”.
Make sure you “cd” into /bin first.
$ ./fastq-dump -X 1000000 --split-files SRR1182374 $ ./fastq-dump -X 1000000 --split-files SRR1182375 $ ./fastq-dump -X 1000000 --split-files SRR2910663 $ ./fastq-dump -X 1000000 --split-files SRR1182376 $ ./fastq-dump -X 1000000 --split-files SRR1182377 $ ./fastq-dump -X 1000000 --split-files SRR2910664
-
Alternatively, can use “prefetch” to download the fastq file.
$ ./sratoolkit.2.9.1-1-mac64/bin/prefetch SRR1182374
4.4 Build salmon index
- First “cd” into the directory with the gencode GTF and Fasta files.
- The “–” is to trim the extra symbols in GENCODE for convenience to handle the data later.
- Make sure to use
$ salmon --version
to check the Salmon version and change the index name in the code accordingly. -
This step will take a few minutes.
$ source activate salmon $ salmon index -t gencode.v30.transcripts.fa.gz -i gencode.v30_salmon_1.2.1 --gencode
4.5 Perform quantification using Salmon
-
Make sure the Salmon index folder and all of your fastq files are in the same directory and you are in the directory.
$ salmon quant -i gencode.v30_salmon_1.2.1 -p 8 --libType A --validateMappings --gcBias --biasSpeedSamp 5 -1 SRR1182374_1.fastq -2 SRR1182374_2.fastq -o M0_HMDM_1 $ salmon quant -i gencode.v30_salmon_1.2.1 -p 8 --libType A --validateMappings --gcBias --biasSpeedSamp 5 -1 SRR1182375_1.fastq -2 SRR1182375_2.fastq -o M0_HMDM_2 $ salmon quant -i gencode.v30_salmon_1.2.1 -p 8 --libType A --validateMappings --gcBias --biasSpeedSamp 5 -1 SRR2910663_1.fastq -2 SRR2910663_2.fastq -o M0_HMDM_3 $ salmon quant -i gencode.v30_salmon_1.2.1 -p 8 --libType A --validateMappings --gcBias --biasSpeedSamp 5 -1 SRR1182376_1.fastq -2 SRR1182376_2.fastq -o M1_HMDM_1 $ salmon quant -i gencode.v30_salmon_1.2.1 -p 8 --libType A --validateMappings --gcBias --biasSpeedSamp 5 -1 SRR1182377_1.fastq -2 SRR1182377_2.fastq -o M1_HMDM_2 $ salmon quant -i gencode.v30_salmon_1.2.1 -p 8 --libType A --validateMappings --gcBias --biasSpeedSamp 5 -1 SRR2910664_1.fastq -2 SRR2910664_2.fastq -o M1_HMDM_3
This way works, but in the end we will need to use loops.
5. QC of the RNA-seq data using MultiQC
-
Install MultiQC
$ conda install -c bioconda -c conda-forge multiqc
-
Run multiqc
$ multiqc .
Multiqc will search in the current directory, so make sure you are in the directory with Salmon Quant folders.
-
You may read the documents to understand how to interpret the QC data.
6. tximport: Importing salmon’s transcript-level quantifications and aggregate them to the gene level for gene-level differential expression analysis
- From now on, everything is done in RStudio. And here is the link and the Rmd file.
-
The Rmd file is modified from the Workflows below, which have more detailed explaination.
https://f1000research.com/articles/7-952/v1
https://bioconductor.org/packages/devel/bioc/vignettes/tximport/inst/doc/tximport.html
7. Exploratory analysis and visualization, and DESeq2 analysis
- Continue using the Rmd file above, which is modified from this workflow.
- Analyzing RNA-seq data with DESeq2 vignettes.
8. Additional resources
- https://seandavi.github.io/ITR/
- Making Venn Diagram:
- Submitting data to GEO
Materials here are licensed as CC BY-NC-SA 4.0 Creative Commons License.