Reference-based RNA-seq Analysis in Galaxy, and Data Visualization in IGV and UCSC Genome Browser

By Hanrui Zhang on [12/01/2021] based on the CSHL Tutorials in Genomics and Bioinformatics: RNA-Seq and bioinformatics.ca workflow.

The workflow in Galaxy was inspired by this material.

1. Tools Needed for this Workflow

2. Background

2.1 RNA-seq Experimental Design

2.2 Galaxy Logistics

2.3 Align to the Genome or the Transcriptome?

3. Reference-based RNA-seq Analysis in Galaxy

3.1 ANALYZING A TOY DATASET (REFERENCE):

3.1.1. Create a new history, maybe name it as "Exercise".

3.1.2. Upload the dataset:

3.1.3. Running the FastQC tool for quality control (QC) on the fastq files (reads off the sequencer):

3.1.4. Trim the reads with Trimmomatic tool to improve the quality of the dataset by removing bad quality bases, clipping adapters and so on:

3.1.5. Run FastQC again on both paired files, and compare results with pre-trimming FastQC output.

3.1.6. Use HISAT2 to align the reads to the human genome, output is a BAM file

3.1.7. Use the bamCoverage tool to create a bigWig coverage track:

Next we will move to the real datasets. We will do the above steps, and we will also add using HTSeq to count reads for each gene.

3.2 ANALYZING A REAL DATASET:

3.1.1. Create a new history.

We will analyze a real dataset from Ldlr-/- mice sequenced at 20M 100bp PE. We will use annotation file (Ensembl 102: Nov 2020 (GRCm38.p6)).

3.2.2. Find, upload, and edit the gtf file (ENSEMBL annotation) for analysis in the Galaxy

Find the ENSEMBL annotation

Use Awk to replace text in a specific column:

3.2.3. Upload own data to the galaxy

Uploading own fastq files using ftp upload

3.2.4. QC, trimming, QC again, and then alignment to generate BAM file

Grouping fastq file for the same sample

Use fastqc to QC fastq files

Use Multiqc to generate an aggregated report

Use Trimmomatics to trim adaptor reads

Run fastqc and multiqc again

Use HISAT2 to perform alignment and generate BAM file

Visualize BAM file in UCSC genome browser or IGV

Use QualiMap BamQC to perform QC on alignments (bam files) and get summary statistics

3.2.5. Generate bigWig file in the galaxy for wiggle display in UCSC or IGV

3.2.6. Use htseq-count to count reads overlapping each annotated gene feature

3.2.7. Differential expression analysis using DESeq2

It is fairly convenient to do DESeq2 analysis in the galaxy and obtain the diagnostic plotting. But since we have the RNA-seq workflow in R for more flexible and powerful plotting and subsetting, and the newest features in DESeq2, e.g. shrink of the fold change, have not been incorporated in galaxy yet, we should just use our workflow in R for DE analysis.

3.2.8. Sharing your History

4. Data Visualization in IGV and UCSC Genome Browser

Data Visualization in IGV

Share visualization of BAM or bigWig files with others in UCSC Genome Browser

5. Other Useful Tools and Resources for Data Interpretation after Completing the Initial Analyses

Using ENSEMBL and BioMart to pull out information for the gene-of-interest

Using Venny to draw Venn Diagrams from lists of genes

Pathway Analysis

6. Additional Notes