Reference-based RNA-seq Analysis in Galaxy, and Data Visualization in IGV and UCSC Genome Browser
By Hanrui Zhang on [12/01/2021] based on the CSHL Tutorials in Genomics and Bioinformatics: RNA-Seq and bioinformatics.ca workflow.
The workflow in Galaxy was inspired by this material.
1. Tools Needed for this Workflow
2. Background
2.1 RNA-seq Experimental Design
- Heterogenous samples (e.g. patient-derived xenograft tumor model) will need more reads. Cell lines will typically need fewer reads.
- 30-50M paired-end (PE) reads for analyzing novel transcripts, alternative splicing, heterogeneous samples; 25M PE for differential expression (DE) analysis.
- n=4 or more.
2.2 Galaxy Logistics
- Register and access: https://usegalaxy.org/
- If registered, history will stay until you delete it.
- 250GB space.
- If need to save space, can delete fastq files once obtaining the BAM file.
- People may also upload their data to Sequence Read Archive (SRA, the largest publicly available repository of high throughput sequencing data) so that data can be loaded from SRA to the galaxy.
2.3 Align to the Genome or the Transcriptome?
Genome
- Advantages: can align novel isoforms.
- Disadvantages: difficult, spurious alignments, spliced alignment, gene families, pseudogenes.
- Now typical workflow would be to align to the genome.
Transcriptome
- Advantages: easy, focused on the part of the genome that is known to be transcribed
- Disadvantages: Reads that come from novel isoforms may not align at all or may be misattributed to a known isoform.
3. Reference-based RNA-seq Analysis in Galaxy
3.1 ANALYZING A TOY DATASET (REFERENCE):
3.1.1. Create a new history, maybe name it as "Exercise".
3.1.2. Upload the dataset:
- For this exercise, we’ll use subsets of data from the Illumina BodyMap 2.0 project, from human adrenal gland tissues. The sampled reads are paired-end 50bp that map mostly to a 500Kb region of chromosome 19, positions 3-3.5 million (chr19:3000000:3500000). (source: https://usegalaxy.org/u/jeremy/p/galaxy-rna-seq-analysis-exercise)
- Import the following two FASTQ files in your user space. To do so:
- Click the Upload Data button on the upper left of the screen
- In the popup, click on the
Paste/Fetch data
button
- You can provide both URLs in the same text box
https://raw.githubusercontent.com/bioinformatics-ca/EPI_2021/master/files/adrenal_1.fastq
https://raw.githubusercontent.com/bioinformatics-ca/EPI_2021/master/files/adrenal_2.fastq
- Under Genome (set all), specify
Human Feb. 2009 (GRCh37/hg19) (hg19)
.
- Click on
Start
.
- After it finished uploading (green state), you can rename the two imported files (if desired), for better organization.
3.1.3. Running the FastQC tool for quality control (QC) on the fastq files (reads off the sequencer):
- To find it, use the search window at the top of the Tools column (left panel).
- From the
FastQC
tool interface, for the field Short read data from your current history
, choose adrenal_1.
- Click on
Execute
.
- Pay attention to the green notice, which provides details about the input and output of the job you just launched.
- Once the job is completed, examine the Webpage results from the history bar using the
Eye
icon.
- Raw output statistics are also available, and can also be seen with the
Eye
icon.
- Repeat the same operations for adrenal_2. As a shortcut, you can click on the FastQC history item, then click on the
Run this job again
icon and simply change the input file to automatically reuse the same parameters.
3.1.4. Trim the reads with Trimmomatic tool to improve the quality of the dataset by removing bad quality bases, clipping adapters and so on:
- Set the input as Paired-end (two separate input files)
- Give the adrenal1 file for direction 1, and adrenal2 for direction 2.
3.1.5. Run FastQC again on both paired files, and compare results with pre-trimming FastQC output.
- Can enable "scratchbook" to visualize FastQC reports pre- and post-trimming.
3.1.6. Use HISAT2 to align the reads to the human genome, output is a BAM file
- Search and select the HISAT2 tool
- For the
Select a reference genome
, search hg19
and select Human (Homo sapiens) (b37): hg19
.
- For
Is this a single or paired library
, select `paired-end.
- For
FASTA/G file #1
, select "(R1 paired); for FASTA/Q file #2
, select "(R2 paired).
- Click
execute
.
- The output is the BAM file (*.bam), which is the compressed binary version of a SAM file that is used to represent aligned sequences up to 128 Mb.
- Visualize the BAM file in UCSC genome browser:
- Click the HISAT2 output file name, then click
display at UCSC main
. This will open another browser window for the UCSC genome browser.
- What do you see now? Do you see the BAM file track?
- Enter
chr19:3,478,706-3,508,735
.
- Under "Custom Tracks", choose between "full" and other options to visualize the alignment.
- If you open IGV, you may also click
display with IGV local Human hg19
to visualize the data in IGV locally.
3.1.7. Use the bamCoverage tool to create a bigWig coverage track:
- Search the
bamCoverage
tool
- Select the "BAM" file.
- The output will be bigwig file
- Choose the bigwig output file, click
display at UCSC main
. Now visualize the bigwig coverage file.
Next we will move to the real datasets. We will do the above steps, and we will also add using HTSeq to count reads for each gene.
3.2 ANALYZING A REAL DATASET:
3.1.1. Create a new history.
We will analyze a real dataset from Ldlr-/- mice sequenced at 20M 100bp PE. We will use annotation file (Ensembl 102: Nov 2020 (GRCm38.p6)).
3.2.2. Find, upload, and edit the gtf file (ENSEMBL annotation) for analysis in the Galaxy
Find the ENSEMBL annotation
- Explanation about gtf file
- Go to the ENSEMBL site https://useast.ensembl.org/index.html.
- Find
Download
, select FTP download
, find the current gtf or previous release in the archived sites.
- Find the annotation needed. e.g.
for human homo_sapiens/Homo_sapiens.GRCh38.104.chr.gtf.gz
;
for mice mus_musculus/Mus_musculus.GRCm39.104.chr.gtf.gz
.
- Right-click and copy the FTP link. For this exercise, use this ftp link
http://ftp.ensembl.org/pub/release-102/gtf/mus_musculus/Mus_musculus.GRCm38.102.chr.gtf.gz
.
- Go back to Galaxy and click "upload data" and "Paste/Fetch Data".
- Enter the FTP link, make sure to change the file name that includes the annotation (GRCm38.102), the file type as "gtf", and the annotation e.g. "GRCm38B" (or other species matching with the ensemble annotation).
- Click "Start" to upload the annotation.
Use Awk to replace text in a specific column:
- The chromosome number column in the gtf file needs to show "chr1" instead of "1" in order to match the annotation in the galaxy. Check the gtf file uploaded and change with awk as stated below if needed.
- Search in the toolbox for "awk" and find "text formatting with awk"
- Select the file to process
- In "AWK Program", put
{if ($0 !~ /^#/) { print "chr"$0} else {print $0}}
- Text formatting syntax is difficult, can either find someone to help or try to search in google to see if someone has done the same thing.
3.2.3. Upload own data to the galaxy
Uploading own fastq files using ftp upload
- Make sure you download FileZilla from the correct webpage link https://filezilla-project.org/.
- Follow this guideline for using client, e.g. FileZilla to connect to the galaxy FTP server https://galaxyproject.org/ftp-upload/.
- File - my site
- New site
- provide a name, e.g. "galaxy"
- Protocol: File Transfer Protocol
- Host: usegalaxy.org
- Encryption: Require explicit FTP over TLS
- Logon Type: Ask for a password
- User: same as for log in the galaxy
- Click "Connect"
- "Accept" the agreement
- Once connected, can drag files from the local folder to the server (left to right). The bottom panel in FileZilla will show the progress of uploading.
- Once the uploading is completed, go back to the galaxy.
- In the galaxy, select "upload data" and "choose remote file" - "FTP directory". The files uploaded to the galaxy FTP site via client should appear there.
- Can also use Cyberduck (available through CUIT).
- The dataset type should be "fastqsager.gz" in order for HISAT2 in the later steps of the workflow to recognize the file. This can be changed by clicking the pencil icon either here or later step for the Trimmomatic output files. (Delphine's note: fastqsanger and fastqillumina formats follow the same convention (post ~2010 sequencing will generally))
3.2.4. QC, trimming, QC again, and then alignment to generate BAM file
Grouping fastq file for the same sample
- In history, click the checkbox for "Operations on multiple datasets".
- Check the pair of R1 and R2 file for the same sample, click "for the selected datasets" drop menu and select "build dataset pair", remember to click "swap" to list R1 first. Rename (e.g. to the sample name) and click "create list".
Use fastqc to QC fastq files
- Search "FastQC" tool
- Find “Short read data from your current history”: Click "Dataset Collection" as the input dataset. Select the dataset pair prepared in the previous step.
- No need to change anything, click "execute".
Use Multiqc to generate an aggregated report
- Multiqc is to aggregate reports together.
- Search the "MultiQC" tool and select
- For “Which tool was used to generate logs?”, select "FastQC"
- In “FastQC output”:
- For “Type of FastQC output?”, select "Raw data"
- For “FastQC output”, select "Raw data files (output of FastQC tool) – Use
Browse Datasets
to find the raw data for FastQC on fastQC read1 et select it (will turn green). Press OK.
- Execute
- Examine the MultiQC output.
Use Trimmomatics to trim adaptor reads
- Search tool "Trimmomatic"
- “Single-end or paired-end reads?”: Select "Paired-end (as collection)"
- Input FASTQ dataset collection with R1/R2 pair: select dataset e.g.
SRR5448889 (will select R1 and R2).
- Note: Options to remove adapters
- Click on Perform initial Illumina-CLIP step? Yes
- “Select standard adapter sequences or provide custom?” Standard
- “Adapter sequences to use” TruSeq3
- “Output trimmomatic log messages?” yes
- Run the tool (Execute).
Run fastqc and multiqc again
Use HISAT2 to perform alignment and generate BAM file
- Search tool "HISAT2".
- "Use a built-in genome"
- "Select a reference genome: Mouse: mm10 or Human hg38 as appropriate
- Select "Paired-end Dataset Collection"
- Choose Paired Collection "Trimmomatic on data * and data *: Paired".
- Specify strand information: make sure to select "Reverse (RF)"
- Notes: can also choose "Paired-end" then select each individual file (the trimmomatic output). The results appear to be the same.
Visualize BAM file in UCSC genome browser or IGV
- Choose the HISAT2 output
- Click "display in UCSC main" to visualize the BAM file in the UCSC genome browser
- Click "display with IGV local" to visualize in IGV, but make sure to install and open IGV first.
Use QualiMap BamQC to perform QC on alignments (bam files) and get summary statistics
- Open the "QualiMap BamQC" tool. Can select multiple HISAT2 output BAM files. Keep all other options as default and execute.
- Open the QualiMap Multi-Sample BamQC tool. Select the raw QualiMap BamQC output. For this step, need to select each output file individually and rename - can rename with the sample name.
3.2.5. Generate bigWig file in the galaxy for wiggle display in UCSC or IGV
- Search tool "bamCoverage"
- Select the BAM file used to generate bigWig: "HISAT2 on data * and data *: aligned reads (BAM)
- Select the genome: e.g. mm10
- Bin size: if 1 the bigWig file size will be big; so using default bin size at 50 should usually be good enough.
- Everything else can be default as long as the final visualization looks good.
- Execute and wait for the job to complete
- In galaxy history, find the generated output, click display in "UCSC main" to visualize in UCSC
- To overlay multiple tracks, see step 4 http://genome.ucsc.edu/inc/hgCollectionHelpInclude.html
Custome track: Can click custom tracks to find the "http" link, which is where the bigWig files are hosted on the galaxy server. The link can be shared with others to visualize the track.
Save session: Click My data - my session - set up a user account and then create a link to share the view of the same session with others.
Notes: Files can be sent from galaxy to UCSC, but uploading a bigWig file from a local drive to UCSC is not possible - the file size is too large and UCSC will not allow it. Therefore, can either use other servers or use IGV to visualize any BAM or bigWig files on the local hard drive.
3.2.6. Use htseq-count to count reads overlapping each annotated gene feature
- Select the htseq-count tool.
- Run the tool in batch mode by selecting “Multiple datasets” (middle icon) for “Aligned SAM/BAM File” (HISAT2 output)
- Select the annotation “GFF File” from the history
- Other options are by "default" but make sure "stranded" was selected.
- Execute
3.2.7. Differential expression analysis using DESeq2
It is fairly convenient to do DESeq2 analysis in the galaxy and obtain the diagnostic plotting. But since we have the RNA-seq workflow in R for more flexible and powerful plotting and subsetting, and the newest features in DESeq2, e.g. shrink of the fold change, have not been incorporated in galaxy yet, we should just use our workflow in R for DE analysis.
3.2.8. Sharing your History
- Open the History Options menu (gear icon) at the top of your history panel. Make History accessible. A Share Link will appear that you give to others.
- Anybody who has this link can view and copy your history.
Data Visualization in IGV
- Open IGV first, make sure the annotation match with the one used in the Galaxy for analysis, i.e. hg38 (the selection menu can be found at the left top corner).
- Go back to the galaxy, find the BAM file or bigWig file, choose display in IGV local. Do this one by one for all the BAM or bigWig files.
- Rename to "sample name _ coverage" and "sample name _ alignment"
- Right-click the track, select color alignment by, and select what's needed, for example, "first of pair strand".
- Can click a read to see the information. Can also right-click a read, select "blat read sequence", a pop-up window comes up, giving information where the sequence aligns, reassuring the alignment is precise.
- The color bar in the middle of the reads means mismatches, we can click the bar to see it. The mismatch could be sequencing error, alignment error, or SNP. If all reads in the position have the same mismatch, which could be PCR duplicate or sequencing error.
- Sashimi plot: right-click track and select sashimi plot, select "gene", select sample to show sashimi plot
- the number on the top of the line is the number of reads spanning splice junction. The plot helps to identify novel exon (in this case needs 30-50 M PE reads) and then a few hundred reads would be reliable to draw a conclusion.
- Can right-click to remove a track.
- Can save the plot
- Can "save session" to save the format then next time just open session and load the data.
- One can obtain anyone's bigwig file for visualization in IGV. But for UCSC, it has to be a server because the file size is too big for UCSC to allow upload locally. When the workflow used for generating the bigwig files is different the results won't be quantitatively comparable, though still good for visualization.
Share visualization of BAM or bigWig files with others in UCSC Genome Browser
5. Other Useful Tools and Resources for Data Interpretation after Completing the Initial Analyses
Using ENSEMBL and BioMart to pull out information for the gene-of-interest
ENSEMBL:
- ENSEMBL is more gene-focused. UCSC is more genome-focused with good visualization.
- ENSEMBL IDs that go beyond 20000000 will most likely be the newly annotated genes, particularly lncRNA and pseudogenes. Well-known genes have smaller numbers. Always include the ENSEMBL version and ensemble ID in the publication.
- Most people use ENSEMBL while working with human and mouse data. For other species, some have better annotation than others.
- Most people come to ENSEMBL with query needs of a specific gene(s) to learn about transcripts, protein, homology, and orthology search.
e.g. can check any genetic variants identified in the gene with scores predicting damaging variants or disease-causing variants, and the available data from gnom, etc. for disease-causing variants.
- ENSEMBL has an Archive Site so that there are stable links to data from a particular release.
- There is also a mobile site https://m.ensembl.org/index.html.
BioMart:
- Example of using BioMart -
- Query "Gene Names" to obtain IDs or vice versa
- Find all the genes within a certain genomic region
- Steps to use BioMart:
- Can use the newest version of Ensembl annotation or if needed, can scroll down to the bottom of ensemble home page and click "View in archive site"
- Choose the annotation
- Click "Filters" and enter the query items and indicate the identity, e.g. stable gene ID or gene name
- Click "Attributes" to select the information you want to know.
- Click "Results" to see the report and export as needed.
Using Venny to draw Venn Diagrams from lists of genes
- A useful tool to identify overlapped genes from multiple gene lists.
- Can use ensemble ID, the overlapped list in the Venn diagram can be selected and the list can be copied/pasted to Biomart to find the gene name.
Pathway Analysis
- The purpose is to guide future experiments and inform hypotheses, therefore the filtering strategy, ranking strategy and the number of genes to include may not matter as much.
For hypergeometric analysis, smaller gene sets and smaller input gene lists may lead to only one or two genes in the list being in the pathway still showing statistical significance. This needs to be taken into consideration and it is generally recommended to prioritize those enriched pathways/gene sets that have more genes from the foreground list.
- Gene list can be small (one or two genes) therefore looks significant but is not.
Using ENSEMBL ID for queries can be helpful to improve mapping.
String is the poor person's version of IPA.
- Should use ENSEMBL ID to query so that all the top DE genes can be mapped
It could be informative to visualize the genes in the enriched pathways because it is possible that those genes are all playing roles in a certain part of the large pathway.
DAVID is old but is getting better at being updated. It is easy to use and can visualize genes on pathway maps.
Check genes enriched in OMIM: genes associated with diseases.
6. Additional Notes
- A useful resource for track configuration: http://genome.ucsc.edu/inc/hgCollectionHelpInclude.html
- Datacamp: Many courses for learning R/Python; can purchase a subscription.
- Codeacademy has good tutorials for command-line tools.