SeqNjoy is a standalone application designed to generate and analyze differential gene expression results from RNA-Seq experiments. The analysis can start from the very initial step using raw short-read files in FASTQ format or from an intermediate alignments files in BAM/SAM format. SeqNjoy results include annotated of the genes (or other genetic features) that are differentially expressed between groups of samples under different conditions, as well as interactive HTML pages. These pages enable users to visually explore, analyze, filter and download selected genesets of interest.
This tutorial provides step-by-step guidance for analyzing RNA-Seq data with SeqNjoy. Each step explains the required input files, the configurable parameters, and the outputs generated. To enhance your understanding of the process, the tutorial also describes the structure of the input files, the significance and implications of each parameter, and key concepts critical to RNA-Seq data analysis. The ultimate goal is to empower you to perform a comprehensive RNA-Seq differential expression analysis on your own.
Before installing SeqNjoy, ensure that your computer meets the application’s minimum system requirements.
Minimum | Recommended | |
---|---|---|
Architecture | 64-bit | |
Operative System | Windows, macOS, Linux (Debian, Fedora or SUSE-based) | |
Web Browser | Chrome, Firefox, Safari, Edge | |
RAM Memory | 8 GB | 16 GB |
Disk Space | 500 GB | 1 TB |
NOTE: minimum requirements are adequate for carrying out analyses involving bacterial genomes, while recommended requirements are adequate also for eukaryotic genomes.
SeqNjoy is optimized to be used with 1920 x 1080 screen resolution (or higher).
Hard drive requirements specify the necessary storage space for project files in typical RNA-Seq analysis. SeqNjoy is installed as a standard R package, and users can store project folders on local or USB-connected hard drives of their choice.
Please disable Sleep Mode or Hibernation in your computer to avoid interruptions of long lasting processes.
A connection to the Internet is only required for downloading the necessary packages and libraries during the installation and for retrieving genome files from Ensembl.
SeqNjoy is an R package; therefore, installing R is required before using the software.
RTools is necessary to install SeqNjoy on Windows because some neccesary R packages are not available as precompiled Windows binaries and must be built from source.
Download the appropriate version of RTools for your R installation from the official website: https://cloud.r-project.org/bin/windows/Rtools/.
To use the feature “Align reads with Bowtie2” on Windows, ensure the following software is installed:
python3 --version
perl --version
Pandoc is required to build the vignettes used within the app. If you do not already have Pandoc installed:
pandoc --version
Administrator privileges are required to install R binaries.
Pandoc is required to build the vignettes used within the app. If you do not already have Pandoc installed:
Depending on the macOS distribution, developer tools and compilers such as xcode might be required. Check R macOS tools website for installation instructions.
These tools are also required to install R from source.
Installing SeqNjoy dependencies from source requires several system libraries. The package names vary depending on the Linux distribution. Ensure the appropriate libraries are installed by running the commands below.
sudo apt-get install -y \
build-essential gfortran libreadline-dev libx11-dev libxt-dev \
libbz2-dev liblzma-dev libpcre2-dev libcurl4 libcurl4-openssl-dev \
libpng-dev libjpeg-dev libxml2-dev libssl-dev librsvg2-dev
sudo yum install -y \
make gcc-c++ gcc-gfortran readline-devel libX11 libX11-devel \
libXt libXt-devel bzip2-devel pcre2-devel libcurl-devel \
libxml2-devel openssl-devel libpng-devel libjpeg-turbo-devel \
librsvg2-devel
sudo zypper install -y \
make gcc-c++ gcc-fortran readline-devel libX11-devel libXt-devel \
libbz2-devel lzma-sdk-devel pcre2-devel libcurl-devel \
openjpeg-devel libxml2-devel libopenssl-devel libjpeg8-devel \
libpng16-devel librsvg-devel
The most recent release, avalaible on GitHub, can be installed using the remotes
package.
Pandoc is required to build vignettes (see Software requirements).
You can disable this option, but some help pages will not be available.
## Install Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE)){
install.packages("BiocManager")
}
## Install remotes
if (!requireNamespace("remotes", quietly = TRUE)){
install.packages("remotes")
}
## Install SeqNjoy without vignettes
remotes::install_github("BioinfoGP/SeqNjoy", dependencies=TRUE, bioc_version = BiocManager::version())
## Install SeqNjoy with vignettes
remotes::install_github("BioinfoGP/SeqNjoy", dependencies=TRUE, build_vignettes = TRUE, bioc_version = BiocManager::version())
Installation from source can also be done using devtools. Download the latest gzipped package source code SeqNjoy_0.5.X.tar.gz available at BioinfoGP’s GitHub repository and install it from R console.
## Set your working directory to the location where the package was downloaded.
setwd("path/to/downloaded/package")
## Install remotes
if (!requireNamespace("remotes", quietly = TRUE)){
install.packages("remotes")
}
## Install Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE)){
install.packages("BiocManager")
}
# Create a temporary directory and extract the package contents
d <- tempdir()
untar("SeqNjoy_0.5.X.tar.gz", exdir=d)
# Install SeqNjoy
remotes::install(file.path(d, "SeqNjoy"), dependencies=TRUE, repos=BiocManager::repositories())
Open R, then load SeqNjoy library and launch the application. The browser will automatically start.
library(SeqNjoy)
SeqNjoy()
If the browser has not already started, or you have closed it. Open your browser and enter the URL http://127.0.0.1:7628/.
To stop SeqNjoy, click the “Exit SeqNjoy” button or press Ctrl+C in the R console.
RNA-Seq is a high-throughput sequencing technique used to detect and quantify RNA in biological samples. This provides a proxy for detecting and quantifying gene expression levels, among other applications.
The “wet lab” part of the RNA-Seq process (Figure 3.1, top) begins with the library preparation, a process involving several key steps: extraction of RNA from the samples, filtering to isolate mRNA molecules while depleting other RNA species (e.g., ribosomal RNA), reverse transcription of mRNA into more stable cDNA molecules, fragmentation and size selection to produce smaller nucleic acid fragments (easier to handle), and sequencing these fragments to generate short reads.
The sequencing machine (“sequencer”) reads the nucleotide sequences corresponding to the mRNA transcripts, producing millions of short reads. These reads are stored in FASTQ files, typically one file per sample or biological replicate in your experiment. These FASTQ files are the starting point for RNA-Seq data analysis—and you can carry out this analysis using SeqNjoy!
Figure 3.1 provides a schematic overview of the standard RNA-Seq pipeline for analyzing mRNA expression in biological samples. The upper section outlines the experimental protocol, while the yellow-highlighted area below illustrates the main steps for processing and analyzing the RNA-Seq data from the sequenced samples.
Figure 3.1: Schematic overview of the standard RNA-Seq pipeline for mRNA expression analysis
In brief, the bioinformatic analysis of a RNA-Seq experiment involves three consecutive steps:
SeqNjoy utilizes user-friendly interfaces that allow you to perform a complete RNA-Seq expression gene data analysis in a short number of steps. If you have never performed this kind of analysis, we recommend reading RNA-Seq in short section as a starting point to get familiarized with the overall process.
Flowchart below (Figure 4.1) summarizes how SeqNjoy addresses the different parts of the analysis:
Figure 4.1: Flowchart of SeqNjoy
The Homepage serves as the entry point to the application (located in the upper-left part of the flowchart). This initial page is where you create your projects. A project acts as a container for a single study and includes all the files required for an RNA-Seq data analysis.
The files associated with a project include:
- Input files, such as short reads files, gene feature files, genome reference files, and functional annotations.
- Output files generated during the analysis steps, including alignment files, counts files, and lists of differentially expressed genes resulting from sample comparisons.
From the list of created projects displayed on the Homepage, you can select any project to access its Project’s Main Page, which serves as the central hub of the application. All functionalities are initiated from this page.
From the Project’s Main Page, you can:
Just as a recipe requires initial ingredients to prepare a dish, SeqNjoy requires specific input files to begin processing your project. Depending on the available data, you can follow one of the two workflows described below:
This is the standard input for a complete RNA-Seq data analysis.
If you start from short reads files (FASTQ format), the analysis process comprises all the steps mentioned before: Quality Control, Preprocess Reads, Align Reads, Quantify Genes and Compare Results, as described in the Figure 4.2, that illustrates the SeqNjoy workflow, showing the dependencies between input and output files at each step. All of them are launched from the Project’s Main Page.
Figure 4.2: Overview of the SeqNjoy workflow, showing the analysis steps (right), the types of files used (center), and the utilities available for file inspection (left).
In the case of starting with alignment files SAM/BAM, the workflow is similar, except the Align Reads step is skipped.
The RNA-Seq data analysis proceeds from top to bottom, with the output files from one step becoming the input for the next.
SeqNjoy also provides Utilities to inspect or analyze specific types of files or the quality report (Figure 4.3). These tools can help you better understand the data generated at different stages of the analysis.
Figure 4.3: Utilities in SeqNjoy for file inspection/analysis
In the following sections, we provide destailed description of: - Each step of the analysis - The utilities available in SeqNjoy - The characteristics and use of the files involved For the FIESTA 2.0 graphical viewer, a specific tutorial is available here.
Once you launch SeqNjoy, the homepage welcomes you with a brief summary and key information about the application.
Figure 5.1: Homepage
Before creating a project, you must select a Project Folder, a directory on your computer or external drive where all new projects will be stored. Click the Projects Folder button to select or create this directory.
A project in SeqNjoy acts as a container for a single RNA-Seq study, including both the input and the generated files. Input files include FASTQ files, genome reference files, gene feature files, and functional annotations. As the analysis progresses, the project also stores generated files such as quality reports, alignment files, count tables, and lists of differentially expressed genes.
Each project is uniquely identified by a project name, which you provide in the text box at the bottom of the homepage (Figure 5.2).
Figure 5.2: Select Projects Folder and then Create a New Project by writing a project’s name. The project will be created in the selected Projects Folder
After clicking the Create New Project button, you will be directed to the Project’s Main Page, where you can begin your analysis(see Main Page section).
Once you have one or more projects created, the Available Projects section of the Homepage displays a list of your projects along their creation dates (Figure 5.3).
Figure 5.3: List of started projects
For each project, you can either open the project it by clicking the “play” icon , which takes you to the Main Page to start or continue your analysis, or delete the project it by clicking the “basket” icon
, which permanently removes the project along with all its associated files from your system.
⚠ IMPORTANT: Deleting a project permanently removes all input and result files, making the project impossible to recover. This action CANNOT be undone.
Once the project has been created, you can start uploading your input files. First, open the project by clicking the “play” icon in the Homepage and access the Project’s Main Page of the application (Figure 6.1).
Figure 6.1: Project’s Main page
Now, click the Load Files button and select in your computer one or several files of the allowed file types. SeqNjoy supports the following file types and extensions:
File type | File Formats | Allowed File Extensions |
---|---|---|
Genome sequences | FASTA | fa, fa.gz, fasta, fasta.gz, fna, fna.gz |
Genes features | GFF3, GTF | gff, gff.gz, gtf, gtf.gz, gff3, gff3.gz |
Short reads sequences | FASTQ | fastq, fastq.gz, fq, fq.gz |
Alignments | BAM, SAM | bam, sam |
Gene annotations | ANN (text table) | ann, ann.txt, ann.tsv, ann.tab, ann.xls, ann.xlsx |
If you are not familiarized with any of these file types, go to the section File Types to know more about the structure of the format and the function and details of each file type.
Figure 6.2 shows the application loading a number of FASTQ files:
Figure 6.2: Uploading files in SeqNjoy
The pop-up window reports the status of file uploading and post-processing by means of a progress bar and informative messages describing the step being performed. When all is OK, you should see a window with an “all in green” aspect like this (Figure 6.3):
Figure 6.3: Uploading process successful
Take into account that in the case of some large and/or complex files the upload may take a considerable amount of time.
Genome files include two types: (1) a genome reference in FASTA format and (2) a genomic features file (GFF3 or GTF) containing information about chromosomes, genes, CDS, exons, etc.
If you already have a FASTA and a GTF/GFF file, you can upload them following the steps described in the previous section. Otherwise, you can download genome files directly from the Ensembl repositories. Ensembl provides access to a large number of genomes, focusing on the most biologically and medically relevant species.
To download genomes from Ensembl, click the Download Genomes button of the Main Page and then select Download Genomes from Ensembl (Figure 7.1):
Figure 7.1: Selecting the Download Genomes option from the Main Page.
A new window will open where you need to select the taxonomic group of your organism (Figure 7.2).
Important: To access the repository, you must be connected to the Internet, and Ensembl (for vertebrates) or Ensembl Genomes (for bacteria, fungi, metazoa, plants and protists) must be accessible and functional. Ensure that the window displays the message “Online - release <number>” next to the repository name; otherwise, you will not be able to proceed with the download using this option.
Figure 7.2: Taxonomic group selection
After selecting the taxonomic group, choose the specific genome from the dropdown menu (Figure 7.3).
Figure 7.3: Choosing the genome of interest from the dropdown menu.
Once you have selected the species/strain genome, click Check Availability Genome (Figure 7.4) to view the files of that organism available for download. Tipically, you will find the reference genome file (FASTA) and the genomic features file in both common formats, GTF and GFF.
Figure 7.4: Checking the availability of genome files before downloading. The application will list available FASTA and GTF/GFF files
To proceed with the download, select the FASTA file and at least one of genomic features files (GTF or GFF), then click the Download button (Figure 7.5).
Figure 7.5: Selecting genome files: reference genome (FASTA) and genomic features file (GTF/GFF). In this case, a GTF file is selected
Once the download is complete, the genome files will appear in File List on the “Project’s Main Page” (Figure 7.6).
Figure 7.6: Once downloaded, the genome files appear in the File List on the Project’s Main Page along with the rest of loaded files
After you have uploaded your input files, your Project’s Main Page shows the list of the loaded files. The list offers different information and a number of options to operate on them. From the Project’s Main Page you can also perform the successive steps of your RNA-Seq gene expression data analysis project.
This is the aspect of the Project’s Main Page after a standard RNA-Seq data analysis project has been performed (Figure 8.1):
Figure 8.1: Project’s Main Page
The Project’s Main Page is structured as follows.
In the top area the following functionalities are available:
The rest of the Project’s Main Page is dedicated to the File List. The File List area contains information about all the files loaded by the user or generated by the processing with SeqNjoy, and provides a number of operations that the user can run on one or more selected files. The File List comprises the following buttons and columns, described from the left to the right:
Select/Unselect: A column of check boxes. When you check one or more boxes, the corresponding file(s) is/are marked as selected. If you click a checked box, the file(s) returns to unselected status. In case you want to select all the files of the list, you can do in a more convenient way by clicking the uppermost check box (it also serves to “unselect all”).
There are two possible operations you can execute with the selected files:
Both operations, Download and Delete, are also available on a file-by-file basis in the two rightmost columns of the File List.
Show buttons: You can filter the File List to hide or show the files having a particular file type(s). For example, if you click a colored “bam” button , all bam files are hidden from the File List and the bam button becomes grayish . Click again the grayish icon to return your bam files to the list. You can click multiple Show buttons to show/hide several file types at one time. Note that only file types with no members selected can be hidden.
The numbers accompanying the file type on the button indicate the total number of existing files of that type and how many of them have been selected via the checkboxes (see Select/Unselect above).
Type: It informs you of the type of the file. Refer the File Types in SeqNjoy section to know more about the file types and their characteristics.
Details: This column contains diverse information about the file and provides access to additional utilities for some file types.
File Types | Operation | Tool | Icon |
---|---|---|---|
html | Quality Control Report Visualization | fastp | ![]() |
fasta, bam | Genomic Browser | IGV | ![]() |
gff | Genome features structure | Rgff | ![]() |
diffexp | Visualize, plot and filter gene expression results | FIESTA 2.0 | ![]() |
Quality control (QC) of sequencing reads is a crucial step in NGS data analysis. It allows users to assess the quality of their sequencing data before proceeding with downstream analyses. QC reports provide insights into potential issues such as low-quality bases, adapter contamination, and sequence biases. Based on these reports, users can decide whether their reads are suitable for analysis as they are or if preprocessing steps, such as trimming or filtering, are necessary. While QC is not mandatory, it is highly recommended to ensure data reliability.
The QC process is conducted using the software tools Biostrings (H Pagès, 204C.E.) and fastp (Chen et al., 2018).
To perform quality control, navigate to the Project’s Main Page and click the Quality Control button. This opens a new window with a menu containing a single option: Quality Control with fastp (Figure 9.1).
Figure 9.1: Option for Quality Control of the read files (FASTQ)
Clicking this option displays the QC interface, where users can select the FASTQ files to analyze and specify the number of reads to include in the QC process (Figure 9.2).
Figure 9.2: Quality Control window
The QC interface provides fields for selecting one or multiple FASTQ files:
Additionally, users must set the Sample Reads for QC parameter. By default, this is set to 1,000,000, meaning the first million reads (or read pairs) will be analyzed. Setting this value to 0 will process all reads in the selected FASTQ file(s).
Once the desired files and parameters are selected, click the Launch button to start the QC process.
When the QC process completes, the corresponding QC icon to the left of each analyzed FASTQ file (or paired-end set) will be enabled (it was previously grayed out). Clicking this icon opens the QC report, where users can review metrics and visualizations related to read quality. For a detailed breakdown of the Quality Control Report’s content, refer to section Quality Control Report.
After reviewing the QC results, users can decide whether preprocessing steps, such as trimming or filtering, are necessary to improve read quality before proceeding with further analyses.
Typically, after performing Quality Control (QC) and inspect the QC report, users may identify issues in their reads that can be corrected through preprocessing. Common problems include the presence of adapter sequences, low-quality regions, and an excessive number of undetermined bases (N). To mitigate these, a preprocessing step can be applied to remove adapter fragments contamination, trim low-quality bases, and discard entire reads that fail to meet quality or length thresholds.
The application uses fastp (Chen et al., 2018) to preprocess FASTQ sequencing files, applying various corrections based on user-defined parameters.
To access this operation, navigate to the Project’s Main Page and click the Preprocess Reads button. This opens a new window containing a single menu option: Preprocess Reads with fastp (Figure 10.1).
Figure 10.1: Option for preprocessing reads
Selecting this option launches the preprocessing setup interface (Figure 10.2).
Figure 10.2: Preprocess Reads window
In this interface, users must select the FASTQ files they wish to preprocess and configure the correction parameters. As with Quality Control, users can select one or more FASTQ files. If the input files are paired-end, selecting one file will automatically populate its mate file.
On the right side of the preprocessing interface, users can configure several parameters. The default parameters are set to provide reasonable results for most datasets, effectively addressing common issues. If you are unsure about which values to use, it is recommended to keep the defaults, as they will correct most typical sequencing errors without over-filtering your data.
Adapter trimming: When enabled, removes adapter sequences from reads. These are unwanted fragments introduced during sequencing library preparation.
Keep reads of ‘n’ nucleotides or longer: Retains only reads with a minimun length of ‘n’ nucleotides after trimming (default: 15 nucleotides).
Allow up to ‘n’ N per read: Reads with more than this number of ambiguous bases (‘N’) are discarded (default: 5 N’s).
Discard reads with low complexity: When enabled, removes reads with low complexity. A low-complexity region in the genome is an area where the nucleotide sequence follows a repetitive or predictable pattern with little variability. These regions contain limited combinations of bases and lack the diversity that characterizes most functional DNA sequences. Due to their repetitive nature, they can cause issues in bioinformatics analyses, affecting sequence alignment and the accurate detection of variants.
Trim poly-X tails: When enabled, this option removes homopolymeric tails (stretches of the same nucleotide, such as poly-A or poly-T) that are longer than 10 nucleotides. These artifacts can arise during library preparation or sequencing and may introduce biases in downstream analyses.
Trim ‘n’ nucleotides from the 3’ end of read 1: Removes ‘n’ nucleotides from the 3’ end of all reads in the “reads_1.fastq” file (default: 0, no trimming). The same option is available for trimming the 3’ end of reads in the “reads_2.fastq” file.
Trim ‘n’ nucleotides from the 5’ end of read 1: Removes ‘n’ nucleotides from the 5’ end of all reads in the “reads_1.fastq” file (default: 0, no trimming). The same option is available for trimming the 5’ end of reads in the “reads_2.fastq” file.
Additionally, users must define an Output Prefix, which will be incorporated into the filename of the processed FASTQ files to distinguish them from the originals.
Once the FASTQ files, parameters, and output prefix are set, clicking the Launch button will process the files according to the specified adjustments.
The output consists of cleaned FASTQ files, which will appear in the File List on the Project’s Main Page. These quality-improved FASTQ files can be reassessed using a new Quality Control step if needed. Once assessed and, if necessary, preprocessed, the resulting FASTQ files serve as input for the Align Reads step.
Click the “Align Reads” button of Main Page to access this window. For performing this task, you inform a genome sequence (FASTA) file and the corresponding short reads files (FASTQ files) and obtain files with the alignments (BAM formatted) of the short reads contained in the FASTQ files to the reference genome (FASTA file).
There are two different aligners available to perform this task, and you should choose one of them in the submenu that is displayed after clicking the Align Reads button (Figure 11.1): bowtie2 (Langmead & Salzberg, 2012) or hisat2 (Kim et al., 2015). Both of them are well-documented and widely used programs.Figure 11.1: Align Reads: selecting alignment method (aligner)
After choosing any of them you access a window where you have to fill the input FASTQ files and alignment parameters that are necessary to complete the task.
Figure 11.2: Window for the reads alignment. Input files are the genome reference (FASTA file) and the short reads files (FASTQ files). The shaded areas contains bowtie-specific or hisat2-specific parameters and therefore are differents according to the selected aligner.
The Align Reads to Genomes window has the following parts (see Figure 11.2):
Select Genome sequence: select here one FASTA file among the listed in the drop-down menu. Usually, you select the genome reference of the species/variety the sequenced samples come from (or a close one), but you can select any uploaded genome reference that could be more convenient in your particular study.
Select FASTQ files: Below the Genome Sequence you have a grid of drop-down menus arranged in rows and columns. To fill this part in, you must know that there are two modalities of RNA-Seq named single-end and paired-ends sequencing. Refer to the corresponding section to know about them if you need it.
In “Fastq 1” column you select: a) all your single-end FASTQ files, if your FASTQ file are of the single-end type or b) the files corresponding to the first member of the pair in the case of paired-ends FASTQ files (normally, the filename of these “read 1” files contains the suffix “_1” or similar).
“Fastq 2” column only applies in the case of paired-end sequencing. In that case, you have to put here the second member of the pair (normally, files with the suffix “_2” or similar in their filesnames). This second member of the pair must be selected side-by-side with the first ("_1“) member that was selected in”Fastq 1" column (as in Figure 11.2). Do not select any file in this column if all the FASTQ files in your experiment are of “single-end” type.
In addition, you have to fill in a Sample label for each Fastq 1 (single-end) or Fastq 1 plus Fastq 2 (paired-end) item you fill in. This Sample label will name the resulting alignment (BAM) file.
At the end, the aspect of this area will be similar to one of the displayed in the Figure 11.2.
Output files suffix (optional): The text you type in this box will be added to the name of all the output files (BAM files) of this step, immediately after the sample label and before the .bam extension, that is, “Sample_name.suffix.bam” (Example: “Sample1.aligned_with_bowtie2.bam”, “Sample2.aligned_with_bowtie2.bam”, etc.). If no “output files suffix” is typed, the BAM output files will be named simply with the sample name followed by the “.bam” extension (i.e. [Sample_name].bam).
Alignment parameters: The shaded areas of the window contain the parameters that are used by the aligner of your choice, bowtie2 or hisat2. Some options are the present irrespective of the aligner you choose:
Figure 11.3: FASTQC Plot representing the quality per base sequence on a FASTQ file with quality drop in the 3’ extreme.
With the Trim reads from 5’ end parameter you can set the number of nucleotides to be removed from the reads of the file at the 5’ end (i.e. “on the left”). With Trim reads from 5’ end you can set the number of nucleotides to be removed from the reads of the file at the 3’ end (i.e. “on the right”).
Max CPU Cores: bowtie2 and hisat2 support parallel processing, that is, it can use multiple CPU cores. Multi-core is enabled by default to use (up to) 4 cores. You can modify this value in the text box with a numeric value greater or equal to 1. If your computer has less CPU cores than the number typed here, the aligner program will use the maximum your computer has.
Remove duplicated alignments: In RNA-Seq protocols (especially if a PCR amplification step is included), there is a certain probability of introducing artificial generation of duplicated reads, that is, reads of the same size aligning exactly in the same chunk of the genome (same start and end positions in the genome, Figure 11.4). The presence of duplicated reads above a certain number may be an indication of the existence of such unwanted technical artifacts during the sequencing process. The application provides this parameter to remove reads that are duplicated a number of times equal of greater to a value (50 by default).
Figure 11.4: Duplicated reads are represented in red color
Specific options for bowtie2.
We use the R package Rbowtie2 (Wei et al., 2018) to align with bowtie2 (Langmead & Salzberg, 2012).Performance: Alignment methods search a reasonable trade-off between speed and sensitivity/accuracy in their results in order to reduce false detected alignments without missing too many true alignments. In bowtie2 you can adjust this balance by setting one of the following values the “performance” option: very-sensitive, sensitive (default value), fast or very-fast. The default value, sensitive, keeps a good balance between sensitivity and precision. In the case you need more precision (at the cost of a slower performance) choose very-sensitive. Conversely, choose fast or very-fast to prioritize speed performance (at the cost of a lesser precision).
Mode: By default, bowtie2 aligns “end-to-end” mode, meaning that it attempts to align the entire read without omitting characters at either extreme. The alternative, local mode, does not require that reads align from one extreme to the other, that is, it can yield alignments that do not include one (or both) of their extremes. Observe in the example of Figure 11.5 that in "local" mode the extremes are not aligned while they are both aligned in "end-to-end" mode.
Figure 11.5: ‘end-to-end’ and ‘local’ modes.
Specific options for hisat2.
We use the R package Rhisat2 (Soneson, 2021) to align with hisat2 (Kim et al., 2015).
All the functionalities of these hisat2-specific parameters are linked to the fact that hisat2 is a “splice-aware” aligner. In practical terms, that means that it uses the information deduced and/or provided about the genomic locations of the splice sites and introns in order to improve the precision of the predicted alignments. Of course, this not applies for intronless genomes. By informing these parameters the alignment may be slightly more accurate (if you do not inform them, hisat2 will do his best anyway.)
Once you have filled the form in with the required files and set your parameter values, the “Launch!” button turns from disabled (grey) to enabled (red) so you can click it and launch the alignment(s). A pop-up window appears informing you of the progress of the alignment task. When the process ends OK a message “done” is displayed in the pop-up window and a “Close” button is enabled. By clicking “Close” you return to the “Align Reads to Genome” pane. Now, you could make a new Align Reads to Genome round or click the “X” icon for closing the “Align Reads to Genome” window. You will then return to the “Project’s Main Page”, which will have been updated to contain the new alignment (.bam) files in its File List.
Figure 11.6: Align Reads to Genome dialog ready to launch the alignments processes.
Click the “Quantify Genes” button in the “Project’s Main Page” to access this window (Figure 12.1).
Figure 12.1: Quantification window
In summary, you inform here a genome features file (GFF) file and a number of alignment files in BAM. Normally, these alignments have been generated previously in the “Align Reads” step. Alternatively, the BAM files were generated elsewhere and the user uploaded to SeqNjoy.
The output of the quantification is one “counts” file per BAM alignment file. Each “counts” file contains a list with the “counts” , i.e. the number of “reads” or “fragments” that have been mapped to each gene (or other feature type that you select) of the genome.
Genome Features: Select one GFF3 or GTF file from the list of loaded GFF files displayed in the drop-down menu.
Output file suffix (optional): you can type in this box a text chunk that will be added to the name of all the output files of this step (“counts” files), immediately before the .counts extension, that is, “bam_name.suffix.counts” (Example: “Sample1.exons_by_genes.bam”, “Sample2.exons_by_genes.bam”, etc.). If no output file suffix is typed, the resulting files will be named similarly to the corresponding BAM (“sample_name.counts”).
Alignments: In this column of selectors you select the samples (their corresponding BAM alignment files) from which you want to obtain the quantification of counts for one (or several) genomic feature(s). At least one BAM file must be selected.
The following parameters are used by the tool that performs the quantification task, the function featureCounts of the R package Rsubread (Liao et al., 2019):
If you have any doubt, there is a specific section on strand specificity in RNA-Seq in this Tutorial. No default value is set, as it depends on how the sequencing of your reads has been made. If you finally don’t know which strandedness protocol was followed in the sequencing, select unstranded as it is the safest option.
Primary only: A read may map ambiguously to multiple locations, e.g. repeated or closely similar regions in the genome. In that cases, only one of the multiple read alignments is considered “primary”, that is, “the best alignment” based on diverse criteria. All other alignments are marked as “secondary”. By default (box checked) the quantification will count only the alignments marked as “primary alignments” and discard the “secondary”. If not checked, all alignments will be counted. In Figure 12.2, if "primary only" is checked, only the alignment tagged as primary will be counted in feature A for read R001. If not, all the three alignments will be counted: one count for feature A, one for feature B and one for feature C.
Figure 12.2: Primary and secondary alignments.
Genomic features and blocks: The GFF file selected has a number of feature types that you can quantify by. For example, you may wish to get the list of reads that align (counts) per each gene. For that, in the drop-down menu “Select Feature” you should select the “feature type” gene among the existing in your GFF file. Some features, like gene, are constructs that have constituent sub-features underneath. One key sub-feature related to gene expression is exon. In principle, it is more accurately (specially if your GFF is well annotated) to take into account for quantification not all the reads aligning the gene span but only those reads aligning within the (block of) exons of the gene (Figure 12.3).
Figure 12.3: Genomic features and blocks for computing counts in ‘GeneX’. When only the feature ‘gene’ is selected, all reads aligning in the GeneX span are counted (13 reads). When you add the block ‘exon’ to the feature ‘gene’ you are imposing the restriction that only the reads inside exons are counted, so the counts for GeneX are 10.
Add more button: Click to set additional feature types to quantify. For example, if the GFF file contains “genes” and also “pseudogenes” you may want quantify both.
Once you have filled the form with the required files and set your parameter values, the “Launch!” button turns from disabled (grey) to enabled (red) so you can click it and launch the quantification. In Figure 12.4, the quantification process will be launched to count genes for nine samples, with reverse strandedness mode. Only primary alignments will be counted. The output of the quantification is a number of counts files equal to the number of input BAM files, that is, nine in this case.
Figure 12.4: Quantification dialog ready to be launched
A pop-up window appears informing you of the progress of the quantification task. When the process ends OK a message “done” is displayed in the pop-up window and a “Close” button is enabled. By clicking “Close” you return to the Quantify Genomic Features pane. Now, you could make a new Quantify Genomic Features round or click the “X” icon for closing the window. You will then return to the “Project’s Main Page”, that it will have been updated to contain the new “counts” files.
In this step you can perform an statistical analysis of the differences in the expression for a given set of features between two groups of samples. You can use it repeatedly to make more that one binary comparison (e.g. “SampleGroupDrugA versus SampleGroupDrugB”, and then “SampleGroupDrugA versus SampleGroupPlacebo” and finally “SampleGroupDrugB versus SampleGroupPlacebo”).
SeqNjoy offers you two possible methods to perform this task, both of them widely used and well documented: DESeq2 (Love et al., 2014) and edgeR (Robinson et al., 2009). If you click the Compare Samples button a submenu listing both options is opened and you can select any of them (Figure 13.1) .Figure 13.1: Compare Samples: selecting the method for calculating differential expression
After selecting the desired option, the Calculate Differential Expression dialog is displayed (Figure 13.2). Most of this window is the same regardless of the selected software. Only some specific parameters located in the shaded area of the window are different and specific of the selected method (DESeq2 or edgeR) to perform the differential expression calculations.
Figure 13.2: Calculate Differential Expression (Compare Samples) dialog as displayed initially.
The following fields are present in the Calculate Differential Expression window irrespective of the selected method:
Specific options for DESeq2 (Love et al., 2014):
Independent Filtering: The default (checked box) is to perform independent filtering. The purpose of independent filtering is to run a process, subsequent to and independent of the particular statistical tests performed by DESeq2, in order to remove from the results those that have no, or little chance of showing significant evidence, what typically increases the statistical/detection power. The simplest criterion to do the filtering is using the mean of all the normalized counts for the evaluated gene. In the box is unchecked, this additional filtering is disabled. More information in this link
Use beta prior The default (checked box) is to use beta prior, which is equivalent to perform a “Log fold change (LFC) shrinkage”, that is, to shrink the LFC estimates of genes with little information or high dispersion toward more likely (lower) LFC estimates. Figure 13.3 illustrates the effect of using (right) or not (left) beta prior (PC) on the shrinkage of LFC. Observe how using PC a shrinkage on the LFC (“Log2Ratio”, in the Y-axis) is performed. This shrinkage affects those genes that have low expression (low “Log2BaseMean”, X-axis). Red and green dots represent genes upregulated and downregulated, respectively, with p-adjusted (FDR) < 0.05.
Figure 13.3: Effect of beta prior on the shrinkage of LFC.
Use beta prior procedure when your data have a high number of low-expressed genes (low number of counts) and/or a high number of genes showing high difference of expression within a condition (high dispersion in the counts values). You can find additional explanations in this link.
Specific options for edgeR (Robinson et al., 2009):
Prior Count (optional): This argument is the counterpart in edgeR to the Use beta prior argument in DESeq2. When checked, you can type in the text box a positive number that will be used in the calculations of differential expression as a factor to shrink Log Fold-Changes (LFC). The default is 0.125, a value that produce a minimal LFC shrinkage. Larger values of prior count will produce more shrinkage. Figure 13.4 shows the effect of different values of Prior Count (PC) on the shrinkage of LFC. Observe how the higher the value of PC the larger the shrinkage of LFC (“Log2Ratio”, in the Y-axis). Observe also how the shrinkage affects only the those genes with low expression (low “Log2BaseMean”, X-axis). Red and green dots represent genes upregulated and downregulated, respectively, with p-adjusted (FDR) < 0.05.
Figure 13.4: Effect of different values of Prior Count on the shrinkage of LFC.
Once you have filled the form with the required files and set your parameters values, the Launch! button turns from disabled (grey) to enabled (red) so you can press it and launch the differential expression calculation. In the figure 13.5 differential expression calculation between the Case ("pH5_CO2") group of samples and the Control ("pH5") group of samples is displayed ready to be launched (Launch! button enabled). In the example, an annotations file has been provided so the information contained in this file will be incorporated to the diffexp file with the differential expression calculations.
Figure 13.5: Calculate Differential Expression dialog ready to be launched.
After launching the process, a pop-up window informs you the progress of the “differential expression” task. When the process ends OK the message “done” is displayed in the pop-up window and a close button is enabled. By clicking the Close button you return to the Calculate Differential Expression pane. Now, you can fill in again the form with a new comparison between groups of samples to calculate the differential expression between them or click the “X” icon for closing the window. You will then return to the “Project’s Main Page”, that it will have been updated to contain the newly created diffexp files in Excel format (extension .xlsx) containing the results of the calculated differential expressions of the comparisons performed.
The “diffexp” output file (syntax: [CaseLabel]vs[ControlLabel].[OutputSuffix_if_typed].xlsx) contains the list of differentially expressed genes (DEG) detected between the “Case” and the “Control” groups of samples. Check in the File Types section the information content of the diffexp files. You can graphically inspect and filter the differentially expressed genes results using the FIESTA 2.0 analytic tool.
In addition to the pipeline to perform the RNA-Seq gene expression data analysis, SeqNjoy provides a number utilities for evaluating and inspecting the content of some of the types of files involved in the analysis. These utilities are launched from buttons located in the File List.
After running Quality Control (QC) on paired-end FASTQ files, a detailed quality report is generated. This report provides key insights into the sequencing data, allowing users to assess whether further preprocessing steps, such as trimming or filtering, are necessary.
This report is generated using fastp (Chen et al., 2018), a widely used tool for FASTQ quality assessment and preprocessing.
To see an example of the Quality Control (QC) report generated for paired-end reads, click here. This example illustrates the types of quality metrics available.
This section summarizes fundamental metrics for each read file (R1 and R2), including:
A boxplot displays the quality scores across read positions. This allows users to identify potential quality drops at the ends of reads, which might require trimming.
A graphical representation of base frequency (A, T, C, G) at each position. Unexpected deviations may indicate sequencing artifacts.
A histogram showing the distribution of overall read quality. A large proportion of low-quality reads may suggest sequencing issues.
This section provides a graphical representation of detected adapter contamination, which may require trimming in the preprocessing step.
Displays the proportion of duplicate reads, which can indicate over-sequencing or PCR amplification bias.
Once the QC process is completed, a gray icon next to the analyzed FASTQ files in the File List on the Project’s Main Page will become highlighted. Clicking this icon opens the Quality Control Report.
Based on the report, users can determine whether preprocessing steps, such as adapter trimming or low-quality read filtering, should be applied before proceeding with alignment.
To access this utility from a GFF (GFF3 or GTF formatted) of the File List click the icon located rightmost in the Details column.
When launched, SeqNjoy will run the plot_features function of the Rgff R package (Garcia-Martin et al., 2022) to generate a picture containing a tree representing the hierarchical structure (dependencies) of the feature types included in the selected GFF file (Figure 14.1). This tree representation is useful for a global look of what is annotated in the genome and it can help you decide which feature(s) and blocks to use in the Quantify Genes step of the analysis.
Figure 14.1: A hierarchical tree of a genome feature file (GFF) produced by Rgff
As it is shown in the example tree above, the highest feature type in this GFF is gene and there are 7127 items annotated as genes. gene has a child node, the transcript feature. Included into transcript there are five child features, that are siblings from each other: CDS, exon, five_prime_UTR, start_codon and stop_codon.
With this information, we know that we could count genes, exons-by-gene, transcript, exons-by-transcript, exons, etc. in our analysis but not “exon-by-CDS” for example, as exon is not a descendant of the CDS feature.
You can access this utility from a FASTA file containing the genomic sequence or from an alignment file (BAM), by clicking the icon located on the right of the Details column of the corresponding file. The option is not available for user-provided BAMs.
After clicking the button, the widely used “Integrative Genomic Viewer” (IGV) (Robinson et al., 2011, 2022) tool will be launched and will load the corresponding FASTA or BAM file. If you have loaded a FASTA, a screen similar to the one in Figure 14.2 will be displayed.
Figure 14.2: IGV launched with a FASTA file as input
If you use a BAM file with IGV a similar window is displayed. In addition to the BAM file, the associated genomic sequence (FASTA file) is automatically incorporated to IGV in the same session. See in Figure 14.3 the initial aspect displayed in this case and a minimal “use case” for IGV to visually compare the alignment files of two samples (i.e. two BAM files) for a particular gene.
Figure 14.3: A simple usage example of IGV to compare two BAM files in a certain genomic region.
IGV has a plethora of functionalities and possibilities to interact with your alignment files. Please refer the IGV user general guide, to take full advantage of this versatile tool. In addition, you can inspect your files with the web/on-line version that IGV provides at https://igv.org/app/, documented at https://igvteam.github.io/igv-webapp/.
FIESTA 2.0 (Oliveros, 2022) is an interactive viewer for differential expression results files that is integrated in SeqNjoy. It provides multiple ways of representing your data as tunable interactive plots that can be downloaded as high resolution images. In addition, FIESTA 2.0 allows you sorting and filtering genes by one or several columns and saving the relevant ones into files.
Figure 14.4 illustrates a usage example for FIESTA 2.0. When a diffexp file is loaded into FIESTA 2.0, a default representation of the expression data as a MA-plot is displayed. From this point, you can combine filters (top) to select gene subsets. A very common (and useful) filter combination is to select and highlight with user-selected colours genes (center) that show to be significantly (P-adjusted < 0.05) under- (negative Log2Ratio) or overexpressed (positive Log2Ratio). Finally, user can select any pair of numerical columns present in the diffexp file to be represented (bottom), so other popular representations like Volcano plots are also available. User-generated plots and subsets of relevant genes are both downloadable (in the case of the plots, with high resolution).
Figure 14.4: FIESTA 2.0 usage example.
To access FIESTA 2.0 viewer from a file containing the differential expression results (“diffexp” files in .xlsx format) click the icon located rightmost in the Details column.
More details about FIESTA 2.0 here.
SeqNjoy uses up to seven types of files in the analyses. The files that you provide have these types: FASTA, FASTQ, GFF3/GTF and ANN. Other types are only held by files generated through the SeqNjoy analysis: COUNT, DIFFEXP. Finally, the format BAM/SAM is that of the alignment files, which can be either internally generated by SeqNjoy or provided (loaded) by the user.
To follow, we describe the aspect, source, uses, format requirements and content of each file type. You can find a good complement to know more about them in this link by Andrew Severin and also in this page.
Figure 15.1: FASTA file format
Content: Each FASTQ file contains information of millions of so-called short reads, or small genomic sequences that have been “read” by a sequencer machine in a sample. FASTQ is also the name of the format of these files, and is kind of a extension of the FASTA format (it has also a header and lines with genomic sequences).
A FASTQ file uses four lines per short read (Figure 15.2):
Figure 15.2: FASTQ file format
Content: A GFF3 or GTF file contains information for the so-called genomic features. A feature is a genomic region with some involvement in a biological process. For example, the mRNA “EDEN.1” (mRNA00001) of the gene “EDEN” (gene00001) in the Figure 15.3 is a feature. In the same figure, that specific feature is described in Line 3 of the exemplified GFF file:
Figure 15.3: Canonical representation of the representation of a single gene in GFF3. Adapted from the GFF3 specifications
The Figure 15.3 depicts how a gene and its features composition (mRNAs, exons, CDSs) are represented in a GFF3. After some header lines (starting with “#”), each feature is written in a line. You can see that each feature has information about its location (start, end, strand) and feature type (gene, mRNA, exon, etc.). A key point is the last column, where relations of dependency between the features are coded. For example, the feature “mRNA00001” (with feature type “mRNA” ) in Line 3 depends on (note its “Parent” tag) another feature named “gene00001” (aka EDEN) in Line 2, which has the feature type “gene”. SeqNjoy has a specific utility for displaying the feature structure of your GFF3/GTF file in a graphical way.
More formally, each line of a GFF3 (or GTF) file has 9 tab-separated columns:
Column 1: seqID (e.g. chromosome-id: “chr1” or “1”, “chr2”, etc..). IMPORTANT: These ids must be the same as the used in the FASTA file. See in the example below (Figure 15.4) how only the GFF3 in green has matching chromosomes ids (chr1, chr2…) with those in the FASTA file displayed.
Figure 15.4: GFF and FASTA chromosome ids must match.
GFF3 is currently the preferred format to represent Gene Features, but it is not uncommon to find gene feature files in GTF format. GFF3 provides more flexibility to represent the hierarchy of the features in a genome but GTF has some advantages in certain cases. SeqNjoy can process files in GFF3 or GTF formats. In both cases, columns 1 to 8 are the same. The 9th column has a different syntax in GFF3 and GTF (Figure 15.5 )
Figure 15.5: Comparison of GTF and GFF3 files for the representation of the features of the same gene
Important: For SeqNjoy, GFF files must contain at least one “gene” in the feature type column.
For more detailed information about GFF formats refer to this, this or this link.
Content: BAM/SAM files follows a format named “SAM format” (Sequence Alignment/Map). This format consists on tab-limited columns that store alignments of the short reads from the FASTQ file that have been mapped to the reference genome (your FASTA file).
A BAM is a binary file and therefore it is not directly readable with an editor (you can use IGV to inspect its content). Getting the uncompressed/text version of a BAM, a SAM file, does not solve normally the problem, as SAM files are very large and any try to browse it with a common editor will most likely end in a crash of the editor after a while.
In brief, a BAM/SAM file contains a complete information per line of each alignment: read identification, chromosome location, alignment type (single-end or paired-ends), mapping quality, presence of insertions and deletions in the alignment, insert size, etc.
You can find detailed information about the SAM format in the SAM Format Specification.
Figure 15.6: Counts file example
Content: The “ANN” format of an annotations file is very simple (Figure 15.7). The first line is the header line and has the column names. The first column of the file must contain gene identifiers. These identifiers must match (in type, not necessarily in number) those present in the counts files. The rest of columns may contain different biological information related to the genes, e.g. gene description, gene synonyms, biological processes it is involved in, molecular functions, cell compartments, expressed tissues, biological pathways, interactions, orthologous genes, literature links, identifiers to other databases, etc.
Figure 15.7: Annotations file format
It is not necessary that all genes present in the counts files are also present in the annotations file, nor is it necessary that all genes present in your annotations file are present in the counts files.
Figure 15.8: diffexp file format.
This is the meaning of the columns present in the diffexp files (Figure 15.8):
Andrews, S., Krueger, F., Segonds-Pichon, A., Biggins, L., Krueger, C., & Wingett, S. (2012). FastQC. Babraham Institute.
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1), 289–300. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x
Chen, S., Zhou, Y., Chen, Y., & Gu, J. (2018). Fastp: An ultra-fast all-in-one fastq preprocessor. Bioinformatics, 34(17), i884–i890. https://doi.org/10.1093/bioinformatics/bty560
Garcia-Martin, J. A., Oliveros, J. C., & Torres-Perez, R. (2022). Rgff: R utilities for gff and gtf files. National Center for Biotechnology, Spanish National Research Council (CNB-CSIC). https://cran.r-project.org/web/packages/Rgff
H Pagès, R. G., P Aboyoun. (204C.E.). Biostrings: Efficient manipulation of biological strings. https://bioconductor.org/packages/Biostrings.
Kim, D., Langmead, B., & Salzberg, S. L. (2015). HISAT: A fast spliced aligner with low memory requirements. Nature Methods, 12(4), 357–360. https://doi.org/10.1038/nmeth.3317
Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with bowtie 2. Nature Methods, 9(4), 357–359. https://doi.org/10.1038/nmeth.1923
Liao, Y., Smyth, G. K., & Shi, W. (2019). The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic Acids Research, 47(8), e47. https://doi.org/10.1093/nar/gkz114
Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12). https://doi.org/10.1186/s13059-014-0550-8
Oliveros, J. C. (2022). FIESTA 2.0: Interactive scatter plots in your browser. National Center for Biotechnology, Spanish National Research Council (CNB-CSIC). https://bioinfogp.cnb.csic.es/tools/fiesta2
Pomaznoy, M., Sethi, A., Greenbaum, J., & Peters, B. (2019). Identifying inaccuracies in gene expression estimates from unstranded RNA-seq data. Scientific Reports, 9(1). https://doi.org/10.1038/s41598-019-52584-w
Robinson, J. T., Thorvaldsdóttir, H., Turner, D., & Mesirov, J. P. (2022). Igv.js: An embeddable javascript implementation of the integrative genomics viewer (igv). bioRxiv. https://doi.org/10.1101/2020.05.03.075499
Robinson, J. T., Thorvaldsdóttir, H., Winckler, W., Guttman, M., Lander, E. S., Getz, G., & Mesirov, J. P. (2011). Integrative genomics viewer. Nature Biotechnology, 29(1), 24–26. https://doi.org/10.1038/nbt.1754
Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2009). edgeR: A bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139–140. https://doi.org/10.1093/bioinformatics/btp616
Soneson, C. (2021). Rhisat2: R wrapper for hisat2 aligner. https://github.com/fmicompbio/Rhisat2
Wei, Z., Zhang, W., Fang, H., Li, Y., & Wang, X. (2018). EsATAC: An easy-to-use systematic pipeline for atac-seq data analysis. Bioinformatics. https://doi.org/10.1093/bioinformatics/bty141
After the “fragmentation” step, in RNA-Seq protocols fragments ~300 to 500 bp long are produced. Even this size range is too long to be sequenced entirely by the type of sequencers that generate short reads. What is done then is to read only the end part of the fragment, which produce short reads of 100-200 bases (this interval may vary according to the experimental design, the machine used, etc.) The rest of the fragment is not sequenced.
Now, there are two modalities in RNA-Seq, single-end and paired-ends (Figure 15.9):
Figure 15.9: Single-end and Paired-ends reads: a global view
Single-end RNA-Seq (SE, aka single-read RNA-Seq): Is the simplest mode. Only one end of the fragment is sequenced. All the short reads sequences generated for a sample with single-end (SE) technique are collected in a single FASTQ file.
Paired-end RNA-Seq mode (PE): Two reads per fragment are produced. In this mode, the reading starts at one end, finishes this direction at a pre-specified read length (e.g. 150 bases) and a “read 1” is produced as a result. Then, it starts another round of reading from the opposite end of the fragment, finishes this new (opposite) direction at the same read length and finally a second read, “read 2”, is produced. At the end, both ends of the fragment are sequenced leaving the inner portion of the fragment unsequenced.
In PE sequencing, read 1 reads (those generated first-in-the-pair) and read 2 reads (generated secondly) are stored in two separated FASTQ files. One file contains all the read 1 reads the other file contains all the read 2 reads. Each of these files has a label in its name that identifies whether it contains read 1 or read 2 reads, something like Sample1_blah_blah_1.fastq and Sample1_blah_blah_2.fastq (or something similar like Sample1_blah_blah_R1.fastq and Sample1_blah_blah_R2.fastq, etc.) Note that usually the only difference in the filenames between two FASTQ paired-ends files of the same sample is just this "_1" or "_2"-like particles.
In addition, "_1" and "_2" FASTQ files of a particular sample have both the same number of reads. Moreover, they kept the same order of reads (see Figure 15.9): at a given position X in both files, the reads in “_1” and “_2” correspond to the same fragment. In fact, as with the files, the read identifier that is given to the “read 1” and “read 2” reads of the same fragment is the same… except again for the particle indicating whether it is for reads 1 or reads 2.
To know how is the orientation of the reads in SE and PE is useful to interpret correctly the alignments of the reads and to understand important topics like the unstranded or stranded protocols in RNA-Seq.
The sequencing always synthesizes the reads in a 5’ -> 3 direction. (It is the way the DNA polymerase proceeds) so the template strand is traversed 3’ -> 5’.
In single-end, the reads extend in the 5’ -> 3’ direction along the forward (plus) DNA strand (Figure 15.10).
Figure 15.10: Single-end read orientation
In paired-ends, the read 1 read extends in the 5’ –> 3’ direction towards read 2 along the forward (plus) DNA strand. As for the read 2 read, it extends in the 5’ –> 3’ direction towards read 1 along the reverse (minus) DNA strand. Read 1 is the reverse complement of the first 3’-most bases of the forward strand (therefore identical to the first 5’-most bases of the reverse strand) and read 2 is the reverse complement of the first 3’-most bases of the reverse strand (therefore identical to the first 5’-most bases of the forward strand). See the read 1 and read 2 orientations represented in Figure 15.11:
Figure 15.11: Paired-ends FR reads orientation
Therefore the sequences of the two reads of a pair are pointing towards each other on opposite strands. When you align them to a genome, one read should align to the forward strand, and the other should align to the reverse strand, at a higher base pair position than the first one so that they are pointed towards one another. This is known as an “FR” read pair or “forward/reverse”, as this is the order in which they are placed. A read pair having this orientation is called a “proper pair”. There are other possible orientations, but we don’t need to go into them here.
When you use IGV genomic browser you can see this orientation in the case of a read pair of a PE sequencing. Take into account that in IGV only the forward (plus) strand of the genome is displayed. With the “Color by” -> “read strand” option the read in the forward strand is colored red and the read of the reverse strand is colored blue. Observe in the figure above a read pair meeting the FR orientation. They are even more visible if you select “View in Pairs” option (Figure 15.12).
Figure 15.12: Paired-ends FR reads orientation as seen in IGV.
Both single-end (SE) and paired-ends (PE) can be used for measuring gene expression. PE offers the advantage over the SE that the distance covered by the R1 and R2 and the unsequenced sequence in-between is controlled on average (it is named “insert size” and is usually in an approximate range 200-600 bp) so the aligners can take advantage of this information to be more accurate in the correct alignment of the reads to the reference genome. The point is that a read pair properly (FR) oriented, covers a longer span in the genome (read_1 + inner_distance + read_2, Figure 15.13) and therefore a stronger evidence of being well aligned than a single read.
Figure 15.13: Insert size in Paired-ends reads
The main disadvantage of the PE protocol is that it is normally more expensive and time-consuming to generate than single-end reads. On the other hand, just for counting reads or calculate differential expression may be unnecessary to have such a high level of accuracy, especially if your species is well-annotated. But take into account that in the future you might want to reuse the reads intended to study differential expression to, for example, extend the study to look for alternative splicing. In that case, you will need to have done PE sequencing as SE is insufficient for alternative splicing analysis.
During transcription, one of the two DNA strands serves as a template for synthesizing the RNA molecule. Since the template strand is reverse complementary to the transcribed RNA, it is referred to as the antisense strand, negative-sense strand, non-coding strand, or minus strand.
Conversely, the non-template strand has the same sequence as the transcript (except for T instead of U) and is known as the sense strand, positive-sense strand, coding strand, or plus strand (Figure 15.14).
Figure 15.14: Representation of sense and antisense strands in transcription
It is important to note that either DNA strand can contain genes, meaning that the designation of sense and antisense is gene-specific (Figure 15.15):
Figure 15.15: Sense and antisense strand assignment depends on the gene
RNA-Seq protocols generate single-stranded mRNA molecules that are then converted into double-stranded complementary DNA (cDNA). These cDNA molecules are subsequently amplified, producing millions of short sequencing reads. As a result, for any given transcript, the sequencing reads fall into two categories:
Initially, most RNA-Seq library preparation protocols were non-strand-specific (also called unstranded). In these protocols, sequences from both the sense strand (the cDNA strand encoding the transcript) and the antisense strand (the reverse complement of the original mRNA) are sequenced without preserving strand information.
This means that in the resulting FASTQ files, _1.fastq
and _2.fastq
contain a mixture of sense and antisense reads, making it impossible to determine whether a given read originated from the sense or antisense strand of the original mRNA.
Figure 15.16: Comparison of unstranded and stranded RNA-Seq protocols
When assigning reads to genes, it is ideal to match all reads derived from the sense strand to a sense gene in the reference genome and all reads from the antisense strand to an antisense gene. However, in an unstranded protocol, this is not possible—some antisense reads may be incorrectly assigned to sense genes, and vice versa. This can introduce biases in gene expression quantification.
Now, consider the stranded protocol illustrated on the right in Figure 15.16. Here, the second-strand cDNA fragment (the sense strand) is prevented from amplification. This results in only the first-strand cDNA (antisense strand) being amplified. Consequently, strand information is retained in the sequencing data:
_1.fastq
file contains antisense (reverse) reads, as they are sequenced first. The _2.fastq
file contains sense (direct) reads, sequenced second._1.fastq
, while the antisense (reverse) reads are stored in _2.fastq
.In stranded RNA-Seq protocols, strand information is preserved by selectively amplifying only one strand of cDNA. This distinction directly impacts how reads are stored in the FASTQ files and determines whether a given read corresponds to the sense or antisense strand of the original RNA transcript. Figure 15.17 illustrates how sequencing reads are distributed across FASTQ files depending on the protocol used.
Figure 15.17: Unstranded and stranded protocols and FASTQ files.
_1.fastq
or _2.fastq
, making strand determination impossible._1.fastq
._2.fastq
.By ensuring that all sequencing reads carry strand information, stranded protocols prevent incorrect assignment of antisense reads and eliminate potential biases in gene quantification.
Several techniques exist for generating stranded RNA-Seq libraries. One of the most widely used involves chemical modification with dUTP. This method incorporates dUTP specifically into the non-sense strand of the cDNA, preventing its amplification during PCR (15.16, right). This approach is known as reverse-stranded RNA-Seq, since the reverse (antisense) strand is amplified and sequenced first.
Alternatively, some stranded protocols selectively deplete the sense strand instead of the antisense strand, resulting in a direct-stranded RNA-Seq protocol. In this case, the sense (direct) strand is amplified and sequenced first, while the antisense strand is discarded.
When your provider delivers the short-read files (FASTQ files) from your RNA-Seq experiment, it is recommended to check the sequencing report to determine which library preparation method was used: unstranded (non-stranded), direct-stranded, or reverse-stranded. If this information is not explicitly mentioned, request it from the provider, as it is crucial for accurate transcript quantification (counts). Once you have identified the protocol type, select the appropriate option in the Strand Specificity selector of the Quantify Genomic Features window.
However, the protocol may not always be listed under these exact names. For example, some reports may refer to the protocol as “directional RNA-Seq,” which is equivalent to “stranded RNA-Seq.” In other cases, only the name of the commercial kit used for library preparation might be provided. The following table serves as a reference to help interpret these variations:
Protocol | Synonym | Examples/Kits1 | Description |
---|---|---|---|
Unstranded | Non-stranded | Standard Illumina, TruSeq RNA Sample Prep Kit, NuGEN OvationV2, SMARTer Universal Low Input (TaKara), GDC normalized TCGA data | Strand information is not retained (lost). |
Reverse-stranded | First-strand | dUTP, NSR, NNSR; TruSeq Stranded (Total RNA/mRNA), NEB Ultra Directional, Agilent SureSelect Strand-Specific | The second read (read 2) corresponds to the original RNA strand, while the first read (read 1) aligns to the opposite (reverse) strand. |
Direct-stranded | Second-strand | Directional Illumina (Ligation), Standard SOLiD, ScriptSeq v2, SMARTer Stranded Total RNA, NuGEN Encore Complete, NuGEN SoLo, Illumina ScriptSeq | The first read (read 1) corresponds to the original (sense) RNA strand, while the second read (read 2) aligns to the opposite strand. |
1 The information in this column was compiled from this, this, and this source.
Regardless of the reported protocol, we recommend verifying the strandedness yourself by following an approach like the one described in the next section.
In some cases, you may not have access to information about the strandedness of your RNA-Seq data. This can happen if you obtained your FASTQ or BAM files from an online repository that lacks metadata or if the dataset is poorly documented.
Fortunately, there are strategies to infer strandedness from the data itself. One such strategy involves using the IGV (Robinson et al., 2011, 2022) genomic browser.
To analyze strandedness, you need to load the following files into IGV:
- Reference genome (FASTA format)
- Gene annotation file (GFF3/GTF format)
- BAM file(s) containing the RNA-Seq alignments
Once loaded, you can use IGV’s Color by Read Strand option to visualize the alignment patterns.
As described in the Reads Orientation section, paired-end reads typically align in an FR conformation (Forward-Reverse), where:
- Read 1 aligns to the forward (+) DNA strand, and
- Read 2 aligns downstream on the reverse (-) DNA strand (Figure 15.12).
IGV allows you to visualize read orientation using the Color by Read Strand option:
- Forward strand reads are colored red.
- Reverse strand reads are colored blue.
Additionally, paired-end sequencing assigns each read to one of two FASTQ files:
- “reads_1.fastq” contains all Read 1 sequences.
- “reads_2.fastq” contains all Read 2 sequences.
Now, let’s examine how this information can be used to infer strandedness.
By inspecting only Read 1 (reads_1.fastq
), you can infer the strandedness of your data:
reads_1.fastq
contains a mix of red and blue reads, meaning some align to the forward strand and others to the reverse strand.reads_1.fastq
contains only forward-aligned reads (red) when mapping to genes on the forward strand of the genome.reads_1.fastq
contains only reverse-aligned reads (blue) for genes on the forward strand.IGV provides another useful visualization tool clicking the track’s gear menu of IGV and selecting Color by → First-of-Pair Strand in the BAM track (Figure 15.18).
Figure 15.18: option ‘first in pair’
When enabled, both reads of a pair are colored based on the strand to which Read 1 aligns:
- If Read 1 aligns to the forward strand, the entire pair is red.
- If Read 1 aligns to the reverse strand, the entire pair is blue.
This visualization allows for a quick assessment of whether Read 1 is consistently aligning with a particular strand, which is crucial for determining the strandedness protocol.
By left-clicking a read in IGV, you can access additional alignment details (Figure 15.19). One key piece of information is the Pair Orientation, which can be either F2R1 or F1R2:
Figure 15.19: Pair orientation in IGV
reads_1.fastq
, while Read 2 aligns to the reverse strand (blue) and comes from reads_2.fastq
.reads_1.fastq
, while Read 2 aligns to the forward strand (red) and comes from reads_2.fastq
.These orientations help confirm whether the sequencing protocol follows a direct stranded or reverse stranded approach.
Using the tools above, you can now determine the library preparation method:
reads_1.fastq
.Figure 15.20: Direct stranded: read pairs in red for genes in forward (+) strand and in blue for genes in reverse (-) strand.
Figure 15.21: Reverse stranded: read pairs in blue for genes in forward (+) strand and in red for genes in reverse (-) strand.
By comparing your observations with these expected patterns, you can confidently determine whether your RNA-Seq library was prepared using an unstranded, direct stranded, or reverse stranded protocol.
To verify that you have grasped the concept, try to determine which strandedness protocols were used for BAM-1 and BAM-2 in Figure 15.22. Remember that the reads have been colored using the “First-of-Pair” option in IGV.
Figure 15.22: What are the strandedness protocols in BAM-1 and in BAM-2?
Final remark: If you are still unable to determine strandedness, the safest option is to select “unstranded” during analysis. This ensures that no reads are mistakenly discarded.
A study by (Pomaznoy et al., 2019) found that ignoring strand information can significantly affect gene expression estimates that “10% of all genes and 2.5% of protein-coding genes show a two-fold or higher difference when strandedness is ignored.”
The answer for the question about the strandedness protocols in Figure 15.22 is that:
This confirms that BAM-1 was sequenced using a direct stranded RNA-Seq protocol, while BAM-2 lacks strand specificity.
SeqNjoy: Complete RNA-Seq workflow in your Desktop. Developed by BioinfoGP CNB-CSIC (2022)