SeqNjoy: Complete RNA-Seq workflows in your Desktop

Tutorial

Juan Antonio García-Martín, Rafael Torres-Pérez, Juan Carlos Oliveros

SeqNjoy | BioinfoGP | CNB | CSIC

1 Introduction

SeqNjoy is a standalone application designed to generate and analyze differential gene expression results from RNA-Seq experiments. The analysis can start from the very initial step using raw short-read files in FASTQ format or from an intermediate alignments files in BAM/SAM format. SeqNjoy results include annotated of the genes (or other genetic features) that are differentially expressed between groups of samples under different conditions, as well as interactive HTML pages. These pages enable users to visually explore, analyze, filter and download selected genesets of interest.

This tutorial provides step-by-step guidance for analyzing RNA-Seq data with SeqNjoy. Each step explains the required input files, the configurable parameters, and the outputs generated. To enhance your understanding of the process, the tutorial also describes the structure of the input files, the significance and implications of each parameter, and key concepts critical to RNA-Seq data analysis. The ultimate goal is to empower you to perform a comprehensive RNA-Seq differential expression analysis on your own.

Before installing SeqNjoy, ensure that your computer meets the application’s minimum system requirements.

2 Computer requirements and installation

2.1 Computer requirements

Minimum Recommended
Architecture 64-bit
Operative System Windows, macOS, Linux (Debian, Fedora or SUSE-based)
Web Browser Chrome, Firefox, Safari, Edge
RAM Memory 8 GB 16 GB
Disk Space 500 GB 1 TB

NOTE: minimum requirements are adequate for carrying out analyses involving bacterial genomes, while recommended requirements are adequate also for eukaryotic genomes.

SeqNjoy is optimized to be used with 1920 x 1080 screen resolution (or higher).

Hard drive requirements specify the necessary storage space for project files in typical RNA-Seq analysis. SeqNjoy is installed as a standard R package, and users can store project folders on local or USB-connected hard drives of their choice.

Please disable Sleep Mode or Hibernation in your computer to avoid interruptions of long lasting processes.

A connection to the Internet is only required for downloading the necessary packages and libraries during the installation and for retrieving genome files from Ensembl.

2.2 Software requirements

SeqNjoy is an R package; therefore, installing R is required before using the software.

2.2.1 Windows

2.2.1.1 Rtools on Windows

RTools is necessary to install SeqNjoy on Windows because some neccesary R packages are not available as precompiled Windows binaries and must be built from source.

Download the appropriate version of RTools for your R installation from the official website: https://cloud.r-project.org/bin/windows/Rtools/.

2.2.1.2 Bowtie2 aligner

To use the feature “Align reads with Bowtie2” on Windows, ensure the following software is installed:

  1. Python 3
    • Python 3 is required for genome indexing.
    • If Python 3 is not already installed:
      • Recommended Option: Download and install Python 3 from the official website: https://www.python.org/downloads/.
      • Ensure that you check the option to Add Python to PATH during installation.
    • Alternative Option: Install Python 3 from the Microsoft Store.
    • After installing Python, verify its installation by running the following command in a terminal:
        python3 --version
    • If python3 does not work and instead opens the Microsoft Store, you may need to adjust your system configuration:
      • Check App Execution Aliases settings: Open Manage App Execution Aliases in the Control panel and disable the python3 alias for the Microsoft Store.
      • Ensure that the installation path for Python is correctly added to your PATH environment variable.
  2. Perl
    • Perl is required for sequence aligning.
    • You can download and install Perl from Strawberry Perl or another Perl distribution for Windows.
    • After installing Perl, verify its installation by running the following command in a terminal:
        perl --version

2.2.1.3 Pandoc

Pandoc is required to build the vignettes used within the app. If you do not already have Pandoc installed:

  • Download Pandoc from the official website: https://pandoc.org/installing.html.
  • Follow the installation instructions for Windows.
  • After installation, verify it by running the following command in a terminal:
    pandoc --version

2.2.2 macOS

Administrator privileges are required to install R binaries.

2.2.2.1 Pandoc

Pandoc is required to build the vignettes used within the app. If you do not already have Pandoc installed:

2.2.2.2 Source compilers

Depending on the macOS distribution, developer tools and compilers such as xcode might be required. Check R macOS tools website for installation instructions.

These tools are also required to install R from source.

2.2.3 Linux

Installing SeqNjoy dependencies from source requires several system libraries. The package names vary depending on the Linux distribution. Ensure the appropriate libraries are installed by running the commands below.

2.2.3.1 Debian-based Distributions (e.g., Debian, Ubuntu)

    sudo apt-get install -y \
        build-essential gfortran libreadline-dev libx11-dev libxt-dev \
        libbz2-dev liblzma-dev libpcre2-dev libcurl4 libcurl4-openssl-dev \
        libpng-dev libjpeg-dev libxml2-dev libssl-dev librsvg2-dev

2.2.3.2 Fedora-based (e.g., Fedora, Red Hat, Rocky Linux)

    sudo yum install -y \
        make gcc-c++ gcc-gfortran readline-devel libX11 libX11-devel \
        libXt libXt-devel bzip2-devel pcre2-devel libcurl-devel \
        libxml2-devel openssl-devel libpng-devel libjpeg-turbo-devel \
        librsvg2-devel

2.2.3.3 SUSE -based (e.g., openSUSE, SUSE Linux Enterprise)

    sudo zypper install -y  \
        make gcc-c++ gcc-fortran readline-devel libX11-devel libXt-devel \
        libbz2-devel lzma-sdk-devel pcre2-devel libcurl-devel \
        openjpeg-devel libxml2-devel libopenssl-devel libjpeg8-devel \
        libpng16-devel librsvg-devel

2.3 Installation

2.3.1 From GitHub

The most recent release, avalaible on GitHub, can be installed using the remotes package. Pandoc is required to build vignettes (see Software requirements). You can disable this option, but some help pages will not be available.


    ## Install Bioconductor
    if (!requireNamespace("BiocManager", quietly = TRUE)){
        install.packages("BiocManager")
    }

    ## Install remotes
    if (!requireNamespace("remotes", quietly = TRUE)){
        install.packages("remotes")
    }

    ## Install SeqNjoy without vignettes
    remotes::install_github("BioinfoGP/SeqNjoy", dependencies=TRUE, bioc_version = BiocManager::version())

    ## Install SeqNjoy with vignettes
    remotes::install_github("BioinfoGP/SeqNjoy", dependencies=TRUE, build_vignettes = TRUE, bioc_version = BiocManager::version())

2.3.2 From Source

Installation from source can also be done using devtools. Download the latest gzipped package source code SeqNjoy_0.5.X.tar.gz available at BioinfoGP’s GitHub repository and install it from R console.

    ## Set your working directory to the location where the package was downloaded. 
    setwd("path/to/downloaded/package")

    ## Install remotes
    if (!requireNamespace("remotes", quietly = TRUE)){
        install.packages("remotes")
    }

    ## Install Bioconductor
    if (!requireNamespace("BiocManager", quietly = TRUE)){
        install.packages("BiocManager")
    }

    # Create a temporary directory and extract the package contents
    d <- tempdir()
    untar("SeqNjoy_0.5.X.tar.gz", exdir=d)

    # Install SeqNjoy
    remotes::install(file.path(d, "SeqNjoy"), dependencies=TRUE, repos=BiocManager::repositories())

2.4 Starting SeqNjoy

Open R, then load SeqNjoy library and launch the application. The browser will automatically start.

    library(SeqNjoy)

    SeqNjoy()

If the browser has not already started, or you have closed it. Open your browser and enter the URL http://127.0.0.1:7628/.

To stop SeqNjoy, click the “Exit SeqNjoy” button or press Ctrl+C in the R console.

3 RNA-Seq in short

RNA-Seq is a high-throughput sequencing technique used to detect and quantify RNA in biological samples. This provides a proxy for detecting and quantifying gene expression levels, among other applications.

The “wet lab” part of the RNA-Seq process (Figure 3.1, top) begins with the library preparation, a process involving several key steps: extraction of RNA from the samples, filtering to isolate mRNA molecules while depleting other RNA species (e.g., ribosomal RNA), reverse transcription of mRNA into more stable cDNA molecules, fragmentation and size selection to produce smaller nucleic acid fragments (easier to handle), and sequencing these fragments to generate short reads.

The sequencing machine (“sequencer”) reads the nucleotide sequences corresponding to the mRNA transcripts, producing millions of short reads. These reads are stored in FASTQ files, typically one file per sample or biological replicate in your experiment. These FASTQ files are the starting point for RNA-Seq data analysis—and you can carry out this analysis using SeqNjoy!

Figure 3.1 provides a schematic overview of the standard RNA-Seq pipeline for analyzing mRNA expression in biological samples. The upper section outlines the experimental protocol, while the yellow-highlighted area below illustrates the main steps for processing and analyzing the RNA-Seq data from the sequenced samples.

Schematic overview of the standard RNA-Seq pipeline for mRNA expression analysis

Figure 3.1: Schematic overview of the standard RNA-Seq pipeline for mRNA expression analysis

In brief, the bioinformatic analysis of a RNA-Seq experiment involves three consecutive steps:

  1. Quality control and preprocessing of short reads: Before alignment, it is essential to assess the quality of the raw sequencing data in FASTQ format. A quality control (QC) analysis is performed to detect potential issues such as adapter contamination, low-quality bases, or short reads. If necessary, a preprocessing step is carried out to clean the data, which includes trimming adapters, filtering out low-quality reads, and discarding excessively short sequences. This step ensures that only high-quality reads are retained for downstream analysis.
  2. Alignment to the reference genome: Once the reads have passed quality control, they are aligned to a FASTA file containing the reference genome. This process generates BAM files containing the aligned reads, with one BAM file produced for each FASTQ file or pair of FASTQ files, depending on the sequencing type (single-end or paired-end RNA-Seq).
  3. Gene quantification: The aligned reads in each BAM file are mapped to a GTF or GFF file, which contains the start and end positions of all annotated genes. For each gene, the total number of reads mapped to it is counted. The result is a “counts file” for each BAM file. A counts file includes a list of all genes in the genome and the corresponding number of reads (“counts”) mapped to each gene.
  4. Samples comparison / Differential expression calculation: The counts files are grouped into two sets: one for the “Control” samples and another for the “Case” samples. This step statistically compares the expression of each gene between the two groups to identify genes with significant differential expression and quantify the magnitude of these differences. The output is a table containing the results of the differential expression analysis for each gene.

4 SeqNjoy framework in short

SeqNjoy utilizes user-friendly interfaces that allow you to perform a complete RNA-Seq expression gene data analysis in a short number of steps. If you have never performed this kind of analysis, we recommend reading RNA-Seq in short section as a starting point to get familiarized with the overall process.

Flowchart below (Figure 4.1) summarizes how SeqNjoy addresses the different parts of the analysis:

Flowchart of SeqNjoy

Figure 4.1: Flowchart of SeqNjoy

The Homepage serves as the entry point to the application (located in the upper-left part of the flowchart). This initial page is where you create your projects. A project acts as a container for a single study and includes all the files required for an RNA-Seq data analysis.

The files associated with a project include:
- Input files, such as short reads files, gene feature files, genome reference files, and functional annotations.
- Output files generated during the analysis steps, including alignment files, counts files, and lists of differentially expressed genes resulting from sample comparisons.

From the list of created projects displayed on the Homepage, you can select any project to access its Project’s Main Page, which serves as the central hub of the application. All functionalities are initiated from this page.

From the Project’s Main Page, you can:


Just as a recipe requires initial ingredients to prepare a dish, SeqNjoy requires specific input files to begin processing your project. Depending on the available data, you can follow one of the two workflows described below:

  1. From Short Reads (FASTQ format):

This is the standard input for a complete RNA-Seq data analysis.

  1. From Alignment Files (SAM or BAM format):

If you start from short reads files (FASTQ format), the analysis process comprises all the steps mentioned before: Quality Control, Preprocess Reads, Align Reads, Quantify Genes and Compare Results, as described in the Figure 4.2, that illustrates the SeqNjoy workflow, showing the dependencies between input and output files at each step. All of them are launched from the Project’s Main Page.

Overview of the SeqNjoy workflow, showing the analysis steps (right), the types of files used (center), and the utilities available for file inspection (left).

Figure 4.2: Overview of the SeqNjoy workflow, showing the analysis steps (right), the types of files used (center), and the utilities available for file inspection (left).

In the case of starting with alignment files SAM/BAM, the workflow is similar, except the Align Reads step is skipped.

The RNA-Seq data analysis proceeds from top to bottom, with the output files from one step becoming the input for the next.


SeqNjoy also provides Utilities to inspect or analyze specific types of files or the quality report (Figure 4.3). These tools can help you better understand the data generated at different stages of the analysis.

Utilities in SeqNjoy for file inspection/analysis

Figure 4.3: Utilities in SeqNjoy for file inspection/analysis


In the following sections, we provide destailed description of: - Each step of the analysis - The utilities available in SeqNjoy - The characteristics and use of the files involved For the FIESTA 2.0 graphical viewer, a specific tutorial is available here.

5 First steps in SeqNjoy: creating and managing projects

Once you launch SeqNjoy, the homepage welcomes you with a brief summary and key information about the application.

Homepage

Figure 5.1: Homepage

Before creating a project, you must select a Project Folder, a directory on your computer or external drive where all new projects will be stored. Click the Projects Folder button to select or create this directory.

A project in SeqNjoy acts as a container for a single RNA-Seq study, including both the input and the generated files. Input files include FASTQ files, genome reference files, gene feature files, and functional annotations. As the analysis progresses, the project also stores generated files such as quality reports, alignment files, count tables, and lists of differentially expressed genes.

Each project is uniquely identified by a project name, which you provide in the text box at the bottom of the homepage (Figure 5.2).

Select Projects Folder and then Create a New Project by writing a project's name. The project will be created in the selected Projects Folder

Figure 5.2: Select Projects Folder and then Create a New Project by writing a project’s name. The project will be created in the selected Projects Folder

After clicking the Create New Project button, you will be directed to the Project’s Main Page, where you can begin your analysis(see Main Page section).

Once you have one or more projects created, the Available Projects section of the Homepage displays a list of your projects along their creation dates (Figure 5.3).

List of started projects

Figure 5.3: List of started projects

For each project, you can either open the project it by clicking the “play” icon , which takes you to the Main Page to start or continue your analysis, or delete the project it by clicking the “basket” icon , which permanently removes the project along with all its associated files from your system.

IMPORTANT: Deleting a project permanently removes all input and result files, making the project impossible to recover. This action CANNOT be undone.

6 Project’s Main Page: Loading files

Once the project has been created, you can start uploading your input files. First, open the project by clicking the “play” icon in the Homepage and access the Project’s Main Page of the application (Figure 6.1).

Project's Main page

Figure 6.1: Project’s Main page

Now, click the Load Files button and select in your computer one or several files of the allowed file types. SeqNjoy supports the following file types and extensions:

File type File Formats Allowed File Extensions
Genome sequences FASTA fa, fa.gz, fasta, fasta.gz, fna, fna.gz
Genes features GFF3, GTF gff, gff.gz, gtf, gtf.gz, gff3, gff3.gz
Short reads sequences FASTQ fastq, fastq.gz, fq, fq.gz
Alignments BAM, SAM bam, sam
Gene annotations ANN (text table) ann, ann.txt, ann.tsv, ann.tab, ann.xls, ann.xlsx

If you are not familiarized with any of these file types, go to the section File Types to know more about the structure of the format and the function and details of each file type.

Figure 6.2 shows the application loading a number of FASTQ files:

Uploading files in **SeqNjoy**

Figure 6.2: Uploading files in SeqNjoy

The pop-up window reports the status of file uploading and post-processing by means of a progress bar and informative messages describing the step being performed. When all is OK, you should see a window with an “all in green” aspect like this (Figure 6.3):

Uploading process successful

Figure 6.3: Uploading process successful

Take into account that in the case of some large and/or complex files the upload may take a considerable amount of time.

7 Downloading genome files

Genome files include two types: (1) a genome reference in FASTA format and (2) a genomic features file (GFF3 or GTF) containing information about chromosomes, genes, CDS, exons, etc.

If you already have a FASTA and a GTF/GFF file, you can upload them following the steps described in the previous section. Otherwise, you can download genome files directly from the Ensembl repositories. Ensembl provides access to a large number of genomes, focusing on the most biologically and medically relevant species.

To download genomes from Ensembl, click the Download Genomes button of the Main Page and then select Download Genomes from Ensembl (Figure 7.1):

Selecting the *Download Genomes* option from the Main Page.

Figure 7.1: Selecting the Download Genomes option from the Main Page.

A new window will open where you need to select the taxonomic group of your organism (Figure 7.2).

Important: To access the repository, you must be connected to the Internet, and Ensembl (for vertebrates) or Ensembl Genomes (for bacteria, fungi, metazoa, plants and protists) must be accessible and functional. Ensure that the window displays the message “Online - release <number>” next to the repository name; otherwise, you will not be able to proceed with the download using this option.

Taxonomic group selection

Figure 7.2: Taxonomic group selection

After selecting the taxonomic group, choose the specific genome from the dropdown menu (Figure 7.3).

Choosing the genome of interest from the dropdown menu.

Figure 7.3: Choosing the genome of interest from the dropdown menu.

Once you have selected the species/strain genome, click Check Availability Genome (Figure 7.4) to view the files of that organism available for download. Tipically, you will find the reference genome file (FASTA) and the genomic features file in both common formats, GTF and GFF.

Checking the availability of genome files before downloading. The application will list available FASTA and GTF/GFF files

Figure 7.4: Checking the availability of genome files before downloading. The application will list available FASTA and GTF/GFF files

To proceed with the download, select the FASTA file and at least one of genomic features files (GTF or GFF), then click the Download button (Figure 7.5).

Selecting genome files: reference genome (FASTA) and genomic features file (GTF/GFF). In this case, a GTF file is selected

Figure 7.5: Selecting genome files: reference genome (FASTA) and genomic features file (GTF/GFF). In this case, a GTF file is selected

Once the download is complete, the genome files will appear in File List on the “Project’s Main Page” (Figure 7.6).

Once downloaded, the genome files appear in the *File List* on the [*Project's Main Page*](#projects-main-page-file-list) along with the rest of loaded files

Figure 7.6: Once downloaded, the genome files appear in the File List on the Project’s Main Page along with the rest of loaded files

8 Project’s Main Page: File List

After you have uploaded your input files, your Project’s Main Page shows the list of the loaded files. The list offers different information and a number of options to operate on them. From the Project’s Main Page you can also perform the successive steps of your RNA-Seq gene expression data analysis project.

This is the aspect of the Project’s Main Page after a standard RNA-Seq data analysis project has been performed (Figure 8.1):

Project's Main Page

Figure 8.1: Project’s Main Page

The Project’s Main Page is structured as follows.

In the top area the following functionalities are available:

The rest of the Project’s Main Page is dedicated to the File List. The File List area contains information about all the files loaded by the user or generated by the processing with SeqNjoy, and provides a number of operations that the user can run on one or more selected files. The File List comprises the following buttons and columns, described from the left to the right:

9 Quality Control

Quality control (QC) of sequencing reads is a crucial step in NGS data analysis. It allows users to assess the quality of their sequencing data before proceeding with downstream analyses. QC reports provide insights into potential issues such as low-quality bases, adapter contamination, and sequence biases. Based on these reports, users can decide whether their reads are suitable for analysis as they are or if preprocessing steps, such as trimming or filtering, are necessary. While QC is not mandatory, it is highly recommended to ensure data reliability.

The QC process is conducted using the software tools Biostrings (H Pagès, 204C.E.) and fastp (Chen et al., 2018).

To perform quality control, navigate to the Project’s Main Page and click the Quality Control button. This opens a new window with a menu containing a single option: Quality Control with fastp (Figure 9.1).

Option for Quality Control of the read files (FASTQ)

Figure 9.1: Option for Quality Control of the read files (FASTQ)

Clicking this option displays the QC interface, where users can select the FASTQ files to analyze and specify the number of reads to include in the QC process (Figure 9.2).

Quality Control window

Figure 9.2: Quality Control window

The QC interface provides fields for selecting one or multiple FASTQ files:

Additionally, users must set the Sample Reads for QC parameter. By default, this is set to 1,000,000, meaning the first million reads (or read pairs) will be analyzed. Setting this value to 0 will process all reads in the selected FASTQ file(s).

Once the desired files and parameters are selected, click the Launch button to start the QC process.

When the QC process completes, the corresponding QC icon to the left of each analyzed FASTQ file (or paired-end set) will be enabled (it was previously grayed out). Clicking this icon opens the QC report, where users can review metrics and visualizations related to read quality. For a detailed breakdown of the Quality Control Report’s content, refer to section Quality Control Report.

After reviewing the QC results, users can decide whether preprocessing steps, such as trimming or filtering, are necessary to improve read quality before proceeding with further analyses.

10 Preprocess Reads

Typically, after performing Quality Control (QC) and inspect the QC report, users may identify issues in their reads that can be corrected through preprocessing. Common problems include the presence of adapter sequences, low-quality regions, and an excessive number of undetermined bases (N). To mitigate these, a preprocessing step can be applied to remove adapter fragments contamination, trim low-quality bases, and discard entire reads that fail to meet quality or length thresholds.

The application uses fastp (Chen et al., 2018) to preprocess FASTQ sequencing files, applying various corrections based on user-defined parameters.

To access this operation, navigate to the Project’s Main Page and click the Preprocess Reads button. This opens a new window containing a single menu option: Preprocess Reads with fastp (Figure 10.1).

Option for preprocessing reads

Figure 10.1: Option for preprocessing reads

Selecting this option launches the preprocessing setup interface (Figure 10.2).

Preprocess Reads window

Figure 10.2: Preprocess Reads window

In this interface, users must select the FASTQ files they wish to preprocess and configure the correction parameters. As with Quality Control, users can select one or more FASTQ files. If the input files are paired-end, selecting one file will automatically populate its mate file.

On the right side of the preprocessing interface, users can configure several parameters. The default parameters are set to provide reasonable results for most datasets, effectively addressing common issues. If you are unsure about which values to use, it is recommended to keep the defaults, as they will correct most typical sequencing errors without over-filtering your data.

Additionally, users must define an Output Prefix, which will be incorporated into the filename of the processed FASTQ files to distinguish them from the originals.

Once the FASTQ files, parameters, and output prefix are set, clicking the Launch button will process the files according to the specified adjustments.

The output consists of cleaned FASTQ files, which will appear in the File List on the Project’s Main Page. These quality-improved FASTQ files can be reassessed using a new Quality Control step if needed. Once assessed and, if necessary, preprocessed, the resulting FASTQ files serve as input for the Align Reads step.

11 Align Reads

Click the “Align Reads” button of Main Page to access this window. For performing this task, you inform a genome sequence (FASTA) file and the corresponding short reads files (FASTQ files) and obtain files with the alignments (BAM formatted) of the short reads contained in the FASTQ files to the reference genome (FASTA file).

There are two different aligners available to perform this task, and you should choose one of them in the submenu that is displayed after clicking the Align Reads button (Figure 11.1): bowtie2 (Langmead & Salzberg, 2012) or hisat2 (Kim et al., 2015). Both of them are well-documented and widely used programs.

Align Reads: selecting alignment method (aligner)

Figure 11.1: Align Reads: selecting alignment method (aligner)


After choosing any of them you access a window where you have to fill the input FASTQ files and alignment parameters that are necessary to complete the task.

Window for the reads alignment. Input files are the genome reference (FASTA file) and the short reads files (FASTQ files). The shaded areas contains *bowtie*-specific or *hisat2*-specific parameters and therefore are differents according to the selected aligner.

Figure 11.2: Window for the reads alignment. Input files are the genome reference (FASTA file) and the short reads files (FASTQ files). The shaded areas contains bowtie-specific or hisat2-specific parameters and therefore are differents according to the selected aligner.


The Align Reads to Genomes window has the following parts (see Figure 11.2):

Once you have filled the form in with the required files and set your parameter values, the “Launch!” button turns from disabled (grey) to enabled (red) so you can click it and launch the alignment(s). A pop-up window appears informing you of the progress of the alignment task. When the process ends OK a message “done” is displayed in the pop-up window and a “Close” button is enabled. By clicking “Close” you return to the “Align Reads to Genome” pane. Now, you could make a new Align Reads to Genome round or click the “X” icon for closing the “Align Reads to Genome” window. You will then return to the “Project’s Main Page”, which will have been updated to contain the new alignment (.bam) files in its File List.

*Align Reads to Genome* dialog ready to launch the alignments processes.

Figure 11.6: Align Reads to Genome dialog ready to launch the alignments processes.


12 Quantify Genes


Click the “Quantify Genes” button in the “Project’s Main Page” to access this window (Figure 12.1).

Quantification window

Figure 12.1: Quantification window


In summary, you inform here a genome features file (GFF) file and a number of alignment files in BAM. Normally, these alignments have been generated previously in the “Align Reads” step. Alternatively, the BAM files were generated elsewhere and the user uploaded to SeqNjoy.

The output of the quantification is one counts file per BAM alignment file. Each “counts” file contains a list with the “counts” , i.e. the number of “reads” or “fragments” that have been mapped to each gene (or other feature type that you select) of the genome.

The following parameters are used by the tool that performs the quantification task, the function featureCounts of the R package Rsubread (Liao et al., 2019):

Once you have filled the form with the required files and set your parameter values, the “Launch!” button turns from disabled (grey) to enabled (red) so you can click it and launch the quantification. In Figure 12.4, the quantification process will be launched to count genes for nine samples, with reverse strandedness mode. Only primary alignments will be counted. The output of the quantification is a number of counts files equal to the number of input BAM files, that is, nine in this case.

Quantification dialog ready to be launched

Figure 12.4: Quantification dialog ready to be launched

A pop-up window appears informing you of the progress of the quantification task. When the process ends OK a message “done” is displayed in the pop-up window and a “Close” button is enabled. By clicking “Close” you return to the Quantify Genomic Features pane. Now, you could make a new Quantify Genomic Features round or click the “X” icon for closing the window. You will then return to the “Project’s Main Page”, that it will have been updated to contain the new “counts” files.

13 Compare Samples (Differential Expression)

In this step you can perform an statistical analysis of the differences in the expression for a given set of features between two groups of samples. You can use it repeatedly to make more that one binary comparison (e.g. “SampleGroupDrugA versus SampleGroupDrugB”, and then “SampleGroupDrugA versus SampleGroupPlacebo” and finally “SampleGroupDrugB versus SampleGroupPlacebo”).

SeqNjoy offers you two possible methods to perform this task, both of them widely used and well documented: DESeq2 (Love et al., 2014) and edgeR (Robinson et al., 2009). If you click the Compare Samples button a submenu listing both options is opened and you can select any of them (Figure 13.1) .

Compare Samples: selecting the method for calculating differential expression

Figure 13.1: Compare Samples: selecting the method for calculating differential expression


After selecting the desired option, the Calculate Differential Expression dialog is displayed (Figure 13.2). Most of this window is the same regardless of the selected software. Only some specific parameters located in the shaded area of the window are different and specific of the selected method (DESeq2 or edgeR) to perform the differential expression calculations.

Calculate Differential Expression (Compare Samples) dialog as displayed initially.

Figure 13.2: Calculate Differential Expression (Compare Samples) dialog as displayed initially.

The following fields are present in the Calculate Differential Expression window irrespective of the selected method:


Once you have filled the form with the required files and set your parameters values, the Launch! button turns from disabled (grey) to enabled (red) so you can press it and launch the differential expression calculation. In the figure 13.5 differential expression calculation between the Case ("pH5_CO2") group of samples and the Control ("pH5") group of samples is displayed ready to be launched (Launch! button enabled). In the example, an annotations file has been provided so the information contained in this file will be incorporated to the diffexp file with the differential expression calculations.

*Calculate Differential Expression* dialog ready to be launched.

Figure 13.5: Calculate Differential Expression dialog ready to be launched.


After launching the process, a pop-up window informs you the progress of the “differential expression” task. When the process ends OK the message “done” is displayed in the pop-up window and a close button is enabled. By clicking the Close button you return to the Calculate Differential Expression pane. Now, you can fill in again the form with a new comparison between groups of samples to calculate the differential expression between them or click the “X” icon for closing the window. You will then return to the “Project’s Main Page”, that it will have been updated to contain the newly created diffexp files in Excel format (extension .xlsx) containing the results of the calculated differential expressions of the comparisons performed.

The “diffexp” output file (syntax: [CaseLabel]vs[ControlLabel].[OutputSuffix_if_typed].xlsx) contains the list of differentially expressed genes (DEG) detected between the “Case” and the “Control” groups of samples. Check in the File Types section the information content of the diffexp files. You can graphically inspect and filter the differentially expressed genes results using the FIESTA 2.0 analytic tool.

14 Utilities

In addition to the pipeline to perform the RNA-Seq gene expression data analysis, SeqNjoy provides a number utilities for evaluating and inspecting the content of some of the types of files involved in the analysis. These utilities are launched from buttons located in the File List.

14.1 Quality Control Report

After running Quality Control (QC) on paired-end FASTQ files, a detailed quality report is generated. This report provides key insights into the sequencing data, allowing users to assess whether further preprocessing steps, such as trimming or filtering, are necessary.

This report is generated using fastp (Chen et al., 2018), a widely used tool for FASTQ quality assessment and preprocessing.

To see an example of the Quality Control (QC) report generated for paired-end reads, click here. This example illustrates the types of quality metrics available.

Report Structure

Basic Statistics

This section summarizes fundamental metrics for each read file (R1 and R2), including:

  • Number of Reads: Total reads analyzed.
  • Sample Size: The number of reads sampled for QC.
  • Average Read Length: Mean length of the reads in nucleotides.
  • GC Content: Proportion of guanine and cytosine bases, which can indicate biases.
  • Adapter Sequences: Identified adapter sequences, if present.
  • Adapter Name: The corresponding adapter reference used during sequencing.

Per-Base Sequence Quality

A boxplot displays the quality scores across read positions. This allows users to identify potential quality drops at the ends of reads, which might require trimming.

Per-Base Nucleotide Composition

A graphical representation of base frequency (A, T, C, G) at each position. Unexpected deviations may indicate sequencing artifacts.

Per-Sequence Quality Scores

A histogram showing the distribution of overall read quality. A large proportion of low-quality reads may suggest sequencing issues.

Adapter Content

This section provides a graphical representation of detected adapter contamination, which may require trimming in the preprocessing step.

Duplicate Read Analysis

Displays the proportion of duplicate reads, which can indicate over-sequencing or PCR amplification bias.

Once the QC process is completed, a gray icon next to the analyzed FASTQ files in the File List on the Project’s Main Page will become highlighted. Clicking this icon opens the Quality Control Report.

Based on the report, users can determine whether preprocessing steps, such as adapter trimming or low-quality read filtering, should be applied before proceeding with alignment.

14.2 GFF feature structure hierarchical tree

To access this utility from a GFF (GFF3 or GTF formatted) of the File List click the icon located rightmost in the Details column.

When launched, SeqNjoy will run the plot_features function of the Rgff R package (Garcia-Martin et al., 2022) to generate a picture containing a tree representing the hierarchical structure (dependencies) of the feature types included in the selected GFF file (Figure 14.1). This tree representation is useful for a global look of what is annotated in the genome and it can help you decide which feature(s) and blocks to use in the Quantify Genes step of the analysis.

A hierarchical tree of a genome feature file (GFF) produced by *Rgff*

Figure 14.1: A hierarchical tree of a genome feature file (GFF) produced by Rgff

As it is shown in the example tree above, the highest feature type in this GFF is gene and there are 7127 items annotated as genes. gene has a child node, the transcript feature. Included into transcript there are five child features, that are siblings from each other: CDS, exon, five_prime_UTR, start_codon and stop_codon.

With this information, we know that we could count genes, exons-by-gene, transcript, exons-by-transcript, exons, etc. in our analysis but not “exon-by-CDS” for example, as exon is not a descendant of the CDS feature.

14.3 Visual exploration of genomic files with IGV

You can access this utility from a FASTA file containing the genomic sequence or from an alignment file (BAM), by clicking the icon located on the right of the Details column of the corresponding file. The option is not available for user-provided BAMs.

After clicking the button, the widely used “Integrative Genomic Viewer” (IGV) (Robinson et al., 2011, 2022) tool will be launched and will load the corresponding FASTA or BAM file. If you have loaded a FASTA, a screen similar to the one in Figure 14.2 will be displayed.

IGV launched with a FASTA file as input

Figure 14.2: IGV launched with a FASTA file as input

If you use a BAM file with IGV a similar window is displayed. In addition to the BAM file, the associated genomic sequence (FASTA file) is automatically incorporated to IGV in the same session. See in Figure 14.3 the initial aspect displayed in this case and a minimal “use case” for IGV to visually compare the alignment files of two samples (i.e. two BAM files) for a particular gene.

A simple usage example of IGV to compare two BAM files in a certain genomic region.

Figure 14.3: A simple usage example of IGV to compare two BAM files in a certain genomic region.

IGV has a plethora of functionalities and possibilities to interact with your alignment files. Please refer the IGV user general guide, to take full advantage of this versatile tool. In addition, you can inspect your files with the web/on-line version that IGV provides at https://igv.org/app/, documented at https://igvteam.github.io/igv-webapp/.

14.4 Graphical analysis of differential expression results with FIESTA 2.0

FIESTA 2.0 (Oliveros, 2022) is an interactive viewer for differential expression results files that is integrated in SeqNjoy. It provides multiple ways of representing your data as tunable interactive plots that can be downloaded as high resolution images. In addition, FIESTA 2.0 allows you sorting and filtering genes by one or several columns and saving the relevant ones into files.

Figure 14.4 illustrates a usage example for FIESTA 2.0. When a diffexp file is loaded into FIESTA 2.0, a default representation of the expression data as a MA-plot is displayed. From this point, you can combine filters (top) to select gene subsets. A very common (and useful) filter combination is to select and highlight with user-selected colours genes (center) that show to be significantly (P-adjusted < 0.05) under- (negative Log2Ratio) or overexpressed (positive Log2Ratio). Finally, user can select any pair of numerical columns present in the diffexp file to be represented (bottom), so other popular representations like Volcano plots are also available. User-generated plots and subsets of relevant genes are both downloadable (in the case of the plots, with high resolution).

*FIESTA 2.0* usage example.

Figure 14.4: FIESTA 2.0 usage example.

To access FIESTA 2.0 viewer from a file containing the differential expression results (“diffexp” files in .xlsx format) click the icon located rightmost in the Details column.

More details about FIESTA 2.0 here.

15 File Types in SeqNjoy

SeqNjoy uses up to seven types of files in the analyses. The files that you provide have these types: FASTA, FASTQ, GFF3/GTF and ANN. Other types are only held by files generated through the SeqNjoy analysis: COUNT, DIFFEXP. Finally, the format BAM/SAM is that of the alignment files, which can be either internally generated by SeqNjoy or provided (loaded) by the user.

To follow, we describe the aspect, source, uses, format requirements and content of each file type. You can find a good complement to know more about them in this link by Andrew Severin and also in this page.

15.1 FASTA file

FASTA file format

Figure 15.1: FASTA file format

15.2 FASTQ files

15.3 Gene Feature File (GFF): GFF3 and GTF formats

More formally, each line of a GFF3 (or GTF) file has 9 tab-separated columns:

GFF3 is currently the preferred format to represent Gene Features, but it is not uncommon to find gene feature files in GTF format. GFF3 provides more flexibility to represent the hierarchy of the features in a genome but GTF has some advantages in certain cases. SeqNjoy can process files in GFF3 or GTF formats. In both cases, columns 1 to 8 are the same. The 9th column has a different syntax in GFF3 and GTF (Figure 15.5 )

Comparison of GTF and GFF3 files for the representation of the features of the same gene

Figure 15.5: Comparison of GTF and GFF3 files for the representation of the features of the same gene

Important: For SeqNjoy, GFF files must contain at least one “gene” in the feature type column.

For more detailed information about GFF formats refer to this, this or this link.

15.4 Alignment files (BAM / SAM files)

15.5 Counts file

Counts file example

Figure 15.6: Counts file example

15.6 Annotations file

15.7 Differential Expression file

*diffexp* file format.

Figure 15.8: diffexp file format.

This is the meaning of the columns present in the diffexp files (Figure 15.8):

References

Andrews, S., Krueger, F., Segonds-Pichon, A., Biggins, L., Krueger, C., & Wingett, S. (2012). FastQC. Babraham Institute.

Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1), 289–300. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x

Chen, S., Zhou, Y., Chen, Y., & Gu, J. (2018). Fastp: An ultra-fast all-in-one fastq preprocessor. Bioinformatics, 34(17), i884–i890. https://doi.org/10.1093/bioinformatics/bty560

Garcia-Martin, J. A., Oliveros, J. C., & Torres-Perez, R. (2022). Rgff: R utilities for gff and gtf files. National Center for Biotechnology, Spanish National Research Council (CNB-CSIC). https://cran.r-project.org/web/packages/Rgff

H Pagès, R. G., P Aboyoun. (204C.E.). Biostrings: Efficient manipulation of biological strings. https://bioconductor.org/packages/Biostrings.

Kim, D., Langmead, B., & Salzberg, S. L. (2015). HISAT: A fast spliced aligner with low memory requirements. Nature Methods, 12(4), 357–360. https://doi.org/10.1038/nmeth.3317

Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with bowtie 2. Nature Methods, 9(4), 357–359. https://doi.org/10.1038/nmeth.1923

Liao, Y., Smyth, G. K., & Shi, W. (2019). The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic Acids Research, 47(8), e47. https://doi.org/10.1093/nar/gkz114

Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12). https://doi.org/10.1186/s13059-014-0550-8

Oliveros, J. C. (2022). FIESTA 2.0: Interactive scatter plots in your browser. National Center for Biotechnology, Spanish National Research Council (CNB-CSIC). https://bioinfogp.cnb.csic.es/tools/fiesta2

Pomaznoy, M., Sethi, A., Greenbaum, J., & Peters, B. (2019). Identifying inaccuracies in gene expression estimates from unstranded RNA-seq data. Scientific Reports, 9(1). https://doi.org/10.1038/s41598-019-52584-w

Robinson, J. T., Thorvaldsdóttir, H., Turner, D., & Mesirov, J. P. (2022). Igv.js: An embeddable javascript implementation of the integrative genomics viewer (igv). bioRxiv. https://doi.org/10.1101/2020.05.03.075499

Robinson, J. T., Thorvaldsdóttir, H., Winckler, W., Guttman, M., Lander, E. S., Getz, G., & Mesirov, J. P. (2011). Integrative genomics viewer. Nature Biotechnology, 29(1), 24–26. https://doi.org/10.1038/nbt.1754

Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2009). edgeR: A bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139–140. https://doi.org/10.1093/bioinformatics/btp616

Soneson, C. (2021). Rhisat2: R wrapper for hisat2 aligner. https://github.com/fmicompbio/Rhisat2

Wei, Z., Zhang, W., Fang, H., Li, Y., & Wang, X. (2018). EsATAC: An easy-to-use systematic pipeline for atac-seq data analysis. Bioinformatics. https://doi.org/10.1093/bioinformatics/bty141

Appendix 1: Single-end and paired-ends sequencing

Introduction

After the “fragmentation” step, in RNA-Seq protocols fragments ~300 to 500 bp long are produced. Even this size range is too long to be sequenced entirely by the type of sequencers that generate short reads. What is done then is to read only the end part of the fragment, which produce short reads of 100-200 bases (this interval may vary according to the experimental design, the machine used, etc.) The rest of the fragment is not sequenced.

Now, there are two modalities in RNA-Seq, single-end and paired-ends (Figure 15.9):

**Single-end** and **Paired-ends** reads: a global view

Figure 15.9: Single-end and Paired-ends reads: a global view

Single-end RNA-Seq (SE, aka single-read RNA-Seq): Is the simplest mode. Only one end of the fragment is sequenced. All the short reads sequences generated for a sample with single-end (SE) technique are collected in a single FASTQ file.

Paired-end RNA-Seq mode (PE): Two reads per fragment are produced. In this mode, the reading starts at one end, finishes this direction at a pre-specified read length (e.g. 150 bases) and a “read 1” is produced as a result. Then, it starts another round of reading from the opposite end of the fragment, finishes this new (opposite) direction at the same read length and finally a second read, “read 2”, is produced. At the end, both ends of the fragment are sequenced leaving the inner portion of the fragment unsequenced.

In PE sequencing, read 1 reads (those generated first-in-the-pair) and read 2 reads (generated secondly) are stored in two separated FASTQ files. One file contains all the read 1 reads the other file contains all the read 2 reads. Each of these files has a label in its name that identifies whether it contains read 1 or read 2 reads, something like Sample1_blah_blah_1.fastq and Sample1_blah_blah_2.fastq (or something similar like Sample1_blah_blah_R1.fastq and Sample1_blah_blah_R2.fastq, etc.) Note that usually the only difference in the filenames between two FASTQ paired-ends files of the same sample is just this "_1" or "_2"-like particles.

In addition, "_1" and "_2" FASTQ files of a particular sample have both the same number of reads. Moreover, they kept the same order of reads (see Figure 15.9): at a given position X in both files, the reads in “_1” and “_2” correspond to the same fragment. In fact, as with the files, the read identifier that is given to the “read 1” and “read 2” reads of the same fragment is the same… except again for the particle indicating whether it is for reads 1 or reads 2.

Reads orientation

To know how is the orientation of the reads in SE and PE is useful to interpret correctly the alignments of the reads and to understand important topics like the unstranded or stranded protocols in RNA-Seq.

The sequencing always synthesizes the reads in a 5’ -> 3 direction. (It is the way the DNA polymerase proceeds) so the template strand is traversed 3’ -> 5’.

In single-end, the reads extend in the 5’ -> 3’ direction along the forward (plus) DNA strand (Figure 15.10).

Single-end read orientation

Figure 15.10: Single-end read orientation

In paired-ends, the read 1 read extends in the 5’ –> 3’ direction towards read 2 along the forward (plus) DNA strand. As for the read 2 read, it extends in the 5’ –> 3’ direction towards read 1 along the reverse (minus) DNA strand. Read 1 is the reverse complement of the first 3’-most bases of the forward strand (therefore identical to the first 5’-most bases of the reverse strand) and read 2 is the reverse complement of the first 3’-most bases of the reverse strand (therefore identical to the first 5’-most bases of the forward strand). See the read 1 and read 2 orientations represented in Figure 15.11:

Paired-ends *FR* reads orientation

Figure 15.11: Paired-ends FR reads orientation

Therefore the sequences of the two reads of a pair are pointing towards each other on opposite strands. When you align them to a genome, one read should align to the forward strand, and the other should align to the reverse strand, at a higher base pair position than the first one so that they are pointed towards one another. This is known as an “FR” read pair or “forward/reverse”, as this is the order in which they are placed. A read pair having this orientation is called a “proper pair”. There are other possible orientations, but we don’t need to go into them here.

When you use IGV genomic browser you can see this orientation in the case of a read pair of a PE sequencing. Take into account that in IGV only the forward (plus) strand of the genome is displayed. With the “Color by” -> “read strand” option the read in the forward strand is colored red and the read of the reverse strand is colored blue. Observe in the figure above a read pair meeting the FR orientation. They are even more visible if you select “View in Pairs” option (Figure 15.12).

Paired-ends *FR* reads orientation as seen in IGV.

Figure 15.12: Paired-ends FR reads orientation as seen in IGV.

Single-end or paired-ends, which is better?

Both single-end (SE) and paired-ends (PE) can be used for measuring gene expression. PE offers the advantage over the SE that the distance covered by the R1 and R2 and the unsequenced sequence in-between is controlled on average (it is named “insert size” and is usually in an approximate range 200-600 bp) so the aligners can take advantage of this information to be more accurate in the correct alignment of the reads to the reference genome. The point is that a read pair properly (FR) oriented, covers a longer span in the genome (read_1 + inner_distance + read_2, Figure 15.13) and therefore a stronger evidence of being well aligned than a single read.

Insert size in Paired-ends reads

Figure 15.13: Insert size in Paired-ends reads

The main disadvantage of the PE protocol is that it is normally more expensive and time-consuming to generate than single-end reads. On the other hand, just for counting reads or calculate differential expression may be unnecessary to have such a high level of accuracy, especially if your species is well-annotated. But take into account that in the future you might want to reuse the reads intended to study differential expression to, for example, extend the study to look for alternative splicing. In that case, you will need to have done PE sequencing as SE is insufficient for alternative splicing analysis.

Appendix 2: Strandedness in RNA-Seq

Understanding Strandedness in Transcription

During transcription, one of the two DNA strands serves as a template for synthesizing the RNA molecule. Since the template strand is reverse complementary to the transcribed RNA, it is referred to as the antisense strand, negative-sense strand, non-coding strand, or minus strand.

Conversely, the non-template strand has the same sequence as the transcript (except for T instead of U) and is known as the sense strand, positive-sense strand, coding strand, or plus strand (Figure 15.14).

Representation of sense and antisense strands in transcription

Figure 15.14: Representation of sense and antisense strands in transcription

It is important to note that either DNA strand can contain genes, meaning that the designation of sense and antisense is gene-specific (Figure 15.15):

Sense and antisense strand assignment depends on the gene

Figure 15.15: Sense and antisense strand assignment depends on the gene

RNA-Seq protocols: unstranded vs. stranded

RNA-Seq protocols generate single-stranded mRNA molecules that are then converted into double-stranded complementary DNA (cDNA). These cDNA molecules are subsequently amplified, producing millions of short sequencing reads. As a result, for any given transcript, the sequencing reads fall into two categories:

  1. Sense reads: These have a sequence identical to the original RNA transcript (except for T instead of U).
  2. Antisense reads: These are the reverse complement of the original RNA transcript sequence.

Initially, most RNA-Seq library preparation protocols were non-strand-specific (also called unstranded). In these protocols, sequences from both the sense strand (the cDNA strand encoding the transcript) and the antisense strand (the reverse complement of the original mRNA) are sequenced without preserving strand information.

This means that in the resulting FASTQ files, _1.fastq and _2.fastq contain a mixture of sense and antisense reads, making it impossible to determine whether a given read originated from the sense or antisense strand of the original mRNA.

Comparison of unstranded and stranded RNA-Seq protocols

Figure 15.16: Comparison of unstranded and stranded RNA-Seq protocols

When assigning reads to genes, it is ideal to match all reads derived from the sense strand to a sense gene in the reference genome and all reads from the antisense strand to an antisense gene. However, in an unstranded protocol, this is not possible—some antisense reads may be incorrectly assigned to sense genes, and vice versa. This can introduce biases in gene expression quantification.

Now, consider the stranded protocol illustrated on the right in Figure 15.16. Here, the second-strand cDNA fragment (the sense strand) is prevented from amplification. This results in only the first-strand cDNA (antisense strand) being amplified. Consequently, strand information is retained in the sequencing data:

FASTQ File Structure in Different Protocols

In stranded RNA-Seq protocols, strand information is preserved by selectively amplifying only one strand of cDNA. This distinction directly impacts how reads are stored in the FASTQ files and determines whether a given read corresponds to the sense or antisense strand of the original RNA transcript. Figure 15.17 illustrates how sequencing reads are distributed across FASTQ files depending on the protocol used.

Unstranded and stranded protocols and FASTQ files.

Figure 15.17: Unstranded and stranded protocols and FASTQ files.

By ensuring that all sequencing reads carry strand information, stranded protocols prevent incorrect assignment of antisense reads and eliminate potential biases in gene quantification.

Several techniques exist for generating stranded RNA-Seq libraries. One of the most widely used involves chemical modification with dUTP. This method incorporates dUTP specifically into the non-sense strand of the cDNA, preventing its amplification during PCR (15.16, right). This approach is known as reverse-stranded RNA-Seq, since the reverse (antisense) strand is amplified and sequenced first.

Alternatively, some stranded protocols selectively deplete the sense strand instead of the antisense strand, resulting in a direct-stranded RNA-Seq protocol. In this case, the sense (direct) strand is amplified and sequenced first, while the antisense strand is discarded.

Getting Down to Practicalities

When your provider delivers the short-read files (FASTQ files) from your RNA-Seq experiment, it is recommended to check the sequencing report to determine which library preparation method was used: unstranded (non-stranded), direct-stranded, or reverse-stranded. If this information is not explicitly mentioned, request it from the provider, as it is crucial for accurate transcript quantification (counts). Once you have identified the protocol type, select the appropriate option in the Strand Specificity selector of the Quantify Genomic Features window.

However, the protocol may not always be listed under these exact names. For example, some reports may refer to the protocol as “directional RNA-Seq,” which is equivalent to “stranded RNA-Seq.” In other cases, only the name of the commercial kit used for library preparation might be provided. The following table serves as a reference to help interpret these variations:

Protocol Synonym Examples/Kits1 Description
Unstranded Non-stranded Standard Illumina, TruSeq RNA Sample Prep Kit, NuGEN OvationV2, SMARTer Universal Low Input (TaKara), GDC normalized TCGA data Strand information is not retained (lost).
Reverse-stranded First-strand dUTP, NSR, NNSR; TruSeq Stranded (Total RNA/mRNA), NEB Ultra Directional, Agilent SureSelect Strand-Specific The second read (read 2) corresponds to the original RNA strand, while the first read (read 1) aligns to the opposite (reverse) strand.
Direct-stranded Second-strand Directional Illumina (Ligation), Standard SOLiD, ScriptSeq v2, SMARTer Stranded Total RNA, NuGEN Encore Complete, NuGEN SoLo, Illumina ScriptSeq The first read (read 1) corresponds to the original (sense) RNA strand, while the second read (read 2) aligns to the opposite strand.

1 The information in this column was compiled from this, this, and this source.

Regardless of the reported protocol, we recommend verifying the strandedness yourself by following an approach like the one described in the next section.

How to Determine the Strandedness with IGV

In some cases, you may not have access to information about the strandedness of your RNA-Seq data. This can happen if you obtained your FASTQ or BAM files from an online repository that lacks metadata or if the dataset is poorly documented.

Fortunately, there are strategies to infer strandedness from the data itself. One such strategy involves using the IGV (Robinson et al., 2011, 2022) genomic browser.

Step 1: Load the Required Files in IGV

To analyze strandedness, you need to load the following files into IGV:
- Reference genome (FASTA format)
- Gene annotation file (GFF3/GTF format)
- BAM file(s) containing the RNA-Seq alignments

Once loaded, you can use IGV’s Color by Read Strand option to visualize the alignment patterns.

Step 2: Understanding Read Orientation in IGV

As described in the Reads Orientation section, paired-end reads typically align in an FR conformation (Forward-Reverse), where:
- Read 1 aligns to the forward (+) DNA strand, and
- Read 2 aligns downstream on the reverse (-) DNA strand (Figure 15.12).

IGV allows you to visualize read orientation using the Color by Read Strand option:
- Forward strand reads are colored red.
- Reverse strand reads are colored blue.

Additionally, paired-end sequencing assigns each read to one of two FASTQ files:
- “reads_1.fastq” contains all Read 1 sequences.
- “reads_2.fastq” contains all Read 2 sequences.

Now, let’s examine how this information can be used to infer strandedness.

Step 3: Detecting Strandedness Using Read 1 Orientation

By inspecting only Read 1 (reads_1.fastq), you can infer the strandedness of your data:

  • Unstranded Protocol:
    • reads_1.fastq contains a mix of red and blue reads, meaning some align to the forward strand and others to the reverse strand.
    • This happens because strand information is lost during library preparation.
  • Direct-Stranded Protocol:
    • reads_1.fastq contains only forward-aligned reads (red) when mapping to genes on the forward strand of the genome.
    • Likewise, it contains only reverse-aligned reads (blue) when mapping to genes on the reverse strand.
  • Reverse-Stranded Protocol:
    • reads_1.fastq contains only reverse-aligned reads (blue) for genes on the forward strand.
    • It contains only forward-aligned reads (red) for genes on the reverse strand.

Step 4: Using “Color by First-of-Pair Strand” in IGV

IGV provides another useful visualization tool clicking the track’s gear menu of IGV and selecting Color by → First-of-Pair Strand in the BAM track (Figure 15.18).

option 'first in pair'

Figure 15.18: option ‘first in pair’

When enabled, both reads of a pair are colored based on the strand to which Read 1 aligns:
- If Read 1 aligns to the forward strand, the entire pair is red.
- If Read 1 aligns to the reverse strand, the entire pair is blue.

This visualization allows for a quick assessment of whether Read 1 is consistently aligning with a particular strand, which is crucial for determining the strandedness protocol.

Step 5: Verifying the “Pair Orientation” in IGV

By left-clicking a read in IGV, you can access additional alignment details (Figure 15.19). One key piece of information is the Pair Orientation, which can be either F2R1 or F1R2:

Pair orientation in IGV

Figure 15.19: Pair orientation in IGV

  • F1R2Read 1 aligns to the forward strand (red) and comes from reads_1.fastq, while Read 2 aligns to the reverse strand (blue) and comes from reads_2.fastq.
  • F2R1Read 1 aligns to the reverse strand (blue) and comes from reads_1.fastq, while Read 2 aligns to the forward strand (red) and comes from reads_2.fastq.

These orientations help confirm whether the sequencing protocol follows a direct stranded or reverse stranded approach.

Final Interpretation: Identifying the Library Protocol

Using the tools above, you can now determine the library preparation method:

  • Unstranded: Mixed ‘red and blue’ reads in reads_1.fastq.
  • Direct stranded: Red reads for genes on the forward strand, blue reads for genes on the reverse strand (Figure 15.20).
    Direct stranded: read pairs in red for genes in forward (+) strand and in blue for genes in reverse (-) strand.

    Figure 15.20: Direct stranded: read pairs in red for genes in forward (+) strand and in blue for genes in reverse (-) strand.

  • Reverse stranded: Blue reads for genes on the forward strand, red reads for genes on the reverse strand (Figure 15.21).
    Reverse stranded: read pairs in blue for genes in forward (+) strand and in red for genes in reverse (-) strand.

    Figure 15.21: Reverse stranded: read pairs in blue for genes in forward (+) strand and in red for genes in reverse (-) strand.

By comparing your observations with these expected patterns, you can confidently determine whether your RNA-Seq library was prepared using an unstranded, direct stranded, or reverse stranded protocol.


To verify that you have grasped the concept, try to determine which strandedness protocols were used for BAM-1 and BAM-2 in Figure 15.22. Remember that the reads have been colored using the “First-of-Pair” option in IGV.

What are the strandedness protocols in BAM-1 and in BAM-2?

Figure 15.22: What are the strandedness protocols in BAM-1 and in BAM-2?

Final remark: If you are still unable to determine strandedness, the safest option is to select “unstranded” during analysis. This ensures that no reads are mistakenly discarded.

A study by (Pomaznoy et al., 2019) found that ignoring strand information can significantly affect gene expression estimates that “10% of all genes and 2.5% of protein-coding genes show a two-fold or higher difference when strandedness is ignored.”


The answer for the question about the strandedness protocols in Figure 15.22 is that:

  • BAM-2 follows an unstranded protocol, as it contains a mix of red and blue reads. This indicates that Read 1 aligns to both DNA strands, meaning strand information was not preserved during sequencing.
  • BAM-1 follows a direct stranded protocol, as:
    • Reads mapping to a gene on the forward (plus) DNA strand are red, corresponding to an F1R2 orientation.
    • Reads mapping to a gene on the reverse (minus) DNA strand are blue, corresponding to an F2R1 orientation.

This confirms that BAM-1 was sequenced using a direct stranded RNA-Seq protocol, while BAM-2 lacks strand specificity.

Appendix 3: FAQ

How to cite SeqNjoy?
Juan Antonio Garcia-Martin, Rafael Torres-Perez and Juan Carlos Oliveros (2022). SeqNjoy: Complete RNA-Seq projects in your Desktop. (manuscript in preparation). https://bioinfogp.cnb.csic.es/tools/seqnjoy
(Please consult this tutorial for reference updates).
My computer has an internal disk low on space. Can I install and use SeqNjoy?
Yes, although RNA-Seq files require considerable amounts of hard disk space, users can store project on local or USB-connected hard drives of their choice. Install SeqNjoy and select a Projects Folder on an dedicated external hard disk with enough space.
Can I recover my files from SeqNjoy after deleting a project?
No, the files are deleted without the possibility of being recovered.
Why cannot I type some characters when naming a project, a sample label or an output suffix?
Some non-alphabetic characters cannot be part of an output file name. That is the reason we limit the use of special characters in labels or suffixes.
Which annotation format should I use, GTF or GFF3?
Both formats contain similar information about genes. GFF3 allows any number of parent-child levels and types of genomic elements whilst GTF is more strict. Many genomic files providers tend to favour GFF3 over GTF, however many bioinformatic tools still use GTF as input as it is best defined. For example, the tool that detects the number and size of introns in SeqNjoy works well with any properly formatted GTF files but not so well with most GFF3.
Can I use SeqNjoy just to align short-reads?
Yes, you can align with SeqNjoy and export the alignment files (in standard BAM format) to be visualized or quantified with another tool of your preference.
Why aren’t my paired-end FASTQ files recognized as a pair?
For two paired-end FASTQ files to be correctly recognized, they must meet the following criteria:

- The filenames must differ by only one character.
- Both files must contain the same number of reads.
- The first read in each file must differ by at most one character, ignoring any possible “(length=XX)” tag present in the FASTQ files.

If your files are not detected as paired, please verify that they meet these requirements.
Can I compare samples without using biological replicates?
No. At least one of the sample groups to be compared must contain two or more biological replicates.
Why do some boxes remain red and the “Launch!” button remains disabled after completing a dialog?
That occurs because there are already result files with matching names to those that will be generated in the dialog. The solution is to change one or several sample labels or the common output suffix.
What are the .R files downloaded together with my project’s files?
All output files in SeqNjoy are generated from other files by applying R instructions on them. The “.R” files contain these instructions to help advanced users to create their own pipelines using R language outside SeqNjoy. Please note that these R codes are generated automatically by SeqNjoy and are not optimized for analyzing large number of samples (loops, lists, variables etc. are not used).
Can I analyze my results with FIESTA Viewer outside SeqNjoy application?
Yes. When a differential expression file is saved (in .xlsx format) the corresponding “.fiesta2.html” is also saved. This .html file is a standard web page that can be opened in another computer to filter and exporting biologically relevant results. You can even publish your results on-line by copying any .fiesta2.html file on a public URL.
When I perform the first alignment with bowtie2 for a given genome, I got the following error. “Error in bowtie2_samtools…”.
If using Windows, Python3 command-line tool is not available, so bowtie2 fails to index the genome. The solution is to install Python3 command line tool from the Microsoft Store.
When I launch SeqNjoy on Windows I am prompted with the message “SeqNjoy finished with errors. Please check that another instance of SeqNjoy is not already running.”
If you are not already running SeqNjoy, open the Task Manager and look for running instances of R (named “R for Windows terminal front-end”) and terminate them.
By using SeqNjoy, are my result files exposed, in some way, on the internet?
No. Input and results files generated with SeqNjoy remain stored in a specific folder on your local hard disk. SeqNjoy can even be used with no internet connection.

SeqNjoy: Complete RNA-Seq workflow in your Desktop. Developed by BioinfoGP CNB-CSIC (2022)