2 Computer requirements and installation

2.1 Computer requirements

	Minimum	Recommended
Architecture	64-bit
Operative System	Windows, macOS, Linux (Debian, Fedora or SUSE-based)
Web Browser	Chrome, Firefox, Safari, Edge
RAM Memory	8 GB	16 GB
Disk Space	500 GB	1 TB

NOTE: minimum requirements are adequate for carrying out analyses involving bacterial genomes, while recommended requirements are adequate also for eukaryotic genomes.

SeqNjoy is optimized to be used with 1920 x 1080 screen resolution (or higher).

Hard drive requirements specify the necessary storage space for project files in typical RNA-Seq analysis. SeqNjoy is installed as a standard R package, and users can store project folders on local or USB-connected hard drives of their choice.

Please disable Sleep Mode or Hibernation in your computer to avoid interruptions of long lasting processes.

A connection to the Internet is only required for downloading the necessary packages and libraries during the installation and for retrieving genome files from Ensembl.

2.2 Software requirements

SeqNjoy is an R package; therefore, installing R is required before using the software.

2.2.1 Windows

2.2.1.1 Rtools on Windows

RTools is necessary to install SeqNjoy on Windows because some neccesary R packages are not available as precompiled Windows binaries and must be built from source.

Download the appropriate version of RTools for your R installation from the official website: https://cloud.r-project.org/bin/windows/Rtools/.

2.2.1.2 Bowtie2 aligner

To use the feature “Align reads with Bowtie2” on Windows, ensure the following software is installed:

Python 3
- Python 3 is required for genome indexing.
- If Python 3 is not already installed:
  - Recommended Option: Download and install Python 3 from the official website: https://www.python.org/downloads/.
  - Ensure that you check the option to Add Python to PATH during installation.
- Alternative Option: Install Python 3 from the Microsoft Store.
- After installing Python, verify its installation by running the following command in a terminal:
```
    python3 --version
```
- If python3 does not work and instead opens the Microsoft Store, you may need to adjust your system configuration:
  - Check App Execution Aliases settings: Open Manage App Execution Aliases in the Control panel and disable the python3 alias for the Microsoft Store.
  - Ensure that the installation path for Python is correctly added to your PATH environment variable.
Perl
- Perl is required for sequence aligning.
- You can download and install Perl from Strawberry Perl or another Perl distribution for Windows.
- After installing Perl, verify its installation by running the following command in a terminal:
```
    perl --version
```

2.2.1.3 Pandoc

Pandoc is required to build the vignettes used within the app. If you do not already have Pandoc installed:

Download Pandoc from the official website: https://pandoc.org/installing.html.
Follow the installation instructions for Windows.
After installation, verify it by running the following command in a terminal:

    pandoc --version

2.2.2 macOS

Administrator privileges are required to install R binaries.

2.2.2.1 Pandoc

Pandoc is required to build the vignettes used within the app. If you do not already have Pandoc installed:

Download Pandoc from the official website: https://pandoc.org/installing.html.
Follow the installation instructions for macOS.

2.2.2.2 Source compilers

Depending on the macOS distribution, developer tools and compilers such as xcode might be required. Check R macOS tools website for installation instructions.

These tools are also required to install R from source.

2.2.3 Linux

Installing SeqNjoy dependencies from source requires several system libraries. The package names vary depending on the Linux distribution. Ensure the appropriate libraries are installed by running the commands below.

2.2.3.1 Debian-based Distributions (e.g., Debian, Ubuntu)

    sudo apt-get install -y \
        build-essential gfortran libreadline-dev libx11-dev libxt-dev \
        libbz2-dev liblzma-dev libpcre2-dev libcurl4 libcurl4-openssl-dev \
        libpng-dev libjpeg-dev libxml2-dev libssl-dev librsvg2-dev

2.2.3.2 Fedora-based (e.g., Fedora, Red Hat, Rocky Linux)

    sudo yum install -y \
        make gcc-c++ gcc-gfortran readline-devel libX11 libX11-devel \
        libXt libXt-devel bzip2-devel pcre2-devel libcurl-devel \
        libxml2-devel openssl-devel libpng-devel libjpeg-turbo-devel \
        librsvg2-devel

2.2.3.3 SUSE -based (e.g., openSUSE, SUSE Linux Enterprise)

    sudo zypper install -y  \
        make gcc-c++ gcc-fortran readline-devel libX11-devel libXt-devel \
        libbz2-devel lzma-sdk-devel pcre2-devel libcurl-devel \
        openjpeg-devel libxml2-devel libopenssl-devel libjpeg8-devel \
        libpng16-devel librsvg-devel

2.3 Installation

2.3.1 From GitHub

The most recent release, avalaible on GitHub, can be installed using the remotes package. Pandoc is required to build vignettes (see Software requirements). You can disable this option, but some help pages will not be available.


    ## Install Bioconductor
    if (!requireNamespace("BiocManager", quietly = TRUE)){
        install.packages("BiocManager")
    }

    ## Install remotes
    if (!requireNamespace("remotes", quietly = TRUE)){
        install.packages("remotes")
    }

    ## Install SeqNjoy without vignettes
    remotes::install_github("BioinfoGP/SeqNjoy", dependencies=TRUE, bioc_version = BiocManager::version())

    ## Install SeqNjoy with vignettes
    remotes::install_github("BioinfoGP/SeqNjoy", dependencies=TRUE, build_vignettes = TRUE, bioc_version = BiocManager::version())

2.3.2 From Source

Installation from source can also be done using devtools. Download the latest gzipped package source code SeqNjoy_0.5.X.tar.gz available at BioinfoGP’s GitHub repository and install it from R console.

    ## Set your working directory to the location where the package was downloaded. 
    setwd("path/to/downloaded/package")

    ## Install remotes
    if (!requireNamespace("remotes", quietly = TRUE)){
        install.packages("remotes")
    }

    ## Install Bioconductor
    if (!requireNamespace("BiocManager", quietly = TRUE)){
        install.packages("BiocManager")
    }

    # Create a temporary directory and extract the package contents
    d <- tempdir()
    untar("SeqNjoy_0.5.X.tar.gz", exdir=d)

    # Install SeqNjoy
    remotes::install(file.path(d, "SeqNjoy"), dependencies=TRUE, repos=BiocManager::repositories())

2.4 Starting SeqNjoy

Open R, then load SeqNjoy library and launch the application. The browser will automatically start.

    library(SeqNjoy)

    SeqNjoy()

If the browser has not already started, or you have closed it. Open your browser and enter the URL http://127.0.0.1:7628/.

To stop SeqNjoy, click the “Exit SeqNjoy” button or press Ctrl+C in the R console.

3 RNA-Seq in short

RNA-Seq is a high-throughput sequencing technique used to detect and quantify RNA in biological samples. This provides a proxy for detecting and quantifying gene expression levels, among other applications.

The “wet lab” part of the RNA-Seq process (Figure 3.1, top) begins with the library preparation, a process involving several key steps: extraction of RNA from the samples, filtering to isolate mRNA molecules while depleting other RNA species (e.g., ribosomal RNA), reverse transcription of mRNA into more stable cDNA molecules, fragmentation and size selection to produce smaller nucleic acid fragments (easier to handle), and sequencing these fragments to generate short reads.

The sequencing machine (“sequencer”) reads the nucleotide sequences corresponding to the mRNA transcripts, producing millions of short reads. These reads are stored in FASTQ files, typically one file per sample or biological replicate in your experiment. These FASTQ files are the starting point for RNA-Seq data analysis—and you can carry out this analysis using SeqNjoy!

Figure 3.1 provides a schematic overview of the standard RNA-Seq pipeline for analyzing mRNA expression in biological samples. The upper section outlines the experimental protocol, while the yellow-highlighted area below illustrates the main steps for processing and analyzing the RNA-Seq data from the sequenced samples.

Figure 3.1: Schematic overview of the standard RNA-Seq pipeline for mRNA expression analysis

In brief, the bioinformatic analysis of a RNA-Seq experiment involves three consecutive steps:

Quality control and preprocessing of short reads: Before alignment, it is essential to assess the quality of the raw sequencing data in FASTQ format. A quality control (QC) analysis is performed to detect potential issues such as adapter contamination, low-quality bases, or short reads. If necessary, a preprocessing step is carried out to clean the data, which includes trimming adapters, filtering out low-quality reads, and discarding excessively short sequences. This step ensures that only high-quality reads are retained for downstream analysis.
Alignment to the reference genome: Once the reads have passed quality control, they are aligned to a FASTA file containing the reference genome. This process generates BAM files containing the aligned reads, with one BAM file produced for each FASTQ file or pair of FASTQ files, depending on the sequencing type (single-end or paired-end RNA-Seq).
Gene quantification: The aligned reads in each BAM file are mapped to a GTF or GFF file, which contains the start and end positions of all annotated genes. For each gene, the total number of reads mapped to it is counted. The result is a “counts file” for each BAM file. A counts file includes a list of all genes in the genome and the corresponding number of reads (“counts”) mapped to each gene.
Samples comparison / Differential expression calculation: The counts files are grouped into two sets: one for the “Control” samples and another for the “Case” samples. This step statistically compares the expression of each gene between the two groups to identify genes with significant differential expression and quantify the magnitude of these differences. The output is a table containing the results of the differential expression analysis for each gene.

4 SeqNjoy framework in short

SeqNjoy utilizes user-friendly interfaces that allow you to perform a complete RNA-Seq expression gene data analysis in a short number of steps. If you have never performed this kind of analysis, we recommend reading RNA-Seq in short section as a starting point to get familiarized with the overall process.

Flowchart below (Figure 4.1) summarizes how SeqNjoy addresses the different parts of the analysis:

Figure 4.1: Flowchart of SeqNjoy

The Homepage serves as the entry point to the application (located in the upper-left part of the flowchart). This initial page is where you create your projects. A project acts as a container for a single study and includes all the files required for an RNA-Seq data analysis.

The files associated with a project include:
- Input files, such as short reads files, gene feature files, genome reference files, and functional annotations.
- Output files generated during the analysis steps, including alignment files, counts files, and lists of differentially expressed genes resulting from sample comparisons.

From the list of created projects displayed on the Homepage, you can select any project to access its Project’s Main Page, which serves as the central hub of the application. All functionalities are initiated from this page.

From the Project’s Main Page, you can:

Load the input files needed for the analysis.
Access menus to perform each successive step of the analysis.
Browse and manage the files generated at different stages of the analysis.
Access some specific utilities to inspect and analyze the content of the files and the quality report.

Just as a recipe requires initial ingredients to prepare a dish, SeqNjoy requires specific input files to begin processing your project. Depending on the available data, you can follow one of the two workflows described below:

From Short Reads (FASTQ format):
- Multiple short reads files (FASTQ format)
- One genome sequence file (FASTA format)
- One genome feature file (GFF3 or GTF format)

This is the standard input for a complete RNA-Seq data analysis.

From Alignment Files (SAM or BAM format):
- Multiple alignment files (SAM or BAM format)
- One genome feature file (GFF3 or GTF format)

If you start from short reads files (FASTQ format), the analysis process comprises all the steps mentioned before: Quality Control, Preprocess Reads, Align Reads, Quantify Genes and Compare Results, as described in the Figure 4.2, that illustrates the SeqNjoy workflow, showing the dependencies between input and output files at each step. All of them are launched from the Project’s Main Page.

Overview of the SeqNjoy workflow, showing the analysis steps (right), the types of files used (center), and the utilities available for file inspection (left).

Figure 4.2: Overview of the SeqNjoy workflow, showing the analysis steps (right), the types of files used (center), and the utilities available for file inspection (left).

In the case of starting with alignment files SAM/BAM, the workflow is similar, except the Align Reads step is skipped.

The RNA-Seq data analysis proceeds from top to bottom, with the output files from one step becoming the input for the next.

SeqNjoy also provides Utilities to inspect or analyze specific types of files or the quality report (Figure 4.3). These tools can help you better understand the data generated at different stages of the analysis.

Figure 4.3: Utilities in SeqNjoy for file inspection/analysis

In the following sections, we provide destailed description of: - Each step of the analysis - The utilities available in SeqNjoy - The characteristics and use of the files involved For the FIESTA 2.0 graphical viewer, a specific tutorial is available here.

5 First steps in SeqNjoy: creating and managing projects

Once you launch SeqNjoy, the homepage welcomes you with a brief summary and key information about the application.

Figure 5.1: Homepage

Before creating a project, you must select a Project Folder, a directory on your computer or external drive where all new projects will be stored. Click the Projects Folder button to select or create this directory.

A project in SeqNjoy acts as a container for a single RNA-Seq study, including both the input and the generated files. Input files include FASTQ files, genome reference files, gene feature files, and functional annotations. As the analysis progresses, the project also stores generated files such as quality reports, alignment files, count tables, and lists of differentially expressed genes.

Each project is uniquely identified by a project name, which you provide in the text box at the bottom of the homepage (Figure 5.2).

Select Projects Folder and then Create a New Project by writing a project's name. The project will be created in the selected Projects Folder

Figure 5.2: Select Projects Folder and then Create a New Project by writing a project’s name. The project will be created in the selected Projects Folder

After clicking the Create New Project button, you will be directed to the Project’s Main Page, where you can begin your analysis(see Main Page section).

Once you have one or more projects created, the Available Projects section of the Homepage displays a list of your projects along their creation dates (Figure 5.3).

Figure 5.3: List of started projects

For each project, you can either open the project it by clicking the “play” icon , which takes you to the Main Page to start or continue your analysis, or delete the project it by clicking the “basket” icon , which permanently removes the project along with all its associated files from your system.

⚠ IMPORTANT: Deleting a project permanently removes all input and result files, making the project impossible to recover. This action CANNOT be undone.

6 Project’s Main Page: Loading files

Once the project has been created, you can start uploading your input files. First, open the project by clicking the “play” icon in the Homepage and access the Project’s Main Page of the application (Figure 6.1).

Figure 6.1: Project’s Main page

Now, click the Load Files button and select in your computer one or several files of the allowed file types. SeqNjoy supports the following file types and extensions:

File type	File Formats	Allowed File Extensions
Genome sequences	FASTA	fa, fa.gz, fasta, fasta.gz, fna, fna.gz
Genes features	GFF3, GTF	gff, gff.gz, gtf, gtf.gz, gff3, gff3.gz
Short reads sequences	FASTQ	fastq, fastq.gz, fq, fq.gz
Alignments	BAM, SAM	bam, sam
Gene annotations	ANN (text table)	ann, ann.txt, ann.tsv, ann.tab, ann.xls, ann.xlsx

If you are not familiarized with any of these file types, go to the section File Types to know more about the structure of the format and the function and details of each file type.

Figure 6.2 shows the application loading a number of FASTQ files:

Figure 6.2: Uploading files in SeqNjoy

The pop-up window reports the status of file uploading and post-processing by means of a progress bar and informative messages describing the step being performed. When all is OK, you should see a window with an “all in green” aspect like this (Figure 6.3):

Figure 6.3: Uploading process successful

Take into account that in the case of some large and/or complex files the upload may take a considerable amount of time.

7 Downloading genome files

Genome files include two types: (1) a genome reference in FASTA format and (2) a genomic features file (GFF3 or GTF) containing information about chromosomes, genes, CDS, exons, etc.

If you already have a FASTA and a GTF/GFF file, you can upload them following the steps described in the previous section. Otherwise, you can download genome files directly from the Ensembl repositories. Ensembl provides access to a large number of genomes, focusing on the most biologically and medically relevant species.

To download genomes from Ensembl, click the Download Genomes button of the Main Page and then select Download Genomes from Ensembl (Figure 7.1):

Figure 7.1: Selecting the Download Genomes option from the Main Page.

A new window will open where you need to select the taxonomic group of your organism (Figure 7.2).

Important: To access the repository, you must be connected to the Internet, and Ensembl (for vertebrates) or Ensembl Genomes (for bacteria, fungi, metazoa, plants and protists) must be accessible and functional. Ensure that the window displays the message “Online - release <number>” next to the repository name; otherwise, you will not be able to proceed with the download using this option.

Figure 7.2: Taxonomic group selection

After selecting the taxonomic group, choose the specific genome from the dropdown menu (Figure 7.3).

Figure 7.3: Choosing the genome of interest from the dropdown menu.

Once you have selected the species/strain genome, click Check Availability Genome (Figure 7.4) to view the files of that organism available for download. Tipically, you will find the reference genome file (FASTA) and the genomic features file in both common formats, GTF and GFF.

Figure 7.4: Checking the availability of genome files before downloading. The application will list available FASTA and GTF/GFF files

To proceed with the download, select the FASTA file and at least one of genomic features files (GTF or GFF), then click the Download button (Figure 7.5).

Figure 7.5: Selecting genome files: reference genome (FASTA) and genomic features file (GTF/GFF). In this case, a GTF file is selected

Once the download is complete, the genome files will appear in File List on the “Project’s Main Page” (Figure 7.6).

Once downloaded, the genome files appear in the *File List* on the [*Project's Main Page*](#projects-main-page-file-list) along with the rest of loaded files

Figure 7.6: Once downloaded, the genome files appear in the File List on the Project’s Main Page along with the rest of loaded files

8 Project’s Main Page: File List

After you have uploaded your input files, your Project’s Main Page shows the list of the loaded files. The list offers different information and a number of options to operate on them. From the Project’s Main Page you can also perform the successive steps of your RNA-Seq gene expression data analysis project.

This is the aspect of the Project’s Main Page after a standard RNA-Seq data analysis project has been performed (Figure 8.1):

Figure 8.1: Project’s Main Page

The Project’s Main Page is structured as follows.

In the top area the following functionalities are available:

Load Files: Click here to load your input files for the analysis. See Loading Files section for a description of the process of loading and File Types in SeqNjoy for a full description of the allowed file types.
Project Name: This text box informs you of the current project you are working with. You can change the project name at any moment by typing the new name in the text box.
Align Reads: Click here to access the first step of the analysis (short reads alignment). See the corresponding section for more information.
Quantify Genes: Click here to access the second step of the analysis (quantification of genes). See the corresponding section for more information.
Compare Samples: Click here to access the third step of the analysis (differential expression between sample groups). See the corresponding section for more information.
About: Click here to see Credits or obtain Help.

The rest of the Project’s Main Page is dedicated to the File List. The File List area contains information about all the files loaded by the user or generated by the processing with SeqNjoy, and provides a number of operations that the user can run on one or more selected files. The File List comprises the following buttons and columns, described from the left to the right:

Select/Unselect: A column of check boxes. When you check one or more boxes, the corresponding file(s) is/are marked as selected. If you click a checked box, the file(s) returns to unselected status. In case you want to select all the files of the list, you can do in a more convenient way by clicking the uppermost check box (it also serves to “unselect all”).

There are two possible operations you can execute with the selected files:
- Download the selected file(s): you are asked the destination path of the selected files in your system. In the case of results files, a file with the script in R that generated this result is also provided. In the case of diffexp results, an additional third file is provided: an interactive FIESTA 2.0 html page that you can open with a web browser.
- Delete the selected file(s): The files are deleted from the project. Note that files that are being used as input of other files can not be deleted. Be aware that deleted files can not be recovered.
Both operations, Download and Delete, are also available on a file-by-file basis in the two rightmost columns of the File List.
Show buttons: You can filter the File List to hide or show the files having a particular file type(s). For example, if you click a colored “bam” button , all bam files are hidden from the File List and the bam button becomes grayish . Click again the grayish icon to return your bam files to the list. You can click multiple Show buttons to show/hide several file types at one time. Note that only file types with no members selected can be hidden.

The numbers accompanying the file type on the button indicate the total number of existing files of that type and how many of them have been selected via the checkboxes (see Select/Unselect above).
Type: It informs you of the type of the file. Refer the File Types in SeqNjoy section to know more about the file types and their characteristics.
Details: This column contains diverse information about the file and provides access to additional utilities for some file types.

File name: the background color of the file name indicates the file type (e.g red for FASTQ, green for FASTA, etc.)
File content: e.g. “short reads”, “short read pairs”, chromosomes“,”genes“,”features“,”alignments“… In addition, it displays also the number of items (e.g. ”1000000 reads"). For the alignment files, it displays the result of the duplication removal, if applied.
File origin: Shows the source files used to generate the current file (e.g., a BAM file created from specific FASTQ files).
File glimpse: Displays a list of the first identifiers contained in the file (e.g. chromosome identifiers, gene ids)
Introns: Displays information calculated from the GFF file about the introns in the genome. Some of this information (minimum and maximum intron length) is useful to take full advantage of the splice-aware feature of hisat2 aligner.

For some file types, the Details column displays a specific button that launches an specific utility. Each button is colored according to the file type of the selected file. These are the supported utilities:

File Types	Operation	Tool
html	Quality Control Report Visualization	fastp
fasta, bam	Genomic Browser	IGV
gff	Genome features structure	Rgff
diffexp	Visualize, plot and filter gene expression results	FIESTA 2.0

Bytes: Size of the file in bytes.
Date: Day and time of creation of the file.
(Download file icon): Downloads the file in your computer (you are asked where).
(Delete file icon): Deletes the file (Note: this action cannot be reverted).

9 Quality Control

Quality control (QC) of sequencing reads is a crucial step in NGS data analysis. It allows users to assess the quality of their sequencing data before proceeding with downstream analyses. QC reports provide insights into potential issues such as low-quality bases, adapter contamination, and sequence biases. Based on these reports, users can decide whether their reads are suitable for analysis as they are or if preprocessing steps, such as trimming or filtering, are necessary. While QC is not mandatory, it is highly recommended to ensure data reliability.

The QC process is conducted using the software tools Biostrings (H Pagès, 204C.E.) and fastp (Chen et al., 2018).

To perform quality control, navigate to the Project’s Main Page and click the Quality Control button. This opens a new window with a menu containing a single option: Quality Control with fastp (Figure 9.1).

Figure 9.1: Option for Quality Control of the read files (FASTQ)

Clicking this option displays the QC interface, where users can select the FASTQ files to analyze and specify the number of reads to include in the QC process (Figure 9.2).

Figure 9.2: Quality Control window

The QC interface provides fields for selecting one or multiple FASTQ files:

Single-end reads: Select one FASTQ file.
Paired-end reads: Select the first FASTQ file; the application will automatically identify and pair it with the corresponding mate file.

Additionally, users must set the Sample Reads for QC parameter. By default, this is set to 1,000,000, meaning the first million reads (or read pairs) will be analyzed. Setting this value to 0 will process all reads in the selected FASTQ file(s).

Once the desired files and parameters are selected, click the Launch button to start the QC process.

When the QC process completes, the corresponding QC icon to the left of each analyzed FASTQ file (or paired-end set) will be enabled (it was previously grayed out). Clicking this icon opens the QC report, where users can review metrics and visualizations related to read quality. For a detailed breakdown of the Quality Control Report’s content, refer to section Quality Control Report.

After reviewing the QC results, users can decide whether preprocessing steps, such as trimming or filtering, are necessary to improve read quality before proceeding with further analyses.

10 Preprocess Reads

Typically, after performing Quality Control (QC) and inspect the QC report, users may identify issues in their reads that can be corrected through preprocessing. Common problems include the presence of adapter sequences, low-quality regions, and an excessive number of undetermined bases (N). To mitigate these, a preprocessing step can be applied to remove adapter fragments contamination, trim low-quality bases, and discard entire reads that fail to meet quality or length thresholds.

The application uses fastp (Chen et al., 2018) to preprocess FASTQ sequencing files, applying various corrections based on user-defined parameters.

To access this operation, navigate to the Project’s Main Page and click the Preprocess Reads button. This opens a new window containing a single menu option: Preprocess Reads with fastp (Figure 10.1).

Figure 10.1: Option for preprocessing reads

Selecting this option launches the preprocessing setup interface (Figure 10.2).

Figure 10.2: Preprocess Reads window

In this interface, users must select the FASTQ files they wish to preprocess and configure the correction parameters. As with Quality Control, users can select one or more FASTQ files. If the input files are paired-end, selecting one file will automatically populate its mate file.

On the right side of the preprocessing interface, users can configure several parameters. The default parameters are set to provide reasonable results for most datasets, effectively addressing common issues. If you are unsure about which values to use, it is recommended to keep the defaults, as they will correct most typical sequencing errors without over-filtering your data.

Adapter trimming: When enabled, removes adapter sequences from reads. These are unwanted fragments introduced during sequencing library preparation.
Keep reads of ‘n’ nucleotides or longer: Retains only reads with a minimun length of ‘n’ nucleotides after trimming (default: 15 nucleotides).
Allow up to ‘n’ N per read: Reads with more than this number of ambiguous bases (‘N’) are discarded (default: 5 N’s).
Discard reads with low-quality bases: If enabled, removes reads based on two adjustable criteria:
- At most ‘p’ % of bases: Reads in which more than a percentage “p” of the bases have low quality will be discarded. (default: 40% bases).
- With less than ‘n’ phred quality: A base is considered low quality if its Phred score is below a value “q”. (default: phred score 15).
Discard reads with low complexity: When enabled, removes reads with low complexity. A low-complexity region in the genome is an area where the nucleotide sequence follows a repetitive or predictable pattern with little variability. These regions contain limited combinations of bases and lack the diversity that characterizes most functional DNA sequences. Due to their repetitive nature, they can cause issues in bioinformatics analyses, affecting sequence alignment and the accurate detection of variants.
Trim poly-X tails: When enabled, this option removes homopolymeric tails (stretches of the same nucleotide, such as poly-A or poly-T) that are longer than 10 nucleotides. These artifacts can arise during library preparation or sequencing and may introduce biases in downstream analyses.
Trim ‘n’ nucleotides from the 3’ end of read 1: Removes ‘n’ nucleotides from the 3’ end of all reads in the “reads_1.fastq” file (default: 0, no trimming). The same option is available for trimming the 3’ end of reads in the “reads_2.fastq” file.
Trim ‘n’ nucleotides from the 5’ end of read 1: Removes ‘n’ nucleotides from the 5’ end of all reads in the “reads_1.fastq” file (default: 0, no trimming). The same option is available for trimming the 5’ end of reads in the “reads_2.fastq” file.

Additionally, users must define an Output Prefix, which will be incorporated into the filename of the processed FASTQ files to distinguish them from the originals.

Once the FASTQ files, parameters, and output prefix are set, clicking the Launch button will process the files according to the specified adjustments.

The output consists of cleaned FASTQ files, which will appear in the File List on the Project’s Main Page. These quality-improved FASTQ files can be reassessed using a new Quality Control step if needed. Once assessed and, if necessary, preprocessed, the resulting FASTQ files serve as input for the Align Reads step.

11 Align Reads

Click the “Align Reads” button of Main Page to access this window. For performing this task, you inform a genome sequence (FASTA) file and the corresponding short reads files (FASTQ files) and obtain files with the alignments (BAM formatted) of the short reads contained in the FASTQ files to the reference genome (FASTA file).

There are two different aligners available to perform this task, and you should choose one of them in the submenu that is displayed after clicking the Align Reads button (Figure 11.1): bowtie2 (Langmead & Salzberg, 2012) or hisat2 (Kim et al., 2015). Both of them are well-documented and widely used programs.

Figure 11.1: Align Reads: selecting alignment method (aligner)

After choosing any of them you access a window where you have to fill the input FASTQ files and alignment parameters that are necessary to complete the task.

Window for the reads alignment. Input files are the genome reference (FASTA file) and the short reads files (FASTQ files). The shaded areas contains *bowtie*-specific or *hisat2*-specific parameters and therefore are differents according to the selected aligner.

Figure 11.2: Window for the reads alignment. Input files are the genome reference (FASTA file) and the short reads files (FASTQ files). The shaded areas contains bowtie-specific or hisat2-specific parameters and therefore are differents according to the selected aligner.

The Align Reads to Genomes window has the following parts (see Figure 11.2):

Select Genome sequence: select here one FASTA file among the listed in the drop-down menu. Usually, you select the genome reference of the species/variety the sequenced samples come from (or a close one), but you can select any uploaded genome reference that could be more convenient in your particular study.
Select FASTQ files: Below the Genome Sequence you have a grid of drop-down menus arranged in rows and columns. To fill this part in, you must know that there are two modalities of RNA-Seq named single-end and paired-ends sequencing. Refer to the corresponding section to know about them if you need it.
- In “Fastq 1” column you select: a) all your single-end FASTQ files, if your FASTQ file are of the single-end type or b) the files corresponding to the first member of the pair in the case of paired-ends FASTQ files (normally, the filename of these “read 1” files contains the suffix “_1” or similar).
- “Fastq 2” column only applies in the case of paired-end sequencing. In that case, you have to put here the second member of the pair (normally, files with the suffix “_2” or similar in their filesnames). This second member of the pair must be selected side-by-side with the first ("_1“) member that was selected in”Fastq 1" column (as in Figure 11.2). Do not select any file in this column if all the FASTQ files in your experiment are of “single-end” type.
- In addition, you have to fill in a Sample label for each Fastq 1 (single-end) or Fastq 1 plus Fastq 2 (paired-end) item you fill in. This Sample label will name the resulting alignment (BAM) file.
At the end, the aspect of this area will be similar to one of the displayed in the Figure 11.2.
Output files suffix (optional): The text you type in this box will be added to the name of all the output files (BAM files) of this step, immediately after the sample label and before the .bam extension, that is, “Sample_name.suffix.bam” (Example: “Sample1.aligned_with_bowtie2.bam”, “Sample2.aligned_with_bowtie2.bam”, etc.). If no “output files suffix” is typed, the BAM output files will be named simply with the sample name followed by the “.bam” extension (i.e. [Sample_name].bam).
Alignment parameters: The shaded areas of the window contain the parameters that are used by the aligner of your choice, bowtie2 or hisat2. Some options are the present irrespective of the aligner you choose:
- Trim reads from 5’ end / Trim reads from 3’ end: Previously to align reads, usually the FASTQ files are submitted to programs to assess the quality (QA) of the short reads. FASTQC (Andrews et al., 2012) is probably one of the most popular. In Figure 11.3 we can see an example of one of the plots that FASTQC provides and that represents the “quality per base sequence” (see here how to interpret this plot), i.e. an overview of the range of quality values across all bases at each position in the FASTQ file. It is not unusual that sequencing decreases the quality of the positions at the 3’ end (“right” end). In Figure 11.3 the quality drop is so marked from position 120 (approx.) onward that a trimming of this chunk on the 3’ extreme of the reads might be advisable.
  
  Figure 11.3: FASTQC Plot representing the quality per base sequence on a FASTQ file with quality drop in the 3’ extreme.
  
  Other reason to trim nucleotides from the reads ends may be the presence of adapters that for some reason have not been removed after the sequencing. The aim of the trimming is to remove these read chunks of very low quality or that do not correspond to the sequenced sample.
  With the Trim reads from 5’ end parameter you can set the number of nucleotides to be removed from the reads of the file at the 5’ end (i.e. “on the left”). With Trim reads from 5’ end you can set the number of nucleotides to be removed from the reads of the file at the 3’ end (i.e. “on the right”).
- Max CPU Cores: bowtie2 and hisat2 support parallel processing, that is, it can use multiple CPU cores. Multi-core is enabled by default to use (up to) 4 cores. You can modify this value in the text box with a numeric value greater or equal to 1. If your computer has less CPU cores than the number typed here, the aligner program will use the maximum your computer has.
- Remove duplicated alignments: In RNA-Seq protocols (especially if a PCR amplification step is included), there is a certain probability of introducing artificial generation of duplicated reads, that is, reads of the same size aligning exactly in the same chunk of the genome (same start and end positions in the genome, Figure 11.4). The presence of duplicated reads above a certain number may be an indication of the existence of such unwanted technical artifacts during the sequencing process. The application provides this parameter to remove reads that are duplicated a number of times equal of greater to a value (50 by default).
  
  Figure 11.4: Duplicated reads are represented in red color
Specific options for bowtie2.
We use the R package Rbowtie2 (Wei et al., 2018) to align with bowtie2 (Langmead & Salzberg, 2012).
- Performance: Alignment methods search a reasonable trade-off between speed and sensitivity/accuracy in their results in order to reduce false detected alignments without missing too many true alignments. In bowtie2 you can adjust this balance by setting one of the following values the “performance” option: very-sensitive, sensitive (default value), fast or very-fast. The default value, sensitive, keeps a good balance between sensitivity and precision. In the case you need more precision (at the cost of a slower performance) choose very-sensitive. Conversely, choose fast or very-fast to prioritize speed performance (at the cost of a lesser precision).
- Mode: By default, bowtie2 aligns “end-to-end” mode, meaning that it attempts to align the entire read without omitting characters at either extreme. The alternative, local mode, does not require that reads align from one extreme to the other, that is, it can yield alignments that do not include one (or both) of their extremes. Observe in the example of Figure 11.5 that in "local" mode the extremes are not aligned while they are both aligned in "end-to-end" mode.
  
  Figure 11.5: ‘end-to-end’ and ‘local’ modes.
Specific options for hisat2.

We use the R package Rhisat2 (Soneson, 2021) to align with hisat2 (Kim et al., 2015).

All the functionalities of these hisat2-specific parameters are linked to the fact that hisat2 is a “splice-aware” aligner. In practical terms, that means that it uses the information deduced and/or provided about the genomic locations of the splice sites and introns in order to improve the precision of the predicted alignments. Of course, this not applies for intronless genomes. By informing these parameters the alignment may be slightly more accurate (if you do not inform them, hisat2 will do his best anyway.)
- Min(inimum)/Max(imum) intron length: By default, 20 nucleotides and 100000 nucleotides, respectively. Set here the (estimated) length of the shortest/longest intron in your genome. If possible, this information is provided for you by SeqNjoy in the “Details” column (File List) of the corresponding GFF file.
- GFF file (optional). You can select here one of the GFF files you loaded corresponding to this sequencing. From this GFF file SeqNjoy will try to extract the known splice sites of the genome, that will be used by hisat2 to improve the alignments involving splice sites.

Once you have filled the form in with the required files and set your parameter values, the “Launch!” button turns from disabled (grey) to enabled (red) so you can click it and launch the alignment(s). A pop-up window appears informing you of the progress of the alignment task. When the process ends OK a message “done” is displayed in the pop-up window and a “Close” button is enabled. By clicking “Close” you return to the “Align Reads to Genome” pane. Now, you could make a new Align Reads to Genome round or click the “X” icon for closing the “Align Reads to Genome” window. You will then return to the “Project’s Main Page”, which will have been updated to contain the new alignment (.bam) files in its File List.

Figure 11.6: Align Reads to Genome dialog ready to launch the alignments processes.

12 Quantify Genes

Click the “Quantify Genes” button in the “Project’s Main Page” to access this window (Figure 12.1).

Figure 12.1: Quantification window

In summary, you inform here a genome features file (GFF) file and a number of alignment files in BAM. Normally, these alignments have been generated previously in the “Align Reads” step. Alternatively, the BAM files were generated elsewhere and the user uploaded to SeqNjoy.

The output of the quantification is one “counts” file per BAM alignment file. Each “counts” file contains a list with the “counts” , i.e. the number of “reads” or “fragments” that have been mapped to each gene (or other feature type that you select) of the genome.

Genome Features: Select one GFF3 or GTF file from the list of loaded GFF files displayed in the drop-down menu.
Output file suffix (optional): you can type in this box a text chunk that will be added to the name of all the output files of this step (“counts” files), immediately before the .counts extension, that is, “bam_name.suffix.counts” (Example: “Sample1.exons_by_genes.bam”, “Sample2.exons_by_genes.bam”, etc.). If no output file suffix is typed, the resulting files will be named similarly to the corresponding BAM (“sample_name.counts”).
Alignments: In this column of selectors you select the samples (their corresponding BAM alignment files) from which you want to obtain the quantification of counts for one (or several) genomic feature(s). At least one BAM file must be selected.

The following parameters are used by the tool that performs the quantification task, the function featureCounts of the R package Rsubread (Liao et al., 2019):

Strand specificity: Select the type of protocol that was used in the library preparation: “direct”, “reverse” or “unstranded”.
- With unstranded, a read will be counted for a gene irrespective to which strand it maps to.
- With direct and having single end data, the read has to map to the same strand as the gene. For paired end data, the first read of a pair has to map to the same strand as the gene, and the second read has to map to the opposite strand.
- With reverse and having paired end data, the second read has to map to the same strand as the gene, and the first read has to map to the opposite strand.
If you have any doubt, there is a specific section on strand specificity in RNA-Seq in this Tutorial. No default value is set, as it depends on how the sequencing of your reads has been made. If you finally don’t know which strandedness protocol was followed in the sequencing, select unstranded as it is the safest option.
Primary only: A read may map ambiguously to multiple locations, e.g. repeated or closely similar regions in the genome. In that cases, only one of the multiple read alignments is considered “primary”, that is, “the best alignment” based on diverse criteria. All other alignments are marked as “secondary”. By default (box checked) the quantification will count only the alignments marked as “primary alignments” and discard the “secondary”. If not checked, all alignments will be counted. In Figure 12.2, if "primary only" is checked, only the alignment tagged as primary will be counted in feature A for read R001. If not, all the three alignments will be counted: one count for feature A, one for feature B and one for feature C.

Figure 12.2: Primary and secondary alignments.
Ignore Marked Dup[licate]s: This option applies only to alignment (BAM) files uploaded by the user and which have duplicate marking (with Picard Tools, for example). As said in a similar parameter of the “Align Reads to Genomes” dialog, duplicated reads are sometimes generated due to technical issues (Figure 11.4). By default, duplicated reads are not considered in the quantification (checked box). If you uncheck the box, the duplicated reads will be counted.
Max CPU Cores: featureCounts supports parallel processing, that is, it can use multiple CPU cores. Multi-core is enabled by default to use (up to) 4 cores. You can modify this value in the text box with a numeric value greater or equal to 1. If your computer has less cores than the number typed here, the quantification program will use the maximum your computer has.
Genomic features and blocks: The GFF file selected has a number of feature types that you can quantify by. For example, you may wish to get the list of reads that align (counts) per each gene. For that, in the drop-down menu “Select Feature” you should select the “feature type” gene among the existing in your GFF file. Some features, like gene, are constructs that have constituent sub-features underneath. One key sub-feature related to gene expression is exon. In principle, it is more accurately (specially if your GFF is well annotated) to take into account for quantification not all the reads aligning the gene span but only those reads aligning within the (block of) exons of the gene (Figure 12.3).

Figure 12.3: Genomic features and blocks for computing counts in ‘GeneX’. When only the feature ‘gene’ is selected, all reads aligning in the GeneX span are counted (13 reads). When you add the block ‘exon’ to the feature ‘gene’ you are imposing the restriction that only the reads inside exons are counted, so the counts for GeneX are 10.
Add more button: Click to set additional feature types to quantify. For example, if the GFF file contains “genes” and also “pseudogenes” you may want quantify both.

Once you have filled the form with the required files and set your parameter values, the “Launch!” button turns from disabled (grey) to enabled (red) so you can click it and launch the quantification. In Figure 12.4, the quantification process will be launched to count genes for nine samples, with reverse strandedness mode. Only primary alignments will be counted. The output of the quantification is a number of counts files equal to the number of input BAM files, that is, nine in this case.

Figure 12.4: Quantification dialog ready to be launched

A pop-up window appears informing you of the progress of the quantification task. When the process ends OK a message “done” is displayed in the pop-up window and a “Close” button is enabled. By clicking “Close” you return to the Quantify Genomic Features pane. Now, you could make a new Quantify Genomic Features round or click the “X” icon for closing the window. You will then return to the “Project’s Main Page”, that it will have been updated to contain the new “counts” files.

13 Compare Samples (Differential Expression)

In this step you can perform an statistical analysis of the differences in the expression for a given set of features between two groups of samples. You can use it repeatedly to make more that one binary comparison (e.g. “SampleGroupDrugA versus SampleGroupDrugB”, and then “SampleGroupDrugA versus SampleGroupPlacebo” and finally “SampleGroupDrugB versus SampleGroupPlacebo”).

SeqNjoy offers you two possible methods to perform this task, both of them widely used and well documented: DESeq2 (Love et al., 2014) and edgeR (Robinson et al., 2009). If you click the Compare Samples button a submenu listing both options is opened and you can select any of them (Figure 13.1) .

Figure 13.1: Compare Samples: selecting the method for calculating differential expression

After selecting the desired option, the Calculate Differential Expression dialog is displayed (Figure 13.2). Most of this window is the same regardless of the selected software. Only some specific parameters located in the shaded area of the window are different and specific of the selected method (DESeq2 or edgeR) to perform the differential expression calculations.

Figure 13.2: Calculate Differential Expression (Compare Samples) dialog as displayed initially.

The following fields are present in the Calculate Differential Expression window irrespective of the selected method:

Counts Group: Select one “counts group” from the listed in the drop-down menu. A “counts group” is a set of counts files linked a) to a particular GFF AND b) to a particular “feature” or “block-by-feature” counting decided for that GFF in the “Quantify Genes” step. This is expressed in the options displayed in the “Counts Group” drop-down menu with the syntax “gff_file.gff feature (block)” (or “gff_file.gff feature” if no block was selected in “Quantify Genes” step).
Case Label and Control Label: Put here a denomination for the two groups of samples (or biological replicas) to compare, one for the “case”(or “test”, “mutated”, etc.) group and one for the “control” group. These labels are parts of the name of the results file, that has the following syntax: “[CaseLabel]vs[ControlLabel].[Output_suffix_if_typed].xlsx”.
Select Replicate files: Here you have two columns with drop-down menus. Select from the counts group the counts files corresponding to the samples (or biological replicas) you wish to include in the “case” condition (left column) and do the same operation with the samples included in the “control” condition (put them in the right column). To perform the “differential expression calculation”, it is required that you select at least two samples (or biological replicas) in one of the conditions (but take into account that this minimum valid number of samples is extremely poor in statistical terms as input for the differential expression calculation).
Output files suffix (optional): You can type in this box a text label that will be added to the name of the results file, immediately after the “[CaseLabel]vs[ControlLabel]” tag and before the “.xlsx” extension specific for differential expression files, that is, “[CaseLabel]vs[ControlLabel].[output_suffix].xlsx”. If no output files suffix is typed, the resulting file will be named simply “[CaseLabel]vs[ControlLabel].xlsx”.
Annotations (optional): You can add here a file with annotations of your genes (or the specific feature type of the counts files being analyzed). As you can see in the corresponding section, it must match these requirements:
- Be a tab-separated text file, with extension .ann, .ann.txt, .ann.tsv, or an Excel file, with extension .ann.xls or .ann.xlsx.
- Have a first column containing (gene) identifiers, that must be of the same class as the identifiers present in the counts file (for example, if the gene IDs are ENSEMBL IDs in the counts files, the gene IDs must be ENSEMBL IDs in the first column of the user-provided annotations file).
- Have a first column containing (gene) unique identifiers, that must be the same as the identifiers present in the counts file. Regarding your annotations file, take into account that different databases can assign different identifiers to the same feature (for example, ENSEMBL gene IDs are the format ENSG…. and Genbank gene accessions with the format NM…..).
- The rest of columns contain a number of data relative to the respective (gene) identifier (e.g. name, description, genomic location, functional annotations, other IDs, etc), one data per column.
Specific options for DESeq2 (Love et al., 2014):
- Independent Filtering: The default (checked box) is to perform independent filtering. The purpose of independent filtering is to run a process, subsequent to and independent of the particular statistical tests performed by DESeq2, in order to remove from the results those that have no, or little chance of showing significant evidence, what typically increases the statistical/detection power. The simplest criterion to do the filtering is using the mean of all the normalized counts for the evaluated gene. In the box is unchecked, this additional filtering is disabled. More information in this link
- Use beta prior The default (checked box) is to use beta prior, which is equivalent to perform a “Log fold change (LFC) shrinkage”, that is, to shrink the LFC estimates of genes with little information or high dispersion toward more likely (lower) LFC estimates. Figure 13.3 illustrates the effect of using (right) or not (left) beta prior (PC) on the shrinkage of LFC. Observe how using PC a shrinkage on the LFC (“Log2Ratio”, in the Y-axis) is performed. This shrinkage affects those genes that have low expression (low “Log2BaseMean”, X-axis). Red and green dots represent genes upregulated and downregulated, respectively, with p-adjusted (FDR) < 0.05.
  
  Figure 13.3: Effect of beta prior on the shrinkage of LFC.
  
  Use beta prior procedure when your data have a high number of low-expressed genes (low number of counts) and/or a high number of genes showing high difference of expression within a condition (high dispersion in the counts values). You can find additional explanations in this link.
Specific options for edgeR (Robinson et al., 2009):
- Prior Count (optional): This argument is the counterpart in edgeR to the Use beta prior argument in DESeq2. When checked, you can type in the text box a positive number that will be used in the calculations of differential expression as a factor to shrink Log Fold-Changes (LFC). The default is 0.125, a value that produce a minimal LFC shrinkage. Larger values of prior count will produce more shrinkage. Figure 13.4 shows the effect of different values of Prior Count (PC) on the shrinkage of LFC. Observe how the higher the value of PC the larger the shrinkage of LFC (“Log2Ratio”, in the Y-axis). Observe also how the shrinkage affects only the those genes with low expression (low “Log2BaseMean”, X-axis). Red and green dots represent genes upregulated and downregulated, respectively, with p-adjusted (FDR) < 0.05.
  
  Figure 13.4: Effect of different values of Prior Count on the shrinkage of LFC.

Once you have filled the form with the required files and set your parameters values, the Launch! button turns from disabled (grey) to enabled (red) so you can press it and launch the differential expression calculation. In the figure 13.5 differential expression calculation between the Case ("pH5_CO2") group of samples and the Control ("pH5") group of samples is displayed ready to be launched (Launch! button enabled). In the example, an annotations file has been provided so the information contained in this file will be incorporated to the diffexp file with the differential expression calculations.

Figure 13.5: Calculate Differential Expression dialog ready to be launched.

After launching the process, a pop-up window informs you the progress of the “differential expression” task. When the process ends OK the message “done” is displayed in the pop-up window and a close button is enabled. By clicking the Close button you return to the Calculate Differential Expression pane. Now, you can fill in again the form with a new comparison between groups of samples to calculate the differential expression between them or click the “X” icon for closing the window. You will then return to the “Project’s Main Page”, that it will have been updated to contain the newly created diffexp files in Excel format (extension .xlsx) containing the results of the calculated differential expressions of the comparisons performed.

The “diffexp” output file (syntax: [CaseLabel]vs[ControlLabel].[OutputSuffix_if_typed].xlsx) contains the list of differentially expressed genes (DEG) detected between the “Case” and the “Control” groups of samples. Check in the File Types section the information content of the diffexp files. You can graphically inspect and filter the differentially expressed genes results using the FIESTA 2.0 analytic tool.

14 Utilities

In addition to the pipeline to perform the RNA-Seq gene expression data analysis, SeqNjoy provides a number utilities for evaluating and inspecting the content of some of the types of files involved in the analysis. These utilities are launched from buttons located in the File List.

14.1 Quality Control Report

After running Quality Control (QC) on paired-end FASTQ files, a detailed quality report is generated. This report provides key insights into the sequencing data, allowing users to assess whether further preprocessing steps, such as trimming or filtering, are necessary.

This report is generated using fastp (Chen et al., 2018), a widely used tool for FASTQ quality assessment and preprocessing.

To see an example of the Quality Control (QC) report generated for paired-end reads, click here. This example illustrates the types of quality metrics available.

Report Structure

Basic Statistics

This section summarizes fundamental metrics for each read file (R1 and R2), including:

Number of Reads: Total reads analyzed.
Sample Size: The number of reads sampled for QC.
Average Read Length: Mean length of the reads in nucleotides.
GC Content: Proportion of guanine and cytosine bases, which can indicate biases.
Adapter Sequences: Identified adapter sequences, if present.
Adapter Name: The corresponding adapter reference used during sequencing.

Per-Base Sequence Quality

A boxplot displays the quality scores across read positions. This allows users to identify potential quality drops at the ends of reads, which might require trimming.

Per-Base Nucleotide Composition

A graphical representation of base frequency (A, T, C, G) at each position. Unexpected deviations may indicate sequencing artifacts.

Per-Sequence Quality Scores

A histogram showing the distribution of overall read quality. A large proportion of low-quality reads may suggest sequencing issues.

Adapter Content

This section provides a graphical representation of detected adapter contamination, which may require trimming in the preprocessing step.

Duplicate Read Analysis

Displays the proportion of duplicate reads, which can indicate over-sequencing or PCR amplification bias.

Once the QC process is completed, a gray icon next to the analyzed FASTQ files in the File List on the Project’s Main Page will become highlighted. Clicking this icon opens the Quality Control Report.

Based on the report, users can determine whether preprocessing steps, such as adapter trimming or low-quality read filtering, should be applied before proceeding with alignment.

14.2 GFF feature structure hierarchical tree

To access this utility from a GFF (GFF3 or GTF formatted) of the File List click the icon located rightmost in the Details column.

When launched, SeqNjoy will run the plot_features function of the Rgff R package (Garcia-Martin et al., 2022) to generate a picture containing a tree representing the hierarchical structure (dependencies) of the feature types included in the selected GFF file (Figure 14.1). This tree representation is useful for a global look of what is annotated in the genome and it can help you decide which feature(s) and blocks to use in the Quantify Genes step of the analysis.

Figure 14.1: A hierarchical tree of a genome feature file (GFF) produced by Rgff

As it is shown in the example tree above, the highest feature type in this GFF is gene and there are 7127 items annotated as genes. gene has a child node, the transcript feature. Included into transcript there are five child features, that are siblings from each other: CDS, exon, five_prime_UTR, start_codon and stop_codon.

With this information, we know that we could count genes, exons-by-gene, transcript, exons-by-transcript, exons, etc. in our analysis but not “exon-by-CDS” for example, as exon is not a descendant of the CDS feature.

14.3 Visual exploration of genomic files with IGV

You can access this utility from a FASTA file containing the genomic sequence or from an alignment file (BAM), by clicking the icon located on the right of the Details column of the corresponding file. The option is not available for user-provided BAMs.

After clicking the button, the widely used “Integrative Genomic Viewer” (IGV) (Robinson et al., 2011, 2022) tool will be launched and will load the corresponding FASTA or BAM file. If you have loaded a FASTA, a screen similar to the one in Figure 14.2 will be displayed.

Figure 14.2: IGV launched with a FASTA file as input

If you use a BAM file with IGV a similar window is displayed. In addition to the BAM file, the associated genomic sequence (FASTA file) is automatically incorporated to IGV in the same session. See in Figure 14.3 the initial aspect displayed in this case and a minimal “use case” for IGV to visually compare the alignment files of two samples (i.e. two BAM files) for a particular gene.

Figure 14.3: A simple usage example of IGV to compare two BAM files in a certain genomic region.

IGV has a plethora of functionalities and possibilities to interact with your alignment files. Please refer the IGV user general guide, to take full advantage of this versatile tool. In addition, you can inspect your files with the web/on-line version that IGV provides at https://igv.org/app/, documented at https://igvteam.github.io/igv-webapp/.

14.4 Graphical analysis of differential expression results with FIESTA 2.0

FIESTA 2.0 (Oliveros, 2022) is an interactive viewer for differential expression results files that is integrated in SeqNjoy. It provides multiple ways of representing your data as tunable interactive plots that can be downloaded as high resolution images. In addition, FIESTA 2.0 allows you sorting and filtering genes by one or several columns and saving the relevant ones into files.

Figure 14.4 illustrates a usage example for FIESTA 2.0. When a diffexp file is loaded into FIESTA 2.0, a default representation of the expression data as a MA-plot is displayed. From this point, you can combine filters (top) to select gene subsets. A very common (and useful) filter combination is to select and highlight with user-selected colours genes (center) that show to be significantly (P-adjusted < 0.05) under- (negative Log2Ratio) or overexpressed (positive Log2Ratio). Finally, user can select any pair of numerical columns present in the diffexp file to be represented (bottom), so other popular representations like Volcano plots are also available. User-generated plots and subsets of relevant genes are both downloadable (in the case of the plots, with high resolution).

Figure 14.4: FIESTA 2.0 usage example.

To access FIESTA 2.0 viewer from a file containing the differential expression results (“diffexp” files in .xlsx format) click the icon located rightmost in the Details column.

More details about FIESTA 2.0 here.

15 File Types in SeqNjoy

SeqNjoy uses up to seven types of files in the analyses. The files that you provide have these types: FASTA, FASTQ, GFF3/GTF and ANN. Other types are only held by files generated through the SeqNjoy analysis: COUNT, DIFFEXP. Finally, the format BAM/SAM is that of the alignment files, which can be either internally generated by SeqNjoy or provided (loaded) by the user.

To follow, we describe the aspect, source, uses, format requirements and content of each file type. You can find a good complement to know more about them in this link by Andrew Severin and also in this page.

15.1 FASTA file

File type and extensions: plain text file, uncompressed (fa, fasta, fna) or gzip-compressed (fa.gz, fasta.gz, fna.gz), e.g. my_fasta_file.fa, my_fasta_file.fasta.gz
Source: User-provided. In case you do not have a FASTA file of the sequenced species/variety, you have good repositories of FASTA files in Ensembl, EnsemblGenomes, USCS and NCBI. Very often, you will find or have “.gz” FASTA files. “gz” files are compressed to save space. SeqNjoy will process them normally, you don’t need to uncompress them before loading.
Use: As input file In the “Align Reads” task. It is used as genomic reference sequence where the short reads (FASTQ files) are aligned to.
Content: FASTA files have a very simple format to contain the sequence of nucleotides of a genome, composed by one or several genomic sequences. Each genomic sequence item (a chromosome, for example) is preceded by a header line that starts by a “>” character and contains the identifier of that genomic sequence (Figure 15.1)

Figure 15.1: FASTA file format

15.2 FASTQ files

File type and extensions: plain text files, uncompressed (.fq) or gzip-compressed (.fq.gz). e.g.: SampleName_S1_L001_R1.fastq.gz. Normally, the presence of a "_1“-like or”_2"-like label indicates that the FASTQ file is of paired-ends type.
Source: User-provided. FASTQ files are generated by a sequencer machine, so it is the format of the files that the user receive from a sequencing provider. In other cases, users get the FASTQ files from other experiments they want to reuse and that have been deposited (uploaded) in public databases or repositories like SRA or ENA.
Use: FASTQ files are one of the inputs required at the beginning of the RNA-Seq analysis, in the “Align Reads” step. In fact, FASTQ files are the main input file type, as their content, millions of short reads, are the “RNA-Seq data” to be analyzed in your experiment. You will have one file (in single-end sequencing mode) or two files (in paired-ends sequencing mode) per sequenced sample.
Content: Each FASTQ file contains information of millions of so-called short reads, or small genomic sequences that have been “read” by a sequencer machine in a sample. FASTQ is also the name of the format of these files, and is kind of a extension of the FASTA format (it has also a header and lines with genomic sequences).

A FASTQ file uses four lines per short read (Figure 15.2):
- Line 1 begins with a ‘@’ character and is followed by a sequence identifier and an optional description.
- Line 2 contains the raw sequence letters
- Line 3 begins with a ‘+’ character. This line is kind of separator and has not use for the analysis.
- Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence. Each quality symbol in the Line 4 represents the quality value of the sequence letter located in the same position in Line 2. Quality scores are codified as “Phred Scores” that can be translated to numerical values. Quality scores are important for the aligners to make alignment decisions. For more information about this codification and other features of the FASTQ format, click here.
Figure 15.2: FASTQ file format
Final remarks
- Important to note, the order of short reads in FASTQ files has nothing to do with the position of the read in the genome or any other criterion.
- FASTQ files usually contains tens or hundreds of millions of lines, that is, they are very large files which is often a trouble for viewing them in a normal text editor. Most of the times, you do not need to view FASTQ files (they are only intermediate files). If you need to view a FASTQ file for troubleshooting purposes or out of curiosity, you will need either a text editor that can handle very large files, or access to the console of your computer and write commands in the command line.

15.3 Gene Feature File (GFF): GFF3 and GTF formats

File type and extensions: plain text files, uncompressed (.gff3, .gtf) or gzip-compressed (.gff3.gz, .gtf.gz). e.g.: my_gene_features.gff3, my_gene_features.gff3.gz.
Source: User-provided. GFF3 or GFF files are obtained very often in the same place that the file with the genome (FASTA file). This makes sense as a particular “Gene Features File” (GFF) is linked to a specific “FASTA reference file”. If you don’t have a GFF file for your sequenced species/variety you can try to find one in repositories like Ensembl, EnsemblGenomes, USCS or NCBI. Very often, the GFF3/GTF files are compressed (“.gz” extension) to save space. SeqNjoy will process them normally, you don’t need to uncompress them before loading.
Use: As input file for the “Quantify Genes” task. In a simplistic explanation, in Quantify Genes step the aligned reads contained in each alignment (BAM) file are mapped to the GFF3/GTF file to count the number of reads that map within the limits of a each gene (or other selected feature type) contained in the GFF3/GFT file.
Content: A GFF3 or GTF file contains information for the so-called genomic features. A feature is a genomic region with some involvement in a biological process. For example, the mRNA “EDEN.1” (mRNA00001) of the gene “EDEN” (gene00001) in the Figure 15.3 is a feature. In the same figure, that specific feature is described in Line 3 of the exemplified GFF file:

Figure 15.3: Canonical representation of the representation of a single gene in GFF3. Adapted from the GFF3 specifications

The Figure 15.3 depicts how a gene and its features composition (mRNAs, exons, CDSs) are represented in a GFF3. After some header lines (starting with “#”), each feature is written in a line. You can see that each feature has information about its location (start, end, strand) and feature type (gene, mRNA, exon, etc.). A key point is the last column, where relations of dependency between the features are coded. For example, the feature “mRNA00001” (with feature type “mRNA” ) in Line 3 depends on (note its “Parent” tag) another feature named “gene00001” (aka EDEN) in Line 2, which has the feature type “gene”. SeqNjoy has a specific utility for displaying the feature structure of your GFF3/GTF file in a graphical way.

More formally, each line of a GFF3 (or GTF) file has 9 tab-separated columns:

Column 1: seqID (e.g. chromosome-id: “chr1” or “1”, “chr2”, etc..). IMPORTANT: These ids must be the same as the used in the FASTA file. See in the example below (Figure 15.4) how only the GFF3 in green has matching chromosomes ids (chr1, chr2…) with those in the FASTA file displayed.

Figure 15.4: GFF and FASTA chromosome ids must match.
- Column 2: Source (where the feature comes from: a program, a database…) Not important for us…
- Column 3: Feature type (gene, mRNA, CDS, exon, etc.)
- Column 4: Start position of feature in the genome (i.e. in the FASTA file)
- Column 5: End position of feature in the genome (i.e. in the FASTA file)
- Column 6: Score (confidence value). Often this field contains only a dot ‘.’ (uninformed). Anyway, not important for us.
- Column 7: Strand (+,-,.). A dot means uninformed.
- Column 8: Phase (0,1,2 or not informed (.) ).
- Column 9: List of attributes, with a format tag=value per attribute. Multiple attributes are separated by “;”. It holds (among other data) the dependency relations of the feature.

GFF3 is currently the preferred format to represent Gene Features, but it is not uncommon to find gene feature files in GTF format. GFF3 provides more flexibility to represent the hierarchy of the features in a genome but GTF has some advantages in certain cases. SeqNjoy can process files in GFF3 or GTF formats. In both cases, columns 1 to 8 are the same. The 9th column has a different syntax in GFF3 and GTF (Figure 15.5 )

Figure 15.5: Comparison of GTF and GFF3 files for the representation of the features of the same gene

Important: For SeqNjoy, GFF files must contain at least one “gene” in the feature type column.

For more detailed information about GFF formats refer to this, this or this link.

15.4 Alignment files (BAM / SAM files)

File type and extensions: BAM are binary files with .bam extension. You can also load to SeqNjoy the uncompressed/text version of the BAM files (files with .sam extension) but do it ONLY if you do not have the BAM counterparts. The reasons why we advise against using SAM files are:
- They take up lots and lots of memory space from your disk. A BAM file contains exactly the same information than the SAM file in less space.
- Alignment files processed downstream are expected to be BAM files (.sam files do not contribute anything in comparison).
- We do not preserve your loaded .sam files in the original format. In you provide .sam files, we will convert them immediately to BAM and then delete the loaded .sam(s) as they become redundant.
Source: Generated by SeqNjoy in the upstream Align Reads step of the analysis or, alternatively, user-provided.
Use: Alignment (BAM) files are used as input files for the Quantify Samples step. Additionally, you can graphically inspect the alignments of one or more BAM file(s) by using the IGV viewer that can be launched from the File List.
Content: BAM/SAM files follows a format named “SAM format” (Sequence Alignment/Map). This format consists on tab-limited columns that store alignments of the short reads from the FASTQ file that have been mapped to the reference genome (your FASTA file).

A BAM is a binary file and therefore it is not directly readable with an editor (you can use IGV to inspect its content). Getting the uncompressed/text version of a BAM, a SAM file, does not solve normally the problem, as SAM files are very large and any try to browse it with a common editor will most likely end in a crash of the editor after a while.

In brief, a BAM/SAM file contains a complete information per line of each alignment: read identification, chromosome location, alignment type (single-end or paired-ends), mapping quality, presence of insertions and deletions in the alignment, insert size, etc.

You can find detailed information about the SAM format in the SAM Format Specification.

15.5 Counts file

File type and extension: tab-separated plain text file (.counts extension).
Source: Generated by SeqNjoy in the “Quantify Reads” step of the analysis.
Use: count files are used as input files in the “Compare samples” step of the analysis. When comparing two groups of samples in two different conditions, the counts files of the samples in “Case condition” are compared with the counts files of the samples in “Control condition”.
Content: The format of a counts file is very simple. It has only two columns: the Feature_id and the number of raw counts for that feature id. The feature type of the ids in the first column corresponds to that selected for quantifying in the “Quantify Genomic Features” dialog, for example, gene ids if “gene” was selected as “feature”. The second column, raw counts, contains the number of reads that map in the reference genome within the feature span.

Figure 15.6: Counts file example

15.6 Annotations file

File type and extensions: tab-separated text files (ann, ann.txt, ann.tsv, ann.tab) or Excel files (ann.xls, ann.xlsx). e.g.: my_annotations.ann, my_annotations.ann.xlsx.
Source: User-provided file. You can generate your own annotations files using the easy-to-use web-based tool Biomart, available for vertebrates, plants, non-vertebrate metazoa, protists and fungi.
Use: Optional input file in the “Calculate Differential Expression” window. When provided, its information is aggregated to the rest of columns of the “diffexp” file, which is also generated in the “Compare Samples” task. See “Differential Expression File” for more info.
Content: The “ANN” format of an annotations file is very simple (Figure 15.7). The first line is the header line and has the column names. The first column of the file must contain gene identifiers. These identifiers must match (in type, not necessarily in number) those present in the counts files. The rest of columns may contain different biological information related to the genes, e.g. gene description, gene synonyms, biological processes it is involved in, molecular functions, cell compartments, expressed tissues, biological pathways, interactions, orthologous genes, literature links, identifiers to other databases, etc.

Figure 15.7: Annotations file format

It is not necessary that all genes present in the counts files are also present in the annotations file, nor is it necessary that all genes present in your annotations file are present in the counts files.

15.7 Differential Expression file

File type and extension: Differential expression files (diffexp) are Excel files with extension .xlsx.
Source: Generated by SeqNjoy in the Compare samples step of the analysis.
Use: Differential expression (diffexp) file is the final result of the analysis in SeqNjoy. Additionally, you can carry out further graphical analysis of the results contained in diffexp files by accessing the FIESTA 2.0 utility from the Project’s Main Page.
Content: diffexp files contain information generated by the program selected for the detection of Differential Expressed Genes (DEGs) between two groups of samples that you determine to be in different conditions (Figure 15.8). In addition, diffexp files contain the raw counts and the normalized counts of all the samples involved in the comparison and the location of the genes in the genome (information taken from the GFF file). Finally, if an annotations file (ANN) is provided in the Calculate Differential Expression dialog, the diffexp file incorporates the biological information contained in the ANN file for each matching gene of the diffexp file.

Figure 15.8: diffexp file format.

This is the meaning of the columns present in the diffexp files (Figure 15.8):

ID: Feature ID. Identifier of the feature type that was counted. In the case of “gene” this column contains gene IDs.
- Type: Feature Type. e.g. gene.
- Location: Chromosome, initial position and end position of the feature in the genome as annotated in the GFF file, in the format Chromosome:IniPos-EndPos.
- Strand: Strand of the feature (as stated in the GFF file).
- Log2BaseMean: Average of the normalized counts taken over all samples in logarithm to base 2. This value is calculated by the program that performs the differential expression (DESeq2 or edgeR).
- FoldChange(CaseCondition/ControlCondition): The fold change or “ratio” is a measure that describes how much the expression of the feature changes from the Control to the Case condition. For example, given a particular gene with a expression value (averaged between the biological replica of this condition and normalized) for Control equal to 30.5 and for Case is equal to 61.0, the ratio would be 2. That is, “Case” expression is two-fold the “Control” expression for that gene.
- Log2Ratio(CaseCondition/ControlCondition): Logarithm to base 2 of the ratio (fold change) between the counts for Case condition and the counts for the Control condition. Also known as “log Fold Change”. This value is calculated by the program that performs the differential expression (DESeq2 or edgeR). The value in this column is the logarithm to base 2 of the ratio, which is used commonly for analysis and visualization purposes as it is easy to interpret, e.g. a doubling value of the ratio (ratio=2) is a logRatio of 1; if it is a quadrupling value (ratio=4) then the logRatio is 2 and so on. Moreover, the logRatio is symmetric: when a change decreases by an equivalent amount e.g. a reduction by half (ratio=0.5) is equal to a logRatio=-1, a reduction by quarter is equal to a logRatio=-2, etc.
- StdErr(Log2Ratio): The standard error of the Log2Ratio estimate. Only in case you have selected DESeq2 for differential expression detection.
- P-value: Statistical significance (p-value). The p-value is a measure of how likely you are to get the particular LogRatio of that gene (or a more extreme value) if no real difference existed. Therefore, a small p-value indicates that there is a small chance of getting this LogRatio if no real difference existed, so logically you decide that the difference of expression levels between both conditions is significant. By “small” often a threshold of p-value=0.05 is taken, but note that it is a convention.
- P-adjusted: This field contains an “adjusted p-value” obtained with the “False Discovery Rate” (FDR) method by Benjamini and Hochberg (Benjamini & Hochberg, 1995). FDR is a method to cope with the “multiple comparisons problem”. In this link you will find a very good explanation on this problem and why it is a good idea to take the FDR as “the” measure for evaluating statistical significance of your results.
- Annotations columns: Additional columns will be displayed only if you provided an annotations file in the “Calculate Differential Expression” dialog. The annotations contained for each gene/feature in the annotations file will be added to the differential expression information of that gene/feature.
- Raw counts for the CONTROL group of samples (or biological replicas): Several columns. Each column contains the raw counts detected in the Quantify Genomic Features menu task for each gene/feature in a “Control” sample (or biological replica), i.e. the counts contained in the corresponding counts files “as-is”.
- Raw counts for the CASE group of samples (or biological replicas): Idem as the previous “raw counts” but for “Case” samples (or biological replicas).
- Norm counts columns for CONTROL group of samples (or biological replicas): Several columns. Each column contains the normalized counts for each gene/feature in each sample (or biological replica). The normalization method depends on the program selected to perform the Compare samples step. In the normalization step of the differential expression analysis, raw counts are adjusted to account for factors that prevent direct comparison of expression measures. Considered factors in the normalization calculations are the sequencing depth and the RNA composition. See a very intuitive explanation at this link.
- Norm counts columns for CASE group of samples (or biological replicas): Idem as the previous “norm counts” but for “Case” samples (or biological replicas).

References

Andrews, S., Krueger, F., Segonds-Pichon, A., Biggins, L., Krueger, C., & Wingett, S. (2012). FastQC. Babraham Institute.

Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1), 289–300. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x

Chen, S., Zhou, Y., Chen, Y., & Gu, J. (2018). Fastp: An ultra-fast all-in-one fastq preprocessor. Bioinformatics, 34(17), i884–i890. https://doi.org/10.1093/bioinformatics/bty560

Garcia-Martin, J. A., Oliveros, J. C., & Torres-Perez, R. (2022). Rgff: R utilities for gff and gtf files. National Center for Biotechnology, Spanish National Research Council (CNB-CSIC). https://cran.r-project.org/web/packages/Rgff

H Pagès, R. G., P Aboyoun. (204C.E.). Biostrings: Efficient manipulation of biological strings. https://bioconductor.org/packages/Biostrings.

Kim, D., Langmead, B., & Salzberg, S. L. (2015). HISAT: A fast spliced aligner with low memory requirements. Nature Methods, 12(4), 357–360. https://doi.org/10.1038/nmeth.3317

Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with bowtie 2. Nature Methods, 9(4), 357–359. https://doi.org/10.1038/nmeth.1923

Liao, Y., Smyth, G. K., & Shi, W. (2019). The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic Acids Research, 47(8), e47. https://doi.org/10.1093/nar/gkz114

Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12). https://doi.org/10.1186/s13059-014-0550-8

Oliveros, J. C. (2022). FIESTA 2.0: Interactive scatter plots in your browser. National Center for Biotechnology, Spanish National Research Council (CNB-CSIC). https://bioinfogp.cnb.csic.es/tools/fiesta2

Pomaznoy, M., Sethi, A., Greenbaum, J., & Peters, B. (2019). Identifying inaccuracies in gene expression estimates from unstranded RNA-seq data. Scientific Reports, 9(1). https://doi.org/10.1038/s41598-019-52584-w

Robinson, J. T., Thorvaldsdóttir, H., Turner, D., & Mesirov, J. P. (2022). Igv.js: An embeddable javascript implementation of the integrative genomics viewer (igv). bioRxiv. https://doi.org/10.1101/2020.05.03.075499

Robinson, J. T., Thorvaldsdóttir, H., Winckler, W., Guttman, M., Lander, E. S., Getz, G., & Mesirov, J. P. (2011). Integrative genomics viewer. Nature Biotechnology, 29(1), 24–26. https://doi.org/10.1038/nbt.1754

Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2009). edgeR: A bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139–140. https://doi.org/10.1093/bioinformatics/btp616

Soneson, C. (2021). Rhisat2: R wrapper for hisat2 aligner. https://github.com/fmicompbio/Rhisat2

Wei, Z., Zhang, W., Fang, H., Li, Y., & Wang, X. (2018). EsATAC: An easy-to-use systematic pipeline for atac-seq data analysis. Bioinformatics. https://doi.org/10.1093/bioinformatics/bty141

Appendix 1: Single-end and paired-ends sequencing

Introduction

After the “fragmentation” step, in RNA-Seq protocols fragments ~300 to 500 bp long are produced. Even this size range is too long to be sequenced entirely by the type of sequencers that generate short reads. What is done then is to read only the end part of the fragment, which produce short reads of 100-200 bases (this interval may vary according to the experimental design, the machine used, etc.) The rest of the fragment is not sequenced.

Now, there are two modalities in RNA-Seq, single-end and paired-ends (Figure 15.9):

Figure 15.9: Single-end and Paired-ends reads: a global view

Single-end RNA-Seq (SE, aka single-read RNA-Seq): Is the simplest mode. Only one end of the fragment is sequenced. All the short reads sequences generated for a sample with single-end (SE) technique are collected in a single FASTQ file.

Paired-end RNA-Seq mode (PE): Two reads per fragment are produced. In this mode, the reading starts at one end, finishes this direction at a pre-specified read length (e.g. 150 bases) and a “read 1” is produced as a result. Then, it starts another round of reading from the opposite end of the fragment, finishes this new (opposite) direction at the same read length and finally a second read, “read 2”, is produced. At the end, both ends of the fragment are sequenced leaving the inner portion of the fragment unsequenced.

In PE sequencing, read 1 reads (those generated first-in-the-pair) and read 2 reads (generated secondly) are stored in two separated FASTQ files. One file contains all the read 1 reads the other file contains all the read 2 reads. Each of these files has a label in its name that identifies whether it contains read 1 or read 2 reads, something like Sample1_blah_blah_1.fastq and Sample1_blah_blah_2.fastq (or something similar like Sample1_blah_blah_R1.fastq and Sample1_blah_blah_R2.fastq, etc.) Note that usually the only difference in the filenames between two FASTQ paired-ends files of the same sample is just this "_1" or "_2"-like particles.

In addition, "_1" and "_2" FASTQ files of a particular sample have both the same number of reads. Moreover, they kept the same order of reads (see Figure 15.9): at a given position X in both files, the reads in “_1” and “_2” correspond to the same fragment. In fact, as with the files, the read identifier that is given to the “read 1” and “read 2” reads of the same fragment is the same… except again for the particle indicating whether it is for reads 1 or reads 2.

Reads orientation

To know how is the orientation of the reads in SE and PE is useful to interpret correctly the alignments of the reads and to understand important topics like the unstranded or stranded protocols in RNA-Seq.

The sequencing always synthesizes the reads in a 5’ -> 3 direction. (It is the way the DNA polymerase proceeds) so the template strand is traversed 3’ -> 5’.

In single-end, the reads extend in the 5’ -> 3’ direction along the forward (plus) DNA strand (Figure 15.10).

Figure 15.10: Single-end read orientation

In paired-ends, the read 1 read extends in the 5’ –> 3’ direction towards read 2 along the forward (plus) DNA strand. As for the read 2 read, it extends in the 5’ –> 3’ direction towards read 1 along the reverse (minus) DNA strand. Read 1 is the reverse complement of the first 3’-most bases of the forward strand (therefore identical to the first 5’-most bases of the reverse strand) and read 2 is the reverse complement of the first 3’-most bases of the reverse strand (therefore identical to the first 5’-most bases of the forward strand). See the read 1 and read 2 orientations represented in Figure 15.11:

Figure 15.11: Paired-ends FR reads orientation

Therefore the sequences of the two reads of a pair are pointing towards each other on opposite strands. When you align them to a genome, one read should align to the forward strand, and the other should align to the reverse strand, at a higher base pair position than the first one so that they are pointed towards one another. This is known as an “FR” read pair or “forward/reverse”, as this is the order in which they are placed. A read pair having this orientation is called a “proper pair”. There are other possible orientations, but we don’t need to go into them here.

When you use IGV genomic browser you can see this orientation in the case of a read pair of a PE sequencing. Take into account that in IGV only the forward (plus) strand of the genome is displayed. With the “Color by” -> “read strand” option the read in the forward strand is colored red and the read of the reverse strand is colored blue. Observe in the figure above a read pair meeting the FR orientation. They are even more visible if you select “View in Pairs” option (Figure 15.12).

Figure 15.12: Paired-ends FR reads orientation as seen in IGV.

Single-end or paired-ends, which is better?

Both single-end (SE) and paired-ends (PE) can be used for measuring gene expression. PE offers the advantage over the SE that the distance covered by the R1 and R2 and the unsequenced sequence in-between is controlled on average (it is named “insert size” and is usually in an approximate range 200-600 bp) so the aligners can take advantage of this information to be more accurate in the correct alignment of the reads to the reference genome. The point is that a read pair properly (FR) oriented, covers a longer span in the genome (read_1 + inner_distance + read_2, Figure 15.13) and therefore a stronger evidence of being well aligned than a single read.

Figure 15.13: Insert size in Paired-ends reads

The main disadvantage of the PE protocol is that it is normally more expensive and time-consuming to generate than single-end reads. On the other hand, just for counting reads or calculate differential expression may be unnecessary to have such a high level of accuracy, especially if your species is well-annotated. But take into account that in the future you might want to reuse the reads intended to study differential expression to, for example, extend the study to look for alternative splicing. In that case, you will need to have done PE sequencing as SE is insufficient for alternative splicing analysis.

Appendix 2: Strandedness in RNA-Seq

Understanding Strandedness in Transcription

During transcription, one of the two DNA strands serves as a template for synthesizing the RNA molecule. Since the template strand is reverse complementary to the transcribed RNA, it is referred to as the antisense strand, negative-sense strand, non-coding strand, or minus strand.

Conversely, the non-template strand has the same sequence as the transcript (except for T instead of U) and is known as the sense strand, positive-sense strand, coding strand, or plus strand (Figure 15.14).

Figure 15.14: Representation of sense and antisense strands in transcription

It is important to note that either DNA strand can contain genes, meaning that the designation of sense and antisense is gene-specific (Figure 15.15):

Figure 15.15: Sense and antisense strand assignment depends on the gene

RNA-Seq protocols: unstranded vs. stranded

RNA-Seq protocols generate single-stranded mRNA molecules that are then converted into double-stranded complementary DNA (cDNA). These cDNA molecules are subsequently amplified, producing millions of short sequencing reads. As a result, for any given transcript, the sequencing reads fall into two categories:

Sense reads: These have a sequence identical to the original RNA transcript (except for T instead of U).
Antisense reads: These are the reverse complement of the original RNA transcript sequence.

Initially, most RNA-Seq library preparation protocols were non-strand-specific (also called unstranded). In these protocols, sequences from both the sense strand (the cDNA strand encoding the transcript) and the antisense strand (the reverse complement of the original mRNA) are sequenced without preserving strand information.

This means that in the resulting FASTQ files, _1.fastq and _2.fastq contain a mixture of sense and antisense reads, making it impossible to determine whether a given read originated from the sense or antisense strand of the original mRNA.

Figure 15.16: Comparison of unstranded and stranded RNA-Seq protocols

When assigning reads to genes, it is ideal to match all reads derived from the sense strand to a sense gene in the reference genome and all reads from the antisense strand to an antisense gene. However, in an unstranded protocol, this is not possible—some antisense reads may be incorrectly assigned to sense genes, and vice versa. This can introduce biases in gene expression quantification.

Now, consider the stranded protocol illustrated on the right in Figure 15.16. Here, the second-strand cDNA fragment (the sense strand) is prevented from amplification. This results in only the first-strand cDNA (antisense strand) being amplified. Consequently, strand information is retained in the sequencing data:

In reverse-stranded protocols, the _1.fastq file contains antisense (reverse) reads, as they are sequenced first. The _2.fastq file contains sense (direct) reads, sequenced second.
In direct-stranded protocols, the sense (direct) reads are amplified and stored in _1.fastq, while the antisense (reverse) reads are stored in _2.fastq.

FASTQ File Structure in Different Protocols

In stranded RNA-Seq protocols, strand information is preserved by selectively amplifying only one strand of cDNA. This distinction directly impacts how reads are stored in the FASTQ files and determines whether a given read corresponds to the sense or antisense strand of the original RNA transcript. Figure 15.17 illustrates how sequencing reads are distributed across FASTQ files depending on the protocol used.

Figure 15.17: Unstranded and stranded protocols and FASTQ files.

Unstranded RNA-Seq: The sense read can be stored in either _1.fastq or _2.fastq, making strand determination impossible.
Direct-stranded RNA-Seq: The sense read is stored in _1.fastq.
Reverse-stranded RNA-Seq: The sense read is stored in _2.fastq.

By ensuring that all sequencing reads carry strand information, stranded protocols prevent incorrect assignment of antisense reads and eliminate potential biases in gene quantification.

Several techniques exist for generating stranded RNA-Seq libraries. One of the most widely used involves chemical modification with dUTP. This method incorporates dUTP specifically into the non-sense strand of the cDNA, preventing its amplification during PCR (15.16, right). This approach is known as reverse-stranded RNA-Seq, since the reverse (antisense) strand is amplified and sequenced first.

Alternatively, some stranded protocols selectively deplete the sense strand instead of the antisense strand, resulting in a direct-stranded RNA-Seq protocol. In this case, the sense (direct) strand is amplified and sequenced first, while the antisense strand is discarded.

Getting Down to Practicalities

When your provider delivers the short-read files (FASTQ files) from your RNA-Seq experiment, it is recommended to check the sequencing report to determine which library preparation method was used: unstranded (non-stranded), direct-stranded, or reverse-stranded. If this information is not explicitly mentioned, request it from the provider, as it is crucial for accurate transcript quantification (counts). Once you have identified the protocol type, select the appropriate option in the Strand Specificity selector of the Quantify Genomic Features window.

However, the protocol may not always be listed under these exact names. For example, some reports may refer to the protocol as “directional RNA-Seq,” which is equivalent to “stranded RNA-Seq.” In other cases, only the name of the commercial kit used for library preparation might be provided. The following table serves as a reference to help interpret these variations:

Protocol	Synonym	Examples/Kits¹	Description
Unstranded	Non-stranded	Standard Illumina, TruSeq RNA Sample Prep Kit, NuGEN OvationV2, SMARTer Universal Low Input (TaKara), GDC normalized TCGA data	Strand information is not retained (lost).
Reverse-stranded	First-strand	dUTP, NSR, NNSR; TruSeq Stranded (Total RNA/mRNA), NEB Ultra Directional, Agilent SureSelect Strand-Specific	The second read (read 2) corresponds to the original RNA strand, while the first read (read 1) aligns to the opposite (reverse) strand.
Direct-stranded	Second-strand	Directional Illumina (Ligation), Standard SOLiD, ScriptSeq v2, SMARTer Stranded Total RNA, NuGEN Encore Complete, NuGEN SoLo, Illumina ScriptSeq	The first read (read 1) corresponds to the original (sense) RNA strand, while the second read (read 2) aligns to the opposite strand.

¹ The information in this column was compiled from this, this, and this source.

Regardless of the reported protocol, we recommend verifying the strandedness yourself by following an approach like the one described in the next section.

How to Determine the Strandedness with IGV

In some cases, you may not have access to information about the strandedness of your RNA-Seq data. This can happen if you obtained your FASTQ or BAM files from an online repository that lacks metadata or if the dataset is poorly documented.

Fortunately, there are strategies to infer strandedness from the data itself. One such strategy involves using the IGV (Robinson et al., 2011, 2022) genomic browser.

Step 1: Load the Required Files in IGV

To analyze strandedness, you need to load the following files into IGV:
- Reference genome (FASTA format)
- Gene annotation file (GFF3/GTF format)
- BAM file(s) containing the RNA-Seq alignments

Once loaded, you can use IGV’s Color by Read Strand option to visualize the alignment patterns.

Step 2: Understanding Read Orientation in IGV

As described in the Reads Orientation section, paired-end reads typically align in an FR conformation (Forward-Reverse), where:
- Read 1 aligns to the forward (+) DNA strand, and
- Read 2 aligns downstream on the reverse (-) DNA strand (Figure 15.12).

IGV allows you to visualize read orientation using the Color by Read Strand option:
- Forward strand reads are colored red.
- Reverse strand reads are colored blue.

Additionally, paired-end sequencing assigns each read to one of two FASTQ files:
- “reads_1.fastq” contains all Read 1 sequences.
- “reads_2.fastq” contains all Read 2 sequences.

Now, let’s examine how this information can be used to infer strandedness.

Step 3: Detecting Strandedness Using Read 1 Orientation

By inspecting only Read 1 (reads_1.fastq), you can infer the strandedness of your data:

Unstranded Protocol:
- reads_1.fastq contains a mix of red and blue reads, meaning some align to the forward strand and others to the reverse strand.
- This happens because strand information is lost during library preparation.
Direct-Stranded Protocol:
- reads_1.fastq contains only forward-aligned reads (red) when mapping to genes on the forward strand of the genome.
- Likewise, it contains only reverse-aligned reads (blue) when mapping to genes on the reverse strand.
Reverse-Stranded Protocol:
- reads_1.fastq contains only reverse-aligned reads (blue) for genes on the forward strand.
- It contains only forward-aligned reads (red) for genes on the reverse strand.

Step 4: Using “Color by First-of-Pair Strand” in IGV

IGV provides another useful visualization tool clicking the track’s gear menu of IGV and selecting Color by → First-of-Pair Strand in the BAM track (Figure 15.18).

Figure 15.18: option ‘first in pair’

When enabled, both reads of a pair are colored based on the strand to which Read 1 aligns:
- If Read 1 aligns to the forward strand, the entire pair is red.
- If Read 1 aligns to the reverse strand, the entire pair is blue.

This visualization allows for a quick assessment of whether Read 1 is consistently aligning with a particular strand, which is crucial for determining the strandedness protocol.

Step 5: Verifying the “Pair Orientation” in IGV

By left-clicking a read in IGV, you can access additional alignment details (Figure 15.19). One key piece of information is the Pair Orientation, which can be either F2R1 or F1R2:

Figure 15.19: Pair orientation in IGV

F1R2 → Read 1 aligns to the forward strand (red) and comes from reads_1.fastq, while Read 2 aligns to the reverse strand (blue) and comes from reads_2.fastq.
F2R1 → Read 1 aligns to the reverse strand (blue) and comes from reads_1.fastq, while Read 2 aligns to the forward strand (red) and comes from reads_2.fastq.

These orientations help confirm whether the sequencing protocol follows a direct stranded or reverse stranded approach.

Final Interpretation: Identifying the Library Protocol

Using the tools above, you can now determine the library preparation method:

Unstranded: Mixed ‘red and blue’ reads in reads_1.fastq.
Direct stranded: Red reads for genes on the forward strand, blue reads for genes on the reverse strand (Figure 15.20).

Figure 15.20: Direct stranded: read pairs in red for genes in forward (+) strand and in blue for genes in reverse (-) strand.
Reverse stranded: Blue reads for genes on the forward strand, red reads for genes on the reverse strand (Figure 15.21).

Figure 15.21: Reverse stranded: read pairs in blue for genes in forward (+) strand and in red for genes in reverse (-) strand.

By comparing your observations with these expected patterns, you can confidently determine whether your RNA-Seq library was prepared using an unstranded, direct stranded, or reverse stranded protocol.

To verify that you have grasped the concept, try to determine which strandedness protocols were used for BAM-1 and BAM-2 in Figure 15.22. Remember that the reads have been colored using the “First-of-Pair” option in IGV.

Figure 15.22: What are the strandedness protocols in BAM-1 and in BAM-2?

Final remark: If you are still unable to determine strandedness, the safest option is to select “unstranded” during analysis. This ensures that no reads are mistakenly discarded.

A study by (Pomaznoy et al., 2019) found that ignoring strand information can significantly affect gene expression estimates that “10% of all genes and 2.5% of protein-coding genes show a two-fold or higher difference when strandedness is ignored.”

The answer for the question about the strandedness protocols in Figure 15.22 is that:

BAM-2 follows an unstranded protocol, as it contains a mix of red and blue reads. This indicates that Read 1 aligns to both DNA strands, meaning strand information was not preserved during sequencing.
BAM-1 follows a direct stranded protocol, as:
- Reads mapping to a gene on the forward (plus) DNA strand are red, corresponding to an F1R2 orientation.
- Reads mapping to a gene on the reverse (minus) DNA strand are blue, corresponding to an F2R1 orientation.

This confirms that BAM-1 was sequenced using a direct stranded RNA-Seq protocol, while BAM-2 lacks strand specificity.

Appendix 3: FAQ

How to cite SeqNjoy?: Juan Antonio Garcia-Martin, Rafael Torres-Perez and Juan Carlos Oliveros (2022). SeqNjoy: Complete RNA-Seq projects in your Desktop. (manuscript in preparation). https://bioinfogp.cnb.csic.es/tools/seqnjoy
(Please consult this tutorial for reference updates).
My computer has an internal disk low on space. Can I install and use SeqNjoy?: Yes, although RNA-Seq files require considerable amounts of hard disk space, users can store project on local or USB-connected hard drives of their choice. Install SeqNjoy and select a Projects Folder on an dedicated external hard disk with enough space.
Can I recover my files from SeqNjoy after deleting a project?: No, the files are deleted without the possibility of being recovered.
Why cannot I type some characters when naming a project, a sample label or an output suffix?: Some non-alphabetic characters cannot be part of an output file name. That is the reason we limit the use of special characters in labels or suffixes.
Which annotation format should I use, GTF or GFF3?: Both formats contain similar information about genes. GFF3 allows any number of parent-child levels and types of genomic elements whilst GTF is more strict. Many genomic files providers tend to favour GFF3 over GTF, however many bioinformatic tools still use GTF as input as it is best defined. For example, the tool that detects the number and size of introns in SeqNjoy works well with any properly formatted GTF files but not so well with most GFF3.
Can I use SeqNjoy just to align short-reads?: Yes, you can align with SeqNjoy and export the alignment files (in standard BAM format) to be visualized or quantified with another tool of your preference.
Why aren’t my paired-end FASTQ files recognized as a pair?: For two paired-end FASTQ files to be correctly recognized, they must meet the following criteria:

- The filenames must differ by only one character.
- Both files must contain the same number of reads.
- The first read in each file must differ by at most one character, ignoring any possible “(length=XX)” tag present in the FASTQ files.

If your files are not detected as paired, please verify that they meet these requirements.
Can I compare samples without using biological replicates?: No. At least one of the sample groups to be compared must contain two or more biological replicates.
Why do some boxes remain red and the “Launch!” button remains disabled after completing a dialog?: That occurs because there are already result files with matching names to those that will be generated in the dialog. The solution is to change one or several sample labels or the common output suffix.
What are the .R files downloaded together with my project’s files?: All output files in SeqNjoy are generated from other files by applying R instructions on them. The “.R” files contain these instructions to help advanced users to create their own pipelines using R language outside SeqNjoy. Please note that these R codes are generated automatically by SeqNjoy and are not optimized for analyzing large number of samples (loops, lists, variables etc. are not used).
Can I analyze my results with FIESTA Viewer outside SeqNjoy application?: Yes. When a differential expression file is saved (in .xlsx format) the corresponding “.fiesta2.html” is also saved. This .html file is a standard web page that can be opened in another computer to filter and exporting biologically relevant results. You can even publish your results on-line by copying any .fiesta2.html file on a public URL.
When I perform the first alignment with bowtie2 for a given genome, I got the following error. “Error in bowtie2_samtools…”.: If using Windows, Python3 command-line tool is not available, so bowtie2 fails to index the genome. The solution is to install Python3 command line tool from the Microsoft Store.
When I launch SeqNjoy on Windows I am prompted with the message “SeqNjoy finished with errors. Please check that another instance of SeqNjoy is not already running.”: If you are not already running SeqNjoy, open the Task Manager and look for running instances of R (named “R for Windows terminal front-end”) and terminate them.
By using SeqNjoy, are my result files exposed, in some way, on the internet?: No. Input and results files generated with SeqNjoy remain stored in a specific folder on your local hard disk. SeqNjoy can even be used with no internet connection.

SeqNjoy: Complete RNA-Seq workflow in your Desktop. Developed by BioinfoGP CNB-CSIC (2022)

SeqNjoy: Complete RNA-Seq workflows in your Desktop

Tutorial

Juan Antonio García-Martín, Rafael Torres-Pérez, Juan Carlos Oliveros

1 Introduction