SeqNjoy: Complete RNA-Seq workflows in your Desktop

Tutorial

Juan Antonio García-Martín, Rafael Torres-Pérez, Juan Carlos Oliveros

SeqNjoy | BioinfoGP | CNB | CSIC

1 Introduction

SeqNjoy is a stand-alone application to generate and analyze differential gene expression results from RNA-Seq experiments. Analyses can be performed from the very initial step using the files containing short reads (also known as FASTQ files), or from an intermediate stage using alignments files (BAM/SAM files). The result of the SeqNjoy analyses are files containing annotated list of the genes (or other genetic features) differentially expressed between groups of samples in different conditions, and interactive HTML pages that allow the user to graphically analyze, filter and download selected genesets of interest.

This tutorial is intended to guide you through the process of analyzing your RNA-Seq data with SeqNjoy. For each step, we describe the input type of files required, the available parameters and the outputs the step yields. In addition, to make you more aware of what you are doing, we explain how are these files constituted, the meaning and consequences of each parameter and some key concepts you should be clear about when analyzing RNA-Seq data. The goal is that you become able to perform a complete RNA-Seq gene expression analysis on your own.

SeqNjoy has a series of minimal requirements. Before trying to install it, check if your computer meets these requirements.

2 Computer requirements and installation

2.1 Computer requirements

Minimum Recommended
Architecture 64-bit
Operative System Windows, macOS, Linux (Debian, Fedora or SUSE-based)
Web Browser Chrome, Firefox, Safari, Edge
RAM Memory 8 GB 16 GB
Disk Space 500 GB 1 TB

NOTE: minimum requirements are adequate for carrying out analyses involving bacterial genomes, while recommended requirements are adequate also for eukaryotic genomes.

SeqNjoy is optimized to be used with 1920 x 1080 screen resolution (or higher).

It is recommended to install SeqNjoy on a dedicated hard drive, not shared with other documents or applications. Installing SeqNjoy on an external, USB-connected, 4-5 TB hard drive is a convenient option that helps to keep all files from multiple projects organized.

Some of the steps performed by SeqNjoy may take many hours to complete. It is not recommended to run the application along other CPU or RAM intensive processes.

Please disable Sleep Mode or Hibernation in your computer to avoid interruptions of long lasting processes

Connection to the internet is NOT required (it is a stand-alone application), as long as all the libraries and packages needed for the analysis reside on the file system (your computer or a external hard disk, for example) where the installation is done.

Specific SeqNjoy installers are available for Windows, macOS or Linux operative systems. Go to our Download site and choose the one that corresponds to your operative system. Installers are named SeqNjoy_X.X.X-OS, where X.X.X stands for the version number and OS indicates the operative system.

2.2 Windows (Graphical installation)

Download on your PC the SeqNjoy setup executable (SeqNjoy_X.X.X-Windows-setup.exe file) from our Download site, run it and follow the instructions.

Once successfully installed, double-click SeqNjoy desktop icon or search SeqNjoy into the programs menu to start the application.

2.3 macOS (Command-line installation)

This installer has been tested on macOS 10.14.06 (Mojave).

Prerequisites: R (version 4.1.2 is highly recommended) must be installed before executing the installation script. You can download the binary installer from CRAN page: https://cran.r-project.org/bin/macosx/base/R-4.1.2.pkg.

Installation steps:

The first execution will install the required binary and source R packages included in the installer. Once the required packages are installed, the program will start.

2.4 Linux (Command-line installation)

This installer has been tested on the following Linux platforms, all of them in the 64-bit architecture version: Debian 10.10, Ubuntu 20.04, Linux Mint 20.2, Fedora 34, Rocky Linux 8.6 and openSuse Leap 15.3.

Installation steps:

      tar xzvf SeqNjoy_X.X.X-Linux.tar.gz
      cd SeqNjoy_X.X.X-Linux
      ./runSeqNjoy.sh

During the installation:

After the installation:

2.5 Source code availability

Source code is available for download at BioinfoGP’s GitHub repository.

3 RNA-Seq in short

RNA-Seq is a high-throughput technique of sequencing that allows to detect and quantify the presence of RNA in biological samples, which is a proxy for detecting and quantifying gene expression, among other applications.

The “wet lab” part of the RNA-Seq process (Figure 3.1, top) starts with the “library preparation”, a process that comprises some steps: RNA extraction from the samples, filtering to keep only mRNA molecules and deplete other RNA species (ribosomal RNA, for example), reverse transcription from mRNA molecules to more stable cDNA molecules, fragmentation and size-selection of the fragments to get smaller chunks of nucleic acids (more convenient to deal with) and the sequencing of short reads obtained from the fragments.

The sequencer is the machine in charge of “reading” the nucleotide sequences corresponding to the mRNA transcripts, generating millions of “short reads” that are stored in a number of “FASTQ files”, one per sample (or biological replica) of your experiment. These FASTQ files are the input files of the RNA-Seq data analysis process. And you can perform this analysis by yourself using SeqNjoy!

Figure 3.1 displays an schematic view of the standard RNA-Seq pipeline to analyze mRNA expression in biological samples. The upper part describes the experimental part of the protocol. The yellowish area (bottom) contains the three main steps for the analysis of the RNA-Seq data of the sequenced samples.

Schematic view of the standard RNA-Seq pipeline to analyze mRNA expression

Figure 3.1: Schematic view of the standard RNA-Seq pipeline to analyze mRNA expression

In brief, the bioinformatic analysis of a RNA-Seq experiment comprises three consecutive steps:

  1. Alignment to reference genome: The short reads of the input FASTQ files are aligned to a FASTA file containing the genome sequence selected to align to. As a result, files with the aligned reads in BAM format are generated (one BAM file per each FASTQ file or pair of FASTQ, depending of the RNA-Seq type of sequencing: single-end or paired-ends sequencing).

  2. Gene quantification: The aligned reads of each BAM file are mapped to a file containing the start-to-end positions of each gene (GFF file). For each gene, all reads mapped to it are counted. As a results, a “counts file” is generated per BAM file. A counts file contains a list of all the genes of the genome and the number of reads (“counts”) mapped to each gene.

  3. Samples comparison / Differential expression calculation: The counts files of the samples are grouped into two sets: one set of counts files is assigned to the “Control” samples group and other set of counts files are assigned to the “Case” samples group. The aim of this step is to statistically compare gene by gene the expression of that gene in each group and determine if there is a significant differential expression between both groups and the magnitude of this difference. The final result of this step is a file containing a table with the results of the differential expression test per gene.

4 SeqNjoy framework in short

SeqNjoy utilizes user-friendly interfaces that allow you to perform a complete RNA-Seq expression gene data analysis in a short number of steps. If you have never performed this kind of analysis, we recommend reading RNA-Seq in short section as a starting point to get familiarized with the overall process.

Flowchart below (Figure 4.1) summarizes how SeqNjoy addresses the different parts of the analysis:

Flowchart of SeqNjoy

Figure 4.1: Flowchart of SeqNjoy

The Homepage is the entry point of the application (upper-left part of the flowchart). In this initial page is where you create your projects. A project is a container for one single study and comprises all the files used for a single RNA-Seq data analysis. The files of a project are a) the ones loaded as input (short reads files, gene feature files, genome reference file, functional annotations) plus b) all the files generated in the different steps of the analysis (alignments files, counts files and lists of differentially expressed genes as a result of comparing samples).

From the list of created projects registered in the Homepage, you can select any of them and access the Project’s Main Page, which is the central hub of the application. All the functionalities are launched from this page. From the Project’s Main Page you can:

As in a recipe you need to add some initial ingredients to start your dish, you need to load certain input files to start the analysis of your project. You should load one of the following two combinations of files for starting the processing in SeqNjoy:

In the first case, when you load FASTQ plus GFF plus FASTA files, the analysis process (“the recipe”) comprises three successive steps: Align Reads, Quantify Genes and Compare Results, as described in the Figure 4.2:

Analysis steps in SeqNjoy starting from Short Reads (FASTQ) files

Figure 4.2: Analysis steps in SeqNjoy starting from Short Reads (FASTQ) files

In the second case, when you load alignment files (BAM/SAM files) the workflow is similar. The only difference is that the initial “Align Reads” step is skipped (as you have already aligned reads in your BAM/SAM files…), see Figure 4.3:

Analysis steps in SeqNjoy starting from Alignments (BAM/SAM) files

Figure 4.3: Analysis steps in SeqNjoy starting from Alignments (BAM/SAM) files

As seen in Figure 4.1, there are some Utilities that you can use to inspect or analyze the content of files of certain types (Figure 4.4).

Utilities in SeqNjoy for file inspection/analysis

Figure 4.4: Utilities in SeqNjoy for file inspection/analysis

Figure 4.5 displays the whole picture of the functionalities of SeqNjoy to perform a RNA-Seq data analysis. All of them are launched from the “Project’s Main Page”. The RNA-Seq data analysis proceeds from top to bottom. Observe how the files being the output of some steps are the input of the following:

Functionalities in SeqNjoy.

Figure 4.5: Functionalities in SeqNjoy.

In the next sections, we describe each step of the Analysis, each Utility, as well as the characteristics and use of the involved files. For FIESTA 2.0 graphical viewer you have available an specific tutorial accessible here.

5 First steps in SeqNjoy: creating and managing projects

Once you have launched SeqNjoy (clicking the SeqNjoy icon (Windows) or typing ./runSeqNjoy.sh in the console (Linux/MacOS)) you land in the Homepage (Figure 5.1). In the upper part, the page welcomes you and gives you a summary and important information relative to the application.

Homepage

Figure 5.1: Homepage

The lower panel contains the list of Available Projects. As no project has been created yet, this list is displayed empty at first.

First of all, let’s define what a project is. A project is a container for one single study and comprises all the files used for a single RNA-Seq data analysis. These files are the ones uploaded as input plus all the files resulting from the analysis: alignments, counts, lists of differentially expressed genes.

Each project is identified by a project name that you provide in the text box located at the bottom of the page, as seen in Figure 5.2.

Creating your first project

Figure 5.2: Creating your first project

After clicking the Start New Project button, you are directed to the Project’s Main Page to start working with the project (see the Main Page section).

Once you have one or more projects created, the Available Projects area of the Homepage contains a list with the name and date of creation of your projects (Figure 5.3).

List of started projects

Figure 5.3: List of started projects

At this point, you can perform two simple operations for a particular project by clicking the icons located at both extremes of the project name. If you click the “play” icon the project is opened and you are directed to the Main Page to start or continue your data analysis. On the other side, by clicking the “basket” icon the project is deleted from your system. IMPORTANT: if you delete a project, all its associated files (the input and the results) are deleted so the project can NOT be recovered.

6 Project’s Main Page: Loading files

Once the project has been created, you can start uploading your input files. First, open the project by clicking the “play” icon in the Homepage and access the Project’s Main Page of the application (Figure 6.1).

Project's Main page

Figure 6.1: Project’s Main page

Now, click the Load Files button and select in your computer one or several files of the allowed file types. SeqNjoy supports the following file types and extensions:

File type File Formats Allowed File Extensions
Genome sequences FASTA fa, fa.gz, fasta, fasta.gz, fna, fna.gz
Genes features GFF3, GTF gff, gff.gz, gtf, gtf.gz, gff3, gff3.gz
Short reads sequences FASTQ fastq, fastq.gz, fq, fq.gz
Alignments BAM, SAM bam, sam
Gene annotations ANN (text table) ann, ann.txt, ann.tsv, ann.tab, ann.xls, ann.xlsx

If you are not familiarized with any of these file types, go to the section File Types to know more about the structure of the format and the function and details of each file type.

Figure 6.2 shows the application loading three files that contain a genome reference (FASTA) file, a gene features (GFF) file and an annotations (ANN) file:

Uploading files in **SeqNjoy**

Figure 6.2: Uploading files in SeqNjoy

The pop-up window reports the status of file uploading and post-processing by means of a progress bar and informative messages describing the step being performed. When all is OK, you should see a window with an “all in green” aspect like this (Figure 6.3):

Uploading process successful

Figure 6.3: Uploading process successful

Take into account that in the case of some large and/or complex files the upload may take a considerable amount of time.

7 Project’s Main Page: File List

After you have uploaded your input files, your Project’s Main Page shows the list of the loaded files. The list offers different information and a number of options to operate on them. From the Project’s Main Page you can also perform the successive steps of your RNA-Seq gene expression data analysis project.

This is the aspect of the Project’s Main Page after a standard RNA-Seq data analysis project has been performed (Figure 7.1):

Project's Main Page

Figure 7.1: Project’s Main Page

The Project’s Main Page is structured as follows.

In the top area the following functionalities are available:

The rest of the Project’s Main Page is dedicated to the File List. The File List area contains information about all the files loaded by the user or generated by the processing with SeqNjoy, and provides a number of operations that the user can run on one or more selected files. The File List comprises the following buttons and columns, described from the left to the right:

8 Align Reads

Click the “Align Reads” button of Main Page to access this window. For performing this task, you inform a genome sequence (FASTA) file and the corresponding short reads files (FASTQ files) and obtain files with the alignments (BAM formatted) of the short reads contained in the FASTQ files to the reference genome (FASTA file).

There are two different aligners available to perform this task, and you should choose one of them in the submenu that is displayed after clicking the Align Reads button (Figure 8.1): bowtie2 (Langmead & Salzberg, 2012) or hisat2 (Kim et al., 2015). Both of them are well-documented and widely used programs.

Align Reads: selecting alignment method (aligner)

Figure 8.1: Align Reads: selecting alignment method (aligner)


After choosing any of them you access a window where you have to fill the input FASTQ files and alignment parameters that are necessary to complete the task.

Window for the reads alignment. Input files are the genome reference (FASTA file) and the short reads files (FASTQ files). The shaded areas contains *bowtie*-specific or *hisat2*-specific parameters and therefore are differents according to the selected aligner.

Figure 8.2: Window for the reads alignment. Input files are the genome reference (FASTA file) and the short reads files (FASTQ files). The shaded areas contains bowtie-specific or hisat2-specific parameters and therefore are differents according to the selected aligner.


The Align Reads to Genomes window has the following parts (see Figure 8.2):

Once you have filled the form in with the required files and set your parameter values, the “Launch!” button turns from disabled (grey) to enabled (red) so you can click it and launch the alignment(s). A pop-up window appears informing you of the progress of the alignment task. When the process ends OK a message “done” is displayed in the pop-up window and a “Close” button is enabled. By clicking “Close” you return to the “Align Reads to Genome” pane. Now, you could make a new Align Reads to Genome round or click the “X” icon for closing the “Align Reads to Genome” window. You will then return to the “Project’s Main Page”, which will have been updated to contain the new alignment (.bam) files in its File List.

*Align Reads to Genome* dialog ready to launch the alignments processes.

Figure 8.6: Align Reads to Genome dialog ready to launch the alignments processes.


9 Quantify Genes


Click the “Quantify Genes” button in the “Project’s Main Page” to access this window (Figure 9.1).

Quantification window

Figure 9.1: Quantification window


In summary, you inform here a genome features file (GFF) file and a number of alignment files in BAM. Normally, these alignments have been generated previously in the “Align Reads” step. Alternatively, the BAM files were generated elsewhere and the user uploaded to SeqNjoy.

The output of the quantification is one counts file per BAM alignment file. Each “counts” file contains a list with the “counts” , i.e. the number of “reads” or “fragments” that have been mapped to each gene (or other feature type that you select) of the genome.

The following parameters are used by the tool that performs the quantification task, the function featureCounts of the R package Rsubread (Liao et al., 2019):

Once you have filled the form with the required files and set your parameter values, the “Launch!” button turns from disabled (grey) to enabled (red) so you can click it and launch the quantification. In Figure 9.4, the quantification process will be launched to count genes for nine samples, with reverse strandness mode. Only primary alignments will be counted. The output of the quantification is a number of counts files equal to the number of input BAM files, that is, nine in this case.

Quantification dialog ready to be launched

Figure 9.4: Quantification dialog ready to be launched

A pop-up window appears informing you of the progress of the quantification task. When the process ends OK a message “done” is displayed in the pop-up window and a “Close” button is enabled. By clicking “Close” you return to the Quantify Genomic Features pane. Now, you could make a new Quantify Genomic Features round or click the “X” icon for closing the window. You will then return to the “Project’s Main Page”, that it will have been updated to contain the new “counts” files.

10 Compare Samples (Differential Expression)

In this step you can perform an statistical analysis of the differences in the expression for a given set of features between two groups of samples. You can use it repeatedly to make more that one binary comparison (e.g. “SampleGroupDrugA versus SampleGroupDrugB”, and then “SampleGroupDrugA versus SampleGroupPlacebo” and finally “SampleGroupDrugB versus SampleGroupPlacebo”).

SeqNjoy offers you two possible methods to perform this task, both of them widely used and well documented: DESeq2 (Love et al., 2014) and edgeR (Robinson et al., 2009). If you click the Compare Samples button a submenu listing both options is opened and you can select any of them (Figure 10.1) .

Compare Samples: selecting the method for calculating differential expression

Figure 10.1: Compare Samples: selecting the method for calculating differential expression


After selecting the desired option, the Calculate Differential Expression dialog is displayed (Figure 10.2). Most of this window is the same regardless of the selected software. Only some specific parameters located in the shaded area of the window are different and specific of the selected method (DESeq2 or edgeR) to perform the differential expression calculations.

Calculate Differential Expression (Compare Samples) dialog as displayed initially.

Figure 10.2: Calculate Differential Expression (Compare Samples) dialog as displayed initially.

The following fields are present in the Calculate Differential Expression window irrespective of the selected method:


Once you have filled the form with the required files and set your parameters values, the Launch! button turns from disabled (grey) to enabled (red) so you can press it and launch the differential expression calculation. In the figure 10.5 differential expression calculation between the Case ("pH5_CO2") group of samples and the Control ("pH5") group of samples is displayed ready to be launched (Launch! button enabled). In the example, an annotations file has been provided so the information contained in this file will be incorporated to the diffexp file with the differential expression calculations.

*Calculate Differential Expression* dialog ready to be launched.

Figure 10.5: Calculate Differential Expression dialog ready to be launched.


After launching the process, a pop-up window informs you the progress of the “differential expression” task. When the process ends OK the message “done” is displayed in the pop-up window and a close button is enabled. By clicking the Close button you return to the Calculate Differential Expression pane. Now, you can fill in again the form with a new comparison between groups of samples to calculate the differential expression between them or click the “X” icon for closing the window. You will then return to the “Project’s Main Page”, that it will have been updated to contain the newly created diffexp files in Excel format (extension .xlsx) containing the results of the calculated differential expressions of the comparisons performed.

The “diffexp” output file (syntax: [CaseLabel]vs[ControlLabel].[OutputSuffix_if_typed].xlsx) contains the list of differentially expressed genes (DEG) detected between the “Case” and the “Control” groups of samples. Check in the File Types section the information content of the diffexp files. You can graphically inspect and filter the differentially expressed genes results using the FIESTA 2.0 analytic tool.

11 Utilities

In addition to the pipeline to perform the RNA-Seq gene expression data analysis, SeqNjoy provides three utilities for inspecting the content of some of the types of files involved in the analysis. These utilities are launched from buttons located in the File List.

11.1 GFF feature structure hierarchical tree

To access this utility from a GFF (GFF3 or GTF formatted) of the File List click the icon located rightmost in the Details column.

When launched, SeqNjoy will run the plot_features function of the Rgff R package (Garcia-Martin et al., 2022) to generate a picture containing a tree representing the hierarchical structure (dependencies) of the feature types included in the selected GFF file (Figure 11.1). This tree representation is useful for a global look of what is annotated in the genome and it can help you decide which feature(s) and blocks to use in the Quantify Genes step of the analysis.

A hierarchical tree of a genome feature file (GFF) produced by *Rgff*

Figure 11.1: A hierarchical tree of a genome feature file (GFF) produced by Rgff

As it is shown in the example tree above, the highest feature type in this GFF is gene and there are 7127 items annotated as genes. gene has a child node, the transcript feature. Included into transcript there are five child features, that are siblings from each other: CDS, exon, five_prime_UTR, start_codon and stop_codon.

With this information, we know that we could count genes, exons-by-gene, transcript, exons-by-transcript, exons, etc. in our analysis but not “exon-by-CDS” for example, as exon is not a descendant of the CDS feature.

11.2 Visual exploration of genomic files with IGV

You can access this utility from a FASTA file containing the genomic sequence or from an alignment file (BAM), by clicking the icon located on the right of the Details column of the corresponding file. The option is not available for user-provided BAMs.

After clicking the button, the widely used “Integrative Genomic Viewer” (IGV) (Robinson et al., 2011, 2022) tool will be launched and will load the corresponding FASTA or BAM file. If you have loaded a FASTA, a screen similar to the one in Figure 11.2 will be displayed.

IGV launched with a FASTA file as input

Figure 11.2: IGV launched with a FASTA file as input

If you use a BAM file with IGV a similar window is displayed. In addition to the BAM file, the associated genomic sequence (FASTA file) is automatically incorporated to IGV in the same session. See in Figure 11.3 the initial aspect displayed in this case and a minimal “use case” for IGV to visually compare the alignment files of two samples (i.e. two BAM files) for a particular gene.

A simple usage example of IGV to compare two BAM files in a certain genomic region.

Figure 11.3: A simple usage example of IGV to compare two BAM files in a certain genomic region.

IGV has a plethora of functionalities and possibilities to interact with your alignment files. Please refer the IGV user general guide, to take full advantage of this versatile tool. In addition, you can inspect your files with the web/on-line version that IGV provides at https://igv.org/app/, documented at https://igvteam.github.io/igv-webapp/.

11.3 Graphical analysis of differential expression results with FIESTA 2.0

FIESTA 2.0 (Oliveros, 2022) is an interactive viewer for differential expression results files that is integrated in SeqNjoy. It provides multiple ways of representing your data as tunable interactive plots that can be downloaded as high resolution images. In addition, FIESTA 2.0 allows you sorting and filtering genes by one or several columns and saving the relevant ones into files.

Figure 11.4 illustrates a usage example for FIESTA 2.0. When a diffexp file is loaded into FIESTA 2.0, a default representation of the expression data as a MA-plot is displayed. From this point, you can combine filters (top) to select gene subsets. A very common (and useful) filter combination is to select and highlight with user-selected colours genes (center) that show to be significantly (P-adjusted < 0.05) under- (negative Log2Ratio) or overexpressed (positive Log2Ratio). Finally, user can select any pair of numerical columns present in the diffexp file to be represented (bottom), so other popular representations like Volcano plots are also available. User-generated plots and subsets of relevant genes are both downloadable (in the case of the plots, with high resolution).

*FIESTA 2.0* usage example.

Figure 11.4: FIESTA 2.0 usage example.

To access FIESTA 2.0 viewer from a file containing the differential expression results (“diffexp” files in .xlsx format) click the icon located rightmost in the Details column.

More details about FIESTA 2.0 here.

12 File Types in SeqNjoy

SeqNjoy uses up to seven types of files in the analyses. The files that you provide have these types: FASTA, FASTQ, GFF3/GTF and ANN. Other types are only held by files generated through the SeqNjoy analysis: COUNT, DIFFEXP. Finally, the format BAM/SAM is that of the alignment files, which can be either internally generated by SeqNjoy or provided (loaded) by the user.

To follow, we describe the aspect, source, uses, format requirements and content of each file type. You can find a good complement to know more about them in this link by Andrew Severin and also in this page.

12.1 FASTA file

FASTA file format

Figure 12.1: FASTA file format

12.2 FASTQ files

12.3 Gene Feature File (GFF): GFF3 and GTF formats

More formally, each line of a GFF3 (or GTF) file has 9 tab-separated columns:

GFF3 is currently the preferred format to represent Gene Features, but it is not uncommon to find gene feature files in GTF format. GFF3 provides more flexibility to represent the hierarchy of the features in a genome but GTF has some advantages in certain cases. SeqNjoy can process files in GFF3 or GTF formats. In both cases, columns 1 to 8 are the same. The 9th column has a different syntax in GFF3 and GTF (Figure 12.5 )

Comparison of GTF and GFF3 files for the representation of the features of the same gene

Figure 12.5: Comparison of GTF and GFF3 files for the representation of the features of the same gene

Important: For SeqNjoy, GFF files must contain at least one “gene” in the feature type column.

For more detailed information about GFF formats refer to this, this or this link.

12.4 Alignment files (BAM / SAM files)

12.5 Counts file

Counts file example

Figure 12.6: Counts file example

12.6 Annotations file

12.7 Differential Expression file

*diffexp* file format.

Figure 12.8: diffexp file format.

This is the meaning of the columns present in the diffexp files (Figure 12.8):

References

Andrews, S., Krueger, F., Segonds-Pichon, A., Biggins, L., Krueger, C., & Wingett, S. (2012). FastQC. Babraham Institute.

Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1), 289–300. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x

Garcia-Martin, J. A., Oliveros, J. C., & Torres-Perez, R. (2022). Rgff: R utilities for gff and gtf files. National Center for Biotechnology, Spanish National Research Council (CNB-CSIC). https://cran.r-project.org/web/packages/Rgff

Kim, D., Langmead, B., & Salzberg, S. L. (2015). HISAT: A fast spliced aligner with low memory requirements. Nature Methods, 12(4), 357–360. https://doi.org/10.1038/nmeth.3317

Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with bowtie 2. Nature Methods, 9(4), 357–359. https://doi.org/10.1038/nmeth.1923

Liao, Y., Smyth, G. K., & Shi, W. (2019). The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic Acids Research, 47(8), e47. https://doi.org/10.1093/nar/gkz114

Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12). https://doi.org/10.1186/s13059-014-0550-8

Oliveros, J. C. (2022). FIESTA 2.0: Interactive scatter plots in your browser. National Center for Biotechnology, Spanish National Research Council (CNB-CSIC). https://bioinfogp.cnb.csic.es/tools/fiesta2

Pomaznoy, M., Sethi, A., Greenbaum, J., & Peters, B. (2019). Identifying inaccuracies in gene expression estimates from unstranded RNA-seq data. Scientific Reports, 9(1). https://doi.org/10.1038/s41598-019-52584-w

Robinson, J. T., Thorvaldsdóttir, H., Turner, D., & Mesirov, J. P. (2022). Igv.js: An embeddable javascript implementation of the integrative genomics viewer (igv). bioRxiv. https://doi.org/10.1101/2020.05.03.075499

Robinson, J. T., Thorvaldsdóttir, H., Winckler, W., Guttman, M., Lander, E. S., Getz, G., & Mesirov, J. P. (2011). Integrative genomics viewer. Nature Biotechnology, 29(1), 24–26. https://doi.org/10.1038/nbt.1754

Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2009). edgeR: A bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139–140. https://doi.org/10.1093/bioinformatics/btp616

Soneson, C. (2021). Rhisat2: R wrapper for hisat2 aligner. https://github.com/fmicompbio/Rhisat2

Wei, Z., Zhang, W., Fang, H., Li, Y., & Wang, X. (2018). EsATAC: An easy-to-use systematic pipeline for atac-seq data analysis. Bioinformatics. https://doi.org/10.1093/bioinformatics/bty141

Appendix 1: Single-end and paired-ends sequencing

Introduction

After the “fragmentation” step, in RNA-Seq protocols fragments ~300 to 500 bp long are produced. Even this size range is too long to be sequenced entirely by the type of sequencers that generate short reads. What is done then is to read only the end part of the fragment, which produce short reads of 100-200 bases (this interval may vary according to the experimental design, the machine used, etc.) The rest of the fragment is not sequenced.

Now, there are two modalities in RNA-Seq, single-end and paired-ends (Figure 12.9):

**Single-end** and **Paired-ends** reads: a global view

Figure 12.9: Single-end and Paired-ends reads: a global view

Single-end RNA-Seq (SE, aka single-read RNA-Seq): Is the simplest mode. Only one end of the fragment is sequenced. All the short reads sequences generated for a sample with single-end (SE) technique are collected in a single FASTQ file.

Paired-end RNA-Seq mode (PE): Two reads per fragment are produced. In this mode, the reading starts at one end, finishes this direction at a pre-specified read length (e.g. 150 bases) and a “read 1” is produced as a result. Then, it starts another round of reading from the opposite end of the fragment, finishes this new (opposite) direction at the same read length and finally a second read, “read 2”, is produced. At the end, both ends of the fragment are sequenced leaving the inner portion of the fragment unsequenced.

In PE sequencing, read 1 reads (those generated first-in-the-pair) and read 2 reads (generated secondly) are stored in two separated FASTQ files. One file contains all the read 1 reads the other file contains all the read 2 reads. Each of these files has a label in its name that identifies whether it contains read 1 or read 2 reads, something like Sample1_blah_blah_1.fastq and Sample1_blah_blah_2.fastq (or something similar like Sample1_blah_blah_R1.fastq and Sample1_blah_blah_R2.fastq, etc.) Note that usually the only difference in the filenames between two FASTQ paired-ends files of the same sample is just this "_1" or "_2"-like particles.

In addition, "_1" and "_2" FASTQ files of a particular sample have both the same number of reads. Moreover, they kept the same order of reads (see Figure 12.9): at a given position X in both files, the reads in “_1” and “_2” correspond to the same fragment. In fact, as with the files, the read identifier that is given to the “read 1” and “read 2” reads of the same fragment is the same… except again for the particle indicating whether it is for reads 1 or reads 2.

Reads orientation

To know how is the orientation of the reads in SE and PE is useful to interpret correctly the alignments of the reads and to understand important topics like the unstranded or stranded protocols in RNA-Seq.

The sequencing always synthesizes the reads in a 5’ -> 3 direction. (It is the way the DNA polymerase proceeds) so the template strand is traversed 3’ -> 5’.

In single-end, the reads extend in the 5’ -> 3’ direction along the forward (plus) DNA strand (Figure 12.10).

Single-end read orientation

Figure 12.10: Single-end read orientation

In paired-ends, the read 1 read extends in the 5’ –> 3’ direction towards read 2 along the forward (plus) DNA strand. As for the read 2 read, it extends in the 5’ –> 3’ direction towards read 1 along the reverse (minus) DNA strand. Read 1 is the reverse complement of the first 3’-most bases of the forward strand (therefore identical to the first 5’-most bases of the reverse strand) and read 2 is the reverse complement of the first 3’-most bases of the reverse strand (therefore identical to the first 5’-most bases of the forward strand). See the read 1 and read 2 orientations represented in Figure 12.11:

Paired-ends *FR* reads orientation

Figure 12.11: Paired-ends FR reads orientation

Therefore the sequences of the two reads of a pair are pointing towards each other on opposite strands. When you align them to a genome, one read should align to the forward strand, and the other should align to the reverse strand, at a higher base pair position than the first one so that they are pointed towards one another. This is known as an “FR” read pair or “forward/reverse”, as this is the order in which they are placed. A read pair having this orientation is called a “proper pair”. There are other possible orientations, but we don’t need to go into them here.

When you use IGV genomic browser you can see this orientation in the case of a read pair of a PE sequencing. Take into account that in IGV only the forward (plus) strand of the genome is displayed. With the “Color by” -> “read strand” option the read in the forward strand is colored red and the read of the reverse strand is colored blue. Observe in the figure above a read pair meeting the FR orientation. They are even more visible if you select “View in Pairs” option (Figure 12.12).

Paired-ends *FR* reads orientation as seen in IGV.

Figure 12.12: Paired-ends FR reads orientation as seen in IGV.

Single-end or paired-ends, which is better?

Both single-end (SE) and paired-ends (PE) can be used for measuring gene expression. PE offers the advantage over the SE that the distance covered by the R1 and R2 and the unsequenced sequence in-between is controlled on average (it is named “insert size” and is usually in an approximate range 200-600 bp) so the aligners can take advantage of this information to be more accurate in the correct alignment of the reads to the reference genome. The point is that a read pair properly (FR) oriented, covers a longer span in the genome (read_1 + inner_distance + read_2, Figure 12.13) and therefore a stronger evidence of being well aligned than a single read.

Insert size in Paired-ends reads

Figure 12.13: Insert size in Paired-ends reads

The main disadvantage of the PE protocol is that it is normally more expensive and time-consuming to generate than single-end reads. On the other hand, just for counting reads or calculate differential expression may be unnecessary to have such a high level of accuracy, especially if your species is well-annotated. But take into account that in the future you might want to reuse the reads intended to study differential expression to, for example, extend the study to look for alternative splicing. In that case, you will need to have done PE sequencing as SE is insufficient for alternative splicing analysis.

Appendix 2: Strandness and RNA-Seq

By way of reminder: Strandness in transcription

For the transcription of a gene, a strand of the DNA is used as template of the resulting transcript. As the sequence of this “template strand” is the reverse complement of the sequence of the transcript, we say that this strand is the “negative-sense strand”, “antisense strand”, “non-coding strand” or “minus strand”. Conversely, as the “non-template strand” has a sequence equivalent (except T instead of U nucleotide) to the transcript sequence it is named “positive-sense strand”, “sense strand”, “coding strand” or “plus strand” (Figure 12.14).

Sense and antisense strands in the transcription

Figure 12.14: Sense and antisense strands in the transcription

Important to remember, either of the two polynucleotide strands may contain genes, and hence the determination of sense and antisense is gene specific (Figure 12.15):

Sense and antisense strandness depends on the gene

Figure 12.15: Sense and antisense strandness depends on the gene

RNA-Seq protocols: unstranded vs. stranded

The RNA-Seq protocol produces single-stranded mRNA molecules that are converted to double-stranded complementary DNAs (cDNAs). Then the two strands of the cDNA are amplified to yield millions of short reads. Consequently, for a transcript, you would have short reads of two types: some reads have a sequence that is “the same” (except that “T”s instead of “U”s) as the original RNA transcript sequence (“sense” reads) and others with a reverse complemented sequence respect to the original RNA transcript (“antisense” reads).

At the beginning, most sample preparation protocols were as described in the previous paragraph. This protocol is named non-strand-specific (aka unstranded). Sequences of both sense strand, the strand of the cDNA codifying the transcript and the antisense strand of the original mRNA, are obtained in the sequencing process, without knowing which strand of the cDNA corresponded to the original mRNA. Consequently, it is not clear if a read originated from the sense strand or the antisense strand of the reverse transcripted mRNA. For the FASTQ files where the short reads are stored, that means that the "_1.fastq" file and "_2.fastq" files contains mixed both types of reads.

Unstranded and stranded protocols

Figure 12.16: Unstranded and stranded protocols

When assigning counts to genes you wish to assign all reads coming from molecules of the sense strand to a sense strand gene in the reference and all reads coming from molecules of the antisense strand to an antisense gene in the reference. This is not possible with a non-stranded protocol and some antisense strand reads will be assigned to a sense gene and some sense strand reads will be assigned to an antisense gene. This may lead to some incorrect assignments that can produce biased estimates when counting reads that maps on genes.

Now, observe the protocol on the right in the Figure 12.16. Observe that in this protocol the second-strand fragment (the sense strand) is “stopped” to amplify. With this stranded protocol only the first strand (antisense stranded in this case) survives the subsequent PCR amplification step and hence the strand information of the libraries. As a result, the "_1.fastq" file will store only antisense (reverse) reads, as they are the first of the pair to be sequenced. In a second step, the "_2.fastq" file will store the second of the pair to be sequenced, which is the sense (direct) reads. The described protocol is named reverse stranded, since it is the reverse reads (produced in first place) that are amplified and stored in "_1.fastq" file.

There is an alternative stranded protocol that amplifies only the sense (direct) reads and avoid amplification of antisense (reverse) reads. In this case, the direct reads are the first of the pair to be sequenced and are stored in the "_1.fastq" file. Antisense (reverse) reads are produced in second place and are stored in the "_2.fastq" file. This protocol is direct stranded, since it is the direct reads (produced in first place) that are amplified and stored in "_1.fastq" file.

Figure 12.17 illustrates how in unstranded protocol, you don’t know the sense sequence is stored in ‘reads_1.fastq’ file or in ‘reads_2.fastq’ file, whereas in direct stranded the sense read of the pair is stored in ‘reads_1.fastq’ and in reverse stranded the sense read of the pair is stored in ‘reads_2.fastq’.

Unstranded and stranded protocols and FASTQ files.

Figure 12.17: Unstranded and stranded protocols and FASTQ files.

With the stranded protocols, the short reads carry always “sense sequences” so there are no biased counting due to antisense reads. There are several techniques to produce a “stranded RNA-Seq”. One of the most used is to produce a chemical modification with dUTP specifically in the non-sense strand of the cDNA that blocks its undesirable amplification (Figure 12.16, right). This protocol is named reverse stranded RNA-Seq because the sequence that is amplified is the “reverse” one.

Other types of stranded protocols depletes not the antisense but the sense sequences (reverse complemented of the transcript). This protocol is named direct stranded RNA-Seq.

Getting down to practicalities

When your provider delivers the short reads files (FASTQ files) of your RNA-Seq experiment to you, it is recommended to check the provider’s report of the sequencing to know which library prep method was used: unstranded (non-stranded), direct-stranded or reverse-stranded. If it is not mentioned, request this information from the provider as it is very important in order to obtain a fair transcription (counts) estimation. After knowing the protocol type, select the proper option in the Strand specificity selector of the Quantify Genomic Features window.

Maybe that you don’t be able to find the used protocol by the above designations (for example, you can find that the protocol was a “directional RNA-Seq” which is the same as “stranded RNA-Seq”) or you find only the commercial kit name used for the protocol. In that case, the following conversion table may be useful:

Protocol Synonym Examples/Kits1 Description
Unstranded non-stranded Standard Illumina, TruSeq RNA Sample Prep kit, NuGEN OvationV2, SMARTer universal low input (TaKara), GDC normalized TCGA data Information regarding the strand is not conserved (lost)
Reverse first_strand dUTP, NSR, NNSR; TruSeq Stranded (Total RNA/mRNA), NEB Ultra Directional, Agilent SureSelect Strand-Specific The second read (read 2) is from the original RNA strand/template, first read (read 1) is from the opposite (reverse) strand.
Direct second_strand Directional Illumina (Ligation), Standard SOLiD, ScriptSeq v2, SMARTer Stranded Total RNA, NuGEN Encore Complete, NuGEN SoLo, Illumina ScriptSeq The first read (read 1) is from the original (direct sense) RNA strand/template, second read (read 2) is from the opposite strand.

1 The information of this column was compiled from this, this, and this link.

In any case, we recommend to check the strandness by yourself by following a strategy like the described in the following section.

How to determine the strandness with IGV

It may be that even after searching everywhere you cannot find the information or you are not sure about the strandness of your RNA-Seq files. Maybe you have obtained your FASTQ or BAM files not directly from a provider but from a repository lacking information or maybe they are old ill-documented data, etc.

In that case, there are still some strategies to guess the strandness of your BAM files. Here we tell you a strategy using IGV (Robinson et al., 2011, 2022) browser.

You will need to load in IGV the FASTA and GFF3/GTF files, and the BAM file(s) with the unknown strandness.

As said in Reads orientation section, when aligning to the genome, a read aligns to the forward/plus DNA strand and the other aligns at a higher base pair position at the reverse (or minus) DNA strand so that they are pointed towards one another (“FR” read conformation or “forward/reverse”, in that order, Figure 12.12). We also saw that using the “color by read strand” option, IGV colors forward reads in red and reverse reads in blue.

As we have also seen the “reads 1.fastq” contains the first-read-of-the-pair and “reads 2.fastq” contains the second-read-of-the-pair. Let’s focus on “reads 1.fastq”:

Related to the above, there is another interesting coloring that IGV provides.

You can access it clicking the track’s gear menu of IGV and selecting Colour by -> First-of-Pair Strand in your BAM track. First-of-pair colours both reads of a read pair with the color of the direction of mapping (forward of reverse) of the “Read 1”: if “Read 1” is forward, the pair “Read 1”+“Read 2” is colored red and, conversely, if the “Read 1” is reverse, both members are coloured blue.

Finally, we need a final feature of IGV to fully determine the strandness type protocol. If we left-click a member of a paired-end read, a pop-up window give us diverse information about the alignment. One piece of information is the “Pair Orientation”.

There are two possible values for Pair orientation in a “proper paired” read pair: F2R1 or F1R2. The letters “F” and “R” refer to the above-mentioned “FR” normal orientation. The numbers “1” and “2” tell us the file from which each read of the pair comes (‘reads 1.fastq’ or ‘reads 2.fastq’). Thus, “F2R1” means “the first read is a forward one and comes from ‘reads 2.fq’ file and the second read is a reverse one and comes from the ‘reads 1_fq’ file”. “F1R2” means “the first read is a forward one and comes from the ‘reads 1.fq’ file and the second read is a reverse one and comes from the ‘reads 2.fq’ file”.


We have now all the tools to determine the strandness protocol in IGV:

You can check if you have got the concept by trying to find out which protocol modes have been used in “BAM-1” and in “BAM-2” in Figure 12.20, knowing that the reads have been colored “first-of-pair”. You can find the answer below.

What are the strandness protocols in BAM-1 and in BAM-2?

Figure 12.20: What are the strandness protocols in BAM-1 and in BAM-2?

Final remark: in case you still cannot determine the strandness of your files, the most harmless thing to do is to select “unstranded”. With this option, at least you do not miss the correct stranded reads. Some studies have tried to estimate the impact of proceeding with “unstranded” counting in a “stranded” protocol. According to (Pomaznoy et al., 2019): “10% of all genes and 2.5% of protein coding genes have a two-fold or higher difference in estimated expression when strand information of the reads was ignored.

The answer for the question about the strandness protocols in Figure 12.20 is that BAM-2 is unstranded (mixed colors = mixed orientations of “read 1” reads) and BAM-1 was sequenced with a Direct Stranded protocol (the reads in the “DNA-plus-located” gene are colored red = F1R2 and the reads in the “DNA-minus-located” gene are colored blue = F2R1).

Appendix 3: FAQ

How to cite SeqNjoy?
Juan Antonio Garcia-Martin, Rafael Torres-Perez and Juan Carlos Oliveros (2022). SeqNjoy: Complete RNA-Seq projects in your Desktop. (manuscript in preparation). https://bioinfogp.cnb.csic.es/tools/seqnjoy
(Please consult this tutorial for reference updates).
My computer has an internal disk low on space. Can I install and use SeqNjoy?
RNA-Seq files require considerable amounts of hard disk space. If the computer’s internal disk is full, we recommend installing SeqNjoy on an dedicated external hard disk with enough space. SeqNjoy program and all the files loaded to or created by the application will reside in that hard disk without occupying space in the internal disk of your computer.
Can I recover my files from SeqNjoy after deleting a project?
No, the files are deleted without the possibility of being recovered.
Why cannot I type some characters when naming a project, a sample label or an output suffix?
Some non-alphabetic characters cannot be part of an output file name. That is the reason we limit the use of special characters in labels or suffixes.
Which annotation format should I use, GTF or GFF3?
Both formats contain similar information about genes. GFF3 allows any number of parent-child levels and types of genomic elements whilst GTF is more strict. Many genomic files providers tend to favour GFF3 over GTF, however many bioinformatic tools still use GTF as input as it is best defined. For example, the tool that detects the number and size of introns in SeqNjoy works well with any properly formatted GTF files but not so well with most GFF3.
Can I use SeqNjoy just to align short-reads?
Yes, you can align with SeqNjoy and export the alignment files (in standard BAM format) to be visualized or quantified with another tool of your preference.
Can I compare samples without using biological replicates?
No. At least one of the sample groups to be compared must contain two or more biological replicates.
Why do some boxes remain red and the “Launch!” button remains disabled after completing a dialog?
That occurs because there are already result files with matching names to those that will be generated in the dialog. The solution is to change one or several sample labels or the common output suffix.
What are the .R files downloaded together with my project’s files?
All output files in SeqNjoy are generated from other files by applying R instructions on them. The “.R” files contain these instructions to help advanced users to create their own pipelines using R language outside SeqNjoy. Please note that these R codes are generated automatically by SeqNjoy and are not optimized for analyzing large number of samples (loops, lists, variables etc. are not used).
Can I analyze my results with FIESTA Viewer outside SeqNjoy application?
Yes. When a differential expression file is saved (in .xlsx format) the corresponding “.fiesta2.html” is also saved. This .html file is a standard web page that can be opened in another computer to filter and exporting biologically relevant results. You can even publish your results on-line by copying any .fiesta2.html file on a public URL.
When I perform the first alignment with bowtie2 for a given genome, I got the following error. “Error in bowtie2_samtools…”.
If using Windows, Python3 command-line tool is not available, so bowtie2 fails to index the genome. The solution is to install Python3 command line tool from the Microsoft Store.
When I launch SeqNjoy on Windows I am prompted with the message “SeqNjoy finished with errors. Please check that another instance of SeqNjoy is not already running.”
If you are not already running SeqNjoy, open the Task Manager and look for running instances of R (named “R for Windows terminal front-end”) and terminate them.
When I try to install SeqNjoy on Windows, I see a message entitled “Windows protected your PC” indicating that this application can put my PC on risk.
Currently SeqNjoy executable is not signed, so Windows SmartScreen Defender preemptively blocks the execution of applications without a trusted and known origin. You can continue the installation by clicking on “more info” and then “Run anyway”.
By using SeqNjoy, are my result files exposed, in some way, on the internet?
No. Input and results files generated with SeqNjoy remain stored in a specific folder on your local hard disk. SeqNjoy can even be used with no internet connection.

SeqNjoy: Complete RNA-Seq workflow in your Desktop. Developed by BioinfoGP CNB-CSIC (2022)