SeqNjoy is a stand-alone application to generate and analyze differential gene expression results from RNA-Seq experiments. Analyses can be performed from the very initial step using the files containing short reads (also known as FASTQ files), or from an intermediate stage using alignments files (BAM/SAM files). The result of the SeqNjoy analyses are files containing annotated list of the genes (or other genetic features) differentially expressed between groups of samples in different conditions, and interactive HTML pages that allow the user to graphically analyze, filter and download selected genesets of interest.
This tutorial is intended to guide you through the process of analyzing your RNA-Seq data with SeqNjoy. For each step, we describe the input type of files required, the available parameters and the outputs the step yields. In addition, to make you more aware of what you are doing, we explain how are these files constituted, the meaning and consequences of each parameter and some key concepts you should be clear about when analyzing RNA-Seq data. The goal is that you become able to perform a complete RNA-Seq gene expression analysis on your own.
SeqNjoy has a series of minimal requirements. Before trying to install it, check if your computer meets these requirements.
Minimum | Recommended | |
---|---|---|
Architecture | 64-bit | |
Operative System | Windows, macOS, Linux (Debian, Fedora or SUSE-based) | |
Web Browser | Chrome, Firefox, Safari, Edge | |
RAM Memory | 8 GB | 16 GB |
Disk Space | 500 GB | 1 TB |
NOTE: minimum requirements are adequate for carrying out analyses involving bacterial genomes, while recommended requirements are adequate also for eukaryotic genomes.
SeqNjoy is optimized to be used with 1920 x 1080 screen resolution (or higher).
It is recommended to install SeqNjoy on a dedicated hard drive, not shared with other documents or applications. Installing SeqNjoy on an external, USB-connected, 4-5 TB hard drive is a convenient option that helps to keep all files from multiple projects organized.
Some of the steps performed by SeqNjoy may take many hours to complete. It is not recommended to run the application along other CPU or RAM intensive processes.
Please disable Sleep Mode or Hibernation in your computer to avoid interruptions of long lasting processes
Connection to the internet is NOT required (it is a stand-alone application), as long as all the libraries and packages needed for the analysis reside on the file system (your computer or a external hard disk, for example) where the installation is done.
Specific SeqNjoy installers are available for Windows, macOS or Linux operative systems. Go to our Download site and choose the one that corresponds to your operative system. Installers are named SeqNjoy_X.X.X-OS, where X.X.X stands for the version number and OS indicates the operative system.
Download on your PC the SeqNjoy setup executable (SeqNjoy_X.X.X-Windows-setup.exe file) from our Download site, run it and follow the instructions.
The default installation location is the folder C:\SeqNjoy. Please take into account the space requirements and choose an appropriate alternative location if needed.
Python 3 command line executable is required, so if it is not present you will be prompted for the installation from the Microsoft Store during the setup process.
Once successfully installed, double-click SeqNjoy desktop icon or search SeqNjoy into the programs menu to start the application.
This installer has been tested on macOS 10.14.06 (Mojave).
Prerequisites: R (version 4.1.2 is highly recommended) must be installed before executing the installation script. You can download the binary installer from CRAN page: https://cran.r-project.org/bin/macosx/base/R-4.1.2.pkg.
Installation steps:
Open SeqNjoy_X.X.X-macOS.dmg image file.
Drag the application’s icon to the destination installation folder. Take into account the space requirements and choose an appropriate location.
Double-click the SeqNjoy icon at the installation folder to start the program.
The first execution will install the required binary and source R packages included in the installer. Once the required packages are installed, the program will start.
This installer has been tested on the following Linux platforms, all of them in the 64-bit architecture version: Debian 10.10, Ubuntu 20.04, Linux Mint 20.2, Fedora 34, Rocky Linux 8.6 and openSuse Leap 15.3.
Installation steps:
Download the SeqNjoy_FOR_LINUX_version-name.tar.gz from the SeqNjoy download page.
tar xzvf SeqNjoy_X.X.X-Linux.tar.gz
cd SeqNjoy_X.X.X-Linux
./runSeqNjoy.sh
During the installation:
After the installation:
./runSeqNjoy.sh
script.Source code is available for download at BioinfoGP’s GitHub repository.
RNA-Seq is a high-throughput technique of sequencing that allows to detect and quantify the presence of RNA in biological samples, which is a proxy for detecting and quantifying gene expression, among other applications.
The “wet lab” part of the RNA-Seq process (Figure 3.1, top) starts with the “library preparation”, a process that comprises some steps: RNA extraction from the samples, filtering to keep only mRNA molecules and deplete other RNA species (ribosomal RNA, for example), reverse transcription from mRNA molecules to more stable cDNA molecules, fragmentation and size-selection of the fragments to get smaller chunks of nucleic acids (more convenient to deal with) and the sequencing of short reads obtained from the fragments.
The sequencer is the machine in charge of “reading” the nucleotide sequences corresponding to the mRNA transcripts, generating millions of “short reads” that are stored in a number of “FASTQ files”, one per sample (or biological replica) of your experiment. These FASTQ files are the input files of the RNA-Seq data analysis process. And you can perform this analysis by yourself using SeqNjoy!
Figure 3.1 displays an schematic view of the standard RNA-Seq pipeline to analyze mRNA expression in biological samples. The upper part describes the experimental part of the protocol. The yellowish area (bottom) contains the three main steps for the analysis of the RNA-Seq data of the sequenced samples.
In brief, the bioinformatic analysis of a RNA-Seq experiment comprises three consecutive steps:
Alignment to reference genome: The short reads of the input FASTQ files are aligned to a FASTA file containing the genome sequence selected to align to. As a result, files with the aligned reads in BAM format are generated (one BAM file per each FASTQ file or pair of FASTQ, depending of the RNA-Seq type of sequencing: single-end or paired-ends sequencing).
Gene quantification: The aligned reads of each BAM file are mapped to a file containing the start-to-end positions of each gene (GFF file). For each gene, all reads mapped to it are counted. As a results, a “counts file” is generated per BAM file. A counts file contains a list of all the genes of the genome and the number of reads (“counts”) mapped to each gene.
Samples comparison / Differential expression calculation: The counts files of the samples are grouped into two sets: one set of counts files is assigned to the “Control” samples group and other set of counts files are assigned to the “Case” samples group. The aim of this step is to statistically compare gene by gene the expression of that gene in each group and determine if there is a significant differential expression between both groups and the magnitude of this difference. The final result of this step is a file containing a table with the results of the differential expression test per gene.
SeqNjoy utilizes user-friendly interfaces that allow you to perform a complete RNA-Seq expression gene data analysis in a short number of steps. If you have never performed this kind of analysis, we recommend reading RNA-Seq in short section as a starting point to get familiarized with the overall process.
Flowchart below (Figure 4.1) summarizes how SeqNjoy addresses the different parts of the analysis:
The Homepage is the entry point of the application (upper-left part of the flowchart). In this initial page is where you create your projects. A project is a container for one single study and comprises all the files used for a single RNA-Seq data analysis. The files of a project are a) the ones loaded as input (short reads files, gene feature files, genome reference file, functional annotations) plus b) all the files generated in the different steps of the analysis (alignments files, counts files and lists of differentially expressed genes as a result of comparing samples).
From the list of created projects registered in the Homepage, you can select any of them and access the Project’s Main Page, which is the central hub of the application. All the functionalities are launched from this page. From the Project’s Main Page you can:
As in a recipe you need to add some initial ingredients to start your dish, you need to load certain input files to start the analysis of your project. You should load one of the following two combinations of files for starting the processing in SeqNjoy:
A number of short reads files (FASTQ format) + 1 Genome Sequence file (FASTA format) + 1 Genome Feature file (GFF3 or GTF format). This combination is the common input for a complete RNA-Seq data analysis.
or,
A number of alignment files (in SAM or BAM format) + 1 Genome Feature file (GFF3 or GTF format).
In the first case, when you load FASTQ plus GFF plus FASTA files, the analysis process (“the recipe”) comprises three successive steps: Align Reads, Quantify Genes and Compare Results, as described in the Figure 4.2:
In the second case, when you load alignment files (BAM/SAM files) the workflow is similar. The only difference is that the initial “Align Reads” step is skipped (as you have already aligned reads in your BAM/SAM files…), see Figure 4.3:
As seen in Figure 4.1, there are some Utilities that you can use to inspect or analyze the content of files of certain types (Figure 4.4).
Figure 4.5 displays the whole picture of the functionalities of SeqNjoy to perform a RNA-Seq data analysis. All of them are launched from the “Project’s Main Page”. The RNA-Seq data analysis proceeds from top to bottom. Observe how the files being the output of some steps are the input of the following:
In the next sections, we describe each step of the Analysis, each Utility, as well as the characteristics and use of the involved files. For FIESTA 2.0 graphical viewer you have available an specific tutorial accessible here.
Once you have launched SeqNjoy (clicking the SeqNjoy icon (Windows) or typing ./runSeqNjoy.sh
in the console (Linux/MacOS)) you land in the Homepage (Figure 5.1). In the upper part, the page welcomes you and gives you a summary and important information relative to the application.
The lower panel contains the list of Available Projects. As no project has been created yet, this list is displayed empty at first.
First of all, let’s define what a project is. A project is a container for one single study and comprises all the files used for a single RNA-Seq data analysis. These files are the ones uploaded as input plus all the files resulting from the analysis: alignments, counts, lists of differentially expressed genes.
Each project is identified by a project name that you provide in the text box located at the bottom of the page, as seen in Figure 5.2.
After clicking the Start New Project button, you are directed to the Project’s Main Page to start working with the project (see the Main Page section).
Once you have one or more projects created, the Available Projects area of the Homepage contains a list with the name and date of creation of your projects (Figure 5.3).
At this point, you can perform two simple operations for a particular project by clicking the icons located at both extremes of the project name. If you click the “play” icon the project is opened and you are directed to the Main Page to start or continue your data analysis. On the other side, by clicking the “basket” icon the project is deleted from your system. IMPORTANT: if you delete a project, all its associated files (the input and the results) are deleted so the project can NOT be recovered.
Once the project has been created, you can start uploading your input files. First, open the project by clicking the “play” icon in the Homepage and access the Project’s Main Page of the application (Figure 6.1).
Now, click the Load Files button and select in your computer one or several files of the allowed file types. SeqNjoy supports the following file types and extensions:
File type | File Formats | Allowed File Extensions |
---|---|---|
Genome sequences | FASTA | fa, fa.gz, fasta, fasta.gz, fna, fna.gz |
Genes features | GFF3, GTF | gff, gff.gz, gtf, gtf.gz, gff3, gff3.gz |
Short reads sequences | FASTQ | fastq, fastq.gz, fq, fq.gz |
Alignments | BAM, SAM | bam, sam |
Gene annotations | ANN (text table) | ann, ann.txt, ann.tsv, ann.tab, ann.xls, ann.xlsx |
If you are not familiarized with any of these file types, go to the section File Types to know more about the structure of the format and the function and details of each file type.
Figure 6.2 shows the application loading three files that contain a genome reference (FASTA) file, a gene features (GFF) file and an annotations (ANN) file:
The pop-up window reports the status of file uploading and post-processing by means of a progress bar and informative messages describing the step being performed. When all is OK, you should see a window with an “all in green” aspect like this (Figure 6.3):
Take into account that in the case of some large and/or complex files the upload may take a considerable amount of time.
After you have uploaded your input files, your Project’s Main Page shows the list of the loaded files. The list offers different information and a number of options to operate on them. From the Project’s Main Page you can also perform the successive steps of your RNA-Seq gene expression data analysis project.
This is the aspect of the Project’s Main Page after a standard RNA-Seq data analysis project has been performed (Figure 7.1):
The Project’s Main Page is structured as follows.
In the top area the following functionalities are available:
The rest of the Project’s Main Page is dedicated to the File List. The File List area contains information about all the files loaded by the user or generated by the processing with SeqNjoy, and provides a number of operations that the user can run on one or more selected files. The File List comprises the following buttons and columns, described from the left to the right:
Select/Unselect: A column of check boxes. When you check one or more boxes, the corresponding file(s) is/are marked as selected. If you click a checked box, the file(s) returns to unselected status. In case you want to select all the files of the list, you can do in a more convenient way by clicking the uppermost check box (it also serves to “unselect all”).
There are two possible operations you can execute with the selected files:
Both operations, Download and Delete, are also available on a file-by-file basis in the two rightmost columns of the File List.
Show buttons: You can filter the File List to hide or show the files having a particular file type(s). For example, if you click a colored “bam” button , all bam files are hidden from the File List and the bam button becomes grayish . Click again the grayish icon to return your bam files to the list. You can click multiple Show buttons to show/hide several file types at one time. Note that only file types with no members selected can be hidden.
The numbers accompanying the file type on the button indicate the total number of existing files of that type and how many of them have been selected via the checkboxes (see Select/Unselect above).
Type: It informs you of the type of the file. Refer the File Types in SeqNjoy section to know more about the file types and their characteristics.
Details: This column contains diverse information about the file and provides access to additional utilities for some file types.
pH5-Rep1.bam
is a product of processing the FASTQs 1M-SRR9336468_1.fq
and 1M-SRR9336468_2.fq
.
File Types | Operation | Tool | Icon |
---|---|---|---|
fasta, bam | Genomic Browser | IGV | |
gff | Genome features structure | Rgff | |
diffexp | Visualize, plot and filter gene expression results | FIESTA 2.0 |
Click the “Align Reads” button of Main Page to access this window. For performing this task, you inform a genome sequence (FASTA) file and the corresponding short reads files (FASTQ files) and obtain files with the alignments (BAM formatted) of the short reads contained in the FASTQ files to the reference genome (FASTA file).
There are two different aligners available to perform this task, and you should choose one of them in the submenu that is displayed after clicking the Align Reads button (Figure 8.1): bowtie2 (Langmead & Salzberg, 2012) or hisat2 (Kim et al., 2015). Both of them are well-documented and widely used programs.
After choosing any of them you access a window where you have to fill the input FASTQ files and alignment parameters that are necessary to complete the task.
The Align Reads to Genomes window has the following parts (see Figure 8.2):
Select Genome sequence: select here one FASTA file among the listed in the drop-down menu. Usually, you select the genome reference of the species/variety the sequenced samples come from (or a close one), but you can select any uploaded genome reference that could be more convenient in your particular study.
Select FASTQ files: Below the Genome Sequence you have a grid of drop-down menus arranged in rows and columns. To fill this part in, you must know that there are two modalities of RNA-Seq named single-end and paired-ends sequencing. Refer to the corresponding section to know about them if you need it.
In “Fastq 1” column you select: a) all your single-end FASTQ files, if your FASTQ file are of the single-end type or b) the files corresponding to the first member of the pair in the case of paired-ends FASTQ files (normally, the filename of these “read 1” files contains the suffix “_1” or similar).
“Fastq 2” column only applies in the case of paired-end sequencing. In that case, you have to put here the second member of the pair (normally, files with the suffix “_2” or similar in their filesnames). This second member of the pair must be selected side-by-side with the first ("_1“) member that was selected in”Fastq 1" column (as in Figure 8.2). Do not select any file in this column if all the FASTQ files in your experiment are of “single-end” type.
In addition, you have to fill in a Sample label for each Fastq 1 (single-end) or Fastq 1 plus Fastq 2 (paired-end) item you fill in. This Sample label will name the resulting alignment (BAM) file.
At the end, the aspect of this area will be similar to one of the displayed in the Figure 8.2.
Output files suffix (optional): The text you type in this box will be added to the name of all the output files (BAM files) of this step, immediately after the sample label and before the .bam extension, that is, “Sample_name.suffix.bam” (Example: “Sample1.aligned_with_bowtie2.bam”, “Sample2.aligned_with_bowtie2.bam”, etc.). If no “output files suffix” is typed, the BAM output files will be named simply with the sample name followed by the “.bam” extension (i.e. [Sample_name].bam).
Alignment parameters: The shaded areas of the window contain the parameters that are used by the aligner of your choice, bowtie2 or hisat2. Some options are the present irrespective of the aligner you choose:
With the Trim reads from 5’ end parameter you can set the number of nucleotides to be removed from the reads of the file at the 5’ end (i.e. “on the left”). With Trim reads from 5’ end you can set the number of nucleotides to be removed from the reads of the file at the 3’ end (i.e. “on the right”).
Max CPU Cores: bowtie2 and hisat2 support parallel processing, that is, it can use multiple CPU cores. Multi-core is enabled by default to use (up to) 4 cores. You can modify this value in the text box with a numeric value greater or equal to 1. If your computer has less CPU cores than the number typed here, the aligner program will use the maximum your computer has.
Remove duplicated alignments: In RNA-Seq protocols (especially if a PCR amplification step is included), there is a certain probability of introducing artificial generation of duplicated reads, that is, reads of the same size aligning exactly in the same chunk of the genome (same start and end positions in the genome, Figure 8.4). The presence of duplicated reads above a certain number may be an indication of the existence of such unwanted technical artifacts during the sequencing process. The application provides this parameter to remove reads that are duplicated a number of times equal of greater to a value (50 by default).
Specific options for bowtie2.
We use the R package Rbowtie2 (Wei et al., 2018) to align with bowtie2 (Langmead & Salzberg, 2012).Performance: Alignment methods search a reasonable trade-off between speed and sensitivity/accuracy in their results in order to reduce false detected alignments without missing too many true alignments. In bowtie2 you can adjust this balance by setting one of the following values the “performance” option: very-sensitive, sensitive (default value), fast or very-fast. The default value, sensitive, keeps a good balance between sensitivity and precision. In the case you need more precision (at the cost of a slower performance) choose very-sensitive. Conversely, choose fast or very-fast to prioritize speed performance (at the cost of a lesser precision).
Mode: By default, bowtie2 aligns “end-to-end” mode, meaning that it attempts to align the entire read without omitting characters at either extreme. The alternative, local mode, does not require that reads align from one extreme to the other, that is, it can yield alignments that do not include one (or both) of their extremes. Observe in the example of Figure 8.5 that in "local" mode the extremes are not aligned while they are both aligned in "end-to-end" mode.
Specific options for hisat2.
We use the R package Rhisat2 (Soneson, 2021) to align with hisat2 (Kim et al., 2015).
All the functionalities of these hisat2-specific parameters are linked to the fact that hisat2 is a “splice-aware” aligner. In practical terms, that means that it uses the information deduced and/or provided about the genomic locations of the splice sites and introns in order to improve the precision of the predicted alignments. Of course, this not applies for intronless genomes. By informing these parameters the alignment may be slightly more accurate (if you do not inform them, hisat2 will do his best anyway.)
Once you have filled the form in with the required files and set your parameter values, the “Launch!” button turns from disabled (grey) to enabled (red) so you can click it and launch the alignment(s). A pop-up window appears informing you of the progress of the alignment task. When the process ends OK a message “done” is displayed in the pop-up window and a “Close” button is enabled. By clicking “Close” you return to the “Align Reads to Genome” pane. Now, you could make a new Align Reads to Genome round or click the “X” icon for closing the “Align Reads to Genome” window. You will then return to the “Project’s Main Page”, which will have been updated to contain the new alignment (.bam) files in its File List.
Click the “Quantify Genes” button in the “Project’s Main Page” to access this window (Figure 9.1).
In summary, you inform here a genome features file (GFF) file and a number of alignment files in BAM. Normally, these alignments have been generated previously in the “Align Reads” step. Alternatively, the BAM files were generated elsewhere and the user uploaded to SeqNjoy.
The output of the quantification is one “counts” file per BAM alignment file. Each “counts” file contains a list with the “counts” , i.e. the number of “reads” or “fragments” that have been mapped to each gene (or other feature type that you select) of the genome.
Genome Features: Select one GFF3 or GTF file from the list of loaded GFF files displayed in the drop-down menu.
Output file suffix (optional): you can type in this box a text chunk that will be added to the name of all the output files of this step (“counts” files), immediately before the .counts extension, that is, “bam_name.suffix.counts” (Example: “Sample1.exons_by_genes.bam”, “Sample2.exons_by_genes.bam”, etc.). If no output file suffix is typed, the resulting files will be named similarly to the corresponding BAM (“sample_name.counts”).
Alignments: In this column of selectors you select the samples (their corresponding BAM alignment files) from which you want to obtain the quantification of counts for one (or several) genomic feature(s). At least one BAM file must be selected.
The following parameters are used by the tool that performs the quantification task, the function featureCounts of the R package Rsubread (Liao et al., 2019):
If you have any doubt, there is a specific section on strand specificity in RNA-Seq in this Tutorial. No default value is set, as it depends on how the sequencing of your reads has been made. If you finally don’t know which strandness protocol was followed in the sequencing, select unstranded as it is the safest option.
Primary only: A read may map ambiguously to multiple locations, e.g. repeated or closely similar regions in the genome. In that cases, only one of the multiple read alignments is considered “primary”, that is, “the best alignment” based on diverse criteria. All other alignments are marked as “secondary”. By default (box checked) the quantification will count only the alignments marked as “primary alignments” and discard the “secondary”. If not checked, all alignments will be counted. In Figure 9.2, if "primary only" is checked, only the alignment tagged as primary will be counted in feature A for read R001. If not, all the three alignments will be counted: one count for feature A, one for feature B and one for feature C.
Genomic features and blocks: The GFF file selected has a number of feature types that you can quantify by. For example, you may wish to get the list of reads that align (counts) per each gene. For that, in the drop-down menu “Select Feature” you should select the “feature type” gene among the existing in your GFF file. Some features, like gene, are constructs that have constituent sub-features underneath. One key sub-feature related to gene expression is exon. In principle, it is more accurately (specially if your GFF is well annotated) to take into account for quantification not all the reads aligning the gene span but only those reads aligning within the (block of) exons of the gene (Figure 9.3).
Add more button: Click to set additional feature types to quantify. For example, if the GFF file contains “genes” and also “pseudogenes” you may want quantify both.
Once you have filled the form with the required files and set your parameter values, the “Launch!” button turns from disabled (grey) to enabled (red) so you can click it and launch the quantification. In Figure 9.4, the quantification process will be launched to count genes for nine samples, with reverse strandness mode. Only primary alignments will be counted. The output of the quantification is a number of counts files equal to the number of input BAM files, that is, nine in this case.
A pop-up window appears informing you of the progress of the quantification task. When the process ends OK a message “done” is displayed in the pop-up window and a “Close” button is enabled. By clicking “Close” you return to the Quantify Genomic Features pane. Now, you could make a new Quantify Genomic Features round or click the “X” icon for closing the window. You will then return to the “Project’s Main Page”, that it will have been updated to contain the new “counts” files.
In this step you can perform an statistical analysis of the differences in the expression for a given set of features between two groups of samples. You can use it repeatedly to make more that one binary comparison (e.g. “SampleGroupDrugA versus SampleGroupDrugB”, and then “SampleGroupDrugA versus SampleGroupPlacebo” and finally “SampleGroupDrugB versus SampleGroupPlacebo”).
SeqNjoy offers you two possible methods to perform this task, both of them widely used and well documented: DESeq2 (Love et al., 2014) and edgeR (Robinson et al., 2009). If you click the Compare Samples button a submenu listing both options is opened and you can select any of them (Figure 10.1) .
After selecting the desired option, the Calculate Differential Expression dialog is displayed (Figure 10.2). Most of this window is the same regardless of the selected software. Only some specific parameters located in the shaded area of the window are different and specific of the selected method (DESeq2 or edgeR) to perform the differential expression calculations.
The following fields are present in the Calculate Differential Expression window irrespective of the selected method:
Specific options for DESeq (Love et al., 2014):
Independent Filtering: The default (checked box) is to perform independent filtering. The purpose of independent filtering is to run a process, subsequent to and independent of the particular statistical tests performed by DESeq2, in order to remove from the results those that have no, or little chance of showing significant evidence, what typically increases the statistical/detection power. The simplest criterion to do the filtering is using the mean of all the normalized counts for the evaluated gene. In the box is unchecked, this additional filtering is disabled. More information in this link
Use beta prior The default (checked box) is to use beta prior, which is equivalent to perform a “Log fold change (LFC) shrinkage”, that is, to shrink the LFC estimates of genes with little information or high dispersion toward more likely (lower) LFC estimates. Figure 10.3 illustrates the effect of using (right) or not (left) beta prior (PC) on the shrinkage of LFC. Observe how using PC a shrinkage on the LFC (“Log2Ratio”, in the Y-axis) is performed. This shrinkage affects those genes that have low expression (low “Log2BaseMean”, X-axis). Red and green dots represent genes upregulated and downregulated, respectively, with p-adjusted (FDR) < 0.05.
Use beta prior procedure when your data have a high number of low-expressed genes (low number of counts) and/or a high number of genes showing high difference of expression within a condition (high dispersion in the counts values). You can find additional explanations in this link.
Specific options for edgeR (Robinson et al., 2009):
Prior Count (optional): This argument is the counterpart in edgeR to the Use beta prior argument in DESeq2. When checked, you can type in the text box a positive number that will be used in the calculations of differential expression as a factor to shrink Log Fold-Changes (LFC). The default is 0.125, a value that produce a minimal LFC shrinkage. Larger values of prior count will produce more shrinkage. Figure 10.4 shows the effect of different values of Prior Count (PC) on the shrinkage of LFC. Observe how the higher the value of PC the larger the shrinkage of LFC (“Log2Ratio”, in the Y-axis). Observe also how the shrinkage affects only the those genes with low expression (low “Log2BaseMean”, X-axis). Red and green dots represent genes upregulated and downregulated, respectively, with p-adjusted (FDR) < 0.05.
Once you have filled the form with the required files and set your parameters values, the Launch! button turns from disabled (grey) to enabled (red) so you can press it and launch the differential expression calculation. In the figure 10.5 differential expression calculation between the Case ("pH5_CO2") group of samples and the Control ("pH5") group of samples is displayed ready to be launched (Launch! button enabled). In the example, an annotations file has been provided so the information contained in this file will be incorporated to the diffexp file with the differential expression calculations.
After launching the process, a pop-up window informs you the progress of the “differential expression” task. When the process ends OK the message “done” is displayed in the pop-up window and a close button is enabled. By clicking the Close button you return to the Calculate Differential Expression pane. Now, you can fill in again the form with a new comparison between groups of samples to calculate the differential expression between them or click the “X” icon for closing the window. You will then return to the “Project’s Main Page”, that it will have been updated to contain the newly created diffexp files in Excel format (extension .xlsx) containing the results of the calculated differential expressions of the comparisons performed.
The “diffexp” output file (syntax: [CaseLabel]vs[ControlLabel].[OutputSuffix_if_typed].xlsx) contains the list of differentially expressed genes (DEG) detected between the “Case” and the “Control” groups of samples. Check in the File Types section the information content of the diffexp files. You can graphically inspect and filter the differentially expressed genes results using the FIESTA 2.0 analytic tool.
In addition to the pipeline to perform the RNA-Seq gene expression data analysis, SeqNjoy provides three utilities for inspecting the content of some of the types of files involved in the analysis. These utilities are launched from buttons located in the File List.
To access this utility from a GFF (GFF3 or GTF formatted) of the File List click the icon located rightmost in the Details column.
When launched, SeqNjoy will run the plot_features function of the Rgff R package (Garcia-Martin et al., 2022) to generate a picture containing a tree representing the hierarchical structure (dependencies) of the feature types included in the selected GFF file (Figure 11.1). This tree representation is useful for a global look of what is annotated in the genome and it can help you decide which feature(s) and blocks to use in the Quantify Genes step of the analysis.
As it is shown in the example tree above, the highest feature type in this GFF is gene and there are 7127 items annotated as genes. gene has a child node, the transcript feature. Included into transcript there are five child features, that are siblings from each other: CDS, exon, five_prime_UTR, start_codon and stop_codon.
With this information, we know that we could count genes, exons-by-gene, transcript, exons-by-transcript, exons, etc. in our analysis but not “exon-by-CDS” for example, as exon is not a descendant of the CDS feature.
You can access this utility from a FASTA file containing the genomic sequence or from an alignment file (BAM), by clicking the icon located on the right of the Details column of the corresponding file. The option is not available for user-provided BAMs.
After clicking the button, the widely used “Integrative Genomic Viewer” (IGV) (Robinson et al., 2011, 2022) tool will be launched and will load the corresponding FASTA or BAM file. If you have loaded a FASTA, a screen similar to the one in Figure 11.2 will be displayed.
If you use a BAM file with IGV a similar window is displayed. In addition to the BAM file, the associated genomic sequence (FASTA file) is automatically incorporated to IGV in the same session. See in Figure 11.3 the initial aspect displayed in this case and a minimal “use case” for IGV to visually compare the alignment files of two samples (i.e. two BAM files) for a particular gene.
IGV has a plethora of functionalities and possibilities to interact with your alignment files. Please refer the IGV user general guide, to take full advantage of this versatile tool. In addition, you can inspect your files with the web/on-line version that IGV provides at https://igv.org/app/, documented at https://igvteam.github.io/igv-webapp/.
FIESTA 2.0 (Oliveros, 2022) is an interactive viewer for differential expression results files that is integrated in SeqNjoy. It provides multiple ways of representing your data as tunable interactive plots that can be downloaded as high resolution images. In addition, FIESTA 2.0 allows you sorting and filtering genes by one or several columns and saving the relevant ones into files.
Figure 11.4 illustrates a usage example for FIESTA 2.0. When a diffexp file is loaded into FIESTA 2.0, a default representation of the expression data as a MA-plot is displayed. From this point, you can combine filters (top) to select gene subsets. A very common (and useful) filter combination is to select and highlight with user-selected colours genes (center) that show to be significantly (P-adjusted < 0.05) under- (negative Log2Ratio) or overexpressed (positive Log2Ratio). Finally, user can select any pair of numerical columns present in the diffexp file to be represented (bottom), so other popular representations like Volcano plots are also available. User-generated plots and subsets of relevant genes are both downloadable (in the case of the plots, with high resolution).
To access FIESTA 2.0 viewer from a file containing the differential expression results (“diffexp” files in .xlsx format) click the icon located rightmost in the Details column.
More details about FIESTA 2.0 here.
SeqNjoy uses up to seven types of files in the analyses. The files that you provide have these types: FASTA, FASTQ, GFF3/GTF and ANN. Other types are only held by files generated through the SeqNjoy analysis: COUNT, DIFFEXP. Finally, the format BAM/SAM is that of the alignment files, which can be either internally generated by SeqNjoy or provided (loaded) by the user.
To follow, we describe the aspect, source, uses, format requirements and content of each file type. You can find a good complement to know more about them in this link by Andrew Severin and also in this page.
Content: Each FASTQ file contains information of millions of so-called short reads, or small genomic sequences that have been “read” by a sequencer machine in a sample. FASTQ is also the name of the format of these files, and is kind of a extension of the FASTA format (it has also a header and lines with genomic sequences).
A FASTQ file uses four lines per short read (Figure 12.2):
Content: A GFF3 or GTF file contains information for the so-called genomic features. A feature is a genomic region with some involvement in a biological process. For example, the mRNA “EDEN.1” (mRNA00001) of the gene “EDEN” (gene00001) in the Figure 12.3 is a feature. In the same figure, that specific feature is described in Line 3 of the exemplified GFF file:
The Figure 12.3 depicts how a gene and its features composition (mRNAs, exons, CDSs) are represented in a GFF3. After some header lines (starting with “#”), each feature is written in a line. You can see that each feature has information about its location (start, end, strand) and feature type (gene, mRNA, exon, etc.). A key point is the last column, where relations of dependency between the features are coded. For example, the feature “mRNA00001” (with feature type “mRNA” ) in Line 3 depends on (note its “Parent” tag) another feature named “gene00001” (aka EDEN) in Line 2, which has the feature type “gene”. SeqNjoy has a specific utility for displaying the feature structure of your GFF3/GTF file in a graphical way.
More formally, each line of a GFF3 (or GTF) file has 9 tab-separated columns:
Column 1: seqID (e.g. chromosome-id: “chr1” or “1”, “chr2”, etc..). IMPORTANT: These ids must be the same as the used in the FASTA file. See in the example below (Figure 12.4) how only the GFF3 in green has matching chromosomes ids (chr1, chr2…) with those in the FASTA file displayed.
GFF3 is currently the preferred format to represent Gene Features, but it is not uncommon to find gene feature files in GTF format. GFF3 provides more flexibility to represent the hierarchy of the features in a genome but GTF has some advantages in certain cases. SeqNjoy can process files in GFF3 or GTF formats. In both cases, columns 1 to 8 are the same. The 9th column has a different syntax in GFF3 and GTF (Figure 12.5 )
Important: For SeqNjoy, GFF files must contain at least one “gene” in the feature type column.
For more detailed information about GFF formats refer to this, this or this link.
Content: BAM/SAM files follows a format named “SAM format” (Sequence Alignment/Map). This format consists on tab-limited columns that store alignments of the short reads from the FASTQ file that have been mapped to the reference genome (your FASTA file).
A BAM is a binary file and therefore it is not directly readable with an editor (you can use IGV to inspect its content). Getting the uncompressed/text version of a BAM, a SAM file, does not solve normally the problem, as SAM files are very large and any try to browse it with a common editor will most likely end in a crash of the editor after a while.
In brief, a BAM/SAM file contains a complete information per line of each alignment: read identification, chromosome location, alignment type (single-end or paired-ends), mapping quality, presence of insertions and deletions in the alignment, insert size, etc.
You can find detailed information about the SAM format in the SAM Format Specification.
Content: The “ANN” format of an annotations file is very simple (Figure 12.7). The first line is the header line and has the column names. The first column of the file must contain gene identifiers. These identifiers must match (in type, not necessarily in number) those present in the counts files. The rest of columns may contain different biological information related to the genes, e.g. gene description, gene synonyms, biological processes it is involved in, molecular functions, cell compartments, expressed tissues, biological pathways, interactions, orthologous genes, literature links, identifiers to other databases, etc.
It is not necessary that all genes present in the counts files are also present in the annotations file, nor is it necessary that all genes present in your annotations file are present in the counts files.
This is the meaning of the columns present in the diffexp files (Figure 12.8):
Andrews, S., Krueger, F., Segonds-Pichon, A., Biggins, L., Krueger, C., & Wingett, S. (2012). FastQC. Babraham Institute.
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1), 289–300. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x
Garcia-Martin, J. A., Oliveros, J. C., & Torres-Perez, R. (2022). Rgff: R utilities for gff and gtf files. National Center for Biotechnology, Spanish National Research Council (CNB-CSIC). https://cran.r-project.org/web/packages/Rgff
Kim, D., Langmead, B., & Salzberg, S. L. (2015). HISAT: A fast spliced aligner with low memory requirements. Nature Methods, 12(4), 357–360. https://doi.org/10.1038/nmeth.3317
Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with bowtie 2. Nature Methods, 9(4), 357–359. https://doi.org/10.1038/nmeth.1923
Liao, Y., Smyth, G. K., & Shi, W. (2019). The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic Acids Research, 47(8), e47. https://doi.org/10.1093/nar/gkz114
Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12). https://doi.org/10.1186/s13059-014-0550-8
Oliveros, J. C. (2022). FIESTA 2.0: Interactive scatter plots in your browser. National Center for Biotechnology, Spanish National Research Council (CNB-CSIC). https://bioinfogp.cnb.csic.es/tools/fiesta2
Pomaznoy, M., Sethi, A., Greenbaum, J., & Peters, B. (2019). Identifying inaccuracies in gene expression estimates from unstranded RNA-seq data. Scientific Reports, 9(1). https://doi.org/10.1038/s41598-019-52584-w
Robinson, J. T., Thorvaldsdóttir, H., Turner, D., & Mesirov, J. P. (2022). Igv.js: An embeddable javascript implementation of the integrative genomics viewer (igv). bioRxiv. https://doi.org/10.1101/2020.05.03.075499
Robinson, J. T., Thorvaldsdóttir, H., Winckler, W., Guttman, M., Lander, E. S., Getz, G., & Mesirov, J. P. (2011). Integrative genomics viewer. Nature Biotechnology, 29(1), 24–26. https://doi.org/10.1038/nbt.1754
Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2009). edgeR: A bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139–140. https://doi.org/10.1093/bioinformatics/btp616
Soneson, C. (2021). Rhisat2: R wrapper for hisat2 aligner. https://github.com/fmicompbio/Rhisat2
Wei, Z., Zhang, W., Fang, H., Li, Y., & Wang, X. (2018). EsATAC: An easy-to-use systematic pipeline for atac-seq data analysis. Bioinformatics. https://doi.org/10.1093/bioinformatics/bty141
After the “fragmentation” step, in RNA-Seq protocols fragments ~300 to 500 bp long are produced. Even this size range is too long to be sequenced entirely by the type of sequencers that generate short reads. What is done then is to read only the end part of the fragment, which produce short reads of 100-200 bases (this interval may vary according to the experimental design, the machine used, etc.) The rest of the fragment is not sequenced.
Now, there are two modalities in RNA-Seq, single-end and paired-ends (Figure 12.9):
Single-end RNA-Seq (SE, aka single-read RNA-Seq): Is the simplest mode. Only one end of the fragment is sequenced. All the short reads sequences generated for a sample with single-end (SE) technique are collected in a single FASTQ file.
Paired-end RNA-Seq mode (PE): Two reads per fragment are produced. In this mode, the reading starts at one end, finishes this direction at a pre-specified read length (e.g. 150 bases) and a “read 1” is produced as a result. Then, it starts another round of reading from the opposite end of the fragment, finishes this new (opposite) direction at the same read length and finally a second read, “read 2”, is produced. At the end, both ends of the fragment are sequenced leaving the inner portion of the fragment unsequenced.
In PE sequencing, read 1 reads (those generated first-in-the-pair) and read 2 reads (generated secondly) are stored in two separated FASTQ files. One file contains all the read 1 reads the other file contains all the read 2 reads. Each of these files has a label in its name that identifies whether it contains read 1 or read 2 reads, something like Sample1_blah_blah_1.fastq and Sample1_blah_blah_2.fastq (or something similar like Sample1_blah_blah_R1.fastq and Sample1_blah_blah_R2.fastq, etc.) Note that usually the only difference in the filenames between two FASTQ paired-ends files of the same sample is just this "_1" or "_2"-like particles.
In addition, "_1" and "_2" FASTQ files of a particular sample have both the same number of reads. Moreover, they kept the same order of reads (see Figure 12.9): at a given position X in both files, the reads in “_1” and “_2” correspond to the same fragment. In fact, as with the files, the read identifier that is given to the “read 1” and “read 2” reads of the same fragment is the same… except again for the particle indicating whether it is for reads 1 or reads 2.
To know how is the orientation of the reads in SE and PE is useful to interpret correctly the alignments of the reads and to understand important topics like the unstranded or stranded protocols in RNA-Seq.
The sequencing always synthesizes the reads in a 5’ -> 3 direction. (It is the way the DNA polymerase proceeds) so the template strand is traversed 3’ -> 5’.
In single-end, the reads extend in the 5’ -> 3’ direction along the forward (plus) DNA strand (Figure 12.10).
In paired-ends, the read 1 read extends in the 5’ –> 3’ direction towards read 2 along the forward (plus) DNA strand. As for the read 2 read, it extends in the 5’ –> 3’ direction towards read 1 along the reverse (minus) DNA strand. Read 1 is the reverse complement of the first 3’-most bases of the forward strand (therefore identical to the first 5’-most bases of the reverse strand) and read 2 is the reverse complement of the first 3’-most bases of the reverse strand (therefore identical to the first 5’-most bases of the forward strand). See the read 1 and read 2 orientations represented in Figure 12.11:
Therefore the sequences of the two reads of a pair are pointing towards each other on opposite strands. When you align them to a genome, one read should align to the forward strand, and the other should align to the reverse strand, at a higher base pair position than the first one so that they are pointed towards one another. This is known as an “FR” read pair or “forward/reverse”, as this is the order in which they are placed. A read pair having this orientation is called a “proper pair”. There are other possible orientations, but we don’t need to go into them here.
When you use IGV genomic browser you can see this orientation in the case of a read pair of a PE sequencing. Take into account that in IGV only the forward (plus) strand of the genome is displayed. With the “Color by” -> “read strand” option the read in the forward strand is colored red and the read of the reverse strand is colored blue. Observe in the figure above a read pair meeting the FR orientation. They are even more visible if you select “View in Pairs” option (Figure 12.12).
Both single-end (SE) and paired-ends (PE) can be used for measuring gene expression. PE offers the advantage over the SE that the distance covered by the R1 and R2 and the unsequenced sequence in-between is controlled on average (it is named “insert size” and is usually in an approximate range 200-600 bp) so the aligners can take advantage of this information to be more accurate in the correct alignment of the reads to the reference genome. The point is that a read pair properly (FR) oriented, covers a longer span in the genome (read_1 + inner_distance + read_2, Figure 12.13) and therefore a stronger evidence of being well aligned than a single read.
The main disadvantage of the PE protocol is that it is normally more expensive and time-consuming to generate than single-end reads. On the other hand, just for counting reads or calculate differential expression may be unnecessary to have such a high level of accuracy, especially if your species is well-annotated. But take into account that in the future you might want to reuse the reads intended to study differential expression to, for example, extend the study to look for alternative splicing. In that case, you will need to have done PE sequencing as SE is insufficient for alternative splicing analysis.
For the transcription of a gene, a strand of the DNA is used as template of the resulting transcript. As the sequence of this “template strand” is the reverse complement of the sequence of the transcript, we say that this strand is the “negative-sense strand”, “antisense strand”, “non-coding strand” or “minus strand”. Conversely, as the “non-template strand” has a sequence equivalent (except T instead of U nucleotide) to the transcript sequence it is named “positive-sense strand”, “sense strand”, “coding strand” or “plus strand” (Figure 12.14).
Important to remember, either of the two polynucleotide strands may contain genes, and hence the determination of sense and antisense is gene specific (Figure 12.15):
The RNA-Seq protocol produces single-stranded mRNA molecules that are converted to double-stranded complementary DNAs (cDNAs). Then the two strands of the cDNA are amplified to yield millions of short reads. Consequently, for a transcript, you would have short reads of two types: some reads have a sequence that is “the same” (except that “T”s instead of “U”s) as the original RNA transcript sequence (“sense” reads) and others with a reverse complemented sequence respect to the original RNA transcript (“antisense” reads).
At the beginning, most sample preparation protocols were as described in the previous paragraph. This protocol is named non-strand-specific (aka unstranded). Sequences of both sense strand, the strand of the cDNA codifying the transcript and the antisense strand of the original mRNA, are obtained in the sequencing process, without knowing which strand of the cDNA corresponded to the original mRNA. Consequently, it is not clear if a read originated from the sense strand or the antisense strand of the reverse transcripted mRNA. For the FASTQ files where the short reads are stored, that means that the "_1.fastq" file and "_2.fastq" files contains mixed both types of reads.
When assigning counts to genes you wish to assign all reads coming from molecules of the sense strand to a sense strand gene in the reference and all reads coming from molecules of the antisense strand to an antisense gene in the reference. This is not possible with a non-stranded protocol and some antisense strand reads will be assigned to a sense gene and some sense strand reads will be assigned to an antisense gene. This may lead to some incorrect assignments that can produce biased estimates when counting reads that maps on genes.
Now, observe the protocol on the right in the Figure 12.16. Observe that in this protocol the second-strand fragment (the sense strand) is “stopped” to amplify. With this stranded protocol only the first strand (antisense stranded in this case) survives the subsequent PCR amplification step and hence the strand information of the libraries. As a result, the "_1.fastq" file will store only antisense (reverse) reads, as they are the first of the pair to be sequenced. In a second step, the "_2.fastq" file will store the second of the pair to be sequenced, which is the sense (direct) reads. The described protocol is named reverse stranded, since it is the reverse reads (produced in first place) that are amplified and stored in "_1.fastq" file.
There is an alternative stranded protocol that amplifies only the sense (direct) reads and avoid amplification of antisense (reverse) reads. In this case, the direct reads are the first of the pair to be sequenced and are stored in the "_1.fastq" file. Antisense (reverse) reads are produced in second place and are stored in the "_2.fastq" file. This protocol is direct stranded, since it is the direct reads (produced in first place) that are amplified and stored in "_1.fastq" file.
Figure 12.17 illustrates how in unstranded protocol, you don’t know the sense sequence is stored in ‘reads_1.fastq’ file or in ‘reads_2.fastq’ file, whereas in direct stranded the sense read of the pair is stored in ‘reads_1.fastq’ and in reverse stranded the sense read of the pair is stored in ‘reads_2.fastq’.
With the stranded protocols, the short reads carry always “sense sequences” so there are no biased counting due to antisense reads. There are several techniques to produce a “stranded RNA-Seq”. One of the most used is to produce a chemical modification with dUTP specifically in the non-sense strand of the cDNA that blocks its undesirable amplification (Figure 12.16, right). This protocol is named reverse stranded RNA-Seq because the sequence that is amplified is the “reverse” one.
Other types of stranded protocols depletes not the antisense but the sense sequences (reverse complemented of the transcript). This protocol is named direct stranded RNA-Seq.
When your provider delivers the short reads files (FASTQ files) of your RNA-Seq experiment to you, it is recommended to check the provider’s report of the sequencing to know which library prep method was used: unstranded (non-stranded), direct-stranded or reverse-stranded. If it is not mentioned, request this information from the provider as it is very important in order to obtain a fair transcription (counts) estimation. After knowing the protocol type, select the proper option in the Strand specificity selector of the Quantify Genomic Features window.
Maybe that you don’t be able to find the used protocol by the above designations (for example, you can find that the protocol was a “directional RNA-Seq” which is the same as “stranded RNA-Seq”) or you find only the commercial kit name used for the protocol. In that case, the following conversion table may be useful:
Protocol | Synonym | Examples/Kits1 | Description |
---|---|---|---|
Unstranded | non-stranded | Standard Illumina, TruSeq RNA Sample Prep kit, NuGEN OvationV2, SMARTer universal low input (TaKara), GDC normalized TCGA data | Information regarding the strand is not conserved (lost) |
Reverse | first_strand | dUTP, NSR, NNSR; TruSeq Stranded (Total RNA/mRNA), NEB Ultra Directional, Agilent SureSelect Strand-Specific | The second read (read 2) is from the original RNA strand/template, first read (read 1) is from the opposite (reverse) strand. |
Direct | second_strand | Directional Illumina (Ligation), Standard SOLiD, ScriptSeq v2, SMARTer Stranded Total RNA, NuGEN Encore Complete, NuGEN SoLo, Illumina ScriptSeq | The first read (read 1) is from the original (direct sense) RNA strand/template, second read (read 2) is from the opposite strand. |
1 The information of this column was compiled from this, this, and this link.
In any case, we recommend to check the strandness by yourself by following a strategy like the described in the following section.
It may be that even after searching everywhere you cannot find the information or you are not sure about the strandness of your RNA-Seq files. Maybe you have obtained your FASTQ or BAM files not directly from a provider but from a repository lacking information or maybe they are old ill-documented data, etc.
In that case, there are still some strategies to guess the strandness of your BAM files. Here we tell you a strategy using IGV (Robinson et al., 2011, 2022) browser.
You will need to load in IGV the FASTA and GFF3/GTF files, and the BAM file(s) with the unknown strandness.
As said in Reads orientation section, when aligning to the genome, a read aligns to the forward/plus DNA strand and the other aligns at a higher base pair position at the reverse (or minus) DNA strand so that they are pointed towards one another (“FR” read conformation or “forward/reverse”, in that order, Figure 12.12). We also saw that using the “color by read strand” option, IGV colors forward reads in red and reverse reads in blue.
As we have also seen the “reads 1.fastq” contains the first-read-of-the-pair and “reads 2.fastq” contains the second-read-of-the-pair. Let’s focus on “reads 1.fastq”:
In the case of “unstranded” protocols, “read 1.fastq” contains a mix of forward and reverse reads. When mapping to any gene in particular, there is a mix of forward and reverse reads in “read 1.fastq” file. Some “read 1” reads will go forward and other “read 1” reads will go reverse.
In the case of “direct stranded”, “reads 1.fastq” contains only “direct reads”, that is, reads that align in the same sense that the gene they map to. In a gene that lies in the forward DNA strand (aka plus strand), mapped “read 1” will be also “forward” (red) and in a gene that lies in the reverse DNA strand (aka minus strand) a mapped “read 1” will be “reverse” (blue).
In the case of “reverse stranded”, “reads 1.fastq” contains only “reverse reads”, that is, reads that align in the opposite sense (antisense) that the gene they map to. Therefore, on a gene that lies in the forward DNA strand (plus strand), mapped “read 1” will be “reverse” (blue) and in a gene that lies in the reverse DNA strand (minus strand) a mapped “read 1” will be “forward” (red).
Related to the above, there is another interesting coloring that IGV provides.
You can access it clicking the track’s gear menu of IGV and selecting Colour by -> First-of-Pair Strand in your BAM track. First-of-pair colours both reads of a read pair with the color of the direction of mapping (forward of reverse) of the “Read 1”: if “Read 1” is forward, the pair “Read 1”+“Read 2” is colored red and, conversely, if the “Read 1” is reverse, both members are coloured blue.
Finally, we need a final feature of IGV to fully determine the strandness type protocol. If we left-click a member of a paired-end read, a pop-up window give us diverse information about the alignment. One piece of information is the “Pair Orientation”.
There are two possible values for Pair orientation in a “proper paired” read pair: F2R1 or F1R2. The letters “F” and “R” refer to the above-mentioned “FR” normal orientation. The numbers “1” and “2” tell us the file from which each read of the pair comes (‘reads 1.fastq’ or ‘reads 2.fastq’). Thus, “F2R1” means “the first read is a forward one and comes from ‘reads 2.fq’ file and the second read is a reverse one and comes from the ‘reads 1_fq’ file”. “F1R2” means “the first read is a forward one and comes from the ‘reads 1.fq’ file and the second read is a reverse one and comes from the ‘reads 2.fq’ file”.
We have now all the tools to determine the strandness protocol in IGV:
For a given transcript, non-directional (unstranded) libraries will show a mix of red and blue reads aligning to the locus (that is the easy case), as there is a mix of forward and reverse reads in the “reads 1.fastq” file.
Stranded libraries will show reads of (predominantly) one color in the direction matching the gene orientation (in the “plus” or “minus” strand of the genome). You will see only one color in reads mapping genes in DNA plus strand and only reads in the other color in reads mapping genes in DNA minus strand.
You can check if you have got the concept by trying to find out which protocol modes have been used in “BAM-1” and in “BAM-2” in Figure 12.20, knowing that the reads have been colored “first-of-pair”. You can find the answer below.
Final remark: in case you still cannot determine the strandness of your files, the most harmless thing to do is to select “unstranded”. With this option, at least you do not miss the correct stranded reads. Some studies have tried to estimate the impact of proceeding with “unstranded” counting in a “stranded” protocol. According to (Pomaznoy et al., 2019): “10% of all genes and 2.5% of protein coding genes have a two-fold or higher difference in estimated expression when strand information of the reads was ignored.”
The answer for the question about the strandness protocols in Figure 12.20 is that BAM-2 is unstranded (mixed colors = mixed orientations of “read 1” reads) and BAM-1 was sequenced with a Direct Stranded protocol (the reads in the “DNA-plus-located” gene are colored red = F1R2 and the reads in the “DNA-minus-located” gene are colored blue = F2R1).
SeqNjoy: Complete RNA-Seq workflow in your Desktop. Developed by BioinfoGP CNB-CSIC (2022)