Developed by UTAP2 team
Bioinformatics unit at Life Sciences Core Facilities (LSCF)
Weizmann Institute of Science    

Sequencing and mapping quality control (QC)

Figure 1: Plots the average quality of each base across all reads. Quality of 30 and up is good (predicted error rate 1:1000).

Download figure as table

Figure 2: Histogram showing the number of reads for each sample in raw data. The sequencing depth statistics for all samples were as follows:

the median depth was 53,583,236 reads,

the mean depth was 55,238,538 reads,

the standard deviation was 4,703,440 reads.

Download figure as table

Figure 3: Percentage of reads discarded after trimming.

No figure presented since the percentage of reads discarded after trimming for all samples is lower than 1%.

Download table

Figure 4: Histogram with the number of reads for each sample in each step of the pipeline.

Download figure as table

MACS peak calling

MACS results for each sample

Sample Type Sample Total Tags (Reads) Filtered Tags (Reads) Maximum Duplicate Tags (Reads) at the Same Position Redundant Rate
Treatment RNC1 3513170 0 0 0
Treatment RNC2 3300693 0 0 0
Treatment RKD1 3924299 0 0 0
Treatment RKD2 3433104 0 0 0

MACS results for each comparison

Comparison tag (read) size (bp) MACS model d length Total number of peaks
RNC1 28 1 178147
RNC2 28 1 172091
RKD1 28 1 39898
RKD2 27 1 29716

The final number of peaks for all comparisons is: 419,852

Figure 5: Number of peaks for all samples

Figure 6: Peaks distribution in genomic regions

Figure 7: Peaks distribution around TSS

Figure 8: Overlap of peaks among the first 4 samples

Venn plot legend

samples mark
RNC1 1
RNC2 2
RKD1 3
RKD2 4

Bioinformatics pipeline methods

Reads were trimmed using cutadapt (DOI: 10.14806/ej.17.1.200) with the parameters: -a CTGTAGGCACCATCAATAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC –times TIMES -q 20 -m 25).

Reads that did not align to rRNA were filtered using Bowtie1 (DOI: 10.1186/gb-2009-10-3-r25).

Reads with a minimum length of 25 and a maximum length of 32 were filtered using Cutadapt.

Reads were mapped to the mm10 genome using TopHat (DOI: 10.1093/bioinformatics/btp120) with the parameters: -N 1, –no-novel-juncs, –library-type fr-firststrand, and -p 20.

Uniquely mapped reads were extracted using samtools with the -q 10 parameter.

Only 5’ UTR and CDS fragments are counted using HTSeq-count v2.0.2 (DOI: 10.1093/bioinformatics/btu638) in intersection-nonempty mode.

Significant regions (peaks) are identified using MACS2 callpeak (DOI: 10.1186/gb-2008-9-9-r137) with the parameters: –keep-dup all, –nomodel, and –extsize=1.

The summits output from peak calling were shifted by 13 bases and extended by 3 bases according to gene orientation.

The distribution of peaks in genomic regions and their proximity to TSS (transcription start sites) were examined using ChIPseeker (DOI: 10.18129/B9.bioc.ChIPseeker). The ovelap of peaks for the first 4 samples in the Venn diagram was also analyzed using ChIPseeker.

The pipeline was constructed using Snakemake (DOI: 10.1093/bioinformatics/bts480).

Acknowledgments

Citing UTAP:

Kohen R, Barlev J, Hornung G, Stelzer G, Feldmesser E, Kogan K, Safran M, Leshkowitz D: UTAP: User-friendly Transcriptome Analysis Pipeline. BMC Bioinformatics 2019, 20(1):154.