Developed by UTAP2 team
Bioinformatics unit at Life Sciences Core Facilities (LSCF)
Weizmann Institute of Science    

Sequencing and mapping quality control (QC)

Figure 1: Plots the average quality of each base across all reads. Quality of 30 and up is good (predicted error rate 1:1000).

Quality of read 1:

Download figure as table

1234567891011121314151617181920212223242526272829303132333435363738010203040
BaseMean quality

Quality of read 2:

Download figure as table

1234567891011121314151617181920212223242526272829303132333435363738010203040
BaseMean quality

Figure 2: Histogram showing the number of reads for each sample in raw data. The sequencing depth statistics for all samples were as follows:

the median depth was 42,837,616 reads,

the mean depth was 44,940,902 reads,

the standard deviation was 7,617,043 reads.

Download figure as table

Aire_C313Y_Het2Aire_C313Y_Het3Aire_C313Y_WT1Aire_C313Y_WT20e+002e+074e+07
Number of input reads

Figure 3: Percentage of reads discarded after trimming.

No figure presented since the percentage of reads discarded after trimming for all samples is lower than 1%.

Download table

Figure 4: Histogram with the number of reads for each sample in each step of the pipeline.

Download figure as table

Aire_C313Y_Het2Aire_C313Y_Het3Aire_C313Y_WT1Aire_C313Y_WT20e+002e+074e+07
Steps0_Raw_counts1_After_cutadapt2_Mitochondrial_genes_removed3_Mapped_uniquely4_Counts_after_PCR_deduplication5_Nucleosome-free_CountsCounts

Figure 5: Coverage plot on Genebody

Plot of mean read (counts per million mapped reads) coverage of gene regions. This plot displays the mean coverage for all the genes, from -2000 bases of the transcription start site (TSS) to +2000 bases of the transcription end site (TES).

Figure 6: Coverage plot on TSS

This plot displays the mean coverage for all the genes, from -2000 bases of the transcription start site (TSS) to 2000+ bases after it.

Figure 7: This plot displays the insert-size histogram for each sample.

MACS peak calling

MACS results for each sample

Sample Type Sample Total Fragments
Treatment Aire_C313Y_WT1 3920338
Treatment Aire_C313Y_WT2 7939748
Treatment Aire_C313Y_Het3 7005306
Treatment Aire_C313Y_Het2 9707626

MACS results for each comparison

Comparison Fragment size (bp) MACS model d length Total number of peaks Total number of peaks after filtering black list’s peaks
Aire_C313Y_WT1 33 33 138793 136902
Aire_C313Y_WT2 33 33 83949 82665
Aire_C313Y_Het3 33 33 245524 242566
Aire_C313Y_Het2 33 33 114515 112921

The final number of peaks for all comparisons is: 575,054

Link for Integrated annotated peaks file Download table

Figure 8: Number of peaks for all samples

Aire_C313Y_Het2Aire_C313Y_Het3Aire_C313Y_WT1Aire_C313Y_WT2050000100000150000200000250000
PeaksNumber_of_peaksNumber_of_black_list_filtered_peaksNumber of peaks

Figure 9: Peaks distribution in genomic regions

Figure 10: Peaks distribution around TSS

Figure 11: Overlap of peaks among the first 4 samples

Venn plot legend

samples mark
Aire_C313Y_WT1 1
Aire_C313Y_WT2 2
Aire_C313Y_Het3 3
Aire_C313Y_Het2 4

Bioinformatics pipeline methods

Reads were trimmed using cutadapt (DOI: 10.14806/ej.17.1.200) with the parameters: -a CTGTCTCTTATACACATCTCCGAGCCCACGAGAC -A CTGTCTCTTATACACATCTGACGCTGCCGACGAGTGTAGATCTCGGTGGTCGCCGTATCATT –times TIMES -q 25 -m 30).

Reads were mapped to mm10 genome using Bowtie2 (DOI: 10.1038/nmeth.1923).

Uniquely mapped reads were extracted using samtools with the parameters: -F 4, -f 0x2, -q 39.

The reads were graphically visualized using ngsplot with the parameters: -G -R genebody -C -O samples -D refseq -L 50000.

Following alignment, mitochondrial genes were removed from the analysis, and duplicated reads were removed using picard-tools.

Nucleosome-free fragments at the length <120bp were selected from the remaining unique reads, and broad peaks were called using MACS2 callpeak (https://doi.org/10.1186/gb-2008-9-9-r137) with the parameters: –bw 120 -B -f BAMPE –SPMR –B –shift -50 –extsize 100 -keep-dup all -q 0.05).

If chosen, for mouse genome a TSS file is used, containing either a broad or narrow definition of the gene’s TSS (Transcription Start Site) regions (based on Nature. 2016 Jun 30;534(7609):652-7 - The landscape of accessible chromatin in mammalian preimplantation embryos).

The predicted peaks were annotated according to the mm10 genome using Homer with default parameters after merging all peaks from all samples together with bedtools multiinter.

The distribution of peaks in genomic regions and their proximity to TSS (transcription start sites) were examined using ChIPseeker (DOI: 10.18129/B9.bioc.ChIPseeker). The ovelap of peaks for the first 4 samples in the Venn diagram was also analyzed using ChIPseeker.

The resulting peaks underwent filtering to exclude peaks from the blacklist.

black list for mm10 genome is taken from (https://github.com/Boyle-Lab/Blacklist/tree/master/lists)

The pipeline was constructed using Snakemake (DOI: 10.1093/bioinformatics/bts480).

Acknowledgments

Citing UTAP:

Kohen R, Barlev J, Hornung G, Stelzer G, Feldmesser E, Kogan K, Safran M, Leshkowitz D: UTAP: User-friendly Transcriptome Analysis Pipeline. BMC Bioinformatics 2019, 20(1):154.