stLFR resequencing analysis report

中文

stLFR-reSeq V2.0.0.0

1.Fastq product

Generation of stLFR resequencing data is started from the HMW gDNA sample. A library is constructed by MGIEasy stLFR Library Prep Kit by using HMW gDNA, then the library is sequenced on DNBSEQ platforms which manufactured by MGI. The stLFR fastq files are got at the end. After removing low-quality reads by SOAPnuke, we got clean data of stLFR sample. Base distribution, quality distribution, Q20 and Q30 of clean stLFR data are showed below.

Sample

V300022071A

Fastq report

SampleV300022071A
Raw reads1,685,759,388
Raw bases(bp)168,575,938,800
Clean reads1,348,332,758
Clean bases(bp)134,833,275,800
Q20(%)97.01
Q30(%)88.65
Total depth(X)56.19

 a. Clean fastq is built by removing low-quality read from raw fastq. One pair of read is defined as low-quality when it:(1) contains adapter sequences, (2) N ratio > 1%, (3) the ratio of the base (base quality is lower than 10) > 10%.

 b. Q20 and Q30 means the ratio of base whose quality > 20 or 30 of the clean fastq.

 c. Total depth is calculated by the raw fastq.

Base distribution

Base distribution (x-axis is the sequencing cycle, y-axis is the proportion of five bases at a certain cycle).

Base quality distribution

Quality distribution (x-axis is the cycle along reads, y-axis is the heatmap of quality at given cycle).

2.stLFR long-fragment product

Long-fragment read is a feature of stLFR technology, built by virtual co-barcoding method. The performance of Long-fragment read is showed by analyzing relationship between long-fragment and barcode, long-fragment length distribution and long-fragment coverage.

Fragment & Barcode report

SampleV300022071A
Total barcode type3,623,878,656
Barcode number59,735,965
Barcode type rate1.65%
Reads pair number842,879,694
Reads pair number(after split)713,953,863
Barcode split rate84.7%
Split barcode(G)142.79

 a. The barcodes of stLFR data are related to the Long-fragment read. We should do and filter and split the barcodes before we do next analysis. After this step, each read will get a barcode ID. The table above shows the barcode split statistics results.

Fragment barcode distribution

Long-fragment vs. barcode in stLFR data (x-axis is average long-fragment DNA molecular number captured by barcode beads, y-axis is the density of barcode number).

Fragment length distribution

Length distribution of long-fragment in stLFR data (x-axis is the length, y-axis is the count of long-fragment with given length).

Fragment coverage distribution

Coverage of long-fragment of stLFR data (x-axis is the coverage of long-fragment, y-axis is the density of long-fragment at a certain coverage).

3.Alignment Statistics

Clean stLFR data is aligned to reference genome by BWA or MegaBOLT. The performance of alignment is evaluated by mapping rate, coverage, insert size distribution, etc.

Alignment report

SampleV300022071A
Mapped reads1,346,475,379
Mapped bases(bp)134,647,537,900
Mapping rate99.86%
Paired mapping rate99.33%
Mismatch bases(bp)746,868,575
Mismatch rate0.55%
Duplicate reads314,482,226
Duplicate rate23.32%
Dup depth(X)26.822664
Average sequencing depth(X)34.98
Coverage(≥1X)99.13%
Coverage(≥4X)98.91%
Coverage(≥10X)98.34%
Coverage(≥20X)93.84%
Mean insert size(bp)228.31

 a. The statistics results shown here are calculated based on clean data after low-quality filter and barcode-split filter.

 b. The mapping rate and paired mapping rate are the proportion of read which is mapped or paired-mapped to reference genome.

 c. The mismatch means the unmapped bases between read and reference genome.

 d. The duplication means duplicated reads introduced by PCR or other processes.

 e. The average sequencing depth is calculated by mapped reads without duplicated reads.

 f. The coverage is calculated as proportion of reference genome covered by more than 1, 4, 10 or 20 folds reads.

 g. The mean insert size is the mean insert size of paired mapped reads.

Insert size distribution

Insert size distribution (x-axis is insert size, y-axis is the count of paired reads at a certain insert size).

Depth accumulative distribution

Cumulatived sequencing depth distribution (x-axis is the depth, y-axis is the proportion of reference genome that achieves at or above certain depth).

Depth distribution

Sequencing depth distribution (x-axis is the depth, y-axis is the proportion of reference genome at a certain depth).

GC bias distribution

GC-bias distribution (x-aixs shows GC content, y-axis plots the NORMALIZED_COVERAGE, the distribution of WINDOWs corresponding to GC percentages, and base qualities corresponding to each GC content bin).

4.Variant Statistics

The variants of stLFR data are called by GATK or MegaBOLT and BGI self-developed CNV/SV calling softwares, and visualized by CIRCOS.

Variant statistics

SampleV300022071A
Total_SNP3,976,683
dbSNP_rate93.19%
Novel_SNP270,752
Novel_SNP_Rate6.81%
Ti/Tv1.95
Total_INDEL913,668
dbINDEL_Rate73.8%
CNV deletion568
CNV duplication36
SV DEL28
SV DUP6
SV INV7
SV TRA0

 a. SNP: Insertion& Deletion.

 b. Ti/TV:the ratio of transition (Ti) to transversion (Tv) SNPs.

 c. INDEL: Insertion & Deletion.

 d. CNV:copy number variant.

 e. SV: Structure Variant.

 f. DEL: deletion.

 g. DUP: duplication.

 h. INV: inversion.

 i. TRA: translocation.

Variant evaluation

TPFPFNPPVSensitivityF-measure
SNP3,189,73310,16820,5240.99680.99360.9952
Indel455,21523,13026,0480.95160.94590.9488

 a. PPV: positive predictive value, precision, TP/(TP+FP).

 b. Sensitivity: true positive rate, TP/(TP+FN).

 c. F-measure: the harmonic mean of precision and sensitivity, 2*TP/(2*TP+FP+FN).

Circos of variants

CIRCOS of variant from whole genome (From the outside to the inside are: (i) chromosomes, (ii) SNP densitity, (iii) INDEL densitity, (iv) CNV deletions, (v) CNV duplications and (vi) structure variants).

5.Phasing Statistics

High quality haplotype assembly results are achieved by HapCUT2 from stLFR data.

Phasing statistics

chrswitch ratemismatch rateflat ratemissing ratephased countAN50N50max block snp fracphasing rate
10.01010.00480.2740.0009176,55610,335,8129,395,6660.11560.9954
20.00050.00030.32310.0005179,14615,956,81110,450,1790.15270.9963
30.00040.00030.17590.0004149,64710,121,4689,445,8290.1070.9967
40.01050.00480.29670.0006158,04117,745,17217,799,1990.13310.9965
50.00980.0040.27350.0005144,07329,257,52529,402,5370.23410.9965
60.01080.00440.27780.0004156,46312,084,35511,848,0710.20950.9966
70.00390.00170.24810.0006125,83916,590,96316,637,9240.18580.9964
80.00940.00390.27130.0006120,38910,515,2739,614,3790.16180.9967
90.010.00480.21480.0017102,69714,486,48313,793,9370.17030.9943
100.00460.00250.25010.0005113,96616,063,79216,124,2110.170.9963
110.01090.00490.21940.0005106,38213,220,22912,827,5410.15090.9966
120.00030.00030.20020.0004102,26111,496,4978,466,5990.17660.9965
130.00060.00040.36230.000378,91127,849,57327,930,5810.33840.9968
140.00050.00030.14830.001172,4587,483,2314,494,1880.22270.9958
150.00090.00030.2690.001163,19611,982,0197,995,6800.22940.995
160.00070.00040.44460.000875,17422,472,70322,524,7340.41370.9967
170.00070.00040.25070.000656,9826,892,7066,906,4350.18220.9958
180.00950.00470.29760.000362,85913,105,91813,142,4110.25620.9969
190.01440.00650.41330.000652,98220,934,91721,011,3600.44510.9963
200.00040.00020.22560.000449,75412,073,89412,120,6540.4130.9968
210.00050.00030.26890.001440,17614,433,34714,479,2680.38190.9956
220.01110.0050.37130.001633,85022,849,32712,138,9690.63060.9959
X0.0010.00050.18030.000671,8287,022,6705,134,6560.12480.9946
Y0000.00771,80464,24900.45250.955
Genome0.00570.00260.26630.00072,295,43413,899,41011,697,6880.2050.9962

 a. switch rate: the fraction of switch errors.

 b. mismatch rate: the fraction of mismatch errors.

 c. flat rate: the fraction of flat errors.

 d. missing rate: the fraction missing errors.

 e. phased count: counts of total SNVs phased in the test haplotype.

 f. AN50: the phasing block AN50 length of haplotype completeness.

 g. N50: the phasing block N50 length of haplotype completeness.

 h. max block snp frac: the fraction of SNVs in the largest (most variants phased) block.

 i. phasing rate: the fraction of SNVs in all blocks.

Phasing distribution

The haplotype phasing plot (Each line shows one chrmosome. White shows no phasing block, where grey and dark-bule show phasing blocks).