1.Fastq product

Generation of stLFR resequencing data is started from the HMW gDNA sample. A library is constructed by MGIEasy stLFR Library Prep Kit by using HMW gDNA, then the library is sequenced on DNBSEQ platforms which manufactured by MGI. The stLFR fastq files are got at the end. After removing low-quality reads by SOAPnuke, we got clean data of stLFR sample. Base distribution, quality distribution, Q20 and Q30 of clean stLFR data are showed below.

Sample

V300022071A

Fastq report

Sample	V300022071A
Raw reads	1,685,759,388
Raw bases(bp)	168,575,938,800
Clean reads	1,348,332,758
Clean bases(bp)	134,833,275,800
Q20(%)	97.01
Q30(%)	88.65
Total depth(X)	56.19

a. Clean fastq is built by removing low-quality read from raw fastq. One pair of read is defined as low-quality when it:(1) contains adapter sequences, (2) N ratio > 1%, (3) the ratio of the base (base quality is lower than 10) > 10%.

b. Q20 and Q30 means the ratio of base whose quality > 20 or 30 of the clean fastq.

c. Total depth is calculated by the raw fastq.

Base distribution

Base distribution (x-axis is the sequencing cycle, y-axis is the proportion of five bases at a certain cycle).

Base quality distribution

Quality distribution (x-axis is the cycle along reads, y-axis is the heatmap of quality at given cycle).

2.stLFR long-fragment product

Long-fragment read is a feature of stLFR technology, built by virtual co-barcoding method. The performance of Long-fragment read is showed by analyzing relationship between long-fragment and barcode, long-fragment length distribution and long-fragment coverage.

Fragment & Barcode report

Sample	V300022071A
Total barcode type	3,623,878,656
Barcode number	59,735,965
Barcode type rate	1.65%
Reads pair number	842,879,694
Reads pair number(after split)	713,953,863
Barcode split rate	84.7%
Split barcode(G)	142.79

a. The barcodes of stLFR data are related to the Long-fragment read. We should do and filter and split the barcodes before we do next analysis. After this step, each read will get a barcode ID. The table above shows the barcode split statistics results.

Fragment barcode distribution

Long-fragment vs. barcode in stLFR data (x-axis is average long-fragment DNA molecular number captured by barcode beads, y-axis is the density of barcode number).

Fragment length distribution

Length distribution of long-fragment in stLFR data (x-axis is the length, y-axis is the count of long-fragment with given length).

Fragment coverage distribution

Coverage of long-fragment of stLFR data (x-axis is the coverage of long-fragment, y-axis is the density of long-fragment at a certain coverage).

3.Alignment Statistics

Clean stLFR data is aligned to reference genome by BWA or MegaBOLT. The performance of alignment is evaluated by mapping rate, coverage, insert size distribution, etc.

Alignment report

Sample	V300022071A
Mapped reads	1,346,475,379
Mapped bases(bp)	134,647,537,900
Mapping rate	99.86%
Paired mapping rate	99.33%
Mismatch bases(bp)	746,868,575
Mismatch rate	0.55%
Duplicate reads	314,482,226
Duplicate rate	23.32%
Dup depth(X)	26.822664
Average sequencing depth(X)	34.98
Coverage(≥1X)	99.13%
Coverage(≥4X)	98.91%
Coverage(≥10X)	98.34%
Coverage(≥20X)	93.84%
Mean insert size(bp)	228.31

a. The statistics results shown here are calculated based on clean data after low-quality filter and barcode-split filter.

b. The mapping rate and paired mapping rate are the proportion of read which is mapped or paired-mapped to reference genome.

c. The mismatch means the unmapped bases between read and reference genome.

d. The duplication means duplicated reads introduced by PCR or other processes.

e. The average sequencing depth is calculated by mapped reads without duplicated reads.

f. The coverage is calculated as proportion of reference genome covered by more than 1, 4, 10 or 20 folds reads.

g. The mean insert size is the mean insert size of paired mapped reads.

Insert size distribution

Insert size distribution (x-axis is insert size, y-axis is the count of paired reads at a certain insert size).

Depth accumulative distribution

Cumulatived sequencing depth distribution (x-axis is the depth, y-axis is the proportion of reference genome that achieves at or above certain depth).

Depth distribution

Sequencing depth distribution (x-axis is the depth, y-axis is the proportion of reference genome at a certain depth).

GC bias distribution

GC-bias distribution (x-aixs shows GC content, y-axis plots the NORMALIZED_COVERAGE, the distribution of WINDOWs corresponding to GC percentages, and base qualities corresponding to each GC content bin).

4.Variant Statistics

The variants of stLFR data are called by GATK or MegaBOLT and BGI self-developed CNV/SV calling softwares, and visualized by CIRCOS.

Variant statistics

Sample	V300022071A
Total_SNP	3,976,683
dbSNP_rate	93.19%
Novel_SNP	270,752
Novel_SNP_Rate	6.81%
Ti/Tv	1.95
Total_INDEL	913,668
dbINDEL_Rate	73.8%
CNV deletion	568
CNV duplication	36
SV DEL	28
SV DUP	6
SV INV	7
SV TRA	0

a. SNP: Insertion& Deletion.

b. Ti/TV：the ratio of transition (Ti) to transversion (Tv) SNPs.

c. INDEL: Insertion & Deletion.

d. CNV：copy number variant.

e. SV: Structure Variant.

f. DEL: deletion.

g. DUP: duplication.

h. INV: inversion.

i. TRA: translocation.

Variant evaluation

	TP	FP	FN	PPV	Sensitivity	F-measure
SNP	3,189,733	10,168	20,524	0.9968	0.9936	0.9952
Indel	455,215	23,130	26,048	0.9516	0.9459	0.9488

a. PPV: positive predictive value, precision, TP/(TP+FP).

b. Sensitivity: true positive rate, TP/(TP+FN).

c. F-measure: the harmonic mean of precision and sensitivity, 2*TP/(2*TP+FP+FN).

Circos of variants

CIRCOS of variant from whole genome (From the outside to the inside are: (i) chromosomes, (ii) SNP densitity, (iii) INDEL densitity, (iv) CNV deletions, (v) CNV duplications and (vi) structure variants).

5.Phasing Statistics

High quality haplotype assembly results are achieved by HapCUT2 from stLFR data.

Phasing statistics

chr	switch rate	mismatch rate	flat rate	missing rate	phased count	AN50	N50	max block snp frac	phasing rate
1	0.0101	0.0048	0.274	0.0009	176,556	10,335,812	9,395,666	0.1156	0.9954
2	0.0005	0.0003	0.3231	0.0005	179,146	15,956,811	10,450,179	0.1527	0.9963
3	0.0004	0.0003	0.1759	0.0004	149,647	10,121,468	9,445,829	0.107	0.9967
4	0.0105	0.0048	0.2967	0.0006	158,041	17,745,172	17,799,199	0.1331	0.9965
5	0.0098	0.004	0.2735	0.0005	144,073	29,257,525	29,402,537	0.2341	0.9965
6	0.0108	0.0044	0.2778	0.0004	156,463	12,084,355	11,848,071	0.2095	0.9966
7	0.0039	0.0017	0.2481	0.0006	125,839	16,590,963	16,637,924	0.1858	0.9964
8	0.0094	0.0039	0.2713	0.0006	120,389	10,515,273	9,614,379	0.1618	0.9967
9	0.01	0.0048	0.2148	0.0017	102,697	14,486,483	13,793,937	0.1703	0.9943
10	0.0046	0.0025	0.2501	0.0005	113,966	16,063,792	16,124,211	0.17	0.9963
11	0.0109	0.0049	0.2194	0.0005	106,382	13,220,229	12,827,541	0.1509	0.9966
12	0.0003	0.0003	0.2002	0.0004	102,261	11,496,497	8,466,599	0.1766	0.9965
13	0.0006	0.0004	0.3623	0.0003	78,911	27,849,573	27,930,581	0.3384	0.9968
14	0.0005	0.0003	0.1483	0.0011	72,458	7,483,231	4,494,188	0.2227	0.9958
15	0.0009	0.0003	0.269	0.0011	63,196	11,982,019	7,995,680	0.2294	0.995
16	0.0007	0.0004	0.4446	0.0008	75,174	22,472,703	22,524,734	0.4137	0.9967
17	0.0007	0.0004	0.2507	0.0006	56,982	6,892,706	6,906,435	0.1822	0.9958
18	0.0095	0.0047	0.2976	0.0003	62,859	13,105,918	13,142,411	0.2562	0.9969
19	0.0144	0.0065	0.4133	0.0006	52,982	20,934,917	21,011,360	0.4451	0.9963
20	0.0004	0.0002	0.2256	0.0004	49,754	12,073,894	12,120,654	0.413	0.9968
21	0.0005	0.0003	0.2689	0.0014	40,176	14,433,347	14,479,268	0.3819	0.9956
22	0.0111	0.005	0.3713	0.0016	33,850	22,849,327	12,138,969	0.6306	0.9959
X	0.001	0.0005	0.1803	0.0006	71,828	7,022,670	5,134,656	0.1248	0.9946
Y	0	0	0	0.0077	1,804	64,249	0	0.4525	0.955
Genome	0.0057	0.0026	0.2663	0.0007	2,295,434	13,899,410	11,697,688	0.205	0.9962

a. switch rate: the fraction of switch errors.

b. mismatch rate: the fraction of mismatch errors.

c. flat rate: the fraction of flat errors.

d. missing rate: the fraction missing errors.

e. phased count: counts of total SNVs phased in the test haplotype.

f. AN50: the phasing block AN50 length of haplotype completeness.

g. N50: the phasing block N50 length of haplotype completeness.

h. max block snp frac: the fraction of SNVs in the largest (most variants phased) block.

i. phasing rate: the fraction of SNVs in all blocks.

Phasing distribution

The haplotype phasing plot (Each line shows one chrmosome. White shows no phasing block, where grey and dark-bule show phasing blocks).

stLFR resequencing analysis report

stLFR-reSeq V2.0.0.0