stLFR resequencing analysis report

中文

stLFR-reSeq V2.0.0.0

1.Fastq product

Generation of stLFR resequencing data is started from the HMW gDNA sample. A library is constructed by MGIEasy stLFR Library Prep Kit by using HMW gDNA, then the library is sequenced on DNBSEQ platforms which manufactured by MGI. The stLFR fastq files are got at the end. After removing low-quality reads by SOAPnuke, we got clean data of stLFR sample. Base distribution, quality distribution, Q20 and Q30 of clean stLFR data are showed below.

Sample

V300022071A

Fastq report

SampleV300022071A
Raw reads1,685,759,388
Raw bases(bp)168,575,938,800
Clean reads1,348,332,758
Clean bases(bp)134,833,275,800
Q20(%)97.01
Q30(%)88.65
Total depth(X)56.19

 a. Clean fastq is built by removing low-quality read from raw fastq. One pair of read is defined as low-quality when it:(1) contains adapter sequences, (2) N ratio > 1%, (3) the ratio of the base (base quality is lower than 10) > 10%.

 b. Q20 and Q30 means the ratio of base whose quality > 20 or 30 of the clean fastq.

 c. Total depth is calculated by the raw fastq.

Base distribution

Base distribution (x-axis is the sequencing cycle, y-axis is the proportion of five bases at a certain cycle).

Base quality distribution

Quality distribution (x-axis is the cycle along reads, y-axis is the heatmap of quality at given cycle).

2.stLFR long-fragment product

Long-fragment read is a feature of stLFR technology, built by virtual co-barcoding method. The performance of Long-fragment read is showed by analyzing relationship between long-fragment and barcode, long-fragment length distribution and long-fragment coverage.

Fragment & Barcode report

SampleV300022071A
Total barcode type3,623,878,656
Barcode number59,735,965
Barcode type rate1.65%
Reads pair number842,879,694
Reads pair number(after split)713,953,863
Barcode split rate84.7%
Split barcode(G)142.79

 a. The barcodes of stLFR data are related to the Long-fragment read. We should do and filter and split the barcodes before we do next analysis. After this step, each read will get a barcode ID. The table above shows the barcode split statistics results.

Fragment barcode distribution

Long-fragment vs. barcode in stLFR data (x-axis is average long-fragment DNA molecular number captured by barcode beads, y-axis is the density of barcode number).

Fragment length distribution

Length distribution of long-fragment in stLFR data (x-axis is the length, y-axis is the count of long-fragment with given length).

Fragment coverage distribution

Coverage of long-fragment of stLFR data (x-axis is the coverage of long-fragment, y-axis is the density of long-fragment at a certain coverage).

3.Alignment Statistics

Clean stLFR data is aligned to reference genome by BWA or MegaBOLT. The performance of alignment is evaluated by mapping rate, coverage, insert size distribution, etc.

Alignment report

SampleV300022071A
Mapped reads1,346,475,379
Mapped bases(bp)134,647,537,900
Mapping rate99.86%
Paired mapping rate99.33%
Mismatch bases(bp)746,868,575
Mismatch rate0.55%
Duplicate reads314,482,226
Duplicate rate23.32%
Dup depth(X)26.822664
Average sequencing depth(X)34.98
Coverage(≥1X)99.13%
Coverage(≥4X)98.91%
Coverage(≥10X)98.34%
Coverage(≥20X)93.84%
Mean insert size(bp)228.31

 a. The statistics results shown here are calculated based on clean data after low-quality filter and barcode-split filter.

 b. The mapping rate and paired mapping rate are the proportion of read which is mapped or paired-mapped to reference genome.

 c. The mismatch means the unmapped bases between read and reference genome.

 d. The duplication means duplicated reads introduced by PCR or other processes.

 e. The average sequencing depth is calculated by mapped reads without duplicated reads.

 f. The coverage is calculated as proportion of reference genome covered by more than 1, 4, 10 or 20 folds reads.

 g. The mean insert size is the mean insert size of paired mapped reads.

Insert size distribution

Insert size distribution (x-axis is insert size, y-axis is the count of paired reads at a certain insert size).

Depth accumulative distribution

Cumulatived sequencing depth distribution (x-axis is the depth, y-axis is the proportion of reference genome that achieves at or above certain depth).

Depth distribution

Sequencing depth distribution (x-axis is the depth, y-axis is the proportion of reference genome at a certain depth).

GC bias distribution