Statistical calculations of mixed dna samples

12/17/2023

For data from microarray-based comparative genomic hybridization (arra圜GH) or single nucleotide polymorphism (SNP) genotyping, the raw signal for DNA dosage is the total hybridization intensity at the individual locus, which is either the arra圜GH probe location or the SNP site. Ī common approach for CNA-detection using genomic data relies on the telltale stepwise change of apparent DNA “dosage” in a genomic region (Olshen et al., 2004). We present two use cases commonly encountered in cancer research: ultra-shallow whole-genome sequencing for detecting large, chromosome-scale events, and targeted ultra-deep sequencing for surveillance of known CNAs in rare tumor clones in the task of sensitive detection of cancer relapse or metastasis. As study designs vary and technologies continue to evolve, the input data and the noise characteristics will change depending on the practical situation. We describe the analytical formula and their simplifications in special cases, and share the extendable scripts for others to perform customized power analysis using study-specific parameters. Specifically, we expand the expression of power to include not just the known factors but also one or both of two complications: ( 1) the dispersion of read depth around the mean beyond the independent sampling-by-sequencing assumption, and ( 2) the reduced fraction of the CNA-bearing sample (“purity”) as seen in studies of intratumor heterogeneity or in clinical monitoring of minimal residual disease. Here we present a general analytical framework and a series of simulations that explore situations from the simplest to the increasingly multifactorial. However, the analysis of statistical power that considers the interplay of all these factors has not been systematically developed. In cases involving inadvertent sample mixing or genuine somatic mosaicism, power also depends on the mixing ratio. Traditionally, the power of such a test depends on ( 1) the integer number of copy number change, ( 2) the overall sequencing depth, ( 3) the length of the CNA region, ( 4) the read length and ( 5) the variation of coverage along the genome, which depends on many experimental factors, including whether the chosen platform is whole-genome, whole-exome, or targeted-panel sequencing. In shotgun sequencing, regions of CNAs show step-wise changes in read depth when compared to adjacent “normal” regions, allowing their detection by parametric statistical tests that compare the mean coverage in suspected regions against that of a baseline distribution. DNA sequencing can discover not only single-base variants but also copy-number alterations (CNAs).

0 Comments

Statistical calculations of mixed dna samples

Leave a Reply.

Author

Archives

Categories