Bibliographic Reference
Tarabichi, M., Salcedo, A., Deshwar, A. G., Ni Leathlobhair, M., Wintersinger, J., Wedge, D. C., Van Loo, P., Morris, Q. D., & Boutros, P. C. (2021). A practical guide to cancer subclonal reconstruction from DNA sequencing. Nature Methods, 18(2), 144–155. https://doi.org/10.1038/s41592-020-01013-2
Core Argument
Subclonal reconstruction from bulk tumor DNA sequencing has become a pillar of cancer evolution studies, but the computational approaches involved carry complex assumptions and uncertainties at each step. This paper provides a pragmatic guide that outlines those steps — from study design through CNA reconstruction, SNV clustering, and phylogenetic inference — identifies their limitations, and recommends best practices for analysis and quality assessment, aimed at the growing user community of subclonal reconstruction methods.
Methods
This is a perspective/review article (not an original data study). The authors synthesize knowledge from the subclonal reconstruction literature and their collective experience to outline the standard computational workflow: (1) somatic mutation calling from aligned reads, (2) copy number alteration (CNA) reconstruction using logR and B-allele frequency (BAF), (3) translation of SNV variant allele frequencies (VAFs) to cellular prevalence (CP) or cancer cell fraction (CCF) using copy number and purity estimates, (4) clustering of SNVs by CCF, and (5) phylogenetic reconstruction of clone trees. The paper covers both single-sample and multisample designs and includes a checklist of best practices (Table 1) and a worked example in the Supplementary Note.
Key Findings
- Single-sample sequencing systematically underestimates the number of subclones and can mistake locally dominant subclones as clonal due to the “illusion of clonality”; multiregion sequencing improves subclone resolution and enables phylogenetic inference.
- Sequencing at high depth (>60x) on high-purity samples increases detection power for minor subclones; the number of reads per tumor chromosomal copy (NRPCC) is a useful composite metric for evaluating sequencing depth given tumor purity and ploidy.
- CNA reconstruction carries intrinsic ambiguity — particularly around whole-genome duplication (WGD) and purity/ploidy estimation — and the authors recommend reviewing multiple solutions, using experimental ploidy validation where possible, and restricting subclonal reconstruction to SNVs in regions of normal or clonal copy number.
- SNV clustering methods rely on the “weak parsimony” assumption (most detectable SNVs belong to a small number of subclonal lineages) and the “infinite sites” assumption (each genomic position mutates at most once); binomial or beta-binomial noise models capture read-sampling noise better than fixed-variance Gaussian models.
- Phylogenetic inference is generally advisable only for multisample data; single-sample CPs are often consistent with a linear phylogeny, and the authors recommend that methods report uncertainty in the inferred tree.
Concepts Introduced or Used
- subclonal-reconstruction
- cellular-prevalence
- cancer-cell-fraction
- variant-allele-frequency
- copy-number-alteration
- illusion-of-clonality
- sufficiency-of-subclonality
- weak-parsimony-assumption
- infinite-sites-hypothesis
- pigeonhole-principle
- crossing-rule
- NRPCC
- whole-genome-duplication
- logR
- B-allele-frequency
- SNV-clustering
- clone-tree
- multiregion-sequencing
- superclonal-cluster
- Dollo-process
- tumor-purity
- multiplicity-of-mutation
- branching-clones
- linear-clones
Entities Referenced
- Methods/tools: PhyloWGS, PyClone, DPClust, Battenberg, LICHEe, CloneHD, THetA, TITAN, SciClone, PairClone, ddClone, SPhyR, TrAp, ReMixT, HMMcopy
- Algorithms: circular binary segmentation, piecewise constant fitting, hidden Markov models, Dirichlet process clustering
- Consortia/projects: Pan-Cancer Analysis of Whole Genomes (PCAWG), TRACERx, TRACERx Renal, 100,000 Genomes Project
- Sequencing technologies: whole-genome sequencing (WGS), whole-exome sequencing (WES), targeted gene panels, single-cell WGS, long-read sequencing, liquid biopsy (ctDNA)
Limitations
- Single-sample sequencing systematically underestimates subclone number, and even multiregion bulk sequencing still misses many subclones, particularly when subclones are not evenly distributed across the tumor mass.
- CNA reconstruction carries intrinsic ambiguity: any CNA reconstruction is equivalent to another with each copy number doubled and purity lowered (the WGD ambiguity), and subclonal copy number states cannot be robustly inferred for more than two states within a single region because only two informative inputs (BAF and logR) are available.
- The weak parsimony assumption is somewhat controversial, as the lowest-VAF cluster may contain a mixture of SNVs from numerous parallel lineages growing neutrally.
- The infinite sites assumption is occasionally violated (parallel acquisition of driver SNVs, triallelic loci, subclonal chromosomal losses), though violations are rare enough to not affect clustering based on hundreds to thousands of SNVs.
- Formalin-fixed, paraffin-embedded (FFPE) samples introduce artifacts; fresh frozen tissue is recommended for subclonal reconstruction.
- Low mutation burden (e.g., low coding substitution rate in exome studies) can produce insufficient data for accurate subclonal reconstruction.
- Mechanistic models and approaches remain in their infancy, and a thorough assessment of methods — particularly for subclonal CNA detection — is still needed.
Relevance to Clonal Evolution
This paper serves as a foundational methodological reference for the field of clonal evolution in cancer. It codifies the standard computational pipeline by which tumor subclonal architecture is reconstructed from sequencing data and explicitly surfaces the assumptions and uncertainties that downstream evolutionary inferences (selection, drift, phylogeny, metastatic seeding) depend on. As such, it is essential reading for interpreting any study that draws evolutionary conclusions from bulk tumor sequencing data.