Bibliographic Reference
Satas, G., Zaccaria, S., El-Kebir, M., & Raphael, B. J. (2021). DeCiFering the elusive cancer cell fraction in tumor heterogeneity and evolution. Cell Systems, 12(10), 1004–1018.e10. https://doi.org/10.1016/j.cels.2021.07.006
DOI: 10.1016/j.cels.2021.07.006 Software: https://github.com/raphael-group/decifer
Core Argument
The cancer cell fraction (CCF) — the proportion of cancer cells in a tumor sample that contain a given single-nucleotide variant (SNV) — is the foundational statistic for quantifying tumor heterogeneity and reconstructing tumor evolution from bulk DNA sequencing data. However, current CCF estimation methods suffer from two major and interconnected limitations. First, nearly all existing methods rely on the constant mutation multiplicity assumption, which posits that every cell harboring an SNV carries the same number of copies of that SNV. This assumption is routinely violated in solid tumors, where copy-number aberrations (CNAs) alter SNV multiplicities across subclones. Second, the CCF itself is the wrong quantity for phylogenetic analysis when SNVs can be lost due to subsequent deletions: two SNVs that arose in the same ancestral cell will have the same CCF only if neither is lost, but a deletion affecting one SNV will give it a misleadingly low CCF that suggests it arose later. These two problems — an unrealistic assumption and a mismatch between the measured quantity and the evolutionary target — jointly undermine downstream analyses including clonal clustering, phylogenetic reconstruction, and mutation-timing inference.
The authors address both limitations by (i) introducing the single split copy number (SSCN) assumption, a less restrictive evolutionary model for SNV–CNA interactions; (ii) defining the descendant cell fraction (DCF), a novel statistic that generalizes the CCF to account for mutation losses by measuring the proportion of cells descended from the ancestral cell where the SNV first arose, regardless of whether the SNV was later deleted; and (iii) developing DeCiFer, an algorithm that simultaneously selects a genotype tree for each SNV (determining its DCF) and clusters SNVs into a small number of groups by their DCF values across multiple samples. DeCiFer occupies a middle ground between fast but restrictive CCF-based methods and comprehensive but computationally intractable phylogenetic methods, scaling to thousands of SNVs while avoiding the constant mutation multiplicity assumption.
Methods
Formal framework. The authors formalize the relationship between sequencing read counts, copy-number proportions, and the unobserved genotype at an SNV locus. A genotype is a triple (x, y, m) of non-negative integers, where (x, y) are the allele-specific copy numbers at the locus and m is the mutation multiplicity (number of copies carrying the SNV), with 0 ≤ m ≤ x + y. The CCF c is defined as the fraction of cancer cells with m ≥ 1. The VAF v is related to CCF via a linear transformation that depends on the unknown genotype set and tumor purity.
Constant mutation multiplicity assumption. Nearly all existing methods assume that at every SNV locus, all genotypes with m > 0 have the same fixed multiplicity M (e.g., m = M). This reduces ambiguity but is violated by subclonal CNAs and excludes biologically plausible scenarios such as an SNV arising before an amplification (creating cells with both 1 and 2+ copies of the SNV).
Single split copy number (SSCN) assumption. The authors propose that at every SNV locus, there is exactly one copy-number state (x*, y*) that has two distinct genotypes: (x*, y*, 0) and (x*, y*, m*). This assumption follows from standard evolutionary models — the Dollo model for SNVs (each mutation introduced exactly once) and the infinite-alleles model for allele-specific copy numbers. Under the SSCN assumption, given VAF v and copy-number proportions μ, the genotype proportions are uniquely determined (Lemma 1), and the CCF c is uniquely determined by Theorem 1. Feasibility conditions for v are derived (Lemma 3).
Descendant cell fraction (DCF). The DCF d is defined as the proportion of cells in a sample that are descendants of the ancestral cell where the mutation was introduced, formally summing over all genotypes that descend from the (x*, y*, m*) node in a genotype tree — a rooted tree describing the evolutionary history of a single SNV locus. The critical insight is that two SNVs that arose in the same cell division will have the same DCF even if one is later deleted, whereas their CCFs will diverge. Theorem 2 gives a closed-form expression for the DCF that parallels the CCF formula but replaces Γ_CCF (genotypes with m ≥ 1 at present) with Γ_DCF (descendant genotypes in the tree).
DeCiFer algorithm. DeCiFer solves the Probabilistic Mutation Clustering and Genotype Selection Problem: given variant read counts a_i, total read counts t_i, and copy-number proportions μ_i for each SNV i across p samples, and an integer k > 0, find DCF values d_1, …, d_k for k clusters, and for each SNV select a genotype tree and a cluster assignment, maximizing the posterior probability of the observed read counts. The algorithm uses coordinate ascent: (1) given current cluster DCFs, assign each SNV to the best genotype tree and cluster; (2) given assignments, optimize cluster DCFs using Brent’s algorithm. Multiple restarts mitigate local optima. A variant where VAFs are observed instead of read counts is NP-complete (equivalent to HITTING SET).
Model selection. DeCiFer includes p + 2 fixed clusters (a truncal cluster fixed at sample purity, an absent cluster at DCF = 0, and p sample-specific clusters) plus a variable number of additional clusters chosen via an elbow method.
Probabilistic model. DeCiFer uses a beta-binomial likelihood (parameterized by mean v and precision s) to model overdispersion in sequencing read counts, with the precision s estimated from germline SNPs. Alternatively, a binomial model can be used.
Validation. The authors compared DeCiFer against PyClone and an adjusted PyClone (using CCFs pre-computed by a constant-multiplicity method) on simulated data with 2, 4, 8, and 12 clusters, 1000 SNVs, 5 samples, and 100× coverage. They also compared against PhyloWGS on smaller instances (100 SNVs). DeCiFer was then applied to 49 metastatic prostate cancer samples (10 patients) from Gundem et al. (2015), using SNVs and copy numbers previously identified by Zaccaria & Raphael (2020) via HATCHet.
Key Findings
-
DeCiFer outperforms existing methods on simulated data across all metrics. On simulated instances with 2, 4, 8, and 12 SNV clusters, DeCiFer achieved lower cell fraction error (mean absolute difference between inferred and true cell fractions) and higher adjusted Rand index (cluster accuracy) than PyClone and adjusted PyClone. DeCiFer also accurately estimated the number of clusters, while PyClone systematically overestimated it (inferring 40–120 clusters on several instances). PhyloWGS had the lowest accuracy on 100-SNV instances and failed to converge within 3 days on 1000-SNV instances, whereas DeCiFer completed the largest instances in under 1.5 hours.
-
Over 23,000 SNVs across 49 prostate cancer samples changed classification when using DCFs instead of constant-multiplicity CCFs. Specifically: (a) ~8,500 SNVs classified as subclonal by standard CCFs were reclassified as truncal by DeCiFer, attributable to mutation losses by deletions; (b) ~12,000 SNVs classified as clonal by CCFs were reclassified as subtruncal by DeCiFer, reflecting differences in mutation multiplicity assignments; and (c) 1,560 SNVs classified as absent (VAF = 0 in a sample) by standard CCFs were reclassified as truncal or subtruncal by DeCiFer, indicating SNVs deleted from all cancer cells in a sample.
-
DeCiFer yields more parsimonious evolutionary reconstructions that avoid homoplasy. In prostate cancer patient A17, standard CCF analysis of 284 SNVs on chromosome 5q required 142 independent homoplasy events (each SNV occurring twice on both homologous chromosomes) under the constant mutation multiplicity assumption. DeCiFer’s DCF-based analysis resolved all 284 SNVs as truncal with differing mutation multiplicities, requiring no homoplasy. Similar findings were reproduced in patients A12 and A24.
-
The DCF enables detection of mutations that are truncal but appear absent due to clonal deletion. In patient A12, DeCiFer identified 58 SNVs on chromosome 6q as truncal (DCF ≈ 1) even in sample A12-A where their VAF = 0, because all cancer cells in that sample had undergone a copy-neutral loss of heterozygosity that deleted the mutations. Standard CCF methods classified these SNVs as absent, and the phyloCCF correction could not recover them because it only considers SNVs with VAF > 0.
-
The single split copy number assumption is both more realistic and more restrictive than the constant mutation multiplicity assumption. The SSCN assumption excludes implausible genotype sets (e.g., those requiring homoplasy) while including biologically realistic scenarios (e.g., SNVs on amplified or deleted alleles) that the constant mutation multiplicity assumption cannot represent. When genotype proportions exist under SSCN, they are unique — a property not shared by the constant mutation multiplicity assumption.
Concepts Introduced or Used
- Descendant cell fraction (DCF): A novel statistic that measures the proportion of cells in a sample descended from the ancestral cell where an SNV was introduced, generalizing the CCF to account for mutation losses. SNVs on the same phylogenetic branch have the same DCF even if some are later lost.
- Single split copy number (SSCN) assumption: An assumption that at each SNV locus, exactly one copy-number state (x*, y*) has two distinct genotypes — one with mutation multiplicity 0 and one with mutation multiplicity m*. This follows from standard evolutionary models (Dollo for SNVs, infinite alleles for CNAs).
- Genotype tree: A rooted tree describing the evolutionary history of a single SNV locus, encoding which genotype states are ancestral to others. Combines the Dollo model for SNVs with the infinite-alleles model for allele-specific copy numbers.
- Constant mutation multiplicity assumption: The standard but overly restrictive assumption that all cells carrying an SNV have the same number of copies M of that SNV.
- Cancer cell fraction (CCF): The fraction of cancer cells in a sample that contain at least one copy of a given SNV.
- Variant allele frequency (VAF): The proportion of sequenced reads at a locus that carry the variant allele.
- Fractional copy number (F): The average copy number over all cells (cancer + normal) at a locus.
- PhyloCCF: An ad hoc correction for CCF in regions affected by subclonal deletions, proposed by the TRACERx study (Jamal-Hanjani et al. 2017), which the authors note is limited to VAF > 0 and sample-independent analysis.
Entities Referenced
- DeCiFer: The algorithm introduced in this paper. Available at https://github.com/raphael-group/decifer and through Bioconda.
- PyClone (Roth et al. 2014): A method for clustering SNVs by CCF that simultaneously infers mutation multiplicities. Does not model subclonal CNAs.
- PhyloWGS (Deshwar et al. 2015): A phylogenetic method that simultaneously clusters SNVs and reconstructs tumor phylogenies. Does not scale to large numbers of SNVs.
- SPRUCE (El-Kebir et al. 2016): A phylogenetic method using multi-state perfect phylogeny mixtures; the genotype tree enumeration procedure in DeCiFer extends SPRUCE’s approach.
- HATCHet (Zaccaria & Raphael 2020): A tool for inferring allele-specific copy numbers and clone proportions from multi-sample bulk sequencing data, used as input to DeCiFer for the prostate cancer analysis.
- TRACERx (Jamal-Hanjani et al. 2017): The non-small-cell lung cancer sequencing study that proposed the phyloCCF correction.
- Gundem et al. (2015): The metastatic prostate cancer study whose 49 samples (10 patients) were reanalyzed with DeCiFer.
- Varscan 2 (Koboldt et al. 2012): The somatic SNV caller used to identify mutations in the prostate cancer samples.
- BCFtools (Li 2011): Used to obtain read counts across samples for the prostate cancer analysis.
Limitations (as stated by authors)
-
Copy-number inputs are assumed exact. DeCiFer takes copy-number proportions μ as given, but methods that infer these from bulk DNA sequencing data are subject to errors and may miss small or low-prevalence CNAs. The authors note this uncertainty could be incorporated into the DeCiFer model.
-
No explicit modeling of neutral evolution tail. The clustering model could be improved by better modeling the tail of low-prevalence SNVs expected under neutral evolution, as addressed by methods like Caravagna et al. (2020).
-
Structural variant breakpoints excluded. While the prevalence of structural variant breakpoints is proportional to read counts (Cmero et al. 2020), DeCiFer does not analyze them; this is a future direction.
-
Genotype trees not combined into tumor phylogenies. DeCiFer selects genotype trees for individual SNVs but does not combine these into a comprehensive tumor phylogeny. The authors note that consensus tree methods could address this as future work.
-
Model selection via elbow method. The number of clusters is chosen by an elbow criterion, which may be sensitive to the shape of the objective function.
-
Coordinate ascent does not guarantee global optimum. DeCiFer uses multiple restarts to mitigate local optima, but convergence to the global maximum is not guaranteed.
-
Genomic regions with high copy numbers excluded. The metastatic prostate cancer analysis excluded regions where max(x, y) > 4 and min(x, y) > 2 to enable efficient genotype tree enumeration and limit potential errors.
Relevance to Clonal Evolution
DeCiFer’s introduction of the descendant cell fraction (DCF) addresses a fundamental problem in clonal evolution research: the CCF, as conventionally computed, confounds two distinct evolutionary phenomena — the presence of an SNV in a cell at sampling time and the evolutionary ancestry of that SNV. When CNAs cause SNV losses, the CCF systematically misrepresents the timing and clonality of mutations, leading to inflated numbers of inferred clones, spurious homoplasy events, and phylogenetically implausible tree topologies.
Relationship to the ITH empirical test. The DCF is methodologically important for the intratumor heterogeneity (ITH) empirical test, which currently uses standard CCF correction. The ITH test’s measure of selection (δ) relies on accurate discrimination between neutral and selected dynamics, which in turn depends on correct assignment of mutations to clones and correct estimation of their prevalence. DeCiFer’s DCF-based clustering could provide a more principled input to the ITH test than CCF-based clustering, particularly for tumors with extensive CNAs where the constant mutation multiplicity assumption is violated.
Relationship to the cancer-evolution-olog (CEvo-olog). DeCiFer directly informs multiple nodes in the cancer evolution ontology:
- CCF correction: The DCF represents a more rigorous alternative to the standard CCF correction formula, replacing the constant mutation multiplicity assumption with the SSCN assumption and adding the genotype tree framework to account for mutation losses.
- Mutation classification: DeCiFer’s reclassification of >23,000 SNVs across 49 samples — from subclonal to truncal, clonal to subtruncal, and absent to truncal — demonstrates that mutation classification is not merely a technical detail but a substantive empirical finding that changes with the model.
- Phylogenetic reconstruction: DCF clusters are evolutionarily coherent (SNVs on the same branch have the same DCF) and can serve as direct input to tumor phylogeny methods (Popic et al. 2015, Qiao et al. 2014, El-Kebir et al. 2015, Malikic et al. 2015, Husic et al. 2019), bridging the gap between fast heuristic clustering and full phylogenetic inference.
Broader implications. The paper demonstrates that a substantial fraction of SNVs in CCF-based analyses may be misclassified as subclonal when they are in fact truncal but lost, or as clonal when they are subtruncal with differing multiplicities. This has downstream consequences for any analysis that depends on accurate clone assignment, including neoantigen prediction (where clonal mutations are prioritized), mutational signature decomposition over time, and selection inference. The DCF framework provides a principled path forward for tumors with CNAs — the majority of solid tumors.
Revision history
- 2026-07-05 — Initial source summary created.