Lung cancer and the spread of the disease are characterized by geno-transcriptomic evolution

admin

2 years ago

Nature: https://www.nature.com/articles/s41586-023-05729-x

The TRACERx study (ClinicalTrials.gov): Prospective cohort study of NSCLC tumours and metastatic samples

The TRACERx study (ClinicalTrials.gov identifier NCT01888601) is a prospective observational cohort study that aims to transform our understanding of NSCLC, the design of which has been approved by an independent research ethics committee (13/LO/1546). Informed consent for entry into the TRACERx study was mandatory and obtained from every participant. The study identity number was assigned to all participants. The participants were no longer able to identify themselves in the publications of the study. The study sponsor only oversaw the centralized database, which tracked the human samples, and they were linked to the study identity number.

The cohort represents the first 421 patients whose primary tumour and metastatic samples were received for processing, who met the eligibility criteria as outlined in ref. 17 and from whom collected tumour samples could be sequenced prospectively according to the filtering steps outlined in the CONSORT diagram (CONSORT flow chart; Extended Data Fig. 1).

Fresh frozen samples have sample extracts and sequencing summarized in the accompanying article. Multiregion sequencing was not done where smaller samples were obtained. The sequence of germline DNA from aliquots was resequenced for fresh frozen recurrence/progression samples.

2 20 m sections of Cresyl-Violet stained slides were mounted onto Leica glass slides with a polyethylene naphthalate stain and then on to a sandwich. The area was marked by a histopathologist, and any lesions of less than 3 mm in diameter underwent laser-capture microdissection, with larger lesions undergoing macrodissection with a sterile scalpel.

Sequencing and characterization of the overall genetic origin of FFPE-associated C > T artefacts using a modified KAPA hyperPrep library preparation kit

According to the manufacturer of the kit, the DNA was taken within 48 h. This kit contains the UNG (uracil-N glycosilase) to minimize FFPE-associated C > T artefacts. Only samples with a DNA integrity number greater than 2 were used for downstream processing and the samples that were quantified, Invitrogen, andTapeStation were used. The sample were sheared using the Covaris instrument. Libraries were prepared using 50–200 ng of sheared DNA as input for a modified version of the KAPA HyperPrep library preparation kit (Roche). The inclusion of the Sure Select XT primer and the SureSelect XT oligo were included in the modifications. The remainder of the protocol was performed according to the fresh frozen TRACERx WES sequencing pipeline, with 7–9 PCR cycles used to amplify the DNA to the required 750 ng for hybridization. Sequencing was performed as for the fresh frozen samples, although no additional germline sequencing was performed.

For case-level definitions, a similar approach was used. The overall origin is also defined as polyphyletic if any metastasis was defined as such. Conversely, if all metastases were monophyletic in origin, all branches containing shared clusters were counted. If only a single such branch existed, the case-level origin was classified as monophyletic.

FFPE-Sample LogR Segmentation in the Somatic Copy-Number Aberration Detector and Seeding Methodology

For FFPE samples, modifications to the somatic copy-number aberration detection pipeline were incorporated to address the increase in the fluctuations seen in FFPE-sample logR segmentation. The mean logR value for all SNPs within a BAF segment was assigned as the segmented logR value for that BAF segment. Many small segments remained after this adjustment. These small segments corresponded to logR segments that do not have heterozygous SNPs within them and, therefore, no corresponding BAF segments. Each of the non-BAF segments had to be compared to its preceding segment within the same chromosome, and joined to the segment with the best logR value until there were no logR only segments present. The logR in newly joined segments was changed and used for downstream analyses. There were segments that were related to the lowest logR values removed.

The results of MACHINA can be used to identify different types of seeds. Thus, to provide further evidence to the identified seeding clones, we compared the results of MACHINA with those inferred by the new method in this study. We only took the MACHINA results into account when using the same model as the parallel single-source seeding assumption. Moreover, the definition of monoclonal and polyclonal seeding from MACHINA does not take into account the tree, as done in this study. In addition to MACHINA defining cases as polyclonal only if at least one metastasis sample is polyclonal, they also define cases as monoclonal if there is a single or multiple metastases. All cases of polycloNAL, but that have multiple monoclonal metastases, were re- defined as monoclonal for this comparison.

Allele-specific arm-level LOH events were defined as primary-ubiquitous if the same allele was lost in all primary tumour regions. The arm-level loss was determined by the percentage of the arm being lost. The proportion of primary-ubiquitous LOH events shared in the metastases was compared in the phylogeny-defined early and late divergence cases.

The secondary tumors that were identified have the same WGD event as the primary tumors and the status of the paired metastases was explored. A metastasis was defined as diverging early if no WGD was seen in the metastasis, or a separate WGD event was identified. If the sameWGD event in the primary tumour regions was seen in the metastases, then they were defined as having diverged late.

This approach was repeated until n −1 regions were considered, and the average proportion of shared clonal mutations as well as the classification of the timing of divergence were highlighted.

All simulations were run for mutation rates of either 0.4 or 0.6 mutations per division per base pair (bp) in the exome (6.6 × 10−9 and 10 × 10−9 bp per division, respectively) on the basis of observed mutation rates in lung cancer56. Twenty replicates of simulated primary–metastatic pairs were run for each combination of parameters.

An agent-based model of cancer growth and evolution9-47 was modified to reflect the timing and mode of dispersal. The original model simulations the growth of a tumours through the division of individual cells with a set mutation rate. When the simulation stops the tumour grows in populations of 5,000 cells and reaches a size of more than 100 cells. The simulated tumour is then ‘sampled’ in 8 regions of around 50,000 cells. For each region, exome sequencing is simulated taking into account sequencing error rates for standard Illumina short read sequencing and a mean depth of coverage of 400×, similar to that used in the sequencing of the TRACERx cohort. The simulation produces a file with the minor allele frequency of the detected mutations in each sample.

The model used here was modified from the original to include a dynamic selection landscape. Each cell has a fitness value that controls its probability of dividing. If the maximum fitness of the cell is larger than a random number between 0 and 1 the cell will divide. Cells with large fitness values will therefore be more likely to divide than those with lower values. As the deme approaches its population limit of 5,000 cells, division is more likely with low populations and will become more unlikely. Given that the growth rate is a combination of the division and death rates, the death rate was fixed to avoid further increasing the stochasticity of the model. The death rate of 0.2 was chosen so that the modeled mutation burden was comparable to the mutation burden observed in the TRACERx cohort.

Determining regional subclonal mutations in metastatic tumours using the MACHINA algorithm I. Seeding status and clustering analysis

The seeding region classifier was based on a cohort of regions from primary tumours that had metastasized or that had not metastasized after 3 years of follow up. Only the regions with a seeding clone of >2 CCF were considered for this analysis. More than 500 primary tumour regions from 206 tumours were analysed for seeding status and all other metrics could be measured. The following features were also considered for the classifier:

Cell volume was calculated assuming a parenchymal cell 57 has a side of 15 m. The total amount of tumours is based on the number of cells in the tumours and the cell size. A percentage of the total tumour cells in the tumour were added to account for purity.

If the shared clusters mapped to multiple branches of the phylogenetic tree, each branch was considered separately in the manner described above. If a parent cluster was shared between multiple branches, CCF values of both branches were added together, and the iterative approach continued until the first cluster was found to be clonal in the metastasis.

An input to MACHINA was used to infer migration patterns from the estimated clone proportions. Specifically, MACHINA was run by specifying the primary lung tumour and implementing each metastatic tumour as a separate site. Moreover, MACHINA was run considering all of the possible assumptions about the possible migration patterns that can be evaluated (parallel single source seeding, single source seeding, multi-source seeding, reseeding). The results from the single-SOURCE seeding output from MACHINA was used to explore seeding one metastasis by another site.

Proportion of regionally subclonal mutations: the number of mutations belonging to subclonal mutational clusters in the focal tumour region divided by the total number of mutations in that region. Any subgroup with a cancer cell fraction less than 1 is considered to be subclonal. Details on how clonal clusters were determined are available in a companion paper7. The proportion of smaller clones being present in the tumours region is given by this metric.

Where ({p}{i}=\,\frac{{x}{i}}{{\sum }{i=1}^{n}{x}{i}}) is the vector of CCF proportions. The clone was spread evenly across the entire map with each subclone being scored from 1 to 0 for being unique to a single region. We compared the maximum CCF and subclonal dispersion to investigate both how dominant in any region and spread out across the regions the clusters were to quantify subclonal expansion.

To characterize the phenotype of the tumour region, we measured the tumour-region enrichment score through ssGSEA (using the R package fgsea (v.1.10.1)71 using a Gaussian distribution and default parameters on VST counts) for three cancer-specific gene sets: (1) CIN70 (ref. There are at least three genes that have been linked to poor progess and metastasis, an expression signature associated with genome instability, a lung cancer-specific prognostic maker and a high-plasticity cell. 5. biomaRt (v.2.40.1)99 1:1 orthologues between human and mice genes from cluster 5 were used to calculate this signature, as described in the publication.

Number of region subclonal driver mutations: the number of driver mutations that belong to subclonal mutation clusters (cancer cell fraction below 1) in the focal region. There are details about how driver and clonal clusters were determined in a companion manuscript.

When performing the unpaired SCNA analyses separately for primary LN/satellite lesions and recurrence/progression samples, we constructed a single copy number profile for each sample type (that is, primary tumour, primary LN/satellite lesions and recurrence/progression samples), and performed comparison analyses as described above.

Estimates of CCF subclones based on the cloneMap R package. III. Sample Selection and Quality Testing in RNA-Seq Data

The depictions of CCFs of subclones estimated using our WES has been included in two other books. The depictions were created by using the cloneMap R package63.

All statistical tests were performed in R (v.3.6.3 and 4.1.1). No statistical methods were used to predetermine sample size. Tests involving comparisons of distributions were performed using two-sided Wilcoxon tests (‘wilcox.test’) using paired or unpaired options where appropriate. Tests involving comparison of groups were performed using two-sided Fisher’s exact tests (‘fisher.test’). Hazard ratios and P values were calculated using the survival package (v.3.2.13). For all statistical tests, the number of data points included is plotted or annotated in the figure and the testing was two-sided unless otherwise stated.

The cohort in this manuscript includes the fraction of samples from the first 421 participants (described in detail in two companion manuscripts6,7) with RNA-seq data available after quality checking before and after sequencing (CONSORT diagram in the Supplementary Information).

The differential between transcript tumour fraction and tumour purity was added to test the potential effect of overall tumours-specific expression.

The first 100 participants of the study with multiple tumour regions were selected for the subset of data previously published.

Illumina adapters were trimmed from raw sequencing reads using Cutadapt (v.2.10)52 with standard parameters, and the quality of the trimmed reads were estimated per flow cell lane using FastQC (v.0.11.9)51. Fastq files with less than 80% of their total reads being duplicate were kept for alignment. The UCSC hg19 human reference genome build was aligned to the Fastq read files using the two-pass mode with the ENCODE 3 parameters. The same reads were also mapped to the human transcriptome (RefSeq GCA_000001405.1 build) using the same STAR parameters to generate gene expression data. Next, we marked duplicates using the MarkDuplicates function from GATK (v.4.1.7.0)55. Aligned reads were quality checked using QoRTs (v.1.3.6)56 to assess RNA integrity. Somalier (v.0.2.7)57 was used to detect potential instances of sample mislabelling and FastQ Screen (0.14.0)58 was used to detect potential instances of contamination. The outputs were visualized with MultiQC (v.1.9)59. The default parameters were used to quantify the expression of genes in the files aligned to the transcriptome. Further quality checking of each sample was done using Gene expression patterns. Tumour regions with at most 40% of the genes with zero counts were excluded. Additionally, samples with <20% of reads mapping to a genomic area covered by exactly one gene in a coding sequence genomic region (estimated using the QoRTS output ReadPairs_UniqueGene_CDS) were excluded. Next, RNA coverage was calculated for single nucleotide variants (SNVs) detected in matched whole-exome sequencing data per tumour region using SAMtools (v.1.9)61 mpileup. Mutation expression was used to further quality check the mapping of RNA reads. The expression of SNVs exclusive to a given tumour region was used to detect potential instances of within-patient mislabelling of RNA–DNA matched tumour regions as well as to exclude normal adjacent lung tissue regions that expressed mutations present in paired tumour regions. A similar approach was applied to germline SNPs in order to assess the potential for sample swaps based on the similarities and differences between tumour region and genotypic pattern. Tumour regions in which fewer than 10 mutations, or fewer than 25% of the total mutation count, had evidence of expression, and/or less than 10% of SNPs had evidence of biallelic expression, were excluded. Finally, tumour regions clustering with tumour-adjacent normal tissue regions (see the section ‘UMAP clustering’) and tumour regions with a low purity were also excluded from further analyses. To ensure the reproducibility and portability of the above pipeline, all steps described were implemented through the Nextflow (v.20.07.1)62 pipeline manager.

Unless explicitly specified otherwise, all Wilcoxon tests performed in this work are two-sided, using the function wilcox.test() in base R. To account for the effect of each individual tumour when comparing tumour regions in the cohort, we use linear mixed-effects models throughout the manuscript. These were fitted using the package lmerTest (v.3.1-3)63 in R, using the parent tumour from which the tumour region was derived as a random effect. Significance was obtained comparing a null model with the model containing the variable of interest using the base R function anova(), and setting refit = TRUE to test the impact of fixed effects.

Generating, analysis and profiling of driver alterations in non-LUAD NSCLCs for GSEA and application to ASE73

Unless stated otherwise, plots were generated in the R environment (v.3.6.3), using ggplot2 (v.3.2.1)64, ggpubr (v.0.4.0), cowplot (v.1.0.0), scales(v.1.0.0) and ggrepel (v.0.8.1).

VST counts from all samples in the cohort were used to generate a UMAP67 of expression patterns across the cohort. The package for UMAP was in R and it had default parameters.

Differential mutation analysis was conducted to establish driver alterations that were enriched in LUADs relative to non-LUAD NSCLCs. When determining the number of genes that should be considered for each driver, a total of 266 genes were compared to the numbers of non LUAD and LUAD tumours regions. The four genes that the driver of LUADs enriched were identified using the Benjamini–Hochberg method. The test was performed to compare the relative enrichment of the events in non-LUADs with those in LUADs in the UMAP.

The histologic features of the non-LUAD tumours were further reviewed in relation to the TTF-1, p63, and p40 staining profiles.

A PCA was performed using different values for LUAD and LUSC tumours. PCA was performed using the prcomp() function in base R, centring the data but not scaling, as expression data had already been scaled through VST. Up to 30% of the variance explained in both LUAD and L USC were added to the PCs.

The relationship between I-TED and purity, the number of regions per tumour and hisological subtype was examined using a multivariable linear regression. The percentage of variance explained by each type of alteration was calculated using the Anova function from the R package car (v.3.0-6)69.

The t-statistic generated by limma was used as input for GSEA for MSigDB hallmark gene sets14 using the R package fgsea (v.1.10.1)71 with default parameters.

In accordance with the expected distribution of allelic expression, a beta-binomial test was used to test for ASE73 using the pbetabinom function from the R package VGAM (v.1.1-2)74 and an overdispersion parameter σ of 0.05. The major and minor all areas are inferred from multi region nucleic acid data. The major allele is the one with the most reads in the corresponding whole-exome data. The major and the minor alleleRNA reads were reported in the same way.

With an expected allelic expression ratio of 0.4, a test was performed to see if the probability of getting all-ele specific read counts is as high as the observed distribution.

where t is the total RNA reads at that heterozygous SNP and CPNratio is the raw major allele copy number divided by the total copy number at that site. This eliminates sites with low read counts and/or high-level amplifications, since they were removed by this.

For a subgroup of samples with previously published RRBS data, we obtained a list of differentially-MEADATED TARGETS using the tumor-adjacent normal methylation rate as a proxy. For each of these, we computed the number of CpGs that were significantly hypomethylated and hypermethylated in tumour samples compared to the normal samples, taking only loci that had coverage in all samples (minnormal = 10, mintumour = 3). We then calculated the fraction of positions that were hypomethylated using differentially methylated positions. We compared the metric with the percentage of genes that showed evidence of the cancer’s independence, seperately for L USC and LUAD.

The amplification of a subset of oncogenes or the loss of a subset of tumours suppressor genes can be found in the presence or absence of a CN drivers. There were at least 10% of the genes that were included with copy number driver alterations. These include SOX2, TERT, TERC, CDKN2A, MYC, CCND1, FGFR1, NKX2-1, AKT2, EGFR and CCNE1. There is details in a companion paper about how the driver and clonal clusters are determined.

Presence or absence of driver mutations in cancer genes with a driver mutation in at least 5% of the cohort. There were driver changes in ARID1A, ATM, ATRX, CDKN2A, and CREBBP. See a companion manuscript7 on the definition of cancer genes in our cohort.

SETD2 had three different publications,one for KDM5C83 and none for KMT2B. Three biological replicates with shRNA23), three single replicates with the ZFn and a single knockout with the CRISPR were done in lung cells.

Variant reads from each strand were compared to total depth in order to find the difference in strandedness.

Additionally, we selected ten putative editing events for Sanger sequencing, all of which were validated with this orthogonal method (Extended Data Table 2).

To test the potential relationship between signature activity and the expression of specific genes, we performed a linear mixed-effects model using the number of RNA variants attributed to each signature as the dependent variable, and gene expression of all genes in our dataset (n = 20,136). The expression was measured as log10 for genes with at least 5 read counts and 20% of the RNA-seq cohort.

To further test the relationship between APOBEC expression and RNA-SBS2, a linear mixed-effects model was performed, using the number of RNA variants attributed to RNA-SBS2 as the dependent variable, the expression of all APOBEC genes in the transcriptome as explanatory variables and the tumour identifier as a random effect.

A previously reported local enrichment method was used to conduct the enrichment analyses of the APOBEC motif. In brief, for each C>T variant site, a Fisher’s test was performed to test whether C>T changes within 20 upstream or downstream nucleotides occurred more than expected by chance at specific motifs (CAT[C>T]) in either strand.

Source: https://www.nature.com/articles/s41586-023-05706-4

I-TED and the ITH of other forms of alterations. Clinical characteristics and APOBEC study of LUAD-specific subtypes

The relationship between I-TED and the ITH of other forms of alterations was tested using a multivariable linear regression in a similar fashion as that detailed in the section ‘I-TED’.

Clinical features, including age of the patient, sex, years spent smoking cigarettes and TNM stage of the primary tumour at resection. See methods in a companion manuscript7 for details on how these features were obtained.

The central pathological review describes the LUAD-specific subtype as consisting of acetic, lepidic, cribriform, micropapillary and papillary. It was only available for LUAD tumours, so it’s described more in the companion paper.

There is a ratio of the number of expressed genes divided by the total burden in the tumours region. If you have three reads of the allelic in the data, a mutation is expressed. This metric serves as a proxy for the proportion of tumour mutations that were present in the bulk RNA-seq transcripts.

Similarly, the number of genome-doubling events per tumour region was also considered. There are details on how the events were calculated in the companion paper.

COSMIC mutational signatures SBS1, SBS2, SBS4, SBS5, SBS13 and SBS92 (ref. 34). Signature activity was measured as the fraction of mutations per tumour region corresponding to each signature’s weight. The combined strengths of the two types of SBS were used for the purpose of APOBEC activity. Details on how mutational signatures were extracted are available in a companion manuscript7.

The per tumour region I-TED score was included in order to measure the impact of expression diversity. The median score for samples was I-TED which only had a single region for each tumours.

We additionally included the RNA-editing levels (fraction of RNA molecules with edited sites) of three genes reported to play a role in cancer development from the literature101,102,103: AZIN1, COPA and COG3. This feature was added only for tumour regions with at least 30 unique RNA reads covering the editing sites of interest.

Source: https://www.nature.com/articles/s41586-023-05706-4

Python Framework for Learning Multilayer Perceptron and Logistic Regression Models Using Tensorflow.keras and sklearn

We built the framework in Python using two libraries. Specifically, we built an ensemble classifier that used three different model types: (1) logistic regression, (2) random forest and (3) multilayer perceptron with support vector machine embedded in the final layer. We describe the structure of the machine-learning pipeline in more detail below.

We looked at the correlation structure among the potential explanatory features and then removed the features that had high correlation coefficients. We split the data into training and test datasets, using categorical features from Pandas. After encoding, we had a total of 60 features. Using SMOTENC from imblearn. over_sampling and MinMaxScaler, we scaled the features and improved the balance of the dataset. Adding variable selection before using a LinearSVC model and keeping those features important was done with the sklearn (v.0.0)105 framework. This threshold reduced the number of features. In the process of generating different subsets of the dataset, we generated different streams based on the source of the input features.

To tune model hyperparameters we used a randomized grid search and cross-validation.

We implemented a sequential model from tensorflow.keras (v.2.6.0)104 with dropout layers (dropout = 0.2) to reduce overfitting and used a categorical hinge loss function with an l2 kernel regularizer and sigmoid activation function in the final layer. The final layer of the sequential model is supported by this approach. We used the Adam optimizer from tensorflow.keras.optimizers (v.2.6.0)104. We defined a search grid to tune parameters such as the learning rate and the number of hidden layers. The model we chose was the highest performing model according to the balance of accuracy. We then extracted feature weights from this selected model using PermutationImportance from eli5.sklearn (v.0.11.0). The model that we selected on the held-out test dataset was used to predict the seeding of the test region and compared to the true labels. The machine-learning pipeline was developed using Python (v.3.5.5), and plots of results were generated in R (v.4.0.3) using ggplot2 (v.3.3.5).