Data from the epigenetics will be used to uncovergene interactions


A machine-learning model for understanding epigenetic influences on chromatin-accessibility landscapes and transcriptional states of human and fetal cells

Each cell needs to have a different set of genes that it can use to carry out its functions, as well as enforce its gene- regulatory state, and respond to stimuli. Although each cell has the same genome, the epigenetic modifications that control gene expression depend on the context in which the cell exists, such as developmental stage, disease state or tissue environment. Accessibility of chromatin (the packaged form of DNA) is epigenetically regulated and dictates which regulatory regions of the genome can be accessed by transcription factors — the proteins that turn genes on or off or tune their expression levels1. Writing in Nature, Fu et al.2 report the development of a machine-learning model trained on chromatin-accessibility data from more than 200 adult and fetal cell types, with the aim of understanding how different chromatin-accessibility ‘landscapes’ yield different transcriptional states. The researchers apply the model to make predictions about those states and regulatory sequence that drive expression and interactions between transcription factors.

The first thing we did was collect inference samples across the whole genome by producing 200 region windows. Given a specific gene g on strand (s\in {0,1}), the expression value can be inferred using the GET model f applied to an input matrix ({X}\in {{\mathbb{R}}}^{r\times m}), where r denotes the number of regions, and m includes motifs and (optionally) accessibility features:

We stratified the K562 lentiMPRA elements (approximately 200,000) by overlapping the annotated 15 ENCODE ChromHMM states computed from histone mark and other ChIP–seq data for K562. We selected the elements overlapping with states ‘12 EnhBiv’, ‘6 EnhG’ and ‘7 Enh’ as enhancers, and those overlapping with ‘13 ReprPC’, ‘14 ReprPCWk’ and ‘15 Quies’ as repressive and quiescent regions.

Genome-by-Motif Analysis Using a Global Jacobian of the Feature Embedding and its Applications in Hepatocytes

The Jacobian matrix looks at how each inputDIMENSION changes when it’s applied to the outputDIMENSION. We pick the strand and the output dimension that correspond to the given genes.

Themotif, the feature, is obtained by combining elements from different parts of the world, for example, by combining elements from the USA with those from Japan.

where (\odot ) signifies the element-wise or Hadamard product. The (X) is used for analysis of feature–feature interaction, no matter what the model is used for. This allows study of the relationship between the regulators and observed accessibility.

The cell-type-specific genome-wide gene-by-motif matrix for cell type c, Vc is acquired by concatenating the ({{\rm{v}}}_{g}) across the genome. The same process can be applied to different cell types.

In practice, we use the l2-norm of the Jacobian of the region embedding with respect to output for calculating the region importance score as the embedding score distribution is less skewed than the input motif binding score, potentially making the Jacobian more comparable across regions. The per-region Jacobian score is normalized by the maximum score per gene to make scores comparable across genes with different expression levels.

In order to find out which genes will be most affected by this TF by identifying the largest entries in the motif column, we should use thegene-by-motif matrix. We performed gene enrichment analysis using g:Profiler with the default g_SCS multiple hypothesis testing correction. We filtered the results using term size (gene number in a term definition) greater than 500 and less than 1,000. Terms with adjusted P value less than 0.05 were retained as significant terms. The expression log10(TPM > 1) was chosen by us for visualization against the GATA score.

The input matrix was collected for the hepatocytes or concatenated across all fetal and adult cell types. Pairwise Pearson correlation was computed across all collected regions, resulting in a score for every pair of motifs. We performed discovery using the matrix for all cell types in the GET catalogue. The final database has interactions with the top 5% absolute effect size. Each interaction, we performed structural analysis of the two TFs with the highest expression in the corresponding cell types.

The GET architecture is similar to the state-of-the-art model Enformer4. The following changes helped improve upon and exceed the performance of the model by using the regulatory elements embedded layer. A masked regulatory element mechanism was used to learn the general cis and trans actions between regulatory elements and TFs from human cell types. A random set of positions was uniformly selected to mask out.

We did Spearman correlations in the cell-type-specific and cell-type-agnostic settings. Input × gradient scores were used to construct the matrix for computational efficiency. All genes with promoter overlap with open chromatin peaks were used in the calculation for the cell-type specific settings. Causal discovery was made using the LiNGAM69. In order to create a matrix for cell-type agnostic settings, 50,000 genes were randomly plucked from all the cell types and subjected to the LiNGAM algorithm with the default parameters.

We downloaded the known physical interaction subnetwork from the STRING database and kept interactions that were greater than 400 as the ground truth label to benchmark the predicted causality edges. The mapping of the physical interactions between the TFs and the clusters was done on the basis of the motif cluster annotations. The resulting motif–motif physical interaction network was then compared with our prediction to calculate the precision. We also downloaded and compiled all significant interactions determined by mass spectroscopy40 and mapped them to motif–motif interactions for comparison. For comparison with ChIP–seq colocalization, we acquired colocalization results between ChIP–seq tracks for 677 TFs in HepG2 from TF Atlas. The method for calculating colocalization is documented in the ChIP-Atlas repo. Each ChIP–seq peak set was stratified into three tiers (high, mid and low). Then, we looked at colocalization between tiers and the scores assigned to them with a preference for high–high colocalization. If the strong binding peaks of P1 overlap with P2 strong binding peaks of P1, then the P1–P2 interaction is more robust than it is in the case of P2 weak binding sites. The colocalizations are stronger than mid–mid interactions. 4b, as these represent the more reliable interactions. A stronger cutoff (score ≥ 9, keeping only high–high interactions) reduced the performance to a 0.097 macro F1 score at 2% recall.

pLDDT from AlphaFold is a reliable protein domain caller owing to its accurate structure prediction performance. We segmented each TF protein sequence into low and high pLDDT regions. Empirically, we found that 80% (recall) of known DNA-binding domains could be easily identified using high pLDDT regions plus a high ratio of positively charged residues. We first computed the smoothed pLDDT with a ten-amino-acid moving average and we normalized it by dividing by the maximum. After that, any region that had a smoothed pLDDT score less than 0.6 was defined as a low pLDDT region. If two low pLDDT regions were close (less than 30 amino acids), they were merged into one. Any region that was not a low pLDDT region was labelled as a high pLDDT region.

The initial configuration was made from a predicted PDB file. The Amber99SB-dispersion (a99SBdisp) force field was used for system parameterization. The box size was defined as 1 NIT. Subsequently, the system was solvated using the TIP4P water model through the solvate module. To neutralize the system and generate physiological ion concentrations, sodium (Na+) and chloride (Cl−) ions were added using the genion module. The energy minimization terminated upon reaching a maximum force below 1,000 kJ mol−1 nm−1. Each minimization iteration used a step size of 0.01 and was configured to run for a maximum of 50,000 steps. The system was then equilibrated in two steps: first in the NVT (constant number, volume, temperature) ensemble and then in the NPT (constant number, pressure, temperature) ensemble for 100 ps of simulation time. After a 100 ns production run was performed, trajectories and energy profiles were stored for analysis. All configs of these are available at the Proscope repo (https://github.com/fuxialexander/proscope). In our analysis, we found a correlation between PLDDT scores and the instability of the multimer structure, as well as the results of previous studies.

Source: A foundation model of transcription across human cell types

Reactive HeLa and ReH Cells Transduced with Various Proximity Labeling Methods for Coimmuno Precipitation Negative Control

HeLa and REH cells were purchased from ATCC. Cell lines purchased from a certificated cell line bank were not further authenticated. All cell lines tested negative for mycoplasma. No commonly misidentified cell lines were used in the study.

HeLa cells were cultured in a lab. The 10% fetal bovine serum was supplemented with 10% CO2 and 37 C. HeLa cell lysates were generated with a 0.5% NP-40 lysis buffer and a phosphatase and protease inhibitor cocktail. Samples were incubated with 5 µg agarose-conjugated TFAP2A primary antibody (Santa Cruz Biotechnology, sc-12726 AC) overnight at 4 °C before being run in Laemmli loading buffer (BioRad, 1610737). There were two methods of separation of Tris–glycine gels, and one method to transfer their contents to the vacuo. A repeat experiment was performed for coimmunoprecipitation negative controls, which were probed with primary antibodies against SRF (Abclonal, A16718, 1:750) and β-actin (Cell Signaling Technology, 4967, 1:10000), followed by chemiluminescence detection.

The pCDNA 3.1 -MCS-13Xlinker-Bio ID2HA was used to cloned the PAX5 G183S Mutant. 80899)71. After verification, we subcloned PAX5-WT-13Xlinker-BioID2-HA and PAX5-G183S-13Xlinker-BioID2-HA into the pCDH-GFP-puro vector (System Bioscience, CD513B-1). We used pCDH-PAX5-WT-13Xlinker-BioID and pCDH- G183S-13Xlinker-bioID to transduc the REH B-ALL cell line. The proximity labelling assay was performed following previously published methods71,72,73. ReH stable cell lines have different control methods: pCDH-13Xlinker-HA-GFP, pCDH- PAX 5-WT-13Xlinker-HA-GFP and pCDH-G183S-13Xlink The cells were collected and washed twice, then placed in a lysis buffer for 50 minutes on ice. 10 mM Tris-HCl pH 8.0 and NaCl, 10 mM K, 10 mM. Enhancing with protease and phosphatase inhibitors are included in the Life Technologies catalogue. Proteins were clarified by centrifugation at 21,000g for 15 min at 4 °C. We performed a totalProtein quantification with a Pierce BCA kit, and made 1 m2 of extract with 100 l of magnetic streptavidin. We washed the beads with a lysis buffer and 2 M urea. Tris-HCL pH 8.0, and twice again with lysis buffer. Biotinylated proteins were eluted by boiling in 4× protein loading buffer supplemented with 2 mM biotin and 50 mM dithiothreitol at 95 °C for 10 min. Biotinylated proteins in total protein extracts or immunoprecipitates were detected by western blotting using standard protocols and the following antibodies: streptavidin–HRP antibody (Life Technologies, catalogue no. Anti-HA (Cell Signaling, catalogue no. 3724), anti-Nr2C2 (Cell Signaling, catalogue no. The gelanalyzer 23.1 software was used to quantify and detect the contents of the genes.

The experimental procedure involves designing a library of lentivirus vectors that contain both desired sequence elements and a mini promoter. The vector is randomly inserted into the genome through viral infection; the regulatory activity is then measured through sequencing and counting the log copy number of transcribed RNAs and integrated DNA copies.

Source: A foundation model of transcription across human cell types

Modeling contact frequencies across genes with a convolutional neural network with SCALE-normalized observed frequencies: Enformer, DeepSEA and DNase/ATAC

The model could be improved if we learned to predict cell-type-specific three-dimensional contacts from using the GET region embedded in the model.

All scores in this benchmark (ABC, Enformer, GET, HyenaDNA, DeepSEA and DNase/ATAC) were further normalized across each gene’s ±100 peaks to make them comparable across genes.

The importance of a one-dimensional genomic distance in governing knockout effects has been highlighted in recent studies. Most methods in this benchmark include a component of the genomic distance. For example, Enformer incorporates exponential decay in its positional encodings. The benchmarking results of hyenaDNA follow an exponential decay from the TSS, and include a sulusoidal encoding over the DNA sequence. We have added distance information to the GET. We designed a simple module for the GET that would convert the distance map between peaks to a pseudo-Hi-C contact map. The distanceContactMap is a two-dimensional convolutional neural network with input and SCALE-normalized observed contact frequencies. A Poisson negative log-likelihood loss was used to train the model. We trained DistanceContactMap with the same K562 Hi-C data (ENCFF621AIY) used for training ABC Powerlaw, resulting in a 0.855 Pearson correlation, which mostly captured the exponential decay in contact frequency. The model predictions were termed theGET Powerlaw. The other two scores shown in Fig. 3d are defined as follows:

We used the largest pretrained model available through Hugging Face. To score enhancer–gene pairs, we performed in silico mutagenesis by knocking down the enhancer element (that is, setting each base pair in the enhancer region to the unknown nucleotide N in the vocabulary set) and comparing against the wild-type likelihood of observing the promoter sequence.

Enformer: we used Enformer’s contribution score (gradient × input) with background normalization, following the normalization procedure described by Gschwind et al.33.

Source: A foundation model of transcription across human cell types

Understanding the powerlaw function in the ABC repo from K560 Hi-C data using scikit-Learn Linear Regression

The powerlaw function in the official ABC repo can be figured out using the scale and values that were trained on K560 Hi-C data.

In our analysis and regulatory interpretation, we primarily used the binary ATAC model. This approach offers improved attribution to sequence features, ensuring that the model does not overly depend on accessibility signal strength as a surrogate for sequence characteristics.

In this study, we made sure that GET learns useful regulatory information and that it gives valuable biological insights. The method used to understand GET is outlined below.

Cell type purity and heterogeneity: the dynamic and heterogeneous nature of certain cell types, such as stem cells, and the precision of identification and classification of cell types can introduce variability in gene expression profiles, complicating the prediction task.

Cell types rarity and library size can affect the accuracy of predictions.

We chose to use linear regression because our setting aligned better with regression than classification. We used scikit-Learn LinearRegression with default parameters.

Source: A foundation model of transcription across human cell types

Getting Enformer up to speed: Precision and fine-tuning of the leave-one-chromosome-out assay for K562 CAGE

We found an average Pearson correlation of 0.81 for leave-one-chromosome-out predictions for all 22 autosomes. We used GET to make the prediction for K562 CAGE. We note that this comparison privileges Enformer, which was trained extensively on CAGE tracks, including K562 (track ID: 4828 and 5111), whereas GET needed to be transferred to the new assay. We evaluated the predictions summed up across the two CAGE output tracks for a leave-out peak. We selected chromosome 14 because it did not appear in the public Enformer checkpoint’s training or validation set. It was well-trained in three ways.

In the pretrain, the base model was trained on the fetal–adultAtlas with Binarized ATAC signal, and in fine-tuning, we used Binarized ATAC.

QATAC from QATAC fine-tuned: in this setting, the base model was the leave-out-astrocyte RNA-seq prediction model trained on the fetal accessibility and expression atlas. We further fine-tuned this model using quantitative ATAC signal.

These experiments leveraged LoRA parameter-efficient fine-tuning to achieve significant gains in time and storage complexity. On a single RTX 3090 GPU, all fine-tuning converged within 30 min, resulting in a 3 MB K562-CAGE-specific adaptor that could be merged into the base model.

To explore the impact of omitting motifs in the input features, we used K562 scATAC-seq data from ENCODE (accession: ENCFF998SLH) and evaluated the ATAC prediction performance when holding out randomly selected motifs. peaks with a threshold of q are called peaks. Then, we merged this peak set with the union peak set from the fetal pretraining data, keeping the peaks with at least ten counts in K562. The pretrained checkpoint used for motif analysis wasparameter-efficient fine-tuning of which was used in the above illustration.

In general, GET showed robust performance when leaving out one to ten motifs. The performance was degraded heavily when using 20 motifs with a top 20% cutoff for each motif independently, owing to removal of most of the training data.

Owing to these biases, it is difficult to directly apply a model trained on one dataset to a new platform without fine-tuning. The leave-out cell type approach was taken for a new dataset with multiple cell types. We used leave-out chromosome training for the dataset, which only had one cell type.

In summary, we used ATAC-seq and expression data from refs. There were 1,22 In total, the dataset encompassed 1.3 million single nuclei. The data were only presented in pseudobulked format. All cell types were primary cell types from normal tissue. The pretraining dataset did not include disease states. We incorporated further datasets in downstream tasks such as K562 and zero-shot analysis in tumour cells.

Source: A foundation model of transcription across human cell types

Low-Rank Adaptation for Random Forest Regression: Masked Regulatory Element Embeddings and Output Activation

Random forest: we used scikit-learn RandomForestRegressor with ten estimators and max depth 10. Two-dimensional output was handled by MultiOutputRegressor.

CNN: three Conv1d layers (layer dimensions: 283 input, 128, 64, 32, 3 kernel size) followed by FC(32, 512) → ReLU → FC(512, 2); SoftPlus was used for output activation. We used the same parameters and optimizer that we used in GET.

GET provides an option to perform parameter-efficient fine-tuning over any specific layer through low-rank adaptation (LoRA)64. This is commonly used to adapt to a new assay or platform; we apply LoRA to the region embedding and encoder layers, while doing full fine-tuning on the prediction head. This markedly reduces 99% of the parameters.

By using the basis of validation loss, we were able to pick the best model checkpoints for subsequent evaluation.

where ({{\rm{z}}}{l}^{{\prime} },{{\rm{z}}}{l-1}) denote the intermediate representation in the block (l), ({{\rm{z}}}_{l-1}) denotes the output from the block (l-1), LN is the layer normalization and FFN is the feed-forward network. The GELU layer is applied in the feed-forward network layer.

Similar to the Vision-Transformer-based Masked Autoencoders62, we replaced the regions in the selected positions with a shared but learnable ([{\rm{MASK}}]) token; the masked input regulatory element is denoted by ({X}^{\text{masked}}=(X,M,[{\rm{MASK}}])), where (X={{{x}{i}}}{i=1}^{n}) is the input sample with (n) regulatory elements. The training goal is to predict the original values of the masked elements (M). Specifically, we take masked regulatory element embeddings ({X}^{\text{masked}}) as input to GET, while a simple linear layer is appended as the prediction head. Therefore, the overall objective of self-supervised training can be formulated as:

The GET implementation is based on the PyTorch framework. For the first training stage, we applied AdamW as our optimizer with a weight decay of 0.05 and a batch size of 256. The model was trained to scale from 800 to 40. The maximum learning rate is 1.5 104. The training takes about a week for a cluster with 16 V100 graphics cards. For the second fine-tuning stage, we used AdamW63 as our optimizer with a weight decay of 0.05 and a batch size of 256. The model took around 8 h to complete, using eight A 100 GPUs. Inference for all genes in a single cell type takes several minutes, making it possible to perform large-scale screening.

We include a more detailed description of the optimization hyperparameters, computation infrastructure and convergence criteria used in the development of the model in the section below.

Fine- tuning of the infrastructure was done on eightnvidia A 100 GPUs to ensure consistency.

Epochs and duration: the fine-tuning process was shorter, consisting of 100 epochs, and completed in around one day. This phase is essential for adapting the pretrained model to different tasks.

Source: A foundation model of transcription across human cell types

Modeling Accessibility of Pseudobulk Genomic Regions for MHA Sequencing and Prediction of GenCode V.40 Experiments

There is a text that says “MHA” and there is a text that says “l-1.” +rnz_lprime

where ({W}{q},{W}{k}\in {{\mathbb{R}}}^{(n\times D)\times {d}{{\rm{k}}}},{W}{v}\in {{\mathbb{R}}}^{(n\times D)\times {d}_{{\rm{v}}}}) are learnable linear transformations.

For identification of cell-type-specific accessible regions, the peak calling results from the original studies of each dataset were used to obtain a union set of peaks. Subsequently, to compile a list of accessible regions specific to each cell type, we filtered out peaks with no counts.

The number of fragments in a given region for a certain cell type was used to determine the accessibility score for that genomic region. The counts were normalized through the log CPM procedure to enhance the model generalizability. Specifically, let t be the total fragment count in a pseudobulk, and let ci be the fragment count in region i. Then, the accessibility score si can be computed as:

For experiments encompassing multiomics, the correspondence between accessibility and expression was inherently determined through cell barcodes. Cell type annotations were used in pseudobulk cases to facilitate the mapping. Specifically, the fetal expression atlas from Cao et al.23 was used for fetal cell types, whereas adult data were extracted from Tabula Sapiens24. When several ATAC pseudobulk shared the same cell type annotation, identical expression labels were assigned. This compromise was necessary due to the shortage of multiome sequencing data that is expected to change in the near future.

Each region had its expression values assigned to it. Owing to the limitations of poly(A) scRNA-seq data, only aggregated mRNA levels could be captured, resulting in values that were not reflective of the nascent transcription rate more closely tied to regulatory events. Valuable cell type specific information was provided by these values. The process begins with intersecting the input region list with the GenCode v.40 transcript annotations, followed by the assignment of log CPM values to regions related to the promoter. The value of the remaining regions is 0. The zero label on the non-promoter region can help to deliver negative labels to the model, as it does not represent all transcription events happening in a cell.

In alignment with the 200 × 283 input matrix, the target input is a 200 × 2 matrix, symbolizing the transcription levels of the corresponding 200 regions across both positive and negative strands.