The adult mouse brain is made of several different types of cells

admin

2 years ago

Nature: https://www.nature.com/articles/s41586-023-06818-7

On the Hi-C signal at subclass-specific cCRE-gene pair anchors: An extended analysis using 10X Genomics (v.3) kit

The 10X Genomics (v.3) kit was used for all single-nucleus experiments according to the manufacturer’s protocol recommendations. The manufacturer recommended the preparation of the library. Libraries were pooled and sequenced on NovaSeq S2.

The mouse brain’s orthologous cCREs were identified using the liftover tool. The orthologous cCREs were defined as the ones for which at least half of the bases in the human genome came from the mouse. We next compared these orthologous cCREs in the mouse brain with our previously identified cCREs in the human brain23. Those orthologous cCREs, which both were DNA sequence conserved across species and had open chromatin in orthologous regions, were defined as chromatin-accessibility-conserved cCREs. The other orthologous cCREs, which were only sequence conserved to orthologous regions but had not been identified as open chromatin regions in other species, were defined as chromatin-accessibility-divergent cCREs. Mouse-specific cCREs were those ones that were not able to find orthologous regions in the human genome.

To evaluate the confidence of identified subclass-specific cCRE–gene pairs, we randomly selected 11 major subclasses (Sst_GABA, Pvalb_GABA, CBX_MLI_Megf11_GABA, Vip_GABA, CA1-ProS_Glut, CB_granule_Glut, L6_CT_CTX_Glut, L2-3_IT_CTX_Glut, Astro-TE_NN, Microglia_NN, Bergmann_NN), and calculated the Hi-C signal enrichment (at 1 kb resolution) at the top 20% subclass-specific cCRE–gene pair anchors identified in this study. We found that there is statistically significant higher enrichment (P = 0.004) of chromatin interaction signal at the corresponding subclass-specific cCRE–gene pair anchors, compared with non-corresponding pair anchors (Extended Data Fig. 10g), suggesting that subclass-specific cCRE–gene pairs are more likely to interact in the cell types in which the cCREs are active.

A subclass-level deep-learning model with hypo-D MRIs based on the script basenji_data.py from Basenji65

We trained the subclass-level deep-learning model on four NVIDIA A100 80 GB GPUs using the script basenji_train.py from Basenji65. The parameters were set for training to be 32, 150 and 30.

The i represent the cell type, and j represents genomic bins. The ytrue is a matrix that is calculated from signals. The ypred represents the predicted genomic bin-by-type matrix. The pairwise covariance wi,i was calculated between cell types. We then sum the scores across rows and normalize the number of cell types as weights. Last, the weights w was dot multiplied by the original poisson loss.

The model architecture, layers, and parameters are adapted from the mouse model and only the last output head layer with “units” is changed. To encourage the model to predict cCREs in under-represented cell types, we created one novel loss function:

Our model was trained on all 275 subclasses annotated based on the integration with the scRNA-seq data. A collection of genome signal tracks were generated with the help ofMACS 236. The script basenji_data.py from Basenji65 has been used to create the training, validation, and testing datasets.

GRNs were constructed in each major brain region, with genes (TF and non-TF) and DMRs as nodes and three types of edge. The GRNs were summarized into many triplet combinations, each of which contains a TF, a DMR, a target gene (including TFs) and edges between each other. A section is available in the “TF motifs enrichment” section, where we include the characters whose motif is enriched in hypo-D MRs. We use only a union set of genes with absolute log2 fold change > 1 and FDR < 0.01 (described in the section “DEGs”) in any pairwise comparisons between clusters. The weights of edges were computed on the basis of the PCC between clusters, only the edges showing a significant correlation in the shuffle tests were kept. Before computation of the PCC, all data were quantile normalized within each cluster (across features). The edges include the following three categories: TF–DMR edges, connected if the PCC between TF expression and mCG level of DMRs is significant (FDR < 0.01) and the TF has a binding site in the DMR predicted by Cluster-Buster; TF–gene edges, connected if the PCC between TF expression and target gene expression is significant (FDR < 0.01); and DMR–gene edges, connected if the PCC between mCG level of DMRs and target gene expression is significant and the distance between gene transcription start site and the DMR is within 1 megabase.

To analyse the variability of noise, the signal from the TE-cCREs was aggregated. We normalized and averaged the correlation of the two things to find out what subclasses the correlation was for. mCG signal for each TE in matched subclasses from the companion paper41. To calculate the correlation between chromatin accessibility and RNA expression, we aggregated RNA signals at TE-cCREs of each TE in matched subclasses from a previous study99.

The RepeatMasker and ref Gene annotations were used to create the annotated version of cCREs. To define the high TE-cCREs fraction of subclasses, we fitted a mixture model for the TE-cCRE fraction across all subclasses using the R package mixtools97 (v.2.0.0). The null distribution is used to calculate the P value.

We selected the two peak modules that show global accessibility based on the NMF analysis. We then selected all of the proximal–distal connections with cCREs in the peak modules above and ranked the proximal–distal connections based on the highest Cicero scores they have. We treated them as global proximal–distal connections and performed the Hi-C signals by aggregating all of the Hi-C data. The strong enrichment signals for the global connections were observed from the heat maps.

Clustering the Occupancy Profile of cCREs at FIREs Using Permutation Analysis. A Study of Mouse Genome

We next tested whether cCREs are enriched at FIREs through permutation analysis. In brief, we shuffled the mouse genome 1,000 times, each time generating 1,053,811 random regions with equivalent sizes as the cCREs. We took the number of overlaps between the randomly generated regions and the fires and figured it out. The number of overlaps on FIREs was higher than expected and cCREs were enriched significantly.

We called frequently interacting regions (FIREs) in the mouse cortex38 by applying the criteria in our group’s FIRE paper95. The results showed that FIREs overlap with the cCREs in the mouse brain, and a small fraction of the cCREs do.

We used the coefficients to associate the modules with the cell clusters. In the coefficient matrix, each row represents a module, and each column represents a cell cluster. The values in the matrix indicate the weights of the clusters in their corresponding module. The coefficients were scaled by column from 0 to 1. Subsequently, we used a coefficient > 0.1 (~95th percentile of the whole matrix) as a threshold to associate a cluster with a module.

The basis matrix and the coefficients Matrix define accessible cCREs and the component weights in each module. The key issue to decompose the occupancy profile matrix was to find a reasonable value for the rank R (that is, the number of modules). There are several criteria that have been proposed to decide if a rank R will break down the Occupancy profile matrix into meaningful clusters. Here we applied some measurements, Sparseness. and Entropy42, to evaluate the clustering result. The measurements are stable because five NMF runs with a random seed were used to calculate the average values.

We applied the function to extract the cell from the bin count matrix. The size of a consecutive genomic region was chosen as 500 bp. The top 2.5% and the bottom 2.5% bins were removed based on the read coverage from the count matrix. Only the first two copies of Y and X were considered. For our L1-level clustering, we used all of the bin features (over 4 million) that passed the criteria above as non-neuronal cells and diverse neuronal cells were all included. The default top 500,000 features were used for clustering other levels.

For every cell cluster above, we combined all properly paired reads to generate a pseudobulk ATAC–seq dataset for individual biological replicates. Half of the reads from each biological replicate were generated by us. We called peaks for each of the four datasets and a pool of both replicates independently. Peak calling was performed on the Tn5-corrected single-base insertions. The shift was -75 with no model call-summits. We extended peak summits by 250 bp on either side to a final width of 501 bp so we could do a merging and downstream analysis. If the number of cells in any of the pseudobulk ATAC–seq from either individual biological replicates or individual pseudoreplicates is fewer than 200, we did not run MACS2 for it. This was done to reduce the potential false negatives caused by limited cells in the replicates.

We retained peaks that were detected in the pooled dataset and were found to overlap 50% of peak length with a peak in both replicates, in order to create a list of reproducible peaks.

Peak calling was performed on 1,482 L4-level subtypes at the same time as the filter for the peaks at the single-cell level. Our previous study 19 has 9a in it. Before calling peak times, we merged clusters with less than 200 cells if they shared a common cell cluster annotation and were in the same L3-level cluster. 1,463 different types were used.

Act-seq is what the scRNA-seq data in ref. 7 is known as and they were downloaded from Mendeley Data. In the dataset there are 168,877 cells and 78,476 of them are labelled as neurons and used to integrate the scRNA-seq data for the hypothalamic neuron. PCA was fitted with scRNA-seq data and the Act-seq data were transformed. The CCA framework was used to find anchors between the two datasets, and the transformed PCs of Act-seq were aligned to the scRNA-seq PCs. We used the label transfer method described in the previous section to transfer the mC–RNA co-cluster labels from scRNA-seq cells to the Act-seq cells, considering five neighbouring scRNA-seq cells for each Act-seq cell. The Act-seq cells with Fos expression level > 0 were considered to be Fos+ cells, and the proportions of Fos+ cells were compared between control and each behaviour using one-sided Fisher exact tests. The Fos expression levels were compared with one-sided Wilcoxon rank-sum tests.

Brain clustering with the knn function and the dendroGram analysis using Scanpy, UMAP and SnapATAC2

The function knn was used to construct the k-nearest neighbour graph with the input of 50 and the method set to ‘kd tree’. We next used the function leiden of SnapATAC2 for clustering with the parameter object_function set as modularity. The parameter resolution, which affected the number of clusters a lot, was selected from 0.1 to 2 with a step size 0.1 based on the silhouette coefficient89 using the Python package Scikit-learn90. We also manually checked the UMAP81 for each clustering result to make sure that the resolution was suitable corresponding to the top silhouette coefficient. UMAP projections were calculated in the python package using parameters a as 1.866, b as 0.8008, and init as spectral. All of the resolution parameters during clustering are provided in Supplementary Table 3. The final clusters from L3-level clustering and L4-level clustering were represented by the term subtypes in our analysis.

To build the graph of cell type similarity, we used Scanpy to compute connectivities over 20 neighbours using 250 principal components, which are detailed in the section ‘Visualization of clusters’. We aggregated this weighted adjacency matrix row and column wise by taking the average weights of all cells in a given cell type. We used the scikit-network v.0.28.1 to build a dendroGram from our cell type adjacency matrix86. We plotted major cell type markers and examined spatial localization patterns to organize our neuronal clusters into larger sets, comprising a total of 223 groups (metaclusters). Using Scanpy’s rank_genes_groups with the Wilcoxon method, we generated a table of the top 50 differentially expressed genes per metacluster (Supplementary Table 6).

All experimental procedures using live animals were approved by the SALK Institute Animal Care and Use Committee under protocol number 18-00006. Male mice were purchased from Jackson Laboratories. Brains were extracted from 56–63-day-old mice and sectioned into 600 µm coronal sections along the anterior–posterior axis in ice-cold dissection medium2,83. Specific brain regions were dissected according to the Allen Brain Reference Atlas26 (Extended Data Fig. 1) and nuclei were isolated as described previously26. The dissections of the mouse CB region had 2 animals for snATAC–seq library construction, but the other dissections had 4 or more animals of the same sex. We shared the same fluorescence-activated cell sorting (FACS) sequential gating/sorting strategy and the Sony SH800S software with our previous study19.

Paired-end sequencing reads were demultiplexed and the cell index was transferred to the read name. Sequencing reads were aligned to the mm10 reference genome using bwa84. After alignment, we looked at the fragment length contribution from each of the 234 samples. We then combined the sequencing reads to fragments using the make_fragment_file function of SnapATAC229 and, for each fragment, we applied the following quality control criteria: (1) retain only fragments with quality scores MAPQ > 30; (2) remove PCR duplicates. Cell barcodes in read names were used as a basis for sorting the reads, and a shift of +6 bp was made for positive strand and +5 bp for negative strand.

Enrichment of ATAC–seq accessibility at TSSs was used to quantify data quality without the need for a defined peak set. We followed a previously described procedure86, and used the function filter_cells in SnapATAC2 to calculate TSS enrichment (TSSe). TSS positions were obtained from the GENCODE87 database v.16. There were Tn5-corrected insertions that were aligned to positive and negative strands and there were Tn5-corrected copies that were aligned to a negative strand. This profile was then normalized to the mean accessibility ±1,900–2,000 bp from the TSS and smoothed every 11 bp. The maximum profile was taken as the TSSe.

According to the data standards of the EN Code ATAC–seq, 1,000 uniquely mapped fragments and 10 individually mapped units were removed from each of 234 samples. We used the filter_cells function of SnapATAC2 to achieve this.

We used a modified version of Scrublet28 to remove potential doublets for every sample independently using SnapATAC2. We used the add-tile_matrix function to add the 500 bp genomic bin features, then used the Select_features function to find the features with frequencies between 0.5% and 99.5%. We then applied the scrublet function of SnapATAC2 to get the doublet scores. The expected_doublet_rate was set to 0.08, as a result of our previous experiment. Potential doublets with scrublet scores greater than 1.5 were treated as potential doublets and removed from our analysis.

Scrublet was compared to another method called AMULET30, which is used for doublet detection and removal. We evaluated the performance of the two methods using the precision-recall curve (PRC) and the area under the PRC when we simulations contained artificial doublets from eight samples in the primary motor area.

Clustering of Projection- and Behaviour-related mC-seq neurons in SCENIC+ using Hidden Markov Models

Note that because the GRNs were constructed based only on correlations between data modalities and the binding motifs, they inevitably capture indirect and false-positive relationships. Perturbation experiments would be necessary to validate the connections between cis or trans regulators and target genes.

We used an ensemble motif database from SCENIC+ (ref. 44), which contained 49,504 motif position weight matrices (PWMs) collected from 29 sources. SCENIC+ authors used TOMTOM45’s PWM distances to calculate clustering based onundant motifs. Each cluster was made up of at least one mouse TF genes. To use theCluster-Buster46 implementation in SCENIC+, we scanned motifs in the same cluster together with hidden Markov models.

The unbiased snmC-seq cells from each mC–RNA co-cluster were merged to generate pseudobulk methylation profiles. The different genomes of the mice made use of the epi-retro-seq cells impossible. All Cools identified the DMRs in each region group. We then calculated the PCC between DMR mCG and gene mCH fraction. We shuffled the DMRs and genes within each sample to calculate the null PCC and estimate FDR. FDR was used to modify the DMR–target edges.

The total UMI count of the cell and log transformed normalized the Gene expression level of each cell. We did a pairwise comparison of projection neurons clusters. For each cluster pair, the P values were calculated using Wilcoxon rank-sum tests and the fold change was calculated as the average expression level across cells in the two clusters. The genes with absolute values of log2 fold are differentially expressed. The DEGs from all cluster pairs were merged to generate the heatmaps in Figs. 3 and 4. The top 100 DEGs were ranked by FDRs if more than 100 of them were identified in a pair of clusters.

The weak associations between projection-associated and behaviour-associated clusters are probably caused by the small overlap between brain regions profiled in the two datasets. The difference of clusters used for projection association and behaviour association is different than in the case of Epi-retro-seq. 7c–e is a letter. Increasing the size of the dataset could help to further association between the different types of cells in the same area. Our study aimed at a comprehensive view of a large number of projections across the whole brain and focused on targets that do not seem to receive strong input from VMH. This limited the ability to compare studies and the data overlap between them.

Finally, the neurons from each region group were selected and integrated with the scRNA-seq cells from the same region group, using the same procedure as described above. The mC–RNA co-cluster labels were transferred from the scRNA-seq cell to the MERFISH cells. The MM and DCO subclasses had MERFISH cells assigned to them in the last step, but their clusters were not included in the co-cluster analysis.

We believe that this can be accomplished through the registration of MERFISH DAPI images to the common coordinate framework. The procedure is relatively easy and it is important to use cell types that are known landmarks to refine the registration.

The experiments were described in the ref. The genes panel design, tissue preparation, and data processing are included. The dataset has two slices of sagittal and a few slices of coronal, which correspond to 2, 4, 6, 8, 10. The same naming of slices was used throughout this manuscript (Figs. 3e, 4d and 5c and Extended Data Fig. 6).

In this manuscript, we used the absolute proportions but not the relative ones to the unbiased data owing to the different profiling strategies between the two datasets. The same amount of cells from different sources were combined into the 32 different dissections to carry out FANS, so that there is a proportion of cells from different sources. However, the unbiased snmC-seq profiled all of the dissection regions separately and sequenced the same number of cells in each dissection, which manually amplified the proportion of cells from the smaller or sparser dissection regions relative to the larger or denser ones, and limited the power to estimate the real proportion of neurons in each cluster from the sources.

In the fourth step of the process, the cortical cells from 286 experiments were furtherFILTERED to exclude those with a high proportion of neuron types known to not project to the intended injection site. Specifically, for each FANS run, we counted the number of neurons that were observed in known on-target cell types (({O}{{\rm{on}}})) and off-target cell types (({O}{{\rm{off}}})). Assuming that the proportions of contaminated cells in each subclass would be similar to those of a sample without projection-type enrichment, we compared the observed counts to the counts from unbiased snmC-seq data (({E}{{\rm{on}}}) and ({E}{{\rm{off}}})) collected from the corresponding dissections in Extended Data Fig. 1. The fold enrichment was computed as (\frac{{O}{{\rm{on}}}{E}{{\rm{off}}}}{{O}{{\rm{off}}}{E}{{\rm{on}}}}). The enrichment of on-target cells was assessed by using a one-sided binomial test. The P value was computed as (\Pr (X\ge {O}{{on}}{;n},p)), in which (X \sim {\rm{binomial}}(n,p)), where ~ represents distributed as, (n={O}{{\rm{on}}}+{O}{{\rm{off}}}) and (p=\frac{{E}{{\rm{on}}}}{{E}{{\rm{on}}}+{E}{{\rm{off}}}}). For each ET target, we considered ET as on-target subclasses and IT+inhibitory neurons as off-target. There were thresholds for FDR and fold enrichment. For IT targets, we considered IT as on-target subclasses and layer 6 corticothalamic+inhibitory neurons as off-target. The thresholds were for FDR and fold enrichment. This eliminated 32 out of 286 sorting cases (Extended Data Fig. 2e).

To find anchors between snmC-seq and epi-retro-seq, we used the snmC-seq data to fit a PCA model and use the model to transform epi-retro-seq cells to the same space and find the five nearest snmC-seq cells for each epi-retro-seq cell. Reciprocally, we fit another PCA model with the epi-retro-seq cells and transform the snmC-seq cells and find five nearest epi-retro-seq cells for each snmC-seq cell. The mutual neighbours between the two datasets were used to score using the same method as Seurat v3.

We adopted a similar framework to that of Seurat v3 (ref. 41) for data integration by first identifying the mutual nearest neighbours as anchors between datasets, and then aligning the datasets through the anchors.

Predicting the Projector Group Label of Each Neuron using a Computational Replica and a Comparison with Biological Replications

Logistic regression models were trained to predict the group label of each cell to see if the two groups of cells were similar. We used different features to see if we could predict the projection target from the same source. These include the posterior mCH level of 100-kb-bin and gene-body, and the dimension reduction results of the two matrices. 50 PCs were used as part of a project that used snmC-seq to fit PCA models and transform epi-retro-seq data. We also used two methods to split the cells into training and testing sets. The other used a random selection of cells to be used for both training and testing, using the sex of the mice from which the cells were collected. After the QC steps, we have 168 source–target combinations with data from both sexes and the other 57 with cells from only one sex. All of the comparisons of 959 target pairs were quantified with the computational replicates, but less than half of them were quantified with biological replicates. We noticed significant congruence of model performance between the different features and different train/test splits (Extended Data Fig. 3b–c. The performance when using 100 kb bins was very similar to the performance when using gene bodies. The performance of the raw features was better than that of the PCs. 3b). Computational replicates were better than biological replicates in terms of performance. The predictions were made easier with the dismissal of heterogeneity between biological replicates made by the computational replicates. Nevertheless, the computational replicates still provided strongly correlated results to biological replicates (Extended Data Fig. 3c), which allowed the comparison between different target pairs to evaluate their epigenomic differences.

Technical reasons would include the following. The models can’t capture the difference between real and projected differences because of the high levels of contamination in some experiments. Different replicates vary in the epigenetic differences between neurons projecting to different targets. Third, the sample sizes of some projections are small, which makes learning more challenging. Fourth, the models are not powerful enough to capture the complex differences between projections.

Each cell type was assigned to a neurotransmitter identity based upon the percentage of its cells with non-zero counts of genes essential for the function of that neurotransmitter. Specifically, we used a non-zero threshold nz = 0.35.

To identify variable genes, we used a binomial model of homogenous expression and looked for deviations from that expectation. Specifically, for each gene we computed the relative bulk expression by summing the counts across cells and dividing by the total UMIs of the population. The percentage of all counts that are assigned to that gene is this. np is the equivalent of p in a binomial model for observing the gene in a cell with n counts. The expected proportion with non-zero counts is thus

Cell-type classification from a two-dataset with not labeled data: Application to Brain-wide correspondence of neuronal epigenomics and distant projections

Let’s assume that we have two datasets that are not labeled, A and B. When selecting a cell for a query, we first find its nearest neighbours in A and then the distance from them as a k-dimensional vector d. After the transformation, the closer neighbours have higher weights, and the weights of all neighbours sum up to 1. To get a categorical label from A to B, we needed to use one-hotEncode to represent the label and the vectors of the k neighbours in A. The resulting vector ({{\bf{L}}}{{\rm{q}}{\rm{r}}{\rm{y}}}={\bf{w}}{L}{{\rm{r}}{\rm{e}}{\rm{f}}}) represents the probability of the query cell belonging to each category. The final assignment uses the category with the maximum probability.

In the 3rd step, the experiments with less than 20 cells were excluded in order to ensure statistical power of projection analysis. The non-neuronal cells were also removed from the dataset, after which 34,643 neurons remained. The cell-type classification method is described in the next section.

In quality control (QC) step 1, the cells included in the analysis are required to have a median mCCC level of the experiment < 0.025; 500,000 < nonclonal reads < 10,000,000; and mCCC level < 0.05. In total, 56,843 cells from 703 experiments satisfied these requirements (Extended Data Fig. 2a,b

Source: Brain-wide correspondence of neuronal epigenomics and distant projections

rAAV retrograde labelling in the brain area using injections from granular insular cortex, ENT, pIR, AMY, HY and CBN

As described before, the rAAV-retro-Cre injections can be used to label cells projecting to regions of interest. Animals were placed in a stereotaxic frame after being anaesthetized with either isoflurane or xylazine. The pressure injections of 0.05 to 0.4 l were done using glass micropipettes, which had stereotaxic coordinates on them. To precisely target PFC, agranular insular cortex, ENT, PIR, AMY, HY and CBN, AAV was injected using iontophoresis to ensure confined viral infection. The injections were made with glass pipettes and had a tip diameter of 10 m. For most of the desired target areas, injections were made at different depths, and/or at different anterior–posterior or medial–lateral coordinates to label neurons throughout the target area. There are conditions and injection coordinates listed in Supplementary Table 1. At least two male and two female mice were injected for each desired target. There was no sample size calculation done. We determined that we needed two mice from the same sex for each injection in order to achieve minimum reproducibility. Animals used for injections into the brain area were randomly selected.

As described previously2, all experimental procedures using live animals were approved by the Salk Institute Animal Care and Use Committee. The mouse line used in epi-retro-seq2 is known as the hit-in mouse line. There were retrograde labelling experiments done with adult male and female INTACT mice. Animals were housed in an Association for Assessment and Accreditation of Laboratory Animal Care-accredited facility at the Salk Institute. There was a 12 h light and a 12 h dark cycle. Temperature was monitored and adjusted in accordance with Guide for the Care and Use of Laboratory Animals. Humidity was not controlled but monitored. As there is no recirculated air coming in, the animal facilities have the same humidity as the outside air. San Diego averages 40–60% humidity year-round. The animals were killed 13 days after the surgery and were older than 50 days on the day of dissection. The mice were used in MERFISH experiments.

Enrichment Analysis of SnRNA-Seq Cell Types with scDRS and a Gene Set Analysis on a ranked list to identify telencephalic inhibitory populations

To determine which of our snRNA-seq cell types was enriched for specific GWAS traits, we used scDRS (v.1.0.2)50 with default settings. scDRS computes disease association scores at the single-cell level and considers the distribution of control gene scores to identify significant associated cells.

We used MAGMA v.1.10 (ref. A window of 10 kb is needed to map single-nucleotide polymorphisms to genes. We used the resulting annotations and GWAS summary statistics to calculate each gene’s MAGMA z score (association with a given trait). Mouse genes were converted to their human orthologs using a database. The top score on the MAGMAZ was used to calculate the weighted number of disease genes used in scDRS. Many of the traits we tested for enrichment had previously been calculated with MAGMA scores, so those were used instead.

We performed a Gene set enrichment analysis on a ranked list to identify the transcription factors which are enriched for telencephalic excitatory populations. genes from our telencephalic excitatory cells were ranked by their average correlations with fo and a background ranking of the average telencephalic inhibitory fo. Genes from our telencephalic inhibitory cells were ranked in reverse order. We used multiple database to build a new set of genes and transcription factors. We included human transcription factors that showed up in two or more databases and had at least ten unique targets in each of them. The final gene set had human transcription factors with targets across any two databases. We used the R package to perform an enrichment analysis with a restriction of the size of the genes and against a background of all the genes in our data. P values were computed using a positive one-tailed test and FDR corrected by fgsea.

We constructed a force-directed graph of the weighted bipartite network to identify activity controlled relationships between candidate ARGs and regions of the brain. We used the R package igraph v.1.2.7 to build the network from an incidence matrix of candidate ARGs and excitatory/inhibitory cell types localized to different regions. An entry in the matrix is a representation of the correlation of a gene with a brain region scaled up so all entries are greater than or equal to one. There would be no edge between genes or regions if there were two disjoint sets in the network. The correlation entry between a gene and region was used to determine the weighted edges. To emphasize the most central nodes in the network, we pruned edges with e less than 1.3. We decided on our core IEGs based on the degree of the network in which we live, and then picked them based on the degree of centrality in the network.

The expression of pseudocell-level genes was normalized by the log 2-transformed count. We further normalized the expression of each gene to have a mean expression of zero and a standard deviation of one. The pseudo cell counts were normalized for downstream analysis.

To generate our pseudocells, we first performed dimensionality reduction at the single-cell level. Single cells were divided into 27 groups, consisting of glial cell classes and neuronal populations further divided by neurotransmitter usage. We used a specific number of mouse donors, and selected genes that were highly variable in the number of mice, for each cell group. The principal component analysis ran on the scaled expression data. Next, we constructed pseudocells by grouping single cells within each cell type. Within a cell type of size n, cells were assigned to pseudocells of size s such that the pseudocell size correlated with cell type size:

We aggregated our snRNA-seq expression data into pseudocells using scOnline. The issue of low capture rate from dropouts and pseudoreplication is eliminated by working at a pseudocell resolution, while avoiding issues of pseudobulk approaches.

We used the exact method of bin conf function in the Hmisc R package to calculate the 85% confidence interval for the binomial distribution. For plotting clarity, regions with fewer than five total inhibitory and excitatory cells were excluded.

The percentage of cells with non-zero expression of one NPR was assigned to the cell type, and the average expression of at least one NPR was assigned to the cell type.

The NP identity of each cell type was dependent on the percentage of cells with NP expression and the average expression of NP per cell. We observed that the expression of four NPs showed greater contamination across other cell types: OXT, AVP, PMCH and AGRP. We required the number of cells with non zero expression to be greater than or equal to 0.8 and average expression to be at least five counts per cell for these NPs.

We examined the top expressing transporter and neurotransmitters of the 166 cells that did not meet the conditions.

Source: The molecular cytoarchitecture of the adult mouse brain

Generating Index Cell Types in the Universe Using Mixed Linear Programming and Dynamic Tree-Crossing Minimization

We framed the query as a set covering problem, because we were able to find minimally sized gene lists that allowed us to distinguish one cell type from the others. In the set cover problem, we find the smallest subfamily of a family of sets that can still cover all the elements in the universe set. We can define this as a mixed integer linear programming model programmatically using the JuMP domain-specific modelling language in Julia (refs. 33,89. It was found that we could use the HiGHS open-source solution or the IBM ILOG CPLEX commercial solution. Supplementary methods includes a mixed linear programming model derivation and CPLEX solver parameters.

To create ridge plots in the neighbourhood. Theinterneuron and projection metaclusters for the isocortex, striatum and cerebellum were identified first in Supplementary Table 13. Supplementary Table 5 shows the cell types within each metacluster.

The graph of regional cell type similarity can be generated. We used the weighted jaccard similarity metric to determine the weighted pair of DeepCCF regions. We then used the R package qgraph v.1.9 to generate a force-directed graph.

To evaluate cluster heterogeneity, we analysed the minimum number of cell types that should be covered by the mapped beads. For each region, we computed the number of confidently mapped beads for each cell type sorted in descending order by the number of beads. Next, we looked at the number of cell types that were needed to reach 95 percent of the total beads.

Descendants from distant ancestors were aggregated to find the neighbourhood in an index cell type. If more than 50% of the cell types in the neighbourhood were non-neuronal, we continued to add them.

The genes and cell types were first placed in order using the permuting method. The cell types were then reordered to comply with the cell type dendrogram structure using a dynamic programming tree-crossing minimization optimization88.

The tree structure gives us a chance to modify the leaf sequence in the tree. We did so by iteratively permuting the columns and rows of a normalized cell type by gene matrix so that the elements are grouped around the diagonal. The genes Tbr1, Fezf2, Dlx1, Lhx6, Foxg1, Neurod6, Lhx8, Sim1, Lmx1a, Lhx9, Tal1, Pax7, Hoxc4, Gata3, Hoxb5 and Phox2b were chosen to be discrete, biologically interpretable markers—mostly transcription factors that relate to overall neuronal cell lineage.

RCTD in doublet mode models how well explicit pairs of cell types match a bead’s expression. RCTD takes cell type pairs and looks at them per bead. We found that larger cell type references discouraged us from confidently mapping fine-grained cell types. To balance this sparsity, we added an additional ridge regression term to RCTD’s quadratic optimization tunable with a ridge strength parameter, which allowed us to control the relative sparsity and potential overfitting of the prefiltering stage. Using RCTDs full mode and two ridge strength parameters, we were able to detect a subset of potential cell types for each bead.

It was necessary to modify how RCTD scores explicit pairs of cell types in doublet mode in order to aid in mapping large references. Rather than using the result of the single-cell type pair that fit best, we identified the cell type pairs that scored similar to the best-scoring pair (with likelihood score within 30). We took the number of each cell type occurring in the well-fitting pairs and divided them into the total occurrences of all cell types to come up with a confidence score. We use a maximum score of 0.5 in the paper to set the threshold for a confident mapping.

We put out the 10 cell types most likely to be associated with the given bead, using the explicit cell types used within RCTD’s doublet mode. When modelling how well these cell types mapped to a given bead, we exhaustively used one cell type from the top 10 list and one cell type from the rest of the prefiltered list. For the cerebellum and striatum, the number of cell types considered was sufficiently low that we were able to run the algorithm using all pairs.

For mapping we deployed a modification of the RCTD algorithm23, in which we increased the computational efficiency and throughput, modified cell type prefiltering and adjusted the metric used for the decomposition assignment (see below).

Source: The molecular cytoarchitecture of the adult mouse brain

Distributed Clustering of Immune Cells in the snRNA-seq Dataset I. Samples and Data Saturation

The 16 cell classes we identified in the snRNA-seq data came from 20 different cell types. Most of these excluded clusters are classified as immune cell types and are mentioned in the following figure and tables: Extended Data Fig. Supplementary Tables 2 and 3 are included. We mapped a lot of immune cell populations.

We used the SCOPIT package to determine the saturation of our data. To sample at least c cells in each cell type, SCOPIT calculates the multinomial probability of sequencing enough cells, n, above some success probability, p.

Clustering was performed hierarchically starting from the full dataset of approximately 6 million single nuclei. Each round of clustering consisted of (1) gene selection based on a binomial model; (2) square-root transformation of the counts; (3) construction of the k nearest neighbour and shared neighbour graphs; and (4) Leiden clustering over a range of resolution parameters to find the lowest resolution that yielded multiple clusters. The resulting clusters were then each iteratively re-clustered, and the process was repeated until either (1) no Leiden resolution resulted in a valid clustering or (2) the resulting clusters did not have at least three differentially expressed genes distinguishing them. A key goal of this clustering strategy was to re-calculate gene selection for every clustering, as the relevant variable genes depend on the overall context of the cells being clustered. This resulted in a distributed design in which the data were stored on a disk in a compressed representation that could be efficiently accessed using parallel processes. This allowed us to perform clustering thousands of times without creating redundant copies of the data.

For high-dimensional visualization as in this picture. 1a, we first subsampled each of the clusters to a maximum of 2,000 nuclei. We took the first 250 components and calculated them using the Scanpy package. The t-SNE was generated using the principal component space as a model and an exaggerate factor of four and a perplexity of 350.

The only exception to the above was if the next level of clustering resulted in a set of differential clusters that passed this test; these were situations where the first round of clustering split the cells on a continuous difference in expression but the next round resolved the discrete clusters. We retained these clusters for further subclustering as they may contain additional structure.

Each leaf was compared with its sibling leaves to see if there were differential markers. We used a Mann–Whitney U-test to assess whether any genes are differentially expressed. To ensure that there is a distinct difference in expression between the two populations, we required that a gene be observed in less than 10% of the lower population and more than 20% in the higher population. Every cluster was required to have at least three marker genes that separated it from its neighbours as well as three marker genes in the other direction. The parent was the terminal cluster if a cluster failed the test.

The resolution sweep concluded at the highest resolution without finding multiple clusters, which also indicates a homogeneity of the population.

If the shared neighbour graph was not a single connected component, there is no resolution low enough to form a single cluster, and so, the resolution sweep was not possible. This would typically occur if there were very few variable genes, which is indicative of a homogenous cell population.

We used the Constant Potts Model to identify cell clusters after we computed the shared nearest neighbour graph. This method is sensitive to a resolution parameter, which can be interpreted as a density threshold that separates intercluster and intracluster connections. To find a resolution parameter manually, we implemented a sweep strategy. We started with a very low-resolution value, which results in all cells in one cluster. We gradually increased the resolution until there were at least two clusters and the size ratio between the largest and second-largest cluster was at most 20, meaning that at least 5% of the cells are not in the largest cluster. If a lot of cells were clustered in a round, the number was discarded. Roughly 1% of the cells are contained in this discarded set.

We created a shared nearest neighbour graph after selecting variable genes. The counts were transformed using the square- root function and then computed with the k-nearest neighbour graph using the coefficients of and (not including self-edges) From the kNN graph, we compute the shared neighbour graph, where the weight between a pair of cells is the Jaccard similarity over their neighbours:

We compared this expected value with the observed percentage of non-zero counts and selected all genes that are observed at least 5% less than expected in a given population.

Dsp, Ccn2, and Tmem212 have genes that correspond to the CCF regions in Supplementary Table 12.

The CCF hierarchy was divided into 12 main regions, so that it could be visualized easily. For many of our analyses, we also grouped into ‘DeepCCF’ regions, detailed in Supplementary Table 4.

Source: The molecular cytoarchitecture of the adult mouse brain

Joint dissimilarity function for large deformation diffeomorphic metrics and application to Nissl atlas and comparison with Slide-seq data

This dissimilarity function, subject to Large Deformation Diffeomorphic Metric Mapping regularization, is minimized jointly over all parameters using a gradient-based approach, with estimation of parameters for linear transforms accelerated using Reimannian gradient descent as recently described71. The source code for our standard registration is available online at www.twardlab.com. The transformations above were used to map annotations from the CCF onto each slice. The black curves of the boundaries of each anatomical region were used to make sure the images were accurate. We were able to identify 15 outliers where the rigid motion model was not enough because of the large distortions of tissue slices. For these slices, we included an additional two-dimensional diffeomorphism to model distortions that are independent from slice to slice and cannot be represented as a three-dimensional shape change, as in our previous work72. The data is extended. After the additional two-dimensional diffeomorphism, 2a shows accuracy.

The similarity between the transformed atlas and our data was quantified by using an objective function we developed called previously69,70 which is equal to the weighted sum of square error between the two. To transform contrasts, a third-order polynomial was estimated on each slice of the transformed atlas to best match the red, green and blue channels of our Nissl dataset (12 degrees of freedom per slice). During this process, outlier pixels (artifacts or missing tissue) are estimated using an expectation maximization algorithm, and the posterior probabilities that pixels are not outliers are used as weights in our weighted sum of square error.

As a pre-processing step for the alignment of Slide-seq arrays to Nissl images, for each puck we generated a greyscale intensity image from the Slide-seq data by summing the UMI counts (across all genes) at each bead location on the puck and normalizing by the maximum UMI count value across the entire puck. We used two steps to align the images to the Nissl images. Each image was transformed into an intermediate coordinate space using a manual rigid transformation. The purpose of this first transformation is to bring the images to a similar upright orientation, which will help the second step of alignment. In the second step, we manually identified fiducial markers in the Nissl images and the Slide-seq intensity images. We calculated the bead positions for all beads using thin-plate Spline interpolation and fiducial markers.

The sequenced reads were aligned to GRCm39.103 reference and processed using the Slide-seq tools pipeline (https://github.com/MacoskoLab/slideseq-tools; v.0.2) to generate the gene count matrix and match the bead barcode between array and sequenced reads.

Source: The molecular cytoarchitecture of the adult mouse brain

Nissl Microscopy and Analysis of the Brain Molecular Cytoarchitecture Using Pike 505C VC50 Cameras under Autoexposure Mode

We acquired Nissl images on an Olympus VS120 microscope using a ×20, 0.75 numerical aperture objective. Images were captured with a Pike 505C VC50 camera under autoexposure mode with a halogen lamp at 92% power. The pixel size in all images was 0.3428 μm in both the height and width directions. We acquired a total of 114 Nissl images, each from an adjacent section of the brain to a corresponding section that was processed using the Slide-seq pipeline. Of the 114 sections, we removed 10 from the posterior medulla and upper spinal cord that were outside of the area of the CCF reference brain. There were three additional sections removed from the remaining 104 images due to the poor quality of the corresponding data. The remaining 101 images comprise the final dataset that we use for all our analyses.

Slide-seq arrays were generated as previously described2 with slight modifications. Larger-diameter gaskets were used to generate 5.5 × 5.5 mm2, 6.0 × 6.2 mm2 and 6.5 × 7.5 mm2 bead arrays. The sizes that were chosen were designed for the different parts of the body. To facilitate image processing, we utilized 2 × 2 digital binning on the collected data, resulting in 1.3 μm per pixel.

The read was demultiplexed and aligned to a reference using CellRanger v.5.0.1, except for an additional parameters to include introns. We used CellBender v.3-alpha63 to remove cells.

Source: The molecular cytoarchitecture of the adult mouse brain

C57BL/6J Mice Anaesthetized by Isoflurane in a Gas Chamber. Care and Use of Laboratory Animals for Regional Tissue Dissections

At 56 days of age, C57BL/6J mice were anaesthetized by administration of isoflurane in a gas chamber flowing 3% isoflurane for 1 min. A negative tail pinch response was used to confirm Anaesthesia. A nose cone was flowing for a duration of the procedure after the animals were moved to a dissection tray. Transcardial perfusions were performed with ice-cold pH 7.4 HEPES buffer containing 110 mM Blood should be removed from brain and other organs with the help of 10 mM HEPES, 25 mM glucose, 75 mM sucrose, 7.5 mM Mgcl2 and 2.5 mM KCl. For use in regional tissue dissections, the brain was removed immediately; the meninges was peeled away from the entire brain surface, then frozen for 3 min in liquid nitrogen vapour and moved to −80 °C for long-term storage. For use in generation of the Slide-seq dataset through serial sectioning, the brains were removed immediately, blotted free of residual liquid, rinsed twice with OCT to assure good surface adhesion and then oriented carefully in plastic freezing cassettes filled with OCT. For 5 minutes at room temperature, the cassettes were vibrated to remove air bubbles and to adhere OCT well to the brain surface. The x–y–z axes were adjusted just before the brain was put through a bath of liquid nitrogen. Blocks were cooled to 80 C.

Animals were group housed with a 12-h light–dark schedule and allowed to acclimate to their housing environment (20–22.2 °C, 30–50% humidity) for 2 weeks post-arrival. All procedures involving animals at Massachusetts Institute of Technology were conducted in accordance with the US National Institutes of Health Guide for the Care and Use of Laboratory Animals under protocol number 1115-111-18 and approved by the Massachusetts Institute of Technology Committee on Animal Care. All procedures involving animals at the Broad Institute were conducted in accordance with the US National Institutes of Health Guide for the Care and Use of Laboratory Animals under protocol number 0120-09-16.

The BICCN atlas was built using a technology called high-resolution single-cell transcriptomics, which defines cells by the genes that they express and uses imaging technologies to map the genes’ spatial distribution. The atlas details how many brain cell types there are, as well as their proportions and spatial arrangement in the brain.

The BICCN, a collaboration between 250 scientists and 45 institutions, was funded by the National Institutes of Health to the tune of US$400 million.

These same technologies underpin the China Brain Project, which is being funded with 3.2 billion Chinese yuan (US$446 million). As with the NIH study, its goal is to map macaque brain cell types using genomic technologies — it has already used these tools to generate an atlas of 143 regions in the macaque brain cortex and has identified 264 cortical cell types7.

The Human Brain Project, which was based inGeneva, Switzerland, and ended in September, has created the Human Brain Atlas. This is available on a platform open to the public. The atlas uses post-mortem brain-imaging data and depicts the multilevel organization of the brain, from its cellular and molecular architecture to its functional modules and neural connections.

The Swiss Blue Brain Project, which will wrap up in 2024 after nearly 20 years, will also release a digital model of the mouse brain, based on imaging data, with detailed mapping of neurons and connectivity circuits.

Towards a Common Framework for Cell Clustering, Classifications and Names: A Workshop on Data Processing and Analysis in Cellular Networks

There is a need for better communication between the projects, as they progress, but they are important in their own right. Several of the projects are employing the same technologies. It makes sense for the teams to liaise more closely, at the very least to begin a discusson on how to establish shared data standards, which they have not yet done.

At a minimum, the data, models and code need to be open. Adherence to large data sets is still a challenge. It’s necessary for a common framework for data collection and analysis that includes definitions of types of cell clusters, as well as unified cell classifications and names across species.