Integration of DNA Methylation, scATAC-seq, and scRNA-seq to Infer Transcription Factor Activity

Gene expression regulation depends on the interaction between transcription factors (TFs) and regulatory regions of the genome such as promoters and enhancers. However, the ability of TFs to bind DNA is strongly influenced by the local epigenetic state, including chromatin accessibility and DNA methylation.

In this context, three types of data provide complementary information: scRNA-seq, scATAC-seq, and DNA methylation. Integrating these layers makes it possible to infer which TFs are active and which genes they regulate, providing a more complete view of the regulatory networks that define cellular states.

In this notebook, I present a summary analysis pipeline to integrate these three sources of information using single-cell data. Each omic layer answers a different biological question.

Data type	What it tells us
scRNA-seq	Which genes and TFs are expressed
scATAC-seq	Which genomic regions are accessible
DNA methylation	Whether regulatory regions are epigenetically permissive or repressed

Overall:

TF activity is inferred, not directly measured. To infer TF activity we combine multiple signals:

TF expressed
+ motif present
+ chromatin accessible
+ low methylation
→ strong evidence that the TF is active

Integration reveals regulatory programs. When these signals are combined across thousands of cells we can:

identify active transcription factors
map enhancers to target genes
reconstruct cell-type–specific regulatory networks

These approaches allow researchers to move from descriptive transcriptomics to mechanistic models of gene regulation.

Single-cell multi-omics integrates expression, chromatin accessibility, and DNA methylation to identify which transcription factors are active and how they control gene regulatory networks that define cellular identity.

Summary of the Pipeline

Process scRNA-seq (gene expression)
Identify TF expression levels
Process scATAC-seq (peak detection)
Process DNA methylation (epigenetic state of regulatory regions)
Perform TF motif analysis
Integrate multi-omic matrices
Infer regulatory activity

Step 1. Processing scATAC-seq and Peak Identification

First, chromatin accessibility data are processed to define open regions of the genome.

Typical steps include read alignment, quality filtering, peak calling and construction of a peak × cell matrix.

Conceptual Example

Peak	Cell1	Cell2	Cell3
peak_1	0	4	2
peak_2	1	0	0
peak_3	5	3	1

These peaks represent potential regulatory regions where transcription factors may bind.

Step 2. Integrating Methylation with Accessible Regions

Once peaks are defined, they are intersected with methylation data:

scATAC peaks ∩ methylated CpGs

This allows the calculation of methylation levels within accessible regions.

Conceptual Example

Peak	Methylation
peak_1	0.12
peak_2	0.78
peak_3	0.25

Interpretation

Low methylation + high accessibility → active regulatory region
High methylation + low accessibility → repressed region

Step 3. Identification of TF Motifs

Next, TF motif scanning is performed within the identified peaks. This allows the detection of potential TF binding sites.

Example

Peak region chr1:10200–10400

Motifs detected

SOX2
GATA1
RUNX1

Resulting Matrix (Peak × TF Motif)

Peak	SOX2	GATA1	RUNX1
peak_1	1	0	1
peak_2	0	1	0
peak_3	1	1	0

Step 4. Estimating TF Activity Using Chromatin Accessibility

Peaks containing motifs for a given TF can be aggregated to estimate regulatory activity based on accessibility. This strategy is commonly used in methods such as chromVAR.

Conceptual Workflow

TF motif peaks
↓
average accessibility
↓
TF accessibility score

Step 5. Integrating Methylation at Motif Sites

DNA methylation is then used to refine TF activity inference. For each motif:

identify nearby CpGs
calculate average methylation
evaluate patterns of hypomethylation

Conceptual Example

TF	Accessibility	Methylation
SOX2	high	low
GATA1	low	high

Interpretation

High accessibility + low methylation → TF likely active
Low accessibility + high methylation → TF likely inactive

Step 6. Integration with Gene Expression (scRNA-seq)

Finally, transcriptomic information is incorporated. For each TF, the following evidence is evaluated:

Evidence	Interpretation
TF highly expressed	potential active regulator
motifs in accessible regions	possible binding
low methylation at those regions	permissive epigenetic state

Example

TF: RUNX1

high expression
motifs in accessible peaks
low methylation

→ strong evidence of regulatory activity

Step 7. Inference of Regulatory Networks

Once active TFs are identified:

peaks are linked to nearby genes
TF → gene relationships are inferred
cellular regulatory networks are reconstructed

Conceptual Model

TF
↓
enhancer / promoter
↓
target gene

These networks help identify cell-type-specific regulatory programs.

Conclusion

Integrating chromatin accessibility, DNA methylation, and gene expression enables more accurate inference of transcription factor activity and reconstruction of regulatory networks in complex biological systems.

In the context of single-cell epigenomics, this type of multi-omic integration is increasingly used to characterize cellular states and differentiation trajectories. Computational frameworks such as SCENIC, ArchR, and MOFA allow these analyses to be implemented systematically, and their use is rapidly becoming standard practice in single-cell epigenomics studies.

Key References

The following publications and resources were consulted as key methodological and conceptual references for the approaches described in this document.

chromVAR: Inferring transcription-factor–associated accessibility from single-cell epigenomic data
Introduces a statistical framework to infer transcription factor (TF) activity from chromatin accessibility data by analyzing deviations in accessibility at TF motif sites in single-cell ATAC-seq datasets.

SCENIC: Single-cell regulatory network inference and clustering
Describes a workflow for reconstructing gene regulatory networks and identifying active transcription factors using single-cell RNA-seq data combined with motif enrichment analysis.

MOFA+: A statistical framework for comprehensive integration of multi-modal single-cell data
Presents a factor analysis framework designed to integrate multiple omics layers (such as transcriptomics, epigenomics, and methylation) to identify shared and modality-specific sources of biological variation.

ArchR: scalable software for integrative single-cell chromatin accessibility analysis
Provides a scalable computational platform for large-scale analysis of single-cell ATAC-seq data, including integration with other modalities, TF motif enrichment, peak-to-gene linkage, and regulatory network inference.

ENCODE (Encyclopedia of DNA Elements) Project
A major international consortium that provides comprehensive reference annotations of regulatory elements, transcription factor binding sites, and epigenomic datasets widely used in regulatory genomics studies.

CSC. March 13, 2026