An open exploration of viable human-AI systems.
View the Project on GitHub algoplexity/cybernetic-intelligence
Toward Algorithmic Fidelity in Synthetic Text Generation: A Minimum Description Length Approach with Large Language Models
This paper proposes and evaluates a novel framework for generating high-fidelity synthetic textual data from sensitive corpora using a Minimum Description Length (MDL)-guided large language model (LLM) pipeline. The goal is to retain semantic and structural properties of the original data while minimizing information leakage risk. We hypothesize that MDL-guided representation learning and generation outperform standard prompt-based methods in thematic fidelity, compressive structure, and task utility. Through a series of experiments on real-world survey data, we demonstrate that our approach yields synthetic data with superior topic coherence, lower algorithmic redundancy, and comparable downstream utility to real data.
The increasing use of language models in sensitive domains (e.g., law, policy, healthcare) necessitates rigorous strategies for generating synthetic data that preserves utility without compromising privacy. Traditional prompt-based generation approaches often overfit or fail to capture deep structure. Meanwhile, formal methods like MDL offer principled ways to discover and compress structure in data. We present a hybrid approach that integrates MDL principles into an LLM-driven synthetic generation pipeline and empirically test its effectiveness.
We build upon three intersecting lines of research: (1) MDL-based modeling and motif discovery; (2) LLM-powered text generation and compression (e.g., LLMZip, LLMLingua); and (3) synthetic data generation for privacy preservation. Our work also aligns conceptually with ergodicity-aware critiques of static statistical models, reframing text generation as a path-dependent process.
Our pipeline comprises four stages:
We adapt the K*-Means algorithm to identify clusterings that minimize total description length (data fit + model complexity), enabling theme discovery without predefining the number of clusters.
Seed texts from each cluster are used to condition LLM generations via few-shot meta-prompting. To encourage diversity, a memory-aware prompt filter excludes redundant generations.
We test the following:
We compare against:
We present experimental results from synthetic generation on a sensitive public policy dataset. MDL-guided clustering consistently improves topic coherence and diversity. Synthetic outputs exhibit lower algorithmic redundancy (as measured by BDM), and privacy audits show reduced leakage. Models fine-tuned on our synthetic data retain 92–96% of downstream task performance relative to real data.
The results support our hypothesis that MDL provides a principled backbone for structuring synthetic generation workflows. Moreover, LLMs amplify this by learning compressive semantic patterns. We discuss implications for privacy-preserving AI and ethical document analysis.
This work bridges information theory and language modeling to enable secure, structure-preserving synthetic data pipelines. Future directions include differentiable MDL optimization, federated MDL-guided generation, and extending the framework to multilingual or multimodal corpora.
[To be populated with MDL, BDM, LLMZip, LLMLingua, MetaSynth, and ergodicity economics sources.]