An open exploration of viable human-AI systems.
View the Project on GitHub algoplexity/cybernetic-intelligence
This document captures the current architecture of the long-document classification and information extraction system, as extracted from code_summary.txt. It includes:
The system processes legal and compliance documents (often PDFs) to classify them, extract relevant named entities, and detect the presence of digital or visual signatures. It is built in modular components using both traditional ML and deep learning, relying on caching, multiprocessing, and format-specific processing pipelines.
βββββββββββββββββββββββββ
β Input PDF Files β
ββββββββββββββ¬βββββββββββββ
β
[1] PDF LOADING
β
βββββββββββββββββββββββββ΄βββββββββββββββββββββββββ
β β
ββββββββββββΌββββββββββ ββββββββββββββΌβββββββββββββ
β OCR Pipeline β β Digital Signature Check β
β (tesserocr + β β (PDF bytes + regex) β
β pdf2texts) β ββββββββββββββ¬βββββββββββββ
ββββββββββββ¬ββββββββββ β
β βΌ
ββββββββββΌβββββββββ βββββββββββββββββββββββββββββ
β Chunking β β Visual Signature Detector β
β (semchunk) β β (signitractor ONNX) β
ββββββββ¬βββββββββββ ββββββββββββββββ¬βββββββββββββ
β βΌ
ββββββββΌβββββββββββββ ββββββββββββββββββββββββββ
β Embedding Module β β Signature Classifier β
β (RoBERTa, TF-IDF)β β (signitector LGBM) β
ββββββββ¬βββββββββββββ ββββββββββββββββ¬ββββββββββ
β
ββββββββΌβββββββββββββββββββββ
β Chunk Classification β
β (RoBERTa + LSTM fusion) β
ββββββββ¬βββββββββββββββββββββ
β
ββββββββΌβββββββββββββ
β Entity Extraction β
β (TokenClassifier) β
ββββββββββ¬βββββββββββ
β
ββββββΌβββββ
β Cache β
β + Persistβ
βββββββββββ
| Component | Responsibilities | Collaborators |
|---|---|---|
| PDF Loader | Load PDF files, manage byte-level access | Signature Checker |
| OCR Pipeline | Convert scanned or image-based PDFs into text using Tesseract | Chunker, Visual Signature Detector |
| Digital Sig. Check | Check for cryptographic digital signatures in PDF structure | PDF Loader |
| Visual Sig. Detect | Detect presence of visual signature bounding boxes using ONNX model | OCR, Signature Classifier |
| Signature Classifier | Classify detected visual regions as true/false signatures | Visual Detector |
| Chunker | Split full text into semantically coherent chunks (e.g. ~512 tokens) | OCR, Embedding Module |
| Embedding Module | Generate chunk-level representations using RoBERTa, TF-IDF | Chunker, Classifier |
| Chunk Classifier | Classify document using embeddings + LSTM fusion | Embedding, Cache |
| Entity Extractor | Named entity recognition using RoBERTa-style token classification | Chunker |
| Cache Manager | Cache all intermediate outputs on disk for efficiency and fault tolerance | All major components |
| Upgrade Target | Risk Level | Recommendation |
|---|---|---|
| PDF Loader + OCR | Medium | Replace with unified MLLM-based model (e.g., Donut, MMDocReader) |
| Signature Detection | High | Maintain for now, refactor to MLLM only after testing |
| Chunking + Embedding | Medium | Replace with long-context model inference over entire doc |
| Chunk Classifier | Low | Easy to refactor into FusionNet-style or MoE classifier |
| Entity Extractor | Low | Migrate to InstructNER or few-shot NER |
| Caching Framework | Low | Can retain as-is; decouples stability from model upgrades |
This modular architecture is highly swappable, meaning a gradual migration to newer models can be achieved without full system disruption. The first candidates for upgrade include: