What is Document Transformer?
Document Transformer is a pipeline component that cleans, splits, enriches, filters, or restructures loaded documents before they are embedded, indexed, retrieved, or consumed by a language model.
How It Works
A Document Transformer is where raw loaded content becomes retrieval-ready knowledge. It may remove boilerplate, normalize whitespace, preserve headings, extract tables, redact sensitive fields, deduplicate repeated content, add metadata, detect language, or split documents into chunks. This stage has disproportionate impact on RAG quality: poor transformation can destroy document structure, break citations, leak sensitive data, or create chunks that are too broad or too narrow for reliable retrieval.
Key Characteristics
- Post-load processing: operates after a document loader and before embedding, indexing, or model consumption
- Structure-aware transformation: may preserve headings, lists, tables, sections, page numbers, and semantic boundaries
- Quality control: removes noise, duplicates, boilerplate, malformed text, and irrelevant sections
- Governance support: can apply redaction, filtering, metadata enrichment, and policy-driven exclusion
- Deterministic design: should be reproducible so index versions can be rebuilt and audited
Common Use Cases
- Splitting a long policy manual into heading-aware chunks for RAG retrieval
- Removing navigation, cookie banners, and footer text from crawled HTML
- Extracting tables and preserving page references from PDF documents
- Redacting personally identifiable information before indexing enterprise content
- Adding product, locale, department, permission, or document-version metadata
Example
Loading code...Frequently Asked Questions
How is a Document Transformer different from a Document Loader?
A Document Loader reads content from a source and normalizes it into document objects. A Document Transformer modifies those documents: cleaning, splitting, enriching, filtering, or restructuring them before indexing or model use.
Why does transformation quality matter for RAG?
Retrieval quality depends heavily on the documents being indexed. If transformation destroys headings, mixes unrelated sections, drops tables, or produces poor chunk boundaries, the retriever may return misleading evidence even with a good embedding model.
Should document transformation be deterministic?
Yes, production transformations should be deterministic whenever possible. Determinism makes reindexing, audits, regression tests, and quality comparisons reliable across pipeline versions.
Can a Document Transformer enforce data governance?
It can help, but should not be the only control. Transformers can redact, filter, and tag sensitive content, while permission checks should also be enforced during loading, indexing, retrieval, and answer generation.