Amazing!!! We always want you to try it out with the "hardest" documents - So glad to hear it worked!
The goal of an effective document parser SHOULD be to help with the hardest problems. a low-quality scan of a Norwegian document with statistical tables from 1926 is the perfect test doc 🔥
Posts by Tensorlake
The tensorlake playground was, unlike AWS textract and every other tool I have tried, able to parse my angled, low-quality scan of Norwegian pay statistics from 1926.
Not that 1926 Norwegian statistical tables is a generally useful benchmark…
We published:
✓ Full methodology
✓ Corrected OCRBench v2 ground truth
✓ Comparative analysis across all major providers
Read the full benchmark: tlake.link/benchmarks
Stop benchmarking vanity metrics. Start measuring what breaks.
The results were clear:
Tensorlake: 86.8% TEDS, 91.7% F1
AWS Textract: 80.7% TEDS, 88.4% F1
Azure: 78.1% TEDS, 88.1% F1
Docling: 63.8% TEDS, 68.9% F1
The gap? 670 fewer manual reviews per 10k documents.
We evaluated on OCRBench v2, OmniDocBench, and 100 real enterprise docs using two metrics that predict production success:
TEDS (Tree Edit Dist): Measures if tables stay tables
JSON F1: Measures if downstream systems can use the output
Not "is the text similar?" but "can automation actually work?"
Traditional benchmarks test on clean PDFs and measure text accuracy.
But your production failures come from:
- Collapsed tables
- Jumbled reading order
- Missing visual content
- Hallucinated extractions
None of this shows up in your scores.
Two dense document pages flank a skeptical person’s sticker-style portrait against a green gradient, link text centered below.
Document parsing benchmarks have been measuring the wrong thing.
We tested every major parser on real enterprise documents.
The results will change how you think about OCR accuracy 🧵
Promotional banner for the Qdrant Essentials Course featuring Tensorlake. Text reads: ‘Improve collection querying with knowledge graphs.’ On the left is the Qdrant logo and course title; on the right is a smiling woman with long curly brown hair wearing a cream-colored top, set against a purple gradient grid background.
Want to build scalable data lakes w/ Tensorlake + @qdrant.bsky.social?
In the free Qdrant Essentials Course, learn how to:
- Architect vector-powered data lakes
- Optimize ETL pipelines
- Create knowledge graphs
- Integrate @langchain.bsky.social agents for natural language queries
t.co/OoPZswrL7z
Try it yourself with our SEC filing analysis notebook:
tlake.link/notebooks/vl...
Shows how to extract cryptocurrency metrics from 10-Ks and 10-Qs using page classification
Full changelog: tlake.link/changelog/vlm
What would you build with this?
Where we leverage VLM support:
📄 Page Classification: Large docs, specific sections needed
📊 Table/Figure Summarization: Visual data in reports
⚡ skip_ocr=True: When reading order is complex and for diagrams and scanned docs
Text extraction still uses OCR for best quality
Real results from analyzing 8 SEC filings:
- 1,500+ total pages
- 427 relevant pages identified by VLM
- Processing time: 5 minutes → 45 seconds per document
All without sacrificing accuracy
Our solution: VLMs understand document structure visually
Example: Extracting crypto holdings from SEC filings
1. VLM classifies which pages contain financial data (~50 out of 200 pages)
2. Extract only from relevant pages
3. Skip 70% of processing
Result: 80-90% faster ⚡
The problem: Processing 200-page documents when you only need specific information is slow and expensive
Traditional approach:
OCR everything → Convert to text → Search → Extract
This wastes time processing irrelevant pages
New: Vision Language Models now power key document processing features
We're using VLMs for:
- Page classification in large documents
- Table/figure summarization
- Fast structured extraction (skip_ocr mode)
Here's what this means for document processing 🧵
The company I work for, @tensorlake.ai, is hiring a couple of roles remote within the US: tensorlake.ai/careers
You might be a great fit if you like working with Rust, Python, K8s, me?, and you enjoy building products for developers.
Build approval workflows that trigger on specific feedback. Extract complete edit history for regulatory compliance. Route documents based on flagged sections, all programmatically.
Live now in our API, SDK, and Cloud.
Now you can parse .docx files with tracked changes preserved as clean, structured HTML:
- <del> tags for deletions
- <ins> tags for insertions
- <span class="comment"> for reviewer notes
Tensorlake interface showing parsed Word document with tracked changes preserved as HTML tags, displaying an insurance claim report
Most parsers strip all tracked changes when you extract the text.
That means:
❌ Lost audit trails
❌ Manual review of revision history
❌ No programmatic access to reviewer comments
❌ Workflows that can't route based on specific edits
Perfect for:
→ RAG pipelines (better chunking)
→ Knowledge graphs (accurate trees)
→ Document navigation
→ Table of contents generation
Changelog: tlake.link/changelog/he...
Try it: tlake.link/notebooks/he...
Every section header now returns:
- level: 0 for #, 1 for ##, 2 for ###, etc
- content: clean text
- proper nesting for up to 6 levels
Enable with:
cross_page_header_detection=True
That's it.
Tensorlake analyzes numbering patterns (1, 1.1, 1.2) and visual structure across the ENTIRE document.
Then corrects misidentified header levels automatically.
Works even when headers span page breaks.
Comparison of document header detection. Left side "Just OCR" shows incorrect hierarchy with section 2.2 at wrong indent level. Right side "Header Correction" shows proper nesting where 2.2 is correctly indented under section 2. Bottom shows Python code: doc_ai.parse_and_wait() with cross_page_header_detection=True parameter. Green gradient background with Tensorlake logo.
OCR engines constantly mess up document hierarchy.
Section 2.2 becomes a top-level header (##) instead of nested (###).
We just shipped automatic header correction.
🧵 How it works:
There's no reason your applications should not be citation-ready.
Dive deeper and try out the Colab notebook linked at the bottom of the blog
RAG citation workflow diagram on dark green background showing document processing pipeline: Document (PDF/Image) → Tensorlake Document AI → Parsed Elements (Text, Tables, Figures, and Bounding Box) → merge and insert anchors → Chunks and Anchors (Clean text and citation IDs) → splits to Citation Metadata (page, bounding box, citation IDs) and Vector DB (embeddings, text, and metadata). URL: https://tlake.link/blog/rag-citations
Step 3: Generating AI responses with verifiable citations
Once your chunks carry anchors, retrieval doesn’t change. You can use the dense, hybrid, or reranker setup you already have. Consider hiding the anchors in prose, while keeping them in output and making IDs clickable.
Before and after comparison of document chunking on dark green background. Top panel "Without Contextualized Chunking" shows plain text: "SMOTE creates a broader decision region for the minority class...". Bottom panel "With Contextualized Chunking" shows same text with citation anchor "<c>2.1</c>" and metadata: {"2.1": {"page": 23, "bbox": {...}}}. URL: https://tlake.link/blog/rag-citations
Step 2: Create contextualized chunks
Iterate through page fragment objects and create appropriately sized chunks by combining them. As you create the chunks, you can create contextualized metadata to help during retrieval.
Tensorlake Document AI interface showing document layout analysis with JSON output on left displaying fragment types, content, and bounding box coordinates, and PDF preview on right with highlighted text regions and yellow bounding boxes overlaid on research paper content
Step 1: Parse docs with bounding boxes
Using our Document AI API you get a full document layout. For each page fragment you have access to the page number, fragment type, content, and bounding box. Making it easy to add metadata and anchor points to chunks before embedding.
Citations.
When users ask "where did this come from?" your system should point to the exact page fragment...not just "file_name.pdf".
Built citation-aware RAG with spatial metadata has:
→ Parse docs with bounding boxes
→ Embed citation anchors in chunks
→ Return page numbers + coordinates
A 🧵
Job update: a couple of weeks ago, I joined @tensorlake.ai full time. I’m having a lot of fun building the product with @diptanu.bsky.social and the rest of this wonderful team.
We have a few open positions if you’d like to work with us: www.linkedin.com/jobs/search/...
Trust in retrieval comes from evidence. Tensorlake ties every lookup back to the original table cell.
Read the blog and try it out for yourself 👇
Side-by-side comparison of a dense healthcare data table in PDF format and its structured DataFrame output. A green background with the Tensorlake logo shows an arrow pointing from the PDF to the DataFrame. The caption reads “Parse Dense Tables Reliably” with the link “tlake.link/blog/dense-tables” at the bottom.
In finance, clinical trials, or performance benchmarks, dense tables contain mission-critical data.
But flatten that data like most parsers do and trust is lost.
Tensorlake restores trust by preserving structure, generating summaries for effective embeddings, and attaching evidence via b-boxes.