Tensorlake (@tensorlake.ai) Bsky

Amazing!!! We always want you to try it out with the "hardest" documents - So glad to hear it worked!

The goal of an effective document parser SHOULD be to help with the hardest problems. a low-quality scan of a Norwegian document with statistical tables from 1926 is the perfect test doc 🔥

5 months ago 1 0 0 0

The tensorlake playground was, unlike AWS textract and every other tool I have tried, able to parse my angled, low-quality scan of Norwegian pay statistics from 1926.

Not that 1926 Norwegian statistical tables is a generally useful benchmark…

5 months ago 3 2 1 0

Benchmarking the Most Reliable Document Parsing API | Tensorlake Learn how Tensorlake built the most reliable document parsing API by measuring what actually matters: structural preservation, reading order accuracy, and downstream usability. See benchmark results c...

We published:
✓ Full methodology
✓ Corrected OCRBench v2 ground truth
✓ Comparative analysis across all major providers

Read the full benchmark: tlake.link/benchmarks

Stop benchmarking vanity metrics. Start measuring what breaks.

5 months ago 0 0 0 0

The results were clear:

Tensorlake: 86.8% TEDS, 91.7% F1
AWS Textract: 80.7% TEDS, 88.4% F1
Azure: 78.1% TEDS, 88.1% F1
Docling: 63.8% TEDS, 68.9% F1

The gap? 670 fewer manual reviews per 10k documents.

5 months ago 0 0 1 0

We evaluated on OCRBench v2, OmniDocBench, and 100 real enterprise docs using two metrics that predict production success:

TEDS (Tree Edit Dist): Measures if tables stay tables
JSON F1: Measures if downstream systems can use the output

Not "is the text similar?" but "can automation actually work?"

5 months ago 0 0 1 0

Traditional benchmarks test on clean PDFs and measure text accuracy.

But your production failures come from:
- Collapsed tables
- Jumbled reading order
- Missing visual content
- Hallucinated extractions

None of this shows up in your scores.

5 months ago 0 0 1 0

Two dense document pages flank a skeptical person’s sticker-style portrait against a green gradient, link text centered below.

Document parsing benchmarks have been measuring the wrong thing.

We tested every major parser on real enterprise documents.

The results will change how you think about OCR accuracy 🧵

5 months ago 3 3 1 3

Promotional banner for the Qdrant Essentials Course featuring Tensorlake. Text reads: ‘Improve collection querying with knowledge graphs.’ On the left is the Qdrant logo and course title; on the right is a smiling woman with long curly brown hair wearing a cream-colored top, set against a purple gradient grid background.

Want to build scalable data lakes w/ Tensorlake + @qdrant.bsky.social?

In the free Qdrant Essentials Course, learn how to:
- Architect vector-powered data lakes
- Optimize ETL pipelines
- Create knowledge graphs
- Integrate @langchain.bsky.social agents for natural language queries

t.co/OoPZswrL7z

5 months ago 2 3 0 1

New: Vision Language Models for Document Processing Tensorlake now uses Vision Language Models (VLMs) across multiple features including page classification, figure/table summarization, and structured extraction, enabling faster and more intelligent do...

Try it yourself with our SEC filing analysis notebook:
tlake.link/notebooks/vl...

Shows how to extract cryptocurrency metrics from 10-Ks and 10-Qs using page classification

Full changelog: tlake.link/changelog/vlm

What would you build with this?

5 months ago 0 0 0 0

Where we leverage VLM support:
📄 Page Classification: Large docs, specific sections needed
📊 Table/Figure Summarization: Visual data in reports
⚡ skip_ocr=True: When reading order is complex and for diagrams and scanned docs

Text extraction still uses OCR for best quality

5 months ago 0 0 1 0

Real results from analyzing 8 SEC filings:
- 1,500+ total pages
- 427 relevant pages identified by VLM
- Processing time: 5 minutes → 45 seconds per document

All without sacrificing accuracy

5 months ago 0 0 1 0

Our solution: VLMs understand document structure visually
Example: Extracting crypto holdings from SEC filings
1. VLM classifies which pages contain financial data (~50 out of 200 pages)
2. Extract only from relevant pages
3. Skip 70% of processing

Result: 80-90% faster ⚡

5 months ago 0 0 1 0

The problem: Processing 200-page documents when you only need specific information is slow and expensive

Traditional approach:
OCR everything → Convert to text → Search → Extract

This wastes time processing irrelevant pages

5 months ago 0 0 1 0

New: Vision Language Models now power key document processing features
We're using VLMs for:
- Page classification in large documents
- Table/figure summarization
- Fast structured extraction (skip_ocr mode)

Here's what this means for document processing 🧵

5 months ago 1 1 1 0

Tensorlake Transform Data Into Knowledege

The company I work for, @tensorlake.ai, is hiring a couple of roles remote within the US: tensorlake.ai/careers

You might be a great fit if you like working with Rust, Python, K8s, me?, and you enjoy building products for developers.

5 months ago 8 2 0 0

Build approval workflows that trigger on specific feedback. Extract complete edit history for regulatory compliance. Route documents based on flagged sections, all programmatically.

Live now in our API, SDK, and Cloud.

6 months ago 1 0 0 0

Now you can parse .docx files with tracked changes preserved as clean, structured HTML:
- <del> tags for deletions
- <ins> tags for insertions
- <span class="comment"> for reviewer notes

6 months ago 1 0 1 0

Tensorlake interface showing parsed Word document with tracked changes preserved as HTML tags, displaying an insurance claim report

Most parsers strip all tracked changes when you extract the text.

That means:
❌ Lost audit trails
❌ Manual review of revision history
❌ No programmatic access to reviewer comments
❌ Workflows that can't route based on specific edits

6 months ago 1 1 1 0

Try it in Colab No description

Perfect for:
→ RAG pipelines (better chunking)
→ Knowledge graphs (accurate trees)
→ Document navigation
→ Table of contents generation

Changelog: tlake.link/changelog/he...
Try it: tlake.link/notebooks/he...

6 months ago 0 0 0 0

Every section header now returns:
- level: 0 for #, 1 for ##, 2 for ###, etc
- content: clean text
- proper nesting for up to 6 levels

Enable with:
cross_page_header_detection=True

That's it.

6 months ago 0 0 1 0

Tensorlake analyzes numbering patterns (1, 1.1, 1.2) and visual structure across the ENTIRE document.

Then corrects misidentified header levels automatically.

Works even when headers span page breaks.

6 months ago 0 0 1 0

Comparison of document header detection. Left side "Just OCR" shows incorrect hierarchy with section 2.2 at wrong indent level. Right side "Header Correction" shows proper nesting where 2.2 is correctly indented under section 2. Bottom shows Python code: doc_ai.parse_and_wait() with cross_page_header_detection=True parameter. Green gradient background with Tensorlake logo.

OCR engines constantly mess up document hierarchy.

Section 2.2 becomes a top-level header (##) instead of nested (###).

We just shipped automatic header correction.

🧵 How it works:

6 months ago 2 1 1 1

Citation-Aware RAG: How to add Fine Grained Citations in Retrieval and Response Synthesis | Tensorlake Learn how to build citation-aware RAG systems that link AI responses back to exact source locations in documents. This technical guide covers document parsing with spatial metadata, chunking strategie...

There's no reason your applications should not be citation-ready.

Dive deeper and try out the Colab notebook linked at the bottom of the blog

6 months ago 0 0 0 0

RAG citation workflow diagram on dark green background showing document processing pipeline: Document (PDF/Image) → Tensorlake Document AI → Parsed Elements (Text, Tables, Figures, and Bounding Box) → merge and insert anchors → Chunks and Anchors (Clean text and citation IDs) → splits to Citation Metadata (page, bounding box, citation IDs) and Vector DB (embeddings, text, and metadata). URL: https://tlake.link/blog/rag-citations

Step 3: Generating AI responses with verifiable citations

Once your chunks carry anchors, retrieval doesn’t change. You can use the dense, hybrid, or reranker setup you already have. Consider hiding the anchors in prose, while keeping them in output and making IDs clickable.

6 months ago 0 0 1 0

Before and after comparison of document chunking on dark green background. Top panel "Without Contextualized Chunking" shows plain text: "SMOTE creates a broader decision region for the minority class...". Bottom panel "With Contextualized Chunking" shows same text with citation anchor "<c>2.1</c>" and metadata: {"2.1": {"page": 23, "bbox": {...}}}. URL: https://tlake.link/blog/rag-citations

Step 2: Create contextualized chunks

Iterate through page fragment objects and create appropriately sized chunks by combining them. As you create the chunks, you can create contextualized metadata to help during retrieval.

6 months ago 0 0 1 0

Tensorlake Document AI interface showing document layout analysis with JSON output on left displaying fragment types, content, and bounding box coordinates, and PDF preview on right with highlighted text regions and yellow bounding boxes overlaid on research paper content

Step 1: Parse docs with bounding boxes

Using our Document AI API you get a full document layout. For each page fragment you have access to the page number, fragment type, content, and bounding box. Making it easy to add metadata and anchor points to chunks before embedding.

6 months ago 0 0 1 0

Citations.

When users ask "where did this come from?" your system should point to the exact page fragment...not just "file_name.pdf".

Built citation-aware RAG with spatial metadata has:
→ Parse docs with bounding boxes
→ Embed citation anchors in chunks
→ Return page numbers + coordinates

A 🧵

6 months ago 1 1 1 1

Job update: a couple of weeks ago, I joined @tensorlake.ai full time. I’m having a lot of fun building the product with @diptanu.bsky.social and the rest of this wonderful team.

We have a few open positions if you’d like to work with us: www.linkedin.com/jobs/search/...

6 months ago 8 4 1 1

Parse and Retrieve Dense Tables Accurately with Tensorlake | Tensorlake Learn how Tensorlake preserves structure in dense, multi-page tables—returning DataFrames with summaries and bounding boxes for accurate, explainable retrieval.

Trust in retrieval comes from evidence. Tensorlake ties every lookup back to the original table cell.

Read the blog and try it out for yourself 👇

7 months ago 0 0 0 0

Side-by-side comparison of a dense healthcare data table in PDF format and its structured DataFrame output. A green background with the Tensorlake logo shows an arrow pointing from the PDF to the DataFrame. The caption reads “Parse Dense Tables Reliably” with the link “tlake.link/blog/dense-tables” at the bottom.

In finance, clinical trials, or performance benchmarks, dense tables contain mission-critical data.

But flatten that data like most parsers do and trust is lost.

Tensorlake restores trust by preserving structure, generating summaries for effective embeddings, and attaching evidence via b-boxes.

7 months ago 1 1 1 0

Posts by Tensorlake