Tables broke our document AI pipeline twice — and we open-sourced the fixes

A financial table extracted from a five-page PDF, broken into four separate tables in the output. The continuation rows on pages two and three orphaned from their headers on page one. The total at the bottom of page four sitting on its own as a one-column “table” with no idea what it was the total of.

A graph extraction over a quarterly report that landed the value 155 in the graph as a node — orphaned from its row header (Revenue), orphaned from its column path (2024 > Q2), orphaned from the entity the report was even about. The number was real. The fact it was meant to encode — in 2024 Q2, this company’s revenue was 155 — never got assembled.

Where this came from

The two failures showed up while we were building DocuPrism — our document intelligence engine that turns claims, contracts, case files, and other dense enterprise documents into trust-scored structured data and temporal knowledge graphs. DocuPrism is a graph factory: a system that asks what is true, of whom, and when across thousands of documents and assembles a queryable answer. It is intended to run in production for clients in regulated industries, so failure modes get a lot of attention. (More on DocuPrism here.)

If you process a few hundred documents, the failures we hit don’t really show up — you read the output, you fix the bad row by hand, you move on. At volume, in regulated environments, “fix it by hand” is not an option. You either solve it once in code or you keep losing data.

Failure mode 1: tables that span pages

PDF table extractors are page-bound. They process one page at a time and emit one table per visual table they find. When a logical table spans pages, the extractor doesn’t know that — it produces a separate table for each piece.

Five flavours of the same problem:

Data orphans. Body rows continue on page two without their headers.
Header orphans. Headers at the bottom of one page, data on the next.
Spillover. A long URL or description cut at the page margin appears as a separate one-column “table.”
Split cells. A single cell’s content fragmented across the page break.
Width drift. The same logical table extracted with slightly different column counts on different pages.

For a graph factory, this is fatal. The graph extraction step that runs over the table doesn’t know that the headerless continuation rows on page two belong to the table on page one. The values on page two either lose their context entirely — no row label, no column label, no entity to attach to — or worse, get attached to whatever headers the extractor finds in their immediate neighbourhood, which might be the next table on the page. Bad data in the graph compounds: every downstream entity merge, every temporal stitch, every query starts compromised.

We built table-stitcher to detect these fragments and merge them back. Its core is parser-agnostic: the merge engine never sees parser-native objects, only a small TableMeta dataclass. Adapters translate between the parser’s world and that dataclass.

A word on the adapter that ships in the box: we use Docling in production at PebbleRoad. It’s one of the strongest open-source document extractors out there, and table-stitcher is what we needed on top of it, not a replacement. Hat tip to the DS4SD team for setting the bar. The library ships with a Docling adapter; anything else (HTML, Camelot, your own extractor) is about 50 lines.

The design move that matters most for enterprise readers is the pass-through guarantee. If extraction fails for a table, the original is preserved unchanged. If the whole stitching pipeline fails, the original document comes back as is. The worst case is that nothing changes. Never that data is lost.

Multilingual headers work without a language model — the merge rules read structural signals, not cell text, so Latin, CJK, Thai, Arabic, Cyrillic and every other writing system parse the same way.

GitHub → · pip install table-stitcher[docling]

Failure mode 2: tables that lose their context

This one is sneakier. The extractor does its job. The table comes out clean — headers intact, all the data present. But every cell carries an implicit relationship to its row header and its column header, and standard table representations flatten that relationship away. By the time a graph extractor reaches the cell, the headers have been merged, dropped, or rendered into a shape no parser can reliably recover.

Watch what happens to a two-level financial table. A standard markdown extraction collapses the year/quarter hierarchy into a single header row:

| | Q1 | Q2 | Q1 | Q2 |
| Revenue | 130 | 155 | 118 | 125 |

Now ask the graph extractor: what is Revenue for 2024 Q2? It cannot tell. There are two Q2 columns and the year that distinguished them is gone. The triple it should produce — (Company, has-revenue-for, 2024-Q2) = 155 — cannot be assembled from this representation. The graph either records garbage or skips the cell.

table2rules transforms the same table into one fact per line, with the full row-header path and full column-header path on every line:

Revenue | 2024 > Q1: 130
Revenue | 2024 > Q2: 155
Revenue | 2023 > Q1: 118
Revenue | 2023 > Q2: 125
Operating Costs | 2024 > Q1: 55
Operating Costs | 2024 > Q2: 60
Operating Costs | 2023 > Q1: 48
Operating Costs | 2023 > Q2: 52

Now every line is a self-contained fact: row entity, column path, value. Graph extraction reads it cell by cell and produces clean triples without re-discovering structure. No flattening, no orphaning, no inferred meaning that turns out to be wrong.

The design move underneath is what I find most satisfying about this library. Tables are mazes. Each cell finds its headers by pathfinding — left for row context, above for column context — driven by HTML markup signals (th, thead, scope, rowspan, colspan). It is not pattern matching. It is not table-type classification. The algorithm discovers structure rather than memorising patterns. When markup is hostile, it fails open: it preserves the raw HTML rather than fabricate structure.

The same property — every line is a self-contained fact carrying its full header path — also makes the output chunk-safe for RAG pipelines, where token-based chunkers routinely split header rows away from data rows. We came at this from the graph side; teams running retrieval systems hit the same library from the chunker side. Same fix, two audiences.

A few numbers that matter. The library has been tested against 200 PubTabNet tables with per-cell oracle matching, plus around 2,000 mutation tests applying ten HTML-noise patterns on top. The rules output is a median 27% smaller than the source HTML on tokens — though on dense, deep-header tables it can grow by up to 59%, the deliberate cost of carrying the full header path on every line. And because the parser operates on table geometry rather than cell text, it is language-agnostic by construction. Every writing system answers the same one content-level question: does this cell contain any letter?

GitHub → · pip install table2rules

Why we open-sourced this

Three reasons, all honest.

These are infrastructure problems, not differentiators. Table extraction and table chunking are solved-or-not-solved engineering problems. They are not where DocuPrism’s moat lives. Table parsing is not strategic IP. It is plumbing.

We’ve benefited from open-source for twenty years. The DocuPrism stack rests on many open-source libraries. Releasing back is part of the contract. We picked MIT specifically — not GPL, not Apache 2 with patent clauses — so commercial enterprise teams can adopt the libraries without a legal review.

And there is a practical reason. Every team running production document AI hits these problems eventually. If we save someone two weeks of debugging, they can spend it on the part of their pipeline that actually differentiates their product. That is good for the field.

The two libraries are small. They solve narrow, specific problems. But they are real artefacts of the work — open, MIT-licensed, taxonomy-tested, and ready for anyone else stuck at the same handoff.

If this is useful to you

Use the libraries. table-stitcher · table2rules. Issues and PRs welcome.
Hit the same problems we did? Schedule a DocuPrism demo.
Have a workflow you’d like to make intelligent? Talk to us.