Cleaning Multilingual PDFs: Ligatures, IPA, and Broken Unicode

A PDF is a layout format, not a text format. Pulling clean text out of one, especially a LaTeX-compiled linguistics paper full of IPA and accents, is most of the battle in a retrieval pipeline. Garbage text produces garbage embeddings, and the search is only as good as the text underneath it.

// 01 — RECOVERING MIS-DECODED CHARACTERS

pypdf sometimes hands back IPA symbols and accented letters that were mis-decoded at the byte level. _decode_page_text() recovers them with a latin-1 → UTF-8 re-encode trick, reinterpreting the bytes under the correct encoding so ʃ, é, and friends come back intact instead of as mojibake.

// 02 — THE CLEANING ORDER

clean_text() applies six transformations, and the order is load-bearing: each step assumes the previous one ran:

Expand ligatures: ﬁ → fi, ﬂ → fl, ﬀ/ﬃ/ﬄ, ﬆ → st. PDF typesetting artifacts.
NFC normalize: collapse combining diacritics to precomposed form, so é (e + ◌́) equals é (single codepoint). Without this, two visually identical strings embed differently.
Strip Unicode spaces: remove non-breaking and zero-width spaces PDF renderers inject.
Join hyphenated line-breaks: seman-\ntic → semantic, the LaTeX line-wrap artifact.
Collapse excess newlines: 3+ newlines → 2, preserving paragraph breaks.
Strip trailing whitespace per line: column layouts pad lines to page width with spaces.

// 03 — WHY ORDER MATTERS

Run NFC normalization before ligature expansion and you can normalize a ligature into a form the expander no longer recognizes. Join line-breaks before stripping the zero-width spaces and the hyphen match can miss. Each step is cheap; the sequence is the design. Get it wrong and the corruption is subtle: the text looks fine to a human and embeds wrong.

TAKEAWAYS

Text extraction quality is retrieval quality. Bad characters in means bad vectors in means bad search out. Clean aggressively before embedding.
Unicode normalization (NFC) is not optional for multilingual text: visually identical strings must become byte-identical or they won’t match.
When transformations interact, the order is part of the algorithm. Document it, test it, don’t reshuffle it casually.

Build log 03: chunking with provenance: 512 chars, 64 overlap, full lineage.

// 01 — RECOVERING MIS-DECODED CHARACTERS

// 02 — THE CLEANING ORDER

// 03 — WHY ORDER MATTERS

TAKEAWAYS

NEXT