Build log
Model release · May 2026May 2026

Stronger translator: v7 lands at chrF 62.

The new v7 NLLB-1.3B fine-tune is live. 179k cleaned parallel pairs, fresh data from JW publications and the IUMTBN New Testament, a curated dictionary, and a +7 chrF jump on ium → eng over the previous version.

§ 01

What's new.

Learn Iu Mien just shipped v7 of its neural translator. The model serving the site at api.learniumien.org is now this newer fine-tune, and the dictionary backing /dictionary has been audited and expanded.

Headline numbers: chrF 62 on ium → eng (up from 58 in v6), BLEU 47 on the same direction. The model now trains on 179,386 parallel sentence pairs — 12,610 new pairs over v6 — pulled from sources we hadn't tapped yet.

Reverse direction (eng → ium) is harder for any low-resource MT model, and we know it. v7 nudges this from chrF 28 to 29. The next training run will close more of that gap.

§ 02

Where the new data came from.

Three sources got added to the corpus this round. Watchtower Online Library yielded ~700 paragraph-level Iu Mien ↔ English alignments from JW publications crawled through their wol.jw.org doc-id system. The IUMTBN 1991 New Testament — the Thailand Bible Society's Iu Mien translation — was paired against the King James USFM by book.chapter.verse keys, contributing 4,500 verse-level pairs after sanitization.

Glosbe yielded smaller but cleaner pairs — community-curated phrase pairs that fill out conversational coverage the Bible-heavy corpus misses. We hit Glosbe with seed-based query expansion to extract 250 unique pairs.

Wiktionary entries for Iu Mien words were folded into the dictionary table after sanitization. Multi-definition entries were split ("country;nation" became two rows), concatenated forms ("tocut") were spaced ("to cut"), and lexical markers like "Antonym:hlang" were stripped.

§ 03

Cleanup pass.

Before training, we ran a heuristic audit on the full corpus. The audit caught 186 rows of OCR-fragment garbage from older Purnell-2007 dictionary scans, JW.org donation footer URLs that had leaked through, cross-split contamination between train and test, and zero-width characters embedded in Bible.com text.

The dictionary table got a parallel audit. 52 unrecoverable pairs were deleted — entries where both sides were English fragments, or the Iu Mien column held "[Contraction" debug brackets from a PDF parser. 23 pairs were cosmetically cleaned (lead numbering and trailing semicolons stripped) without being deleted. Total dictionary now sits at 4,672 entries across learnmien, iumienliteracy, wiktionary, purnell-2007, and glosbe.

Net effect: fewer junk rows for the model to memorize, and a dictionary UI that no longer shows "are either not" → "people whose names" as a real lookup result.

§ 04

Architecture and training.

Same base as v6: Meta's NLLB-200-distilled-1.3B, with the tokenizer extended to recognize ium_Latn as a target language. The fine-tune runs LoRA at rank 32 on q/k/v/out projection and the feed-forward layers, trained in bf16 (no quantization) on a single NVIDIA A30 in W&M's SciClone astral subcluster.

Training ran for 22,000 steps at batch size 32 effective, cosine LR schedule from 1.5e-4 with 2% warmup. Best evaluation loss landed at 0.9758, reached at step 22,000 of a 22,424-step schedule — early stopping caught the plateau just before completion.

The merged adapter weights are 2.6 GB on disk; deployed as a static checkpoint behind a FastAPI inference endpoint. Inference latency averages 80-500 ms per sentence depending on length, single GPU on the production server.

§ 05

What's next.

v8 is mid-training right now. It drops the Thai-script and Lao-script Bible parallel cells (~110k rows of redundant content already covered by Latn pairs), bumping LoRA rank to 128 and running 10 epochs on the focused 69k corpus. The bet: capacity concentration on the unified Latin orthography that the website actually serves should narrow the eng → ium gap.

Beyond v8, the roadmap calls for back-translation augmentation against the 16k filtered ium-only paragraphs from FineWeb-2, broader community contribution to the dictionary, and audio pairing for tone-aware learning.

If you want to help — submit dictionary entries from the /dictionary page when signed in, send recordings or scanned materials, or reach the team at [email protected]. Mienh waac belongs to everyone who speaks it.

All postsGet in touch →