What's new.
Learn Iu Mien just shipped v7 of its neural translator. The model serving the site at api.learniumien.org is now this newer fine-tune, and the dictionary backing /dictionary has been audited and expanded.
Headline numbers: chrF 62 on ium → eng (up from 58 in v6), BLEU 47 on the same direction. The model now trains on 179,386 parallel sentence pairs — 12,610 new pairs over v6 — pulled from sources we hadn't tapped yet.
Reverse direction (eng → ium) is harder for any low-resource MT model, and we know it. v7 nudges this from chrF 28 to 29. The next training run will close more of that gap.
Where the new data came from.
Three sources got added to the corpus this round. Watchtower Online Library yielded ~700 paragraph-level Iu Mien ↔ English alignments from JW publications crawled through their wol.jw.org doc-id system. The IUMTBN 1991 New Testament — the Thailand Bible Society's Iu Mien translation — was paired against the King James USFM by book.chapter.verse keys, contributing 4,500 verse-level pairs after sanitization.
Glosbe yielded smaller but cleaner pairs — community-curated phrase pairs that fill out conversational coverage the Bible-heavy corpus misses. We hit Glosbe with seed-based query expansion to extract 250 unique pairs.
Wiktionary entries for Iu Mien words were folded into the dictionary table after sanitization. Multi-definition entries were split ("country;nation" became two rows), concatenated forms ("tocut") were spaced ("to cut"), and lexical markers like "Antonym:hlang" were stripped.
Cleanup pass.
Before training, we ran a heuristic audit on the full corpus. The audit caught 186 rows of OCR-fragment garbage from older Purnell-2007 dictionary scans, JW.org donation footer URLs that had leaked through, cross-split contamination between train and test, and zero-width characters embedded in Bible.com text.
The dictionary table got a parallel audit. 52 unrecoverable pairs were deleted — entries where both sides were English fragments, or the Iu Mien column held "[Contraction" debug brackets from a PDF parser. 23 pairs were cosmetically cleaned (lead numbering and trailing semicolons stripped) without being deleted. Total dictionary now sits at 4,672 entries across learnmien, iumienliteracy, wiktionary, purnell-2007, and glosbe.
Net effect: fewer junk rows for the model to memorize, and a dictionary UI that no longer shows "are either not" → "people whose names" as a real lookup result.
Architecture and training.
Same base as v6: Meta's NLLB-200-distilled-1.3B, with the tokenizer extended to recognize ium_Latn as a target language. The fine-tune runs LoRA at rank 32 on q/k/v/out projection and the feed-forward layers, trained in bf16 (no quantization) on a single NVIDIA A30 in W&M's SciClone astral subcluster.
Training ran for 22,000 steps at batch size 32 effective, cosine LR schedule from 1.5e-4 with 2% warmup. Best evaluation loss landed at 0.9758, reached at step 22,000 of a 22,424-step schedule — early stopping caught the plateau just before completion.
The merged adapter weights are 2.6 GB on disk; deployed as a static checkpoint behind a FastAPI inference endpoint. Inference latency averages 80-500 ms per sentence depending on length, single GPU on the production server.
What's next.
v8 is mid-training right now. It drops the Thai-script and Lao-script Bible parallel cells (~110k rows of redundant content already covered by Latn pairs), bumping LoRA rank to 128 and running 10 epochs on the focused 69k corpus. The bet: capacity concentration on the unified Latin orthography that the website actually serves should narrow the eng → ium gap.
Beyond v8, the roadmap calls for back-translation augmentation against the 16k filtered ium-only paragraphs from FineWeb-2, broader community contribution to the dictionary, and audio pairing for tone-aware learning.
If you want to help — submit dictionary entries from the /dictionary page when signed in, send recordings or scanned materials, or reach the team at [email protected]. Mienh waac belongs to everyone who speaks it.
