Meta's Omnilingual MT for 1,600 Languages

136 points by j0e1 6 days ago | 47 comments

stingraycharles 3 days ago |
I find that meta’s translations are very poor compared to others, at least for relatively obscure languages, which I figured was relevant considering the article.
Google Translate is a good default, but LLMs are really good at translations, as they’re better capable at understanding context and providing culturally appropriate translations.
(I live in Cambodia where they speak Khmer)
smallerize 3 days ago |
*they're
(Sorry I had to)
stingraycharles 3 days ago |
I could have sworn I edited it! I did notice myself as well, but thanks for the correction.
tomrod 3 days ago |
*ពួកគេគឺជា
djsamseng 3 days ago |
Hello from Siem Reap, Cambodia! Awesome to see a fellow tech enthusiast from Cambodia.
I actually found Facebook’s translations pretty good (better than Google Translate for things longer than a sentence). From my understanding of Khmer, Khmer is a bit more verbose and context dependent, hence LLMs in Khmer would be a big help understand those nuances.
In the inverse case (LLMs generating khmer from English) I heard from locals that it sounds formal and “robotic” which I found quite interesting.
pseudocomposer 3 days ago |
Kagi Translate is fantastic. Multilingual support is honestly one of the best things about LLMs, imo.
yellow_lead 3 days ago |
It's not even good for Chinese
ks2048 3 days ago |
So, LLMs are noticeably better in Khmer than Google Translate? I wonder why Google Translate doesn't use Gemini under-the-hood. Perhaps it's more prone to hallucinations.
I'm interested in find some thorough testing of translations on different LLMs vs Translation APIs.
pattilupone 3 days ago |
There's a dropdown on Google Translate that lets you choose "Advanced" mode or "Classic" mode. Advanced mode uses Gemini but it's only available for select languages.
psychoslave 3 days ago |
That's a high count, but still a bit away from "Omni". Usual count is between 4k and 8k depending the source. But the first 1k might be the hardest, certainly.
simultsop 3 days ago |
when you market, you use frontier and edge terms, so it sounds pro max
cointegrated 2 days ago |
1.6k languages is for how many we were able to find more or less reliable evaluation data (mostly thanks to Bible translators and all those who contributed to BOUQuET). Out of the remaining several thousand languages, we expect the OMT models to support understanding (but not generation) for a significant proportion, due to cross-lingual generalisation between similar languages. So it’s not truly “omni” in the sense of supporting every single language on Earth, but it’s our best effort to do so, and probably the most “omni” models existing today.
psychoslave 2 days ago |
So, hyperchilio-lingual would be more accurate, and myriad-lingual would be even behind all documented existing human language. But I guess marketing team is not that found of precision in philological considerations.
coder68 2 days ago |
Is there interest in benchmarking the proprietary LLMs for translation? Curious as I often use Gemini 3 Flash, but I have no idea how good it is for my language family. I prefer open models (in fact the smaller the better for offline), but it'd be useful to know how well the Big Three do.
croes 3 days ago |
Off topic, since the AI craze MS‘ documentation translation has ridiculous errors like translating try catch keywords to "versuchen" and "fangen" for German pages
Tarq0n 3 days ago |
Yes their translations offer negative value, which is annoying because at work you can't usually choose your locale settings.
And the errors are really basic, like translating shortly to short, not the same thing at all!
bikeshaving 3 days ago |
I’m very wary of celebrating Meta’s language work when the company was credibly found to have contributed to the genocide against the Rohingya in Myanmar, and separately, to human rights abuses against Tigrayans during the conflict in northern Ethiopia. Be careful whose sins you’re laundering.
https://www.amnesty.org/en/latest/news/2025/02/meta-new-poli... https://www.amnesty.org/en/latest/news/2023/10/meta-failure-...
0x3f 3 days ago |
Do you also boycott Toyota for the Hilux?
bikeshaving 3 days ago |
I don’t own a car :)
asveikau 2 days ago |
I had the same reaction to this post. Mainly because one of Meta's explanations for the lapse was that they didn't have moderators who understood the local language
You hear that folks? Lack of localization kills.
garyclarke27 3 days ago |
They can translate 1600 languages, but they cannot do basic text formatting, where are the paragraphs?
canjobear 3 days ago |
It's an abstract for a paper, so it's officially supposed to be one paragraph.
BalinKing 3 days ago |
In the paper itself, the abstract actually does have a paragraph break, so it's probably just an autoformatting issue or something.
djoldman 3 days ago |
Just spent a long time trying to find where you can download any of these weights.
Is it open weight? If so, why isn't there just a straight link to the models?
ks2048 3 days ago |
I haven't seen anywhere claiming they are open weight (although their last similar model, NLLB was).
They say their leaderboard and evaluation datasets are freely available. Closest statement I've seen in the paper, "Our translation models are built on top of freely available models."
cointegrated 2 days ago |
It is not open weight as of today (unfortunately, for the reasons out of control of us the authors, we weren’t able to release the weights). All we could release is part of the evaluation data. I hope this will change in a while.
ks2048 3 days ago |
I'll be looking at this in detail. I've started a company to do similar things, https://6k.ai
I'm currently concentrating on better data gathering for low-resource languages.
When you look in detail at data like Common Crawl, finepdfs, and fineweb, (1) they are really lacking quality data sources if you know where to look, and (2) the sources they have are not processed "finely" enough (e.g. finepdfs classify each page of PDF as having a specific language, where-as many language learning sources have language pairs, etc.
intended 3 days ago |
There’s many nation states working on this, have you looked into availability of those data sets?
What languages are you prioritizing?
ks2048 3 days ago |
Yes, there are government datasets, languge "acadamies" (or "regulators") - organizations focused on preserving / teaching the language, and often smaller, local publishers that publish material in their local language.
I'm living in Guatemala, so have been focusing on the Mayan languages here (22 languages, millions of speakers).
dhosek 3 days ago |
As an aside, I remember visiting Guatemala (in the border area near Chiapas) in the early 90s and discovering that “Mayan” was not the monolith that I had been led to believe by my culturally narrow American education, but was a diverse collection of related cultures with multiple languages.
In one of the villages we visited, there was a language school where foreigners were learning Jacalteco. One student was from Israel and where most of the students had vocabulary lists in three columns (Jacalteco - Spanish - English), his had four columns where he did one more step of translation to Hebrew.
ccgreg 3 days ago |
Common Crawl has been running a low-resource language project for 1.5 years now -- it's a hard problem.
omneity 3 days ago |
Hey, this is super cool! I’ve been working on a similar problem, focusing on low-resource and underserved languages including the Mayan family, and have published some research and open resources around that [0, 1].
On the data side, I’ve found that the biggest bottleneck isn’t collecting text (it’s out there!) but reliable language identification. It’s often difficult or ambiguous to separate languages cleanly in datasets like Common Crawl, Fineweb, or others. I worked on improving this a bit for Fineweb 2 for my native language, that might inspire you [3].
Many of the challenges you mention seem to recur across regions and language families, so I’d love to connect and compare notes sometime. Feel free to reach me at omar [at] the labs site below.
0: https://wikilangs.org
1: https://omneitylabs.com
2: https://huggingface.co/blog/omarkamali/gherbal-multilingual-...
mandeepj 2 days ago |
You both might find it useful - https://news.ycombinator.com/item?id=44950661
I’ve also recently started in this space: building an agent, for a client, who can communicate in multiple languages.
omneity 2 days ago |
Excellent, thank you mandeepj! Curious about the language coverage of your agent and if / how you plan to eval your agent, if you're willing to share more.
quantumwoke 2 days ago |
It's sad that I didn't see any languages on your website from Australia, where there are hundreds of languages that need translating.
ks2048 2 days ago |
It’s a small sample and not specifically ones we’re working on. It’s biased towards alternative scripts for visual interest.
Australian languages are definitely interesting! and I will say, from what I’ve seen, Australian government (and other orgs) have done better than most to help document them (in recent years, at least)
ks2048 3 days ago |
Meta released No Language Left Behind (NLLB) [1], I think in 2022. I wonder why this in not "NLLB 2.0"? These companies love introducing new names to confuse things
[1] https://ai.meta.com/research/no-language-left-behind/
cointegrated 2 days ago |
This project is absolutely NLLB 2.0 in spirit. However, we decided to reserve the name “OMT-NLLB” only to the subset of the new models that have encoder-decoder architecture similar to the original NLLB-200. The other models are called “OMT-LLaMA” and have classical LLM architecture. The idea here (and we had to emphasize it to justify the project internally) is that we are developing not just new models but a recipe for massive multilinguality that can be integrated into general-purpose LLMs.
ks2048 3 days ago |
Another interesting thing mentioned here is: BOUQuET: Benchmark and Open-initiative for Universal Quality Evaluation in Translation.
https://huggingface.co/spaces/facebook/bouquet
intended 3 days ago |
Didn’t research show that models get worse at translation the more languages get added in? The curse of multilinguality? Lauscher 2020?
It looks like meta found a way forward.
Reading meta’s abstract, it seems that they have found ways to improve the quality of the training data, and also new evaluation tools?
They are also saying that OMT-LLaMA does a better job at text generation than other baseline models.
gojomo 3 days ago |
Can translate between 1600 languages.
Can't achieve subject-verb agreement in 1st sentence of their English abstract.
Advances made through No Language Left Behind (NLLB) have demonstrated that high-quality machine translation (MT) scale to 200 languages.
vgivanovic 2 days ago |
Huh? I'm a native English speaker and the sentence looks OK.
Advances have demonstrated... The NLLB part is an adjectival(sic) phrase that modifies the noun "Advances".
Hopefully I'm not wrong...
sajforbes a day ago |
It was a needlessly snarky way to word it, but they are right. The issue is the verb 'scale/s' rather than 'advance/s'
pxtail 3 days ago |
Where are real, useful features, why in 2026 can't I get transcript of voice messages in my chat?
mandeepj 3 days ago |
WhatsApp has that feature!
asveikau 2 days ago |
They've come a long way since enabling Burmese genocide citing lack of available translations.
mrlonglong 2 days ago |
It can't even do decent Welsh to English translations.