Google Translate is a good default, but LLMs are really good at translations, as they’re better capable at understanding context and providing culturally appropriate translations.
(I live in Cambodia where they speak Khmer)
(Sorry I had to)
I actually found Facebook’s translations pretty good (better than Google Translate for things longer than a sentence). From my understanding of Khmer, Khmer is a bit more verbose and context dependent, hence LLMs in Khmer would be a big help understand those nuances.
In the inverse case (LLMs generating khmer from English) I heard from locals that it sounds formal and “robotic” which I found quite interesting.
I'm interested in find some thorough testing of translations on different LLMs vs Translation APIs.
And the errors are really basic, like translating shortly to short, not the same thing at all!
https://www.amnesty.org/en/latest/news/2025/02/meta-new-poli... https://www.amnesty.org/en/latest/news/2023/10/meta-failure-...
You hear that folks? Lack of localization kills.
Is it open weight? If so, why isn't there just a straight link to the models?
They say their leaderboard and evaluation datasets are freely available. Closest statement I've seen in the paper, "Our translation models are built on top of freely available models."
I'm currently concentrating on better data gathering for low-resource languages.
When you look in detail at data like Common Crawl, finepdfs, and fineweb, (1) they are really lacking quality data sources if you know where to look, and (2) the sources they have are not processed "finely" enough (e.g. finepdfs classify each page of PDF as having a specific language, where-as many language learning sources have language pairs, etc.
What languages are you prioritizing?
I'm living in Guatemala, so have been focusing on the Mayan languages here (22 languages, millions of speakers).
In one of the villages we visited, there was a language school where foreigners were learning Jacalteco. One student was from Israel and where most of the students had vocabulary lists in three columns (Jacalteco - Spanish - English), his had four columns where he did one more step of translation to Hebrew.
On the data side, I’ve found that the biggest bottleneck isn’t collecting text (it’s out there!) but reliable language identification. It’s often difficult or ambiguous to separate languages cleanly in datasets like Common Crawl, Fineweb, or others. I worked on improving this a bit for Fineweb 2 for my native language, that might inspire you [3].
Many of the challenges you mention seem to recur across regions and language families, so I’d love to connect and compare notes sometime. Feel free to reach me at omar [at] the labs site below.
2: https://huggingface.co/blog/omarkamali/gherbal-multilingual-...
I’ve also recently started in this space: building an agent, for a client, who can communicate in multiple languages.
Australian languages are definitely interesting! and I will say, from what I’ve seen, Australian government (and other orgs) have done better than most to help document them (in recent years, at least)
It looks like meta found a way forward.
Reading meta’s abstract, it seems that they have found ways to improve the quality of the training data, and also new evaluation tools?
They are also saying that OMT-LLaMA does a better job at text generation than other baseline models.
Can't achieve subject-verb agreement in 1st sentence of their English abstract.
Advances made through No Language Left Behind (NLLB) have demonstrated that high-quality machine translation (MT) scale to 200 languages.
Advances have demonstrated... The NLLB part is an adjectival(sic) phrase that modifies the noun "Advances".
Hopefully I'm not wrong...