Facebook has begun using unsupervised machine learning to translate content on its platform when it doesn’t have many examples of translations from one language to another — such as from English to Urdu.
The method was devised by Facebook AI Research (FAIR) and is being used on the platform in a collaborative effort between FAIR and the Applied Machine Learning division of the company, FAIR Paris lab director Antoine Bordes told VentureBeat in a phone interview.
The approach performs about as well as supervised models with 100,000 translations from one language to another, and it outperforms systems for language pairings for which Facebook has few examples.
“When you are on cases like English-Urdu, where there’s very few [translations], there we show that our system is better than the supervised system. So it’s better to train an unsupervised system than a supervised system that doesn’t have enough data,” Bordes said.
The results of the effort led by Facebook AI researchers Guillaume Lample and Marc’Aurelio Ranzato will be presented at EMNLP 2018 this fall.
Bordes was an early FAIR hire and called the research some of the best he’s ever seen. The study puts attention on translation, a crucial task for Facebook and an issue FAIR has been focused on since its began in 2013, Bordes said.
“We could go now on a planet where people speak a language that nobody else speaks — okay, the aliens — and you can actually go and try to have a decent translation of what is said there,” Bordes said. “You could go to an ancient manuscript for language that has not been deciphered, and you could actually get a sense of what it does, so this is really the kind of breakthrough that this work has achieved, and I think that’s why I’m pretty excited.”
Like other FAIR projects, the AI system will be open-sourced and made available for download on GitHub. Earlier this year, Facebook open-sourced Translate, an AI system currently used to power translation on Facebook.
Systems like Translate required huge amounts of labeled data to be trained. Completing a translation from French to English, for example, required millions of sample sentences to create a system capable of understanding both languages. Because of this, translations have been difficult when Facebook doesn’t have many examples of translations from one particular language to another.
The AI system now being used in these sorts of cases was put together with a combination of three elements: word-for-word translation, language models, and back translation.
Word-for-word translation is trained to predict words based on context drawn from the five words preceding and the five words following a specific word in a sentence. This word embedding method was laid out in a paper co-authored by Lample and Ranzato last fall.
Language models trained with large amounts of data — like books or other written text — are then used to arrange sentences in a structure that makes sense for an English speaker or Urdu speaker, for example.
Finally, back-translation is used to improve upon translation carried out with word-for-word translation and language models. The methods aren’t new, but the combination of the three is producing results, he said.
“Using these two systems [and] translating back and forth between the two languages, I can train them together to try to improve against each other, so this is really the core of the paper, using the words [translation model], using the language model to do first translation, then using the back translation idea to try to improve,” he said.
Facebook will explore this AI system for other forms of translation in the future, but more data and work with specialized translators is needed to verify the results, Bordes said.