Moroccan Arabic: a major challenge for computer models and researchers
Translation software, like the kind on your phone, struggles with Darija, the Moroccan Arabic dialect. The gap between Darija and Standard Arabic is as wide as that between Limburgish and Standard Dutch. Abder Issam and colleagues have taken on the challenge of machine translation for Darija, while also looking at Limburgish.
Dialects and accents pose a problem for translation software. Whether it’s spoken Darija or written Limburgish, the results are far from perfect. The more a language is spoken, the more data is available to train computer models. That’s the crux: there’s far less data for Darija or Limburgish, making training much harder.
Translating using maths
Most translation software uses neural networks, computer models inspired by the human brain. These networks consist of electronic ‘nerve cells’ (neurons) that learn to recognise patterns in data, such as texts. They adapt based on input and work together to find the best answer: the translation.
Abder Issam, a PhD candidate at the Department of Advanced Computing Sciences, explains in more detail: “A neural translation network has two main parts: an encoder and a decoder. The encoder converts the original sentence (for example, English) into a mathematical representation: a matrix of numbers. Each sentence becomes a series of figures. The decoder then finds a new matrix that matches the original, but in the target language (for example, Darija). Through mathematical calculations, a translated sentence emerges.”
Human effort
Just as a toddler learns to speak by listening, a neural network learns from countless examples it is given. The data used to train the network consists of pairs of sentences in both languages. In Abder’s research, these are Darija and English. Abder says: “Large language models learn from hundreds of thousands of examples. When we started, we had a dataset of 10,000 sentences in both English and Darija. Thanks to the work of many volunteers, this open-source dataset from Morocco has now grown to 45,000 sentences. Building a language model is truly a human effort.”
Working with a limited dataset leads to lower-quality translations. Abder compared various translation techniques and discovered a combination that yields the best results. He validated his model using, among other texts, the New Testament in Darija. “You calculate the quality by translating 1,000–2,000 sentences and then comparing the result with the existing translation. A good model scores above 40. Our model scores nearly 27 for Darija-to-English translation and 10 for English-to-Darija.”
Languages of Morocco
The original language of Morocco is Tamazight (Berber). Due to Arab influence, a new language emerged: Darija, or Moroccan Arabic. Later, French, Spanish, and more recently, English also influenced the language. In Morocco, Standard Arabic is used for official matters and in the media. Almost all Moroccans also speak Darija and/or Tamazight.
The difference between the two translations surprised the researchers. “It shows how important a large dataset is. English is a language with a lot of available data, which is why translating into English works better than the other way around.”
But there’s another reason for the difference. Darija is not a uniform language. “Even within Darija, there are dialects. In northern Morocco, Spanish influence is strong; in central Morocco, it’s French. That’s why Moroccans write the same sentence in different ways. On top of that, loanwords are sometimes written in their original form and sometimes converted into Arabic script.”
More human effort
How can translation quality be improved? “The simplest way is with more data. Fortunately, there’s a growing group of people in Morocco working on this, including researchers trying to develop new, better language models. They can compare their models with ours and see if they score better. That’s also the goal of our research: to lay a foundation for others to build on.” Abder and his colleagues were among the first, if not the first, to examine language models for Darija at an academic level.
The diversity of languages is the subject of Abder’s PhD research. He looks at the difficulty of translating languages spoken with an accent, for example by people who speak a dialect or a foreign language. Think of the Limburgish accent or the accent of a Dutchman speaking English. In addition to Darija, Limburgish is part of his research. “Limburgish is also very diverse. That’s why we’re first working on a computer model that can recognise local variants of Limburgish.”
In the end, the digital future of languages like Darija and Limburgish will depend not just on algorithms, but on the people who speak them and the researchers determined to give them a voice in the digital world.
Read the scientific paper here: Low-Resource Machine Translation for Moroccan Arabic
Alexei Rosca, Abderrahmane Issam, Gerasimos Spanakis
Text: Patrick Marx
Sense more of our Science here