Arabic NLP Is Not a Translation Problem
The default playbook for Arabic AI is to translate to English, do the work, translate back. We have seen this fail at every serious customer. Here is why, and what works instead.

The default playbook for Arabic AI inside large companies looks like this: translate the Arabic to English, run your English NLP pipeline, translate the result back to Arabic. It is fast to scope. It is even faster to ship a demo. We have watched it fail at every serious customer.
Arabic is not a translation problem. It is a language problem.
What translation actually loses
The translate-first pipeline assumes that Arabic and English are isomorphic enough for a round-trip to preserve meaning. They are not.
A few of the things that get destroyed in the round-trip:
- Dialect signal: a Khaleeji speaker and an Egyptian speaker are flagged differently by translation models, but the downstream English-only pipeline cannot use that signal because it was thrown away.
- Code-switching: real Arabic business communication mixes Arabic and English mid-sentence. The translator forces one direction or the other, losing the actual register.
- Named entities: Arabic names, places, and company forms get mangled in transliteration. "Al-Mannai" becomes "Almanai" becomes "Al Manai" depending on the model and the day.
- Right-to-left structure: dates, addresses, and numbers in mixed-direction documents come out garbled if the pipeline does not understand bidirectional text.
- Cultural and legal terms: there is no clean English for "majlis al-shura", "wakala", or several dozen contract concepts. The translator approximates and you lose the precision the user needed.
Where it shows up in production
We saw this most clearly building LAW360, our Arabic-English legal AI. The first prototype was translate-first. The legal team rejected it within an hour. Specific clauses that the translator had paraphrased no longer matched the original Arabic source. For legal review, "close enough" is not close enough.
The version that shipped processes Arabic in Arabic. Translation happens at the edge, only when the user explicitly wants it, with the original Arabic preserved as the source of truth.
The dialect problem
Arabic in the GCC is not one language. Modern Standard Arabic (MSA) is the formal register: news, government, contracts. Then there are the dialects: Khaleeji, Levantine, Egyptian, Maghrebi, each with sub-variants.
A model trained on MSA alone hears Khaleeji and misclassifies it as "broken Arabic." A model trained on Egyptian alone misreads Khaleeji vocabulary as nonsense. We train on regional data because regional data is what our customers speak.
Voicexa, our Arabic speech recognition system, handles Khaleeji, MSA, Egyptian, and Levantine because the customers we ship into have callers who use all of them, often in the same five-minute call.
What works instead
The Arabic-native pipeline starts with the assumption that Arabic is the source language and stays that way. Specifically:
- Tokenisation is Arabic-aware: morphology, prefixes, suffixes
- Embeddings are trained on Arabic data, not translated from English
- Retrieval indexes Arabic text directly, not its English translation
- The model writes back in Arabic when the user wrote in Arabic, in the dialect they used
- Translation is a separate, opt-in service, not a step on the critical path
Building this is more work up front than the translate-first shortcut. The payoff is a system that does not embarrass itself when a real Arabic-speaking user shows up.
When translation is fine
There are cases where translate-first is perfectly acceptable. Internal employee search across mixed Arabic/English documents. Rough summarisation. Anything where the output is consumed by a human who can sanity-check it.
Translate-first is not the enemy. Translate-first as the default is.
“For legal review, close enough is not close enough.”
Written by
INFINITEWARE Engineering
We are a Bahrain-based AI company shipping sovereign, on-premise systems for government, finance, energy, and legal across the GCC since 2008. Forty-plus clients. Sixteen products in production. We write here when we have something specific worth sharing from the work.


