Engineering

Arabic NLP Is Not a Translation Problem

The default playbook for Arabic AI is to translate to English, do the work, translate back. We have seen this fail at every serious customer. Here is why, and what works instead.

INFINITEWARE EngineeringApril 27, 20266 min read

The default playbook for Arabic AI inside large companies looks like this: translate the Arabic to English, run your English NLP pipeline, translate the result back to Arabic. It is fast to scope. It is even faster to ship a demo. We have watched it fail at every serious customer.

Arabic is not a translation problem. It is a language problem.

What translation actually loses

The translate-first pipeline assumes that Arabic and English are isomorphic enough for a round-trip to preserve meaning. They are not.

A few of the things that get destroyed in the round-trip:

Dialect signal: a Khaleeji speaker and an Egyptian speaker are flagged differently by translation models, but the downstream English-only pipeline cannot use that signal because it was thrown away.
Code-switching: real Arabic business communication mixes Arabic and English mid-sentence. The translator forces one direction or the other, losing the actual register.
Named entities: Arabic names, places, and company forms get mangled in transliteration. "Al-Mannai" becomes "Almanai" becomes "Al Manai" depending on the model and the day.
Right-to-left structure: dates, addresses, and numbers in mixed-direction documents come out garbled if the pipeline does not understand bidirectional text.
Cultural and legal terms: there is no clean English for "majlis al-shura", "wakala", or several dozen contract concepts. The translator approximates and you lose the precision the user needed.

Where it shows up in production

Translate-first loses signal at every step. Arabic-native preserves it end to end.

We saw this most clearly building LAW360, our Arabic-English legal AI. The first prototype was translate-first. The legal team rejected it within an hour. Specific clauses that the translator had paraphrased no longer matched the original Arabic source. For legal review, "close enough" is not close enough.

The version that shipped processes Arabic in Arabic. Translation happens at the edge, only when the user explicitly wants it, with the original Arabic preserved as the source of truth.

The dialect problem

Arabic is not one language. A model trained on one register fails on the others.

Arabic in the GCC is not one language. Modern Standard Arabic (MSA) is the formal register: news, government, contracts. Then there are the dialects: Khaleeji, Levantine, Egyptian, Maghrebi, each with sub-variants.

A model trained on MSA alone hears Khaleeji and misclassifies it as "broken Arabic." A model trained on Egyptian alone misreads Khaleeji vocabulary as nonsense. We train on regional data because regional data is what our customers speak.

Voicexa, our Arabic speech recognition system, handles Khaleeji, MSA, Egyptian, and Levantine because the customers we ship into have callers who use all of them, often in the same five-minute call.

What works instead

The Arabic-native pipeline starts with the assumption that Arabic is the source language and stays that way. Specifically:

Tokenisation is Arabic-aware: morphology, prefixes, suffixes
Embeddings are trained on Arabic data, not translated from English
Retrieval indexes Arabic text directly, not its English translation
The model writes back in Arabic when the user wrote in Arabic, in the dialect they used
Translation is a separate, opt-in service, not a step on the critical path

Building this is more work up front than the translate-first shortcut. The payoff is a system that does not embarrass itself when a real Arabic-speaking user shows up.

When translation is fine

There are cases where translate-first is perfectly acceptable. Internal employee search across mixed Arabic/English documents. Rough summarisation. Anything where the output is consumed by a human who can sanity-check it.

Translate-first is not the enemy. Translate-first as the default is.

“For legal review, close enough is not close enough.”

Written by

INFINITEWARE Engineering

We are a Bahrain-based AI company shipping sovereign, on-premise systems for government, finance, energy, and legal across the GCC since 2008. Forty-plus clients. Seventeen products in production. We write here when we have something specific worth sharing from the work.

About INFINITEWARE

Have a workflow like this?

Let's talk about shipping it into production.

Arabic NLP Is Not a Translation Problem

What translation actually loses

Where it shows up in production

The dialect problem

What works instead

When translation is fine

INFINITEWARE Engineering

More from the blog

Harnesses That Hire Other Harnesses

AI Agents Are a Trust Problem. Three Architectures That Help.

Sovereign LLMs in Production: What Actually Runs On-Premise