Working PaperAI Systems

Bahraini Dialect Text-to-Speech: A Diacritization-First Approach to Front-End Design

The claim: for Gulf-dialect TTS the front-end is where the real work is, and the acoustic model is the easy part.

Authors

Ameen Altajer - Chief Executive Officer, INFINITEWARE

February 18, 2025

13 min read

Abstract

This working paper argues that the dominant design constraint for Bahraini-dialect and Gulf-dialect text-to-speech (TTS) is not the acoustic model but the front-end that converts written dialect input to the phoneme sequence the acoustic model consumes. Recent open-weight neural TTS acoustic models are competitive with proprietary alternatives on English and Modern Standard Arabic. On Bahraini dialect, the same acoustic models produce degraded output because the input they receive is degraded. The degradation is upstream of the acoustic model, at the diacritization and grapheme-to-phoneme (G2P) layer. We describe the specific ways in which Arabic dialect breaks conventional diacritization and G2P, propose a diacritization-first front-end design, and outline the coupling this creates between corpus construction, front-end training, and acoustic model fine-tuning. The paper reports early results from the INFINITEWARE R&D programme on Bahraini TTS.

1. Introduction

The recent generation of open-weight neural TTS acoustic models has narrowed the quality gap between open and proprietary speech synthesis for major languages. Systems trained on English or Modern Standard Arabic corpora now produce speech that is close to indistinguishable from human speech under standard evaluation conditions. The intuition that follows from this progress is that Gulf-dialect TTS is now a matter of collecting a dialect corpus and fine-tuning the acoustic model against it. In our experience building Bahraini-dialect TTS as part of the INFINITEWARE R&D programme, this intuition is wrong.

The remaining gap for Gulf-dialect TTS is not at the acoustic model. It is at the front-end, the stage that converts input text into the phoneme sequence the acoustic model consumes. Fine-tuning a strong acoustic model against a dialect corpus produces speech that is technically fluent but linguistically wrong, because the phoneme sequence the acoustic model is being asked to render does not correspond to how the dialect is actually pronounced. The fix is not more acoustic training. It is a diacritization-first front-end designed for dialect input.

This paper describes the specific ways in which Arabic dialect breaks conventional Arabic diacritization and G2P; proposes a front-end design that treats diacritization as the primary object rather than a preprocessing step; describes the coupling this design creates between corpus construction, front-end training, and acoustic model fine-tuning; and reports early results from the INFINITEWARE Bahraini TTS programme.

2. Why the front-end is the bottleneck

Written Arabic is under-specified with respect to pronunciation. Short vowels and gemination markers are typically omitted in modern written Arabic, and the reader supplies them from context. A written Arabic word may correspond to several pronunciations, and the correct pronunciation depends on the syntactic and semantic context in which the word appears. Diacritization is the process of restoring the missing short vowels and other pronunciation markers to disambiguate the reading. G2P is the process of converting the fully diacritized text to a phoneme sequence.

For Modern Standard Arabic, diacritization is a well-studied problem with usable tools. The tools are trained on MSA corpora and produce MSA diacritization: they insert the vowels and markers that a formal-register reader would supply. For dialect input, these tools produce diacritizations that are formally correct as MSA and wrong as dialect. The phoneme sequence downstream of an MSA diacritization of dialect text corresponds to how the sentence would be pronounced if it were read aloud as MSA, which is not how any Bahraini speaker would pronounce it.

The failure mode is subtle enough that it is often mistaken for an acoustic-model shortcoming. The synthesised speech is fluent, and the fluency is convincing. It is not, however, dialect. Physicians hearing a Bahraini-dialect TTS system that has been trained by fine-tuning the acoustic model against dialect audio, without a dialect-aware front-end, describe the output as sounding like MSA read with a Bahraini accent, rather than as Bahraini dialect. The distinction matters. In many practical applications the difference between the two is the difference between the system being useful and the system being politely ignored.

The neural TTS pipeline with the front-end responsibilities highlighted. Diacritization and G2P jointly determine the phoneme sequence the acoustic model renders.

3. A diacritization-first front-end

The design we propose treats diacritization as the primary object of the front-end rather than as a preprocessing step whose output flows into G2P.

Concretely: the front-end is trained on a dialect corpus in which each written form is annotated with the dialect diacritization that corresponds to how the form is pronounced in the target register, not the MSA diacritization. The G2P step is trained on the same dialect corpus, and its target phoneme inventory includes phonemes that appear in Bahraini dialect but not in MSA. The two steps are trained jointly rather than sequentially, because the correct diacritization depends on the intended pronunciation and the phoneme inventory it will be rendered against.

The design has consequences upstream at the corpus construction stage and downstream at the acoustic model. Upstream, the corpus must include dialect-diacritized text and cannot be built by running an MSA diacritizer over a dialect text collection. This is the most expensive part of the programme. Downstream, the acoustic model is fine-tuned against a dialect audio corpus whose transcriptions have been processed through the dialect-aware front-end, so that the phoneme sequences the acoustic model learns to render are the ones the front-end will subsequently produce.

4. Coupling with corpus construction and acoustic training

The diacritization-first design couples the three stages of a TTS programme in a way that a naive acoustic-first design does not. This coupling has practical consequences for how a Gulf-dialect TTS programme should be sequenced.

Corpus construction must be planned around dialect diacritization from the outset. Retrofitting dialect diacritizations onto a corpus assembled for MSA-first training is slower than assembling a dialect-first corpus from the start.
The front-end and the acoustic model share a phoneme inventory. Introducing new phonemes into the front-end without extending the acoustic model's inventory produces phoneme sequences the acoustic model cannot render. Extending the acoustic model's inventory without updating the front-end produces phoneme sequences the front-end will never request.
Evaluation must include a dialect-fidelity metric, not only an audio-quality metric. Systems that score well on standard mean opinion score (MOS) evaluations can nevertheless produce output that native speakers reject as not-dialect. A separate dialect-fidelity evaluation, ideally with native-speaker annotators, is necessary.

5. Early results

The INFINITEWARE Bahraini TTS programme is in an early stage. We report preliminary observations rather than benchmark numbers, because the evaluation methodology (particularly the dialect-fidelity component) is not yet standardised and premature quantitative claims would mislead.

The dialect-aware front-end produces phoneme sequences that native Bahraini annotators judge to be closer to how the input would be spoken by a Bahraini speaker than the sequences produced by an MSA-diacritizer front-end. The acoustic model fine-tuned against front-end-consistent phoneme sequences produces speech that native speakers accept as Bahraini dialect, whereas the same acoustic model fine-tuned against MSA-diacritized inputs produces speech that native speakers describe as MSA with a Bahraini accent. These are qualitative results; the quantitative evaluation is in preparation and will be reported in a subsequent paper.

6. Limitations and open questions

The results reported here are early and preliminary. The corpus behind the programme is small by the standards of production TTS training and is not yet released. The evaluation methodology, particularly the dialect-fidelity component, is not standardised and requires further work. The generalisation from Bahraini to other Gulf dialects is not automatic; the design principle transfers but the corpus, the front-end training, and the phoneme inventory all need to be re-done per dialect.

We flag one broader open question. The diacritization-first design places substantial weight on the availability of dialect-diacritized text. Producing that text at scale requires expert annotation, and the pool of qualified annotators for any given Gulf dialect is small. The programme's throughput is currently bounded by annotation capacity rather than by model training capacity, and we do not have a general solution to this constraint.

7. Conclusion

Gulf-dialect TTS is not primarily an acoustic-model problem. It is a front-end problem. The remaining quality gap between the best open TTS systems and native Bahraini dialect speech sits at the diacritization and G2P layer, and closing that gap requires a front-end designed around dialect diacritization from the outset. The design we propose treats diacritization as the primary object of the front-end, trains it jointly with G2P against a dialect-diacritized corpus, and couples the front-end tightly with the acoustic model through a shared phoneme inventory. Early qualitative results from the INFINITEWARE Bahraini TTS programme support the design. The next paper in this line will report quantitative results and describe the dialect-fidelity evaluation methodology in detail.

Keywords

Text-to-SpeechArabic Speech SynthesisBahraini DialectDiacritizationGrapheme-to-PhonemeNeural TTSVoicexa

Related research

Working PaperAI Systems

Verification-Gated Agentic Delegation: A Taxonomy and Field Framework for Multi-Harness AI Systems in Regulated Deployments

The practitioner literature on multi-agent AI systems is rich on autonomy and thin on inspectability. In regulated deployments, inspectability is the design constraint. This paper proposes two taxonomies (six delegation patterns and four verification gate types), reports the coupling constraints between them, and describes which pattern-gate combinations survive audit in the domains we have deployed in.

Read

Position PaperAI Systems

The Sovereign AI Stack: Deployment Patterns for Regulated Industries in the GCC

Sovereign AI is discussed as a hosting decision. It is actually a stack of five decisions, and the failure to make all five as one design produces the recurring pattern where a pilot succeeds and a production deployment stalls. This paper structures the five layers and describes the pattern-fits observed across domains.

Read

Field ReportHealth

Ambient Clinical Scribing vs. Structured Post-Encounter Dictation: A Field Comparison in Multilingual GCC Settings

In our GCC-Contextual Framework paper we identified workflow-native capture as one of four load-bearing requirements for clinical documentation AI. This paper compares the two dominant capture modalities against each other in the field, and proposes a specialty-and-language decision rule for choosing between them.

Read