Sovereign LLMs in Production: What Actually Runs On-Premise
Every executive who hears the word AI eventually asks for it on-premise. Few have priced what that means at 13B versus 70B parameters. Here is the engineering reality of running language models inside a customer's own infrastructure.

Every executive who hears the word AI eventually says the same sentence: we want it on-premise. Almost nobody who says it has priced what that means.
At INFINITEWARE we ship sovereign, on-premise LLMs into government, banking, oil and gas, and legal. The work is not glamorous. Most of it is hardware sizing, quantization, network isolation, and tolerating the gap between what the executive asks for and what the budget supports. Here is what "sovereign AI" actually looks like when the rack is in your data centre.
The first conversation: model size versus hardware
The single most expensive misconception in sovereign AI is that you will run a 70B-parameter model at full precision on a single GPU. You will not. Not at FP16, not even close.
Rough orders of magnitude we use in scoping calls:
- 7B at FP16: one consumer GPU (24GB) does it, slowly
- 13B at FP16: needs 32 to 40GB VRAM, so a single A6000 or an H100 with headroom
- 70B at FP16: 2 to 4 H100s minimum, more for any real concurrency
- 70B at Q4: comfortably on one H100 80GB with concurrency
- 405B and above: serious multi-node infrastructure, almost never the right starting point
In practice we deploy 7B to 13B for almost every customer's first sovereign system. Quality is more than acceptable once we fine-tune on the customer's domain. The conversation we end up having is not "how do we get GPT-4 on premise." It is "how do we get an excellent 13B fine-tuned model on premise this quarter."
Quantization is the real lever
Most of the perceived gap between cloud-hosted frontier models and on-premise models closes if you accept Q4 or Q8 quantization. The accuracy delta is small. The hardware delta is enormous.
At Q4, a 70B model fits on one H100 80GB with concurrency. The same hardware barely runs 13B at FP16. Most production teams who reject "small models" are actually rejecting under-quantized large models. The win is in the quant.
We default to Q4 or Q8 for anything that does not require numerical precision (legal classification, customer support, document understanding). We default to FP16 only when we have direct evidence that the use case is hurt by quantization, which is rarer than people assume.
Air-gapped, DMZ, or hybrid
Three deployment topologies cover almost every sovereign engagement:
- Fully air-gapped: the model lives on a network with zero external connectivity. Updates ride in by physical media or a strictly controlled jump host. Used in defence, intelligence, and the most conservative banking deployments.
- DMZ: the model lives inside a perimeter, with controlled egress for telemetry only. The customer's own services connect inward. Used in most government and enterprise deployments.
- Hybrid: model inference lives on-premise, but a tightly scoped service (say, search over a public knowledge base) is allowed an outbound call. Common when the customer also runs a public product.
The topology is decided in the procurement conversation, not the engineering one. Treat it as a constraint, then design the rest of the architecture around it.
Fine-tuning beats RAG more often on-premise
In a cloud deployment with frontier models, RAG is usually the right first move. Frontier models are smart enough that you can hand them context and trust them to reason over it.
On-premise, the dynamic flips. Smaller fine-tuned models often outperform larger un-tuned ones at the customer's specific task, with lower latency and lower cost. Fine-tuning becomes the lever that justifies the smaller model.
Our usual sequence: ship a small fine-tuned model first, add RAG on top for freshness and citation. Reverse it and you usually end up with a slower, more expensive system that performs worse on the customer's actual workload.
What we actually deploy
In rough proportion, what an INFINITEWARE sovereign deployment looks like today:
- A 7B to 13B base model, almost always open-weight (Llama, Qwen, Mistral)
- Q4 or Q8 quantization unless there is specific reason to go FP16
- Fine-tuned on the customer's domain data (legal, financial, automotive, industrial)
- A retrieval layer over the customer's documents
- A small validator or guardrails layer for high-stakes outputs
- Logging that never leaves the customer's perimeter
- One or two GPUs, occasionally four
That is the boring truth of sovereign AI. The interesting work is fitting all of that to a real workload, on a customer's actual hardware, with their security people in the room.
“Most production teams who reject small models are actually rejecting under-quantized large models. The win is in the quant.”
Written by
INFINITEWARE Engineering
We are a Bahrain-based AI company shipping sovereign, on-premise systems for government, finance, energy, and legal across the GCC since 2008. Forty-plus clients. Sixteen products in production. We write here when we have something specific worth sharing from the work.


