Engineering

Sovereign LLMs in Production: What Actually Runs On-Premise

Every executive who hears the word AI eventually asks for it on-premise. Few have priced what that means at 13B versus 70B parameters. Here is the engineering reality of running language models inside a customer's own infrastructure.

INFINITEWARE EngineeringMay 13, 20267 min read

Every executive who hears the word AI eventually says the same sentence: we want it on-premise. Almost nobody who says it has priced what that means.

At INFINITEWARE we ship sovereign, on-premise LLMs into government, banking, oil and gas, and legal. The work is not glamorous. Most of it is hardware sizing, quantization, network isolation, and tolerating the gap between what the executive asks for and what the budget supports. Here is what "sovereign AI" actually looks like when the rack is in your data centre.

The first conversation: model size versus hardware

The single most expensive misconception in sovereign AI is that you will run a 70B-parameter model at full precision on a single GPU. You will not. Not at FP16, not even close.

Rough orders of magnitude we use in scoping calls:

7B at FP16: one consumer GPU (24GB) does it, slowly
13B at FP16: needs 32 to 40GB VRAM, so a single A6000 or an H100 with headroom
70B at FP16: 2 to 4 H100s minimum, more for any real concurrency
70B at Q4: comfortably on one H100 80GB with concurrency
405B and above: serious multi-node infrastructure, almost never the right starting point

In practice we deploy 7B to 13B for almost every customer's first sovereign system. Quality is more than acceptable once we fine-tune on the customer's domain. The conversation we end up having is not "how do we get GPT-4 on premise." It is "how do we get an excellent 13B fine-tuned model on premise this quarter."

Sizing tiers we use in scoping. Quantization shifts every row down by one bracket of cost.

Quantization is the real lever

Most of the perceived gap between cloud-hosted frontier models and on-premise models closes if you accept Q4 or Q8 quantization. The accuracy delta is small. The hardware delta is enormous.

At Q4, a 70B model fits on one H100 80GB with concurrency. The same hardware barely runs 13B at FP16. Most production teams who reject "small models" are actually rejecting under-quantized large models. The win is in the quant.

We default to Q4 or Q8 for anything that does not require numerical precision (legal classification, customer support, document understanding). We default to FP16 only when we have direct evidence that the use case is hurt by quantization, which is rarer than people assume.

Air-gapped, DMZ, or hybrid

Three deployment topologies. Decided in procurement, treated as a constraint by engineering.

Three deployment topologies cover almost every sovereign engagement:

Fully air-gapped: the model lives on a network with zero external connectivity. Updates ride in by physical media or a strictly controlled jump host. Used in defence, intelligence, and the most conservative banking deployments.
DMZ: the model lives inside a perimeter, with controlled egress for telemetry only. The customer's own services connect inward. Used in most government and enterprise deployments.
Hybrid: model inference lives on-premise, but a tightly scoped service (say, search over a public knowledge base) is allowed an outbound call. Common when the customer also runs a public product.

The topology is decided in the procurement conversation, not the engineering one. Treat it as a constraint, then design the rest of the architecture around it.

Fine-tuning beats RAG more often on-premise

In a cloud deployment with frontier models, RAG is usually the right first move. Frontier models are smart enough that you can hand them context and trust them to reason over it.

On-premise, the dynamic flips. Smaller fine-tuned models often outperform larger un-tuned ones at the customer's specific task, with lower latency and lower cost. Fine-tuning becomes the lever that justifies the smaller model.

Our usual sequence: ship a small fine-tuned model first, add RAG on top for freshness and citation. Reverse it and you usually end up with a slower, more expensive system that performs worse on the customer's actual workload.

What we actually deploy

In rough proportion, what an INFINITEWARE sovereign deployment looks like today:

A 7B to 13B base model, almost always open-weight (Llama, Qwen, Mistral)
Q4 or Q8 quantization unless there is specific reason to go FP16
Fine-tuned on the customer's domain data (legal, financial, automotive, industrial)
A retrieval layer over the customer's documents
A small validator or guardrails layer for high-stakes outputs
Logging that never leaves the customer's perimeter
One or two GPUs, occasionally four

That is the boring truth of sovereign AI. The interesting work is fitting all of that to a real workload, on a customer's actual hardware, with their security people in the room.

“Most production teams who reject small models are actually rejecting under-quantized large models. The win is in the quant.”

Written by

INFINITEWARE Engineering

We are a Bahrain-based AI company shipping sovereign, on-premise systems for government, finance, energy, and legal across the GCC since 2008. Forty-plus clients. Seventeen products in production. We write here when we have something specific worth sharing from the work.

About INFINITEWARE

Have a workflow like this?

Let's talk about shipping it into production.

Sovereign LLMs in Production: What Actually Runs On-Premise

The first conversation: model size versus hardware

Quantization is the real lever

Air-gapped, DMZ, or hybrid

Fine-tuning beats RAG more often on-premise

What we actually deploy

INFINITEWARE Engineering

More from the blog

Harnesses That Hire Other Harnesses

AI Agents Are a Trust Problem. Three Architectures That Help.

Fine-Tuning vs RAG vs Prompting: A Decision Matrix