Notes

FHE and LLM inference: the hardest open problem in private AI

Running large language models on fully encrypted data is theoretically sound and practically brutal. Here is where the research stands and what actually works today.

deep dive 6 min read

Running a large language model on data that never becomes plaintext to the server is one of the most compelling open problems at the intersection of cryptography and AI.

Fully homomorphic encryption makes it theoretically possible. The gap between theory and production is where things get interesting.

As of April 6, 2026, this is an opinionated engineering synthesis, not a systematic literature review.

The core idea

FHE allows arbitrary computation on ciphertext such that Decrypt(F(Encrypt(x))) = F(x). A model server could process encrypted tokens and return an encrypted result the client decrypts, without ever touching plaintext. To see additive homomorphic encryption in action, try the Paillier Homomorphic Addition demo.

For LLM inference, the architecture looks clean on paper: encrypt the prompt, run inference as an FHE circuit, return encrypted logits, decrypt locally. The model never sees your input in plaintext. Whether model weights are also protected depends on the protocol and threat model.

In practice, every layer of a transformer fights you.

Why transformers are hostile to FHE

Non-linear operations are the wall

FHE schemes like CKKS handle polynomial operations efficiently, matrix multiplies, dot products, additions. Transformers are full of operations that are not polynomial:

  • Softmax requires exp() and division.
  • GELU and ReLU activations are non-polynomial.
  • Layer normalization requires sqrt() and division.

Each of these must be replaced with high-degree polynomial approximations. The approximations introduce error, and that error accumulates across layers.

Multiplicative depth and bootstrapping

Every multiplication in an FHE circuit consumes a level from the noise budget. A deep network, say 96 transformer layers, burns through that budget fast.

Once the budget is exhausted, you need bootstrapping: a massively expensive operation that refreshes the noise budget. A single bootstrapping step can take seconds to minutes. Recent work on programmable bootstrapping, notably Zama’s TFHE-rs library, is narrowing this cost for certain operation types, but the overhead remains significant for deep circuits like full transformer stacks.

The attention mechanism

The core of a transformer:

Attention(Q, K, V) = softmax(QK^T / √d) · V

The QK^T matrix multiply is fine under FHE. The softmax is a disaster. Current approaches approximate it with polynomial expansions, but approximation error accumulates catastrophically over many layers. By the time you reach the output, the signal may be buried in noise.

The performance gap

For practical calibration:

  • Plaintext GEMM on a modern GPU: microseconds.
  • The same GEMM under CKKS-FHE: 10,000 to 1,000,000× slower.

For a 7B parameter model, pure-FHE inference of a single token can still take hours in many published setups, and in some cases much longer. Treat these as order-of-magnitude directional ranges: exact numbers vary heavily with parameters, packing strategy, hardware, and accuracy targets.

What research has actually demonstrated

Real results exist, but the picture is now more nuanced than “only tiny demos.”

  • CryptoNets (MSR, 2016): First neural network inference on encrypted data, tiny CNNs only.
  • Iron (2022): Private transformer inference on BERT-scale models using a 2PC + HE hybrid, minutes per inference. Not pure FHE.
  • THE-X (2022): Full transformer with polynomial approximations, demonstrated on BERT.
  • Bolt (2024): Optimized private transformer inference using an MPC + FHE hybrid.
  • MPCFormer (2023): MPC-based transformer (not pure FHE), much faster but relies on a trusted third-party dealer for Beaver triple generation.
  • THOR (2024): Improved HE transformer inference on BERT-base, still measured in minutes rather than interactive latency.
  • MOAI / LEAF / Tricycle (2025): Meaningful speedups for HE-based transformer inference. MOAI demonstrated extensibility to LLaMA-3-8B, though most results are still on BERT-scale models or transformer subcomponents.
  • BumbleBee and related MPC work: Pushed farther toward larger-model private inference, including LLaMA-7B-class evaluation in hybrid/two-party settings. This is important progress, but it is not the same thing as pure FHE.

GPT-4-scale pure-FHE inference still looks infeasible for production. The best recent results show real movement, but mostly in three categories: smaller transformers, faster HE evaluation of specific transformer components, and hybrid MPC/FHE systems that relax the problem.

The weight privacy question

One design choice that drastically changes feasibility: do you need to hide the model weights?

If not, if weights are public and you only need input and output confidentiality, the problem becomes much more tractable. Some recent work on private inference assumes public weights and focuses only on protecting the user’s data, which enables MPC approaches that are often dramatically faster than full FHE.

This matters for protocol design. If you are building a private inference API, deciding early whether you need full two-sided privacy or just input privacy changes the entire architecture.

What actually works (as of April 6, 2026)

Trusted execution environments

TEEs (Intel TDX, AMD SEV-SNP, ARM CCA) give you cryptographic attestation that the model runs in an isolated enclave the host OS cannot inspect. The data is decrypted inside the enclave, but the enclave boundary is hardware-enforced.

This is the only approach I consider credibly certifiable for production workloads today. Companies like Edgeless Systems and Anjuna are building toward TEE-based confidential AI.

MPC with non-colluding servers

Secure multi-party computation splits the computation across two or more servers that do not collude. The overhead can be single-digit multiples over plaintext in favorable settings, but may be much higher in others.

The trust model is different: you need to believe the servers will not cooperate against you. For regulated environments, this can be enforced contractually.

A practical hybrid architecture

For a deployment that works now, the pattern looks like this:

Client                    TEE Server                  Model
  │                            │                        │
  ├─ Encrypt tokens ───────────►                        │
  │                            ├─ Decrypt in enclave ───►
  │                            │                  ◄─────┤
  ◄── Encrypt result ──────────┤                        │
  │                            │                        │
  Decrypt locally

The client encrypts. The enclave decrypts, runs inference, and re-encrypts the result. The host never sees plaintext. This is deployable now and provides strong confidentiality guarantees without the performance penalty of FHE.

Feasibility summary

ApproachInput privacyFeasibility todayLatency
Pure FHE LLMFullResearch onlyMinutes to hours per token
TEE (TDX / SEV-SNP)StrongDeployable nowNormal
MPC (two-server)StrongLimited but real deploymentsSingle-digit× to much higher
FHE with public weightsInput onlySmall-transformer researchMinutes per token
FHE at GPT-4 scaleFullNo clear path to production this decadeDays per token

The long-term trajectory

Hardware accelerators specifically for FHE, from companies like Cornami, are trying to close the performance gap. On the software side, Duality Technologies (creators of OpenFHE) continues to push the library ecosystem forward, and contributed to the DARPA DPRIVE hardware research program. If FHE-specific silicon can deliver even a 1,000× speedup, that changes the calculus for smaller models.

The academic community is making real progress. But for a production use case today, especially anything that needs to be certifiable or auditable, TEE-based confidential computing is the only credible path.

FHE-based LLM inference is the right long-term direction. It is also, honestly, one of the hardest engineering problems in applied cryptography. The gap between “theoretically sound” and “practically deployable” is measured in orders of magnitude, and closing it will take both algorithmic breakthroughs and purpose-built hardware.

The vision is correct. The timeline is long.

For background on what FHE enables and where it is practical today outside the LLM context, see Fully Homomorphic Encryption: Compute on Encrypted Data.