The First Trillion-Parameter Model That Doesn't Need NVIDIA

Key Takeaways

DeepSeek V4 has ~1 trillion parameters, 1 million token context window, and runs on Huawei Ascend 950PR chips. Not NVIDIA. Not AMD. DeepSeek deliberately gave Chinese chipmakers early optimization access while denying NVIDIA and AMD.
API pricing is 10-50x cheaper than GPT-5.4. For agentic AI systems that make dozens or hundreds of LLM calls per task, this changes which workflows are economically viable.
Mixture of Experts (MoE) architecture activates only 20-30% of parameters per token. This is how V4 achieves trillion-parameter capability without trillion-parameter compute costs.
Open-weight models at this capability level make self-hosted AI viable. If you don't want your data leaving your network, self-hosted DeepSeek V4 at $0 in API costs is a real option.
The model layer is the most replaceable part of any AI coding stack. The tooling, context management, and prompt engineering around the model often matter more than which LLM is underneath.

DeepSeek V4 is targeting the last two weeks of April 2026 for launch. It will have roughly 1 trillion total parameters, a 1 million token context window, native multimodal support across text, images, video, and audio, and it will run on Huawei's Ascend 950PR processor. Not NVIDIA. Not AMD. Huawei.

Reuters confirmed the chip detail on April 3. According to The Information, DeepSeek deliberately gave Chinese chipmakers early optimization access while denying NVIDIA and AMD that window. That decision turns V4 from a model launch into a geopolitical statement about the future of AI compute infrastructure.

The model has already been delayed twice. The original target was February, then it slipped to mid-March. But V4-Lite, a smaller ~200B parameter variant, appeared on DeepSeek's API nodes in early March, which usually signals the full model is close.

What 1 Trillion Parameters Actually Means

V4 uses a Mixture-of-Experts (MoE) architecture. The full model contains roughly 1 trillion parameters, but only 32 to 37 billion activate for any given token. This is the same approach DeepSeek used with V3, which had 671 billion total parameters.

The practical effect: inference costs stay roughly flat even as the model gets dramatically larger. You get the capability of a trillion-parameter model while only paying for the compute of a 37-billion-parameter one. That is why DeepSeek's pricing has been consistently aggressive. They are not subsidizing cheap prices. The architecture is genuinely cheaper to run.

For comparison, nobody outside OpenAI and Anthropic knows exactly how many parameters GPT-5.4 or Claude Opus 4.5 use. But their API pricing tells the story:

Model	Input (per 1M tokens)	Output (per 1M tokens)
DeepSeek V4	$0.14-$0.30	$0.28-$0.50
GPT-5.4	$1.75-$15.00	$5.00-$60.00
Claude Opus 4.5	$5.00	$25.00

That is not a marginal difference. DeepSeek V4 is 10 to 50 times cheaper depending on the task.

Three Architectural Breakthroughs

The raw parameter count matters less than three specific engineering decisions that make V4 possible at this scale.

Engram Conditional Memory

The hardest problem with million-token context windows is retrieval accuracy. Stuffing a million tokens into the context is useless if the model can't find the right needle in that haystack.

DeepSeek's answer is Engram, a conditional memory system published in January 2026 that separates static facts (API signatures, database schemas, configuration patterns) from dynamic reasoning. On the Needle-in-a-Haystack benchmark at 1 million tokens, Engram pushes accuracy from 84.2% to 97%.

That is the difference between a model that technically supports long context and one that actually uses it reliably.

Manifold-Constrained Hyper-Connections

Training instability is a real problem at trillion-parameter scale. Gradients explode. Loss curves spike. Entire training runs fail. DeepSeek's solution is a mathematical framework called mHC (Manifold-Constrained Hyper-Connections) that constrains signal amplification to under 2x. For context, unconstrained amplification at this scale can hit 3000x.

The overhead is 6.7% additional compute. That is a small price for stable training of a model this size.

Sparse Attention with Lightning Indexer

Standard dense attention across 1 million tokens would make inference prohibitively slow. V4 uses a Sparse Attention mechanism with what DeepSeek calls a "Lightning Indexer" that scans the full context to find relevant excerpts first, then focuses attention only on those specific tokens.

The result is roughly 50% lower computational overhead for long-context scenarios compared to dense attention. This is how a model with a 1 million token window can run at practical speeds on inference hardware that is not top-of-the-line NVIDIA.

The Huawei Chip Story

This is the part that matters beyond benchmarks.

US export controls have cut Chinese AI labs off from NVIDIA's most advanced GPUs. The H200, the B300, none of these ship to China legally. The chips China can legally buy, like the H20, are significantly weaker.

Huawei's Ascend 950PR is the answer. According to TrendForce, the chip delivers roughly 2.87x the compute performance of the NVIDIA H20 and integrates 112GB of Huawei's in-house HiBL memory at 1.4 TB/s bandwidth. It is manufactured by SMIC using their N+3 process, which puts it roughly at 5nm-class performance.

The chip launched in Q1 2026, and the demand surge has been immediate. Reuters reports that Alibaba, ByteDance, and Tencent have placed orders totaling hundreds of thousands of units. EE Times China reports chip prices jumped roughly 20% on the back of V4 anticipation alone.

Huawei plans to produce around 600,000 Ascend 910C chips in 2026, doubling 2025 output, with total Ascend capacity hitting 1.6 million units. The 950PR is already shipping. Next-generation Ascend 960 and 970 chips are in the pipeline, each targeting roughly 2x performance gains over their predecessors.

If DeepSeek succeeds in running both training and inference on Ascend chips within the next year or two, and stabilizes the full software stack (compilers, operators, communication libraries, distributed training, inference frameworks), the entire model development pipeline becomes independent of CUDA.

That is not a hypothetical scenario. It is a stated objective being actively engineered.

Benchmarks, With Caveats

Leaked internal benchmarks put V4 in competitive range with Western frontier models:

Benchmark	DeepSeek V4	GPT-5.4	Claude Opus 4.5	DeepSeek V3
SWE-bench Verified	>80%	~80%	80.9%	~49%
HumanEval	~90%	~92%	~92%	~85%
Context Window	1M tokens	256K	200K	128K

These numbers are unverified by independent third parties. DeepSeek's internal benchmarks have historically been directionally accurate but somewhat optimistic. Wait for independent testing before making infrastructure decisions based on these.

The context window is the clearest advantage on paper. At 1 million tokens, V4 can hold roughly 25,000 to 30,000 lines of code or 15 to 20 full-length novels in a single prompt. GPT-5.4 maxes out at 256K. Claude Opus 4.5 at 200K. If the retrieval accuracy holds at the levels Engram promises, the practical gap is even larger than the raw numbers suggest.

Open Source, Open Weights

DeepSeek is expected to release V4's weights under the Apache 2.0 license. They did this with V3, and the community has come to expect it. Thanks to MoE efficiency and quantization (INT8/INT4), the model should be runnable locally on consumer hardware. Dual RTX 4090s or a single RTX 5090 are the commonly cited minimum specs.

We wrote about why open-source AI matters for small teams back in March. DeepSeek V4 landing open-weight at this capability level adds another data point. If you are building internal tools, running code review, or doing document analysis, and you do not want your data leaving your network, self-hosted DeepSeek V4 at $0 in API costs is a real option.

For developers already using AI coding tools, V4's positioning is interesting. Claude Code currently leads independent rankings with 80.8% on SWE-bench Verified. GPT-5.4 Codex is close behind. DeepSeek V4 claims parity, and even if the real number is a few points lower, the cost difference makes it compelling for teams running large volumes of API calls. You can pair it with open-source coding agents like OpenCode or Aider and get 90% of the capability for a fraction of the cost.

We covered how Claude Code's internals work after the source leak in April. The model layer is the most replaceable part of any AI coding stack. The tooling, context management, and prompt engineering around the model often matter more than which LLM is underneath. If V4 delivers on the benchmarks, swapping it into existing agent workflows will be straightforward.

What This Means for AI Development

The bigger picture is that frontier AI development is no longer a single-track race. For the past five years, every major model ran on NVIDIA GPUs using CUDA. Training clusters were built around A100s, then H100s, then H200s and B300s. The assumption was that access to the best NVIDIA silicon was a prerequisite for building competitive models.

V4 challenges that assumption directly. A trillion-parameter MoE model, trained and optimized for non-NVIDIA hardware, performing at frontier level. If the benchmarks hold under independent testing, the implication is clear: export controls are not stopping Chinese AI development. They are accelerating the development of an alternative hardware ecosystem.

For builders outside the US-China competition, this creates options. More hardware suppliers means more competition on price. More open-weight frontier models means more choices for self-hosting. The cost of running advanced AI drops for everyone.

We noted in our guide to agentic AI that the biggest bottleneck for agent deployment is inference cost. Agents make dozens or hundreds of LLM calls per task. At GPT-5.4 pricing, that adds up fast. At DeepSeek V4 pricing, you can run 10 to 50 times more agent calls for the same budget. That changes which workflows are economically viable.

The Timing Question

V4 has been delayed twice. The current "late April" window is the third target. DeepSeek has not published an official date.

The strongest signal that a launch is imminent: V4-Lite has been live on API infrastructure since early March. When DeepSeek starts stress-testing infrastructure variants, a full launch usually follows within weeks.

If you are making decisions about AI model providers for Q2 2026 projects, the smart move is to wait for the benchmarks. Not the leaked ones. The independent ones from the community, from SWE-bench runners, from the developers who will put V4 through real codebases in the first week.

The specs are ambitious. The pricing is disruptive. The hardware story is geopolitically significant. Whether V4 delivers on all three depends on what happens in the next few weeks.

At AWZ Digital, we build AI-powered products for clients across the Middle East and Southeast Asia, and model selection is one of the first architecture decisions we make. Whether it is GPT-5.4 for reasoning-heavy workflows, Claude for code generation, or a cost-optimized stack using open-weight models like DeepSeek, the choice depends on the use case, the data sensitivity, and the budget. If you are evaluating AI models for your next project, talk to us.

The First Trillion-Parameter Model That Doesn't Need NVIDIA

Key Takeaways

What 1 Trillion Parameters Actually Means

Three Architectural Breakthroughs

Engram Conditional Memory

Manifold-Constrained Hyper-Connections

Sparse Attention with Lightning Indexer

The Huawei Chip Story

Benchmarks, With Caveats

Open Source, Open Weights

What This Means for AI Development

The Timing Question

Sources

Tags

Share this article

Related Articles

Claude Code's Source Just Leaked. Here's What's Inside.

Open-Source AI Is Closing the Gap: How Small Teams Are Building Serious Tools Without Big Budgets

Your AI Chatbot's Biggest Vulnerability Isn't Hallucination. It's the System Prompt.

Stay Updated