- Home
- DeepSeek News
- How to Deploy DeepSeek V4 Locally? Hardware Requirements & Installation Tutorial

How to Deploy DeepSeek V4 Locally? Hardware Requirements & Installation Tutorial
Want to run the most powerful open-source model locally? This article details DeepSeek V4's hardware requirements (VRAM needs) and step-by-step deployment instructions, including quantized version solutions.
How to Deploy DeepSeek V4 Locally
1. Introduction
Local LLM deployment is the ultimate romance for geeks and the best guarantee for enterprise data privacy. DeepSeek V4, as the champion of the open-source world, naturally supports local private deployment. But the 671B parameter scale is no joke. This article will tell you how big of a "fish tank" you need to fit this "giant whale" in your home computer.
2. Hardware Requirements: Can Your GPU Handle It?
DeepSeek V4 is a Mixture of Experts (MoE) model. Although it has fewer active parameters, loading the full weights still requires massive VRAM.
Option A: Full Version (BF16 / FP16)
Suitable for research institutions and wealthy enthusiasts
- VRAM Required: ~1.3TB - 1.5TB
- Recommended Config: 16x NVIDIA A100 (80GB) or H100 cluster
- Cost: Extremely high, not suitable for individuals.
Option B: 4-bit Quantized Version (Highly Recommended)
Suitable for enthusiasts and SMEs Due to MoE characteristics, we can load only active expert weights. Combined with 4-bit quantization, VRAM requirements are significantly reduced.
- VRAM Required: ~350GB - 400GB
- Recommended Config: 8x RTX 4090 (24GB) or 4x A100 (80GB)
- Mac Users: Mac Studio / Mac Pro with 192GB unified memory (M2/M3 Ultra) can barely run specially optimized quantized versions.
Option C: Extreme Quantization (1.58-bit / 2-bit)
For early adopters Community experts (like TheBloke) may release extreme quantized versions.
- VRAM Required: Potentially ~150GB
- Recommended Config: 2-3 machines with dual 3090/4090 for inference parallelization (vLLM / llama.cpp).
3. Installation Steps (Pre-release Version)
The following tutorial is based on Linux (Ubuntu 22.04), assuming you have NVIDIA drivers and CUDA 12.x installed.
Step 1: Prepare Python Environment
conda create -n deepseek python=3.10
conda activate deepseek
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install vllm>=0.4.0 # Recommended to use vLLM for high-speed inferenceStep 2: Download Model Weights
Please wait patiently for the HuggingFace repository update. Assume the repo name is deepseek-ai/deepseek-v4-instruct.
# Install git-lfs
git lfs install
# Download model (ensure 500GB+ disk space)
git clone https://huggingface.co/deepseek-ai/deepseek-v4-instruct-awqStep 3: Start Inference Service
Use vLLM to start an OpenAI API compatible service:
python -m vllm.entrypoints.openai.api_server \
--model ./deepseek-v4-instruct-awq \
--trust-remote-code \
--tensor-parallel-size 8 \ # Match your GPU count
--host 0.0.0.0 \
--port 8000Step 4: Test the Call
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-v4-instruct-awq",
"messages": [{"role": "user", "content": "Hello, DeepSeek!"}]
}'4. Quantization Options: The Key to Lowering the Barrier
If you don't have 8x 4090s, quantization is the only way out.
DeepSeek V4 may officially provide AWQ or GPTQ format quantized weights.
Using llama.cpp is recommended as it's extremely friendly to Apple Silicon (Mac).
# Mac users with llama.cpp
./main -m deepseek-v4-q4_k_m.gguf -n 128 --n-gpu-layers 995. FAQ
Q: Will it crash if VRAM is insufficient? A: Yes. OOM (Out Of Memory) is common. If VRAM is insufficient, vLLM won't even start. Calculate your total VRAM strictly.
Q: What if inference speed is slow? A: In multi-GPU inference, inter-card communication (NVLink/PCIe) is the bottleneck. Use NVLink-capable motherboards if possible, or go directly to server-grade equipment.
Q: Can I run it on CPU?
A: Theoretically llama.cpp supports CPU, but for a 671B parameter model, generating one character may take minutes - it has no practical value.
Note: Please refer to the official README for specific configuration parameters.
DeepSeek V4 Technical Deep Dive
Technical guides and in-depth analysis of DeepSeek V4
Author

Table of Contents
More Posts

OpenAI GPT-5.4 Drops: 1M Context + Native Agents to Block DeepSeek V4!
OpenAI launched its flagship GPT-5.4 with 1 million native context and an agentic engine, aiming to build a technical moat before the DeepSeek V4 release.


The Hardcore Truth Behind DeepSeek V4's Delayed Release
Why did DeepSeek V4 miss its March 2nd launch window? Exploring the truth behind the delay: domestic compute migration, multimodal integration, and strategic timing.


Battle of Lightweight Models: GPT-5.3 Instant and Gemini 3.1 Flash-Lite ArriveâHow Can DeepSeek V4 Stay Ahead?
With OpenAI and Google releasing GPT-5.3 Instant and Gemini 3.1 Flash-Lite on the same day, the lightweight model market is boiling over. This article analyzes the impact of these models on Agent ecosystems like OpenClaw and DeepSeek V4's core competitive advantages in this changing landscape.

Newsletter
Join the community
Subscribe to our newsletter for the latest news and updates