How to Run DeepSeek V4 Flash Locally on Mac: A Deep Dive into the DS4 Engine

Developer ToolsInference & PerformanceModel Architecture

DeepSeek V4 Flash is the closest open-source model to frontier-level performance right now. 284B total parameters, 13B activated (MoE), 1 million token context window. In thinking mode, its reasoning chains are far shorter than peer models — often just one-fifth the length — and the chain length scales proportionally with problem complexity, so it doesn’t overthink simple questions. Its performance on GPQA Diamond and AIME 2025 already approaches the GPT-5 family, and its prose quality in both English and Chinese holds up. If you have a Mac with 96GB or more of RAM, this is probably the best model you can run locally right now.

And then there’s the catch: not a single mainstream general-purpose inference engine on macOS can host it. llama.cpp mainline doesn’t support it — the FP4+FP8 mixed-precision weights, hash MoE routing, and compressed sparse attention of V4 Flash are architecturally too far from what traditional GGUF runners expect. An issue is open and the community is working on it, but there’s no converter yet. Ollama depends on llama.cpp under the hood, so it’s in the same boat — cloud proxy only. vLLM and SGLang have full support but are CUDA-only, which means nothing on a Mac.

Into this gap, antirez released a native inference engine purpose-built for DeepSeek V4 Flash: DwarfStar 4, or ds4. Pure C, Metal-first, one make produces five binaries. It’s not a llama.cpp fork and doesn’t depend on GGML — everything from tokenizer to HTTP server to coding agent is built from scratch, serving only this one model. antirez openly states in the README that the project was developed with heavy assistance from GPT-5.5 — “humans leading the ideas, testing, and debugging” — and you can see the AI’s fingerprints in the code: clean, well-structured, with thorough edge-case handling.

How to Use It: Start the Server, Connect Your Existing Agent

./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192

It supports three API protocols simultaneously — OpenAI, Anthropic, and Codex. This means the coding agent you already use can connect directly — no new tools to learn, no workflow changes required.

There’s a problem here that’s easy to miss but expensive to ignore: tool-call translation loss. DeepSeek V4 Flash generates tool calls in its native DSML format (an XML-like tag language), but agents like Claude Code and Codex only understand JSON. When the agent translates the model’s DSML tool call into JSON and sends it back on the next round, even the slightest formatting difference can cause the model to not recognize what it said last turn — forcing it to reprocess the entire conversation from scratch.

A prerequisite to understand here: when a model processes multi-turn conversations, it doesn’t recompute the entire history from scratch every time. It caches previously computed attention states — the KV cache. New tokens only need to interact with this cache. The cache is entirely determined by the exact text of previous outputs — if the text doesn’t match, the cache is invalidated, and the entire history must be recomputed.

Here’s an analogy: you write an email in Chinese, the recipient replies in English, you translate the English back to Chinese and reply again — the wording can never be identical to the original. The model faces the same issue: when it sees text that doesn’t match what it originally output, it has to re-read the entire context. In a 100K-token conversation, this re-computation can take tens of seconds.

ds4’s approach is to remember the model’s exact words. Every time the model generates a tool call, ds4 assigns it a unique ID and saves the original text. When the agent returns with that ID on the next request, ds4 replays the exact original bytes — the model sees its own verbatim output and picks up right where it left off. Only when that ID is lost (e.g., server restart without KV cache persistence enabled) does ds4 fall back to the backup path of translating JSON back to DSML.

There’s another optimization: when generating DSML protocol structure (tags, parameter names, JSON punctuation), ds4 forces deterministic sampling — this part must be precise, with zero tolerance for drift. But argument payloads — file contents, edit text — still use normal sampling, because forced determinism on long text bodies causes repetition.

In addition to server mode, ds4 has an alpha-stage native agent (ds4-agent). Its design goes further: it eliminates the server middleman entirely, running inference and agent logic in the same process. With no HTTP boundary, responses are instant (with a live progress bar during prefill), and switching between coding sessions with /switch is also instant — restoring a session is just loading a file. It comes with 9 built-in tools (bash, read, write, edit, search, etc.), all vertically tuned for V4 Flash’s DSML format. Still alpha quality, but it’s exploring an interesting direction.

Why a 284B Model Can Run on a Laptop

Here we need to introduce a concept that every LLM inference engine depends on: the KV cache.

When a large model generates text, each new token must attend to all previous tokens. If it recomputed from scratch every time, a 1M-token context would make per-token generation time astronomically slow. So every inference engine caches previously computed attention states — this cache is the KV cache. New tokens only need to interact with the cached state, without replaying the entire history.

KV cache size directly determines how much context you can run. A standard transformer’s KV cache scales linearly with token count: 1M token context requires roughly 180GB just for the cache — not counting the model itself. This is why 1M-context models are impossible to run on most people’s hardware.

DeepSeek V4 Flash redefines this problem at the architecture level. Its Multi-head Latent Attention (MLA) doesn’t store the full attention state for every token. Instead, it performs layered compression continuously during the conversation: the first few layers retain full state for the most recent 128 tokens (handling what’s happening right now), while deeper layers apply different compression ratios for long-range summarization — some layers store one-quarter, others one-hundred-twenty-eighth. This design is motivated by the specific demands of agent workloads: an agent needs to repeatedly search back through old logs and tool results across dozens of tool-calling rounds, not just do a one-shot long-document Q&A. I analyzed V4 Flash’s three-tier attention architecture (HCA for global rough scanning, CSA for indexed detailed lookup, sliding window for near-range full attention) in detail in a previous article, so I won’t repeat it here. The result: from 2048 to 65536 tokens, KV cache grows from 52MB to 926MB, roughly 14MB per 1000 tokens. Extrapolating to 1M tokens gives about 13.4GB — roughly 13× smaller than a standard transformer.

On top of MLA compression, ds4 does three things that upgrade the KV cache from a transient runtime structure into a persistent, cross-session reusable system.

First, disk persistence. Most inference engines treat the KV cache as temporary GPU memory — session ends, memory is wiped. ds4 writes it to files. Cache keys are SHA1 hashes of the rendered text prefix, and cache files contain three parts: the text prefix, token IDs, and the complete graph state. The next time you restart the server and continue the same conversation, the model loads this file directly from disk — no need to reprocess that 25K-token Claude Code system prompt from scratch. Files use ordinary read/write I/O rather than mmap (the model itself already mmaps 80GB+, and adding mmap for each cache file would exhaust virtual memory space). Checkpoints are saved at four key moments: after the initial prompt settles, every ~10K tokens, before eviction by a new session, and on server shutdown. Two additional details in saving: trim the last 32 tokens and align to 2048-token boundaries — this prevents the tokenizer from splitting the same word into different tokens at the boundary on reload, which would cause a mismatch.

Second, tool-call context preservation. This layer connects the KV cache to the exact replay mechanism discussed in the server section above. The model’s tool-call exact-text mappings are also written into KV cache files. When the server loads a cache file after restart, it restores both the conversation memory and the exact-text mappings simultaneously — the model sees its own verbatim previous output when continuing the conversation, with no need for canonicalization to approximate the original.

Third, session switching. For ds4-agent, /switch between sessions is just loading a .kv file — restoration is restoration, with no prefill. This is a capability that naturally follows from making the KV cache a first-class disk citizen: if in-memory KV state can be saved to and restored from disk at any time, a session is no longer an exclusive resource.

Under these designs, the benchmark numbers: on M4 Max (q2 quant), prefill ranges from 344 t/s to 205 t/s (from 2K to 65K context), with generation stable at 23-27 t/s. On M3 Ultra 512GB (q2), prefill at 84-468 t/s, generation at 27-37 t/s. antirez noted on Hacker News that his M3 Max peaks at only 50W during full-speed generation.

Bonus Capability: Change the Model’s Speaking Style with One Parameter

ds4 supports activation steering — a technique for modifying model behavior at inference time without retraining. In practice, it’s a single parameter: --dir-steering-ffn -1 for more concise answers, --dir-steering-ffn 2 for more verbose ones.

I discussed the theory behind this technique in detail in a previous article. In brief: a model’s internals form a high-dimensional space, and certain meaningful concepts (refusal tendency, verbosity, formality) each correspond to a linear direction in that space. Find the direction, push on it, and behavior shifts — like an equalizer on a stereo: push the fader left for less vocals, right for more vocals, but the underlying music doesn’t change.

ds4’s steering operates at runtime without modifying model weights. This differs from the better-known abliteration (which directly edits weight files) in a crucial way: because weights are untouched, the model’s determinism is preserved — the exact tool-call replay mechanism described earlier continues to work. You can also use different strengths across different sessions, or even switch within the same session.

Direction vectors are generated entirely locally: a Python script calls the local ds4 CLI, runs inference on two sets of prompts (target behavior vs. contrast behavior), dumps per-layer 4096-dimensional FFN output activations, computes the difference, normalizes, and de-projects — producing a 43×4096 float32 direction file. The project ships with a verbosity example built from 100 prompt pairs for concise vs. verbose style. On the same prompt “Explain why databases use indexes.”, -1 outputs 67 words in one compact paragraph, the default outputs 136 words, and 2 outputs 171 words with section-by-section elaboration.

The same method can be applied to other behavioral dimensions: reducing coding tendency (for a customer service chatbot that should answer fewer programming questions), adjusting formality, experimenting with specific behavioral tendencies. But it has boundaries — coarse-grained style and tendency adjustments work reliably, while precise factuality or complex reasoning ability are largely unaffected by steering.

Closing

ds4’s approach is a narrow bet: one model, one engine, polished end to end. From custom quantization to the Metal compute graph to KV cache disk persistence to precise tool-call context preservation — every layer serves only V4 Flash.

V4 Flash was released a month ago, and ds4 was released just over two weeks ago. Together, they let a 96GB MacBook run a near-frontier-quality model, callable through Claude Code, Codex, opencode — the agents you already use — without changing your workflow or learning new tools.

The project is still beta quality, and ds4-agent is alpha. But the direction is clear: make a model fully usable on your hardware from end to end, not just capable of spitting out tokens.