If you’ve already decided that open-source models meet your needs, the next question is a procurement decision: where do you buy inference? The same model can be accessed through the official pay-per-token API, the vendor’s own subscription plan (z.ai Coding Plan, Kimi Chat Pro), or Ollama Cloud’s monthly subscription. The billing logic differs so much across channels that it can reshape your entire workflow design. Price is only the first layer. When your prompts contain client data or internal code, the difference in billing models matters far less than the difference in where your data ends up.
This article focuses on three of the most talked-about open-source models right now: GLM-5.1, DeepSeek V4 Pro, and Kimi K2.6. We compare them across three dimensions — price, privacy, and speed — and end with practical recommendations for different usage scenarios.
First, let’s clarify what each channel is actually selling.
Official API is the pay-per-token service run by the model vendors themselves. DeepSeek is at platform.deepseek.com, GLM-5.1 at z.ai, and Kimi at platform.kimi.ai. Input and output tokens are priced separately; you pay for what you use. All three bill in US dollars.
Vendor subscriptions are the vendors’ own monthly plans. z.ai has a Coding Plan (Lite ~$10/month, Pro ~$30/month, Max ~$80/month) covering the full model lineup including GLM-5.1, billed by request count, and compatible with agent coding tools. Kimi also offers a developer-oriented Kimi Code subscription, with tiers named after musical tempos: Moderato (~$19/month), Allegretto (~$49/month), and Vivace (higher, price undisclosed), including CLI access and priority compute resources. DeepSeek currently only has a consumer-facing Chat subscription (Growth $35/month), which does not expose an API and cannot be used for agent coding workflows — it is excluded from this comparison.
Ollama Cloud is Ollama’s cloud inference service,
subscription-based: Free $0, Pro $20/month, Max $100/month. Usage is
measured by GPU time consumption, not by tokens. All three models are
available (glm-5.1:cloud,
deepseek-v4-pro:cloud, kimi-k2.6:cloud). They
likewise do not disclose specific token limits, only stating that Pro is
50x Free and Max is 5x Pro.
Let’s start with official API unit prices. All prices are per million tokens.
| Model | Input (/Mtokens)|Output(/M tokens) | Notes | |
|---|---|---|---|
| DeepSeek V4 Pro (discount) | $0.435 | $0.87 | 75% discount, through 2026/05/31 |
| DeepSeek V4 Pro (full price) | $1.74 | $3.48 | Restores on June 1 |
| GLM-5.1 (z.ai) | $1.40 | $4.40 | 202K context, cached input $0.26 |
| Kimi K2.6 | $0.95 | $4.00 | 256K context, $1 min top-up |
Sources: DeepSeek pricing, z.ai pricing, Kimi pricing
DeepSeek V4 Pro during the discount period is the cheapest of the three. Once the discount ends, the picture flips: GLM-5.1’s input price ($1.40) is lower than DeepSeek’s full price ($1.74), but its output price ($4.40) is higher than both DeepSeek ($3.48) and Kimi ($4.00). If your workload is output-heavy (code generation, long-form writing), GLM-5.1’s API unit price is actually the most expensive of the three.
z.ai’s API also has an easy-to-miss feature: cached input costs only $0.26/M tokens, less than one-fifth of the full input price. If your prompt has a large repeated prefix (e.g., system prompt + unchanging context), your actual input cost will be significantly lower than the sticker price.
Scenario A: Lightweight agent coding. 10K input / 2K output per call, 100 calls per day, 3,000 calls per month. Totals: 30M input + 6M output tokens. This is the typical usage of someone running an agent for coding assistance daily, with dozens to hundreds of calls per day.
| Model | Monthly Cost |
|---|---|
| DeepSeek V4 Pro (discount) | $18.27 |
| Kimi K2.6 | $52.50 |
| GLM-5.1 | $68.40 |
| DeepSeek V4 Pro (full price) | $73.02 |
At this usage level, DeepSeek’s discount price still holds up well — $18 is all it takes. After the discount expires, all three models land in the $50-73 range. Meanwhile, z.ai Coding Plan Lite is only $10 and Ollama Cloud Pro is $20 — subscriptions already show a clear price advantage at this usage level.
Scenario B: Heavy agent coding. Roughly 750M input + 85M output tokens per month (total ~835M tokens), with an input-to-output ratio of about 9:1. The basis for this volume estimate: users of Claude Code’s $200 Max plan tracked with the ccusage tool report monthly consumption in the $2,000-3,000 range (converted to Sonnet 4.6 API price equivalents, assuming 75% of input hits cache), which back-projects to approximately 800M-1.3B tokens/month. We use the midpoint here.
| Model | Monthly Cost |
|---|---|
| DeepSeek V4 Pro (discount) | $400 |
| Kimi K2.6 | $1,053 |
| GLM-5.1 | $1,424 |
| DeepSeek V4 Pro (full price) | $1,601 |
At this volume, pure API monthly costs range from $400 to $1,601 depending on the model and discount. In contrast, z.ai Coding Plan Max is only $80/month and Ollama Cloud Max is $100/month — if the subscriptions can handle this load (z.ai Max can, based on real-world testing), the cost gap is 5-20x.
Now let’s bring subscription plans into the comparison. The difficulty is that each vendor’s subscription doesn’t bill by tokens, and their limit descriptions use inconsistent units, making direct translation nearly impossible. The table below lists what is known, followed by a rough estimation.
| Subscription | Monthly Fee | Models Included | Limit Description |
|---|---|---|---|
| z.ai Coding Plan Lite | ~$10 | GLM-5.1, 5-Turbo, 4.7, 4.6, 4.5-Air | “3× Claude Pro usage” |
| z.ai Coding Plan Pro | ~$30 | Above + GLM-5 | “5× Lite usage” |
| z.ai Coding Plan Max | ~$80 | Above + GLM-5 | “4× Pro usage” |
| Kimi Code Moderato | ~$19 | Kimi K2.6 and full lineup | CLI access, priority compute |
| Kimi Code Allegretto | ~$49 | Same | Higher limit |
| Kimi Code Vivace | Undisclosed | Same | Highest limit |
| Ollama Cloud Pro | $20 | All cloud models | “Day-to-day work,” 50x Free |
| Ollama Cloud Max | $100 | All cloud models | 5x Pro |
Sources: z.ai Coding Plan, Kimi Code guide, Ollama pricing
A critical data point here: z.ai’s Coding Plan describes limits using “N times Claude Pro usage” as a reference, not specific request counts or token numbers. Users have reported that on the Max tier (~$80/month), consuming millions or even tens of millions of tokens per month does not hit the cap. Based on real-world heavy-usage testing (800M tokens/month consumption), z.ai Coding Plan Max ($80/month) can handle it. At the same volume, pure API would cost $400-$1,601 — z.ai Max saves 5-20x.
Ollama Cloud’s limits are similarly opaque. Community reverse-estimation puts the Free tier at roughly 28 GPU-hours/month, with Pro at 50x Free. But converting GPU hours to tokens depends on model size and request complexity, so precise translation isn’t possible. At 800M tokens/month of heavy usage, Ollama Cloud Max ($100/month) might just barely handle it — or it might not — depending on which model you primarily use (larger models consume GPU time faster). The Pro tier ($20) is almost certainly insufficient at this volume.
Kimi Code’s Allegretto (~$49/month) or Vivace may be enough for heavy agent scenarios, but Vivace’s price is undisclosed and its limit is opaque — testing is needed. DeepSeek V4 Pro has no usable subscription option for heavy agent scenarios — its Chat product doesn’t expose an API. Heavy DeepSeek users either go through Ollama Cloud’s subscription or accept high pay-per-token costs.
Back to Scenario A numbers (30M+6M tokens/month). Lightweight agent usage costs $18-$73 on pure API. Ollama Cloud Pro at $20 and z.ai Lite at $10 are both in range — if the subscription can cover this volume (highly likely), the subscription is cheaper.
By Scenario B (750M+85M tokens/month), pure API costs $400-$1,601 per month while z.ai Max is $80 and Ollama Cloud Max is $100. Subscriptions still hold a 5-20x price advantage. z.ai Max is the most clearly viable option at this volume. Ollama Cloud Max carries some risk but is worth trying.
One caveat: this conclusion assumes you primarily use a single vendor’s models. If your workflow spans multiple models (e.g., GLM-5.1 for coding, Kimi for research, DeepSeek for reasoning), buying a separate subscription from each vendor no longer makes sense. This is where Ollama Cloud’s advantage shines: one subscription covers all three models. Conversely, if you only use GLM-5.1, z.ai’s Coding Plan is more focused and offers looser limits.
The conclusions in this section may overturn the pure-price judgments from earlier.
The four service providers — three Chinese model vendors (DeepSeek, Moonshot, Zhipu) and one overseas service provider (Ollama) — differ more dramatically on data policy than on price. Here are the results of a line-by-line review.
Ollama Cloud has the clearest privacy commitment among the four. The official privacy policy states that when using cloud models, prompt and response content exists only transiently during request processing (“process this content transiently”) and is “not stored beyond the time required to fulfill the request,” with technical measures implemented to minimize retention time. The terms of service separately declare: “We do not use your inputs or outputs to train AI models.” In its partnership contract with NVIDIA as the cloud service provider, Ollama requires compliance with a “no logging, no training, zero data retention” policy. Sources: ollama.com/privacy, ollama.com/terms
DeepSeek’s privacy policy uses the phrase “may collect” when describing the handling of prompts and input content, without the same explicit exclusion of storage found in Ollama’s policy. The policy mentions that entities within the corporate group may use data for “foundation model training and improvement” but simultaneously offers an opt-out right. However, how opt-out is executed in the API context (whether via dashboard settings or API parameters) is not explained in the policy. Source: DeepSeek privacy policy
Kimi / Moonshot explicitly lists conversation content as one of the categories of collected information in its privacy policy and states that the purpose includes “model optimization.” Article 5 of the user service agreement states: “You grant us a royalty-free license to use your input and output content and feedback for model service optimization to the extent permitted by law.” Sources: Moonshot privacy policy, Moonshot user agreement
Zhipu / z.ai’s user agreement, Article 10, Section 3, grants an exceptionally broad license: to the extent permitted by law, users “grant Zhipu and its affiliates a non-exclusive, geographically unlimited, perpetual, royalty-free license to use (including storing, using, reproducing, revising, editing, publishing, displaying, translating, distributing said information, or creating derivative works) and the right to sublicense to third parties.” This authorization covers all non-personal information content and permits sublicensing to third parties. Source: Zhipu user agreement
Summarized into a table:
| Dimension | Ollama Cloud | DeepSeek | Kimi / Moonshot | z.ai / Zhipu |
|---|---|---|---|---|
| Stores prompt/response | No | May collect | Explicitly collects | Explicitly collects |
| Used for training | No | Possible by default, opt-out available | Explicitly for optimization | Extremely broad license, training not excluded |
| Third-party audit | None | None | None | None |
| Data storage location | US | China (inferred) | China | China |
No value judgment is made here. Different usage scenarios have different privacy requirements. Readers need to determine which provider’s policy fits their scenario based on whether their prompts contain client data, commercially sensitive information, or personal data.
Output speeds (tokens/second) for the three models on their respective official APIs, based on independent testing by Artificial Analysis (10K input token workload, P50 median):
| Model | Official API Output Speed | Time to First Token |
|---|---|---|
| Kimi K2.6 | ~106 t/s | Not disclosed (includes thinking time) |
| GLM-5.1 | ~52 t/s | 1.43s |
| DeepSeek V4 Pro | ~38 t/s | 2.02s |
Kimi K2.6’s 106 t/s is roughly 2.8× faster than DeepSeek V4 Pro. In terms of perceived experience, generating a 500-token response takes about 4.7 seconds on Kimi, roughly 13 seconds on DeepSeek V4 Pro, and about 9.6 seconds on GLM-5.1 in the middle.
That said, the GLM series shows the most room for speed improvement with third-party optimization. Inference provider Novita ran GLM-5 at FP8 precision and achieved 213 t/s, while Baseten reached 186 t/s using MTP speculative decoding. This suggests that the official API’s 52 t/s is more a limitation of server-side optimization than the model’s inherent capability ceiling. If you call GLM-5.1 through a third-party provider (such as Together AI or Fireworks), speeds may be noticeably better than z.ai’s official API.
Ollama Cloud has not published throughput data for these three models, and no independent third-party benchmark exists. The underlying hardware is NVIDIA Blackwell series, which should theoretically be significantly faster than consumer-grade local hardware, but there is no public data for precise comparison.
On the rate limit front, DeepSeek V4 Pro’s 60 RPM is a practical bottleneck. If your workload involves frequent short requests (e.g., multiple agents calling concurrently), you’ll hit the RPM cap before the TPM cap. Kimi’s rate limit scales with top-up amount: the lowest tier ($1 top-up) gives only 3 RPM and 1 concurrent request, jumping to 200 RPM after topping up to $10. z.ai’s rate limits are not publicly documented; some users report that GLM-5 and 5.1 trigger rate limiting after several days of use.
Combining usage volume and privacy sensitivity yields roughly four quadrants:
Low volume + privacy-insensitive (lightweight agent coding, dozens to hundreds of calls per day): Both pay-per-token API and low-cost subscriptions work. DeepSeek V4 Pro during the discount period is $18/month, nearly matching Ollama Cloud Pro ($20). z.ai Coding Plan Lite ($10/month) is the best value if the limit is sufficient. This tier has multiple reasonable options depending on your model preference.
High volume + privacy-insensitive (heavy agent coding, hundreds of millions to a billion tokens per month): Subscriptions are practically the only option. Pure API at this volume costs $400-$1,600/month. z.ai Coding Plan Max ($80/month) is currently the most clearly viable subscription for heavy usage. Ollama Cloud Max ($100/month) is also worth trying, but GPU-time billing depletes faster on large models — testing is needed. If you work across multiple models, Ollama Cloud’s single subscription covering all models beats buying multiple vendor subscriptions.
Low volume + privacy-sensitive (prompts containing client data or commercial information, lightweight agent): Ollama Cloud is currently the provider with the clearest privacy commitment. Its “no storage, no training” policy has the strongest wording among the four, though it lacks third-party audit certification. The Pro tier ($20) should be sufficient for lightweight agent use.
High volume + privacy-sensitive (enterprise-grade agent workflows, sensitive data): There is no perfect option today. Ollama Cloud has the strongest privacy commitment but no audit certification, and whether the Max tier can sustain heavy usage is uncertain. The Chinese vendors’ data policies (particularly Zhipu’s broad licensing clause and Kimi’s explicit collection statement) need to be reviewed clause by clause against your compliance requirements. If audit compliance is a hard requirement, self-hosting infrastructure or finding a dedicated inference service with SOC 2 certification may be the only way forward.
One final note on the time dimension. DeepSeek V4 Pro’s 75% discount expires on May 31, 2026, after which the API price jumps to four times the current level. In Scenario A (lightweight agent), once the discount expires, DeepSeek’s monthly cost jumps from $18 to $73, immediately widening the gap with z.ai Coding Plan Lite ($10) or Ollama Cloud Pro ($20). In Scenario B (heavy agent), after discount expiry, DeepSeek’s monthly cost leaps from $400 to $1,601, making the subscription advantage even more dramatic.