AI InfrastructureDeveloper Tools

Open-Source Model Inference Buying Guide: GLM-5.1, DeepSeek V4 Pro, Kimi K2.6 — API, Subscriptions, and Ollama Cloud Compared

Published Apr 28, 2026

If you’ve already decided that open-source models meet your needs, the next question is a procurement decision: where do you buy inference? The same model can be accessed through the official pay-per-token API, the vendor’s own subscription plan (z.ai Coding Plan, Kimi Chat Pro), or Ollama Cloud’s monthly subscription. The billing logic differs so much across channels that it can reshape your entire workflow design. Price is only the first layer. When your prompts contain client data or internal code, the difference in billing models matters far less than the difference in where your data ends up.

This article focuses on three of the most talked-about open-source models right now: GLM-5.1, DeepSeek V4 Pro, and Kimi K2.6. We compare them across three dimensions — price, privacy, and speed — and end with practical recommendations for different usage scenarios.

Three Ways to Buy, Three Billing Models

First, let’s clarify what each channel is actually selling.

Official API is the pay-per-token service run by the model vendors themselves. DeepSeek is at platform.deepseek.com, GLM-5.1 at z.ai, and Kimi at platform.kimi.ai. Input and output tokens are priced separately; you pay for what you use. All three bill in US dollars.

Vendor subscriptions are the vendors’ own monthly plans. z.ai has a Coding Plan (Lite ~$10/month, Pro ~$30/month, Max ~$80/month) covering the full model lineup including GLM-5.1, billed by request count, and compatible with agent coding tools. Kimi also offers a developer-oriented Kimi Code subscription, with tiers named after musical tempos: Moderato (~$19/month), Allegretto (~$49/month), and Vivace (higher, price undisclosed), including CLI access and priority compute resources. DeepSeek currently only has a consumer-facing Chat subscription (Growth $35/month), which does not expose an API and cannot be used for agent coding workflows — it is excluded from this comparison.

Ollama Cloud is Ollama’s cloud inference service, subscription-based: Free $0, Pro $20/month, Max $100/month. Usage is measured by GPU time consumption, not by tokens. All three models are available (glm-5.1:cloud, deepseek-v4-pro:cloud, kimi-k2.6:cloud). They likewise do not disclose specific token limits, only stating that Pro is 50x Free and Max is 5x Pro.

Price: API Pay-Per-Token

Let’s start with official API unit prices. All prices are per million tokens.

Model	Input (/Mtokens)\|Output(/M tokens)	Notes
DeepSeek V4 Pro (discount)	$0.435	$0.87	75% discount, through 2026/05/31
DeepSeek V4 Pro (full price)	$1.74	$3.48	Restores on June 1
GLM-5.1 (z.ai)	$1.40	$4.40	202K context, cached input $0.26
Kimi K2.6	$0.95	$4.00	256K context, $1 min top-up

Sources: DeepSeek pricing, z.ai pricing, Kimi pricing

DeepSeek V4 Pro during the discount period is the cheapest of the three. Once the discount ends, the picture flips: GLM-5.1’s input price ($1.40) is lower than DeepSeek’s full price ($1.74), but its output price ($4.40) is higher than both DeepSeek ($3.48) and Kimi ($4.00). If your workload is output-heavy (code generation, long-form writing), GLM-5.1’s API unit price is actually the most expensive of the three.

z.ai’s API also has an easy-to-miss feature: cached input costs only $0.26/M tokens, less than one-fifth of the full input price. If your prompt has a large repeated prefix (e.g., system prompt + unchanging context), your actual input cost will be significantly lower than the sticker price.

API Costs for Two Scenarios

Scenario A: Lightweight agent coding. 10K input / 2K output per call, 100 calls per day, 3,000 calls per month. Totals: 30M input + 6M output tokens. This is the typical usage of someone running an agent for coding assistance daily, with dozens to hundreds of calls per day.

Model	Monthly Cost
DeepSeek V4 Pro (discount)	$18.27
Kimi K2.6	$52.50
GLM-5.1	$68.40
DeepSeek V4 Pro (full price)	$73.02

At this usage level, DeepSeek’s discount price still holds up well — $18 is all it takes. After the discount expires, all three models land in the $50-73 range. Meanwhile, z.ai Coding Plan Lite is only $10 and Ollama Cloud Pro is $20 — subscriptions already show a clear price advantage at this usage level.

Scenario B: Heavy agent coding. Roughly 750M input + 85M output tokens per month (total ~835M tokens), with an input-to-output ratio of about 9:1. The basis for this volume estimate: users of Claude Code’s $200 Max plan tracked with the ccusage tool report monthly consumption in the $2,000-3,000 range (converted to Sonnet 4.6 API price equivalents, assuming 75% of input hits cache), which back-projects to approximately 800M-1.3B tokens/month. We use the midpoint here.

Model	Monthly Cost
DeepSeek V4 Pro (discount)	$400
Kimi K2.6	$1,053
GLM-5.1	$1,424
DeepSeek V4 Pro (full price)	$1,601

At this volume, pure API monthly costs range from $400 to $1,601 depending on the model and discount. In contrast, z.ai Coding Plan Max is only $80/month and Ollama Cloud Max is $100/month — if the subscriptions can handle this load (z.ai Max can, based on real-world testing), the cost gap is 5-20x.

Monthly Subscriptions: Who Benefits at What Volume

Now let’s bring subscription plans into the comparison. The difficulty is that each vendor’s subscription doesn’t bill by tokens, and their limit descriptions use inconsistent units, making direct translation nearly impossible. The table below lists what is known, followed by a rough estimation.

Subscription	Monthly Fee	Models Included	Limit Description
z.ai Coding Plan Lite	~$10	GLM-5.1, 5-Turbo, 4.7, 4.6, 4.5-Air	“3× Claude Pro usage”
z.ai Coding Plan Pro	~$30	Above + GLM-5	“5× Lite usage”
z.ai Coding Plan Max	~$80	Above + GLM-5	“4× Pro usage”
Kimi Code Moderato	~$19	Kimi K2.6 and full lineup	CLI access, priority compute
Kimi Code Allegretto	~$49	Same	Higher limit
Kimi Code Vivace	Undisclosed	Same	Highest limit
Ollama Cloud Pro	$20	All cloud models	“Day-to-day work,” 50x Free
Ollama Cloud Max	$100	All cloud models	5x Pro

Sources: z.ai Coding Plan, Kimi Code guide, Ollama pricing

A critical data point here: z.ai’s Coding Plan describes limits using “N times Claude Pro usage” as a reference, not specific request counts or token numbers. Users have reported that on the Max tier (~$80/month), consuming millions or even tens of millions of tokens per month does not hit the cap. Based on real-world heavy-usage testing (800M tokens/month consumption), z.ai Coding Plan Max ($80/month) can handle it. At the same volume, pure API would cost $400-$1,601 — z.ai Max saves 5-20x.

Ollama Cloud’s limits are similarly opaque. Community reverse-estimation puts the Free tier at roughly 28 GPU-hours/month, with Pro at 50x Free. But converting GPU hours to tokens depends on model size and request complexity, so precise translation isn’t possible. At 800M tokens/month of heavy usage, Ollama Cloud Max ($100/month) might just barely handle it — or it might not — depending on which model you primarily use (larger models consume GPU time faster). The Pro tier ($20) is almost certainly insufficient at this volume.

Kimi Code’s Allegretto (~$49/month) or Vivace may be enough for heavy agent scenarios, but Vivace’s price is undisclosed and its limit is opaque — testing is needed. DeepSeek V4 Pro has no usable subscription option for heavy agent scenarios — its Chat product doesn’t expose an API. Heavy DeepSeek users either go through Ollama Cloud’s subscription or accept high pay-per-token costs.

Back to Scenario A numbers (30M+6M tokens/month). Lightweight agent usage costs $18-$73 on pure API. Ollama Cloud Pro at $20 and z.ai Lite at $10 are both in range — if the subscription can cover this volume (highly likely), the subscription is cheaper.

By Scenario B (750M+85M tokens/month), pure API costs $400-$1,601 per month while z.ai Max is $80 and Ollama Cloud Max is $100. Subscriptions still hold a 5-20x price advantage. z.ai Max is the most clearly viable option at this volume. Ollama Cloud Max carries some risk but is worth trying.

One caveat: this conclusion assumes you primarily use a single vendor’s models. If your workflow spans multiple models (e.g., GLM-5.1 for coding, Kimi for research, DeepSeek for reasoning), buying a separate subscription from each vendor no longer makes sense. This is where Ollama Cloud’s advantage shines: one subscription covers all three models. Conversely, if you only use GLM-5.1, z.ai’s Coding Plan is more focused and offers looser limits.

Privacy: Where Does the Data Go

The conclusions in this section may overturn the pure-price judgments from earlier.

The four service providers — three Chinese model vendors (DeepSeek, Moonshot, Zhipu) and one overseas service provider (Ollama) — differ more dramatically on data policy than on price. Here are the results of a line-by-line review.

Ollama Cloud has the clearest privacy commitment among the four. The official privacy policy states that when using cloud models, prompt and response content exists only transiently during request processing (“process this content transiently”) and is “not stored beyond the time required to fulfill the request,” with technical measures implemented to minimize retention time. The terms of service separately declare: “We do not use your inputs or outputs to train AI models.” In its partnership contract with NVIDIA as the cloud service provider, Ollama requires compliance with a “no logging, no training, zero data retention” policy. Sources: ollama.com/privacy, ollama.com/terms

DeepSeek’s privacy policy uses the phrase “may collect” when describing the handling of prompts and input content, without the same explicit exclusion of storage found in Ollama’s policy. The policy mentions that entities within the corporate group may use data for “foundation model training and improvement” but simultaneously offers an opt-out right. However, how opt-out is executed in the API context (whether via dashboard settings or API parameters) is not explained in the policy. Source: DeepSeek privacy policy

Kimi / Moonshot explicitly lists conversation content as one of the categories of collected information in its privacy policy and states that the purpose includes “model optimization.” Article 5 of the user service agreement states: “You grant us a royalty-free license to use your input and output content and feedback for model service optimization to the extent permitted by law.” Sources: Moonshot privacy policy, Moonshot user agreement

Zhipu / z.ai’s user agreement, Article 10, Section 3, grants an exceptionally broad license: to the extent permitted by law, users “grant Zhipu and its affiliates a non-exclusive, geographically unlimited, perpetual, royalty-free license to use (including storing, using, reproducing, revising, editing, publishing, displaying, translating, distributing said information, or creating derivative works) and the right to sublicense to third parties.” This authorization covers all non-personal information content and permits sublicensing to third parties. Source: Zhipu user agreement

Summarized into a table:

Dimension	Ollama Cloud	DeepSeek	Kimi / Moonshot	z.ai / Zhipu
Stores prompt/response	No	May collect	Explicitly collects	Explicitly collects
Used for training	No	Possible by default, opt-out available	Explicitly for optimization	Extremely broad license, training not excluded
Third-party audit	None	None	None	None
Data storage location	US	China (inferred)	China	China

No value judgment is made here. Different usage scenarios have different privacy requirements. Readers need to determine which provider’s policy fits their scenario based on whether their prompts contain client data, commercially sensitive information, or personal data.

Speed: How Noticeable Are the Differences

Output speeds (tokens/second) for the three models on their respective official APIs, based on independent testing by Artificial Analysis (10K input token workload, P50 median):

Model	Official API Output Speed	Time to First Token
Kimi K2.6	~106 t/s	Not disclosed (includes thinking time)
GLM-5.1	~52 t/s	1.43s
DeepSeek V4 Pro	~38 t/s	2.02s

Kimi K2.6’s 106 t/s is roughly 2.8× faster than DeepSeek V4 Pro. In terms of perceived experience, generating a 500-token response takes about 4.7 seconds on Kimi, roughly 13 seconds on DeepSeek V4 Pro, and about 9.6 seconds on GLM-5.1 in the middle.

That said, the GLM series shows the most room for speed improvement with third-party optimization. Inference provider Novita ran GLM-5 at FP8 precision and achieved 213 t/s, while Baseten reached 186 t/s using MTP speculative decoding. This suggests that the official API’s 52 t/s is more a limitation of server-side optimization than the model’s inherent capability ceiling. If you call GLM-5.1 through a third-party provider (such as Together AI or Fireworks), speeds may be noticeably better than z.ai’s official API.

Ollama Cloud has not published throughput data for these three models, and no independent third-party benchmark exists. The underlying hardware is NVIDIA Blackwell series, which should theoretically be significantly faster than consumer-grade local hardware, but there is no public data for precise comparison.

On the rate limit front, DeepSeek V4 Pro’s 60 RPM is a practical bottleneck. If your workload involves frequent short requests (e.g., multiple agents calling concurrently), you’ll hit the RPM cap before the TPM cap. Kimi’s rate limit scales with top-up amount: the lowest tier ($1 top-up) gives only 3 RPM and 1 concurrent request, jumping to 200 RPM after topping up to $10. z.ai’s rate limits are not publicly documented; some users report that GLM-5 and 5.1 trigger rate limiting after several days of use.

Recommendations

Combining usage volume and privacy sensitivity yields roughly four quadrants:

Low volume + privacy-insensitive (lightweight agent coding, dozens to hundreds of calls per day): Both pay-per-token API and low-cost subscriptions work. DeepSeek V4 Pro during the discount period is $18/month, nearly matching Ollama Cloud Pro ($20). z.ai Coding Plan Lite ($10/month) is the best value if the limit is sufficient. This tier has multiple reasonable options depending on your model preference.

High volume + privacy-insensitive (heavy agent coding, hundreds of millions to a billion tokens per month): Subscriptions are practically the only option. Pure API at this volume costs $400-$1,600/month. z.ai Coding Plan Max ($80/month) is currently the most clearly viable subscription for heavy usage. Ollama Cloud Max ($100/month) is also worth trying, but GPU-time billing depletes faster on large models — testing is needed. If you work across multiple models, Ollama Cloud’s single subscription covering all models beats buying multiple vendor subscriptions.

Low volume + privacy-sensitive (prompts containing client data or commercial information, lightweight agent): Ollama Cloud is currently the provider with the clearest privacy commitment. Its “no storage, no training” policy has the strongest wording among the four, though it lacks third-party audit certification. The Pro tier ($20) should be sufficient for lightweight agent use.

High volume + privacy-sensitive (enterprise-grade agent workflows, sensitive data): There is no perfect option today. Ollama Cloud has the strongest privacy commitment but no audit certification, and whether the Max tier can sustain heavy usage is uncertain. The Chinese vendors’ data policies (particularly Zhipu’s broad licensing clause and Kimi’s explicit collection statement) need to be reviewed clause by clause against your compliance requirements. If audit compliance is a hard requirement, self-hosting infrastructure or finding a dedicated inference service with SOC 2 certification may be the only way forward.

One final note on the time dimension. DeepSeek V4 Pro’s 75% discount expires on May 31, 2026, after which the API price jumps to four times the current level. In Scenario A (lightweight agent), once the discount expires, DeepSeek’s monthly cost jumps from $18 to $73, immediately widening the gap with z.ai Coding Plan Lite ($10) or Ollama Cloud Pro ($20). In Scenario B (heavy agent), after discount expiry, DeepSeek’s monthly cost leaps from $400 to $1,601, making the subscription advantage even more dramatic.