April 2026 — an arXiv paper (Incompressible Knowledge Probes) opens with a concrete story. The author Li Bojie has been testing every new model with the same question for three consecutive years: “Do you know USTC Hackergame?” It is an annual CTF competition with Chinese-named problems that draw on specific cultural references — obscure challenges by design. As of May 2024, GPT-4o knew the competition existed but fabricated fake challenge names. Nine months later, Claude 3.7 Sonnet could accurately list all 19 challenges from Hackergame 2023. By April 2026, Kimi K2.6, Claude Opus 4.7, and Gemini 3.1 Pro could list specific challenges across multiple consecutive years.
This is more than an anecdote about one competition. It reveals a more general logic: how many facts a model knows that cannot be derived through reasoning is fundamentally bounded by its parameter count. Closed-source labs can withhold parameter counts, but how much a model knows about obscure facts is hard to fully conceal.
The IKP (Incompressible Knowledge Probes) framework proposed by this paper is designed to measure precisely this dimension.
IKP does not measure how many weights a model occupies in hardware. It measures something else: the model’s effective knowledge capacity — how much obscure factual knowledge it holds, expressed in terms of what parameter count of an open-source model would be needed to match it. At the same parameter count, data mixing ratios, training recipes, and safety alignment can all cause effective knowledge capacity to deviate from the raw parameter count — they are not the same thing.
This distinction matters because model capabilities can be divided into two qualitatively different kinds. One is reasoning ability (reasoning, parsing, instruction following, tool use), which can be compressed into fewer parameters through better architecture, training recipes, distillation, and post-training. The other is factual capacity (long-tail factual associations — the founding year of some institution, the specific work of a low-citation researcher), which is closer to a storage problem. Allen-Zhu & Li 2025 found in synthetic-fact experiments that language models can store roughly 2 bits/parameter of knowledge, providing a reference for the physical upper bound.
Reasoning is like method; facts are like storage. These two dimensions share the same parameter budget but consume different resources. In recent years, smaller models have caught up to larger ones on many benchmarks, which mainly reflects improvements in the efficiency of method-type capabilities. On the factual capacity side, a strong dependence on parameter count remains.
The design of these probes revolves around three principles.
First, it uses 1,400 questions sorted into seven tiers by rarity (T1 being the most common, T7 the most obscure). These are not ordinary factual QA — they deliberately exclude questions that can be answered through reasoning, selecting only facts that “must truly be remembered.”
Second, rarity stratification is central. Three hundred forty-five researcher probes ask the model to identify a computer scientist’s research area and provide a verifiable artifact — a paper title, a named system, an institution, or a collaborator. If the model guesses a plausible-sounding subfield but cannot name specific work, it is scored as weak; fabricated evidence is penalized. Another 557 Wikidata probes sample attributes such as institution founding years and capital cities from Wikidata, stratified by page view quartiles to determine rarity tiers.
Third, a wrong answer is worse than no answer. The scoring rule is: correct +1, weak +0.5, refusal 0, incorrect -1. This means confidently fabricating carries a worse penalty than saying “I don’t know,” suppressing the model’s tendency toward overconfidence.
IKP was calibrated on 89 open-source models with known parameter counts (ranging from 135M to 1,600B, covering 19 vendors). It found that IKP accuracy has a linear relationship with log10(parameter count), with R²=0.917. This correlation is strong, but predictive precision is limited: the 90% prediction interval spans roughly a factor of 3. In other words, for a closed-source model, IKP can estimate what parameter count its effective knowledge capacity corresponds to, but that estimate could be off by a factor of about 3.
Applying this yardstick to frontier closed-source models: GPT-5.5’s effective knowledge capacity is approximately 9.7T, with a 90% interval of [3.2T, 28.7T]; Claude Opus 4.6 is approximately 5.3T, [1.8T, 15.6T]; GPT-5, Claude Opus 4.7, o3, and Grok-4 roughly fall in the 3.0T to 4.1T range. However, it must be noted that above 1T parameters there are only two open-source anchor points: DeepSeek V4 Pro and the Kimi series. The estimates for frontier closed-source models are in fact extrapolations, and the uncertainty may be larger than the global factor-of-3 interval. This interval is not precise enough for fine-grained ranking, but it is sufficient to reveal orders of magnitude: it narrows the discussion from vague guesses of hundreds of billions, a few trillion, or tens of trillions down to a comparable range.
IKP’s most valuable output is splitting the vague notion of “model capability” into two distinct resources.
First, the question of small models catching up to large models. Over the past few years, there has been a strong industry intuition: the same benchmark score requires smaller and smaller models. The Densing Law (Huang et al.) quantifies this intuition: the parameter density required to achieve the same benchmark score roughly doubles every 3.5 months. In simple terms, a small model in 2026 can match the scores of a large model from 2023 on certain benchmarks.
IKP asks a different question: if we look only at obscure facts, do newer models know more at the same parameter count? The paper’s finding is that after controlling for parameter count, no consistent improvement is visible between older and newer models. The implication is not to discard the Densing Law, but rather to read it as better suited to reasoning and problem-solving benchmarks and not directly extrapolate it to long-tail factual capacity. It is good news that small models catch up to large ones on reasoning tasks, but it does not automatically follow that they also remember the same number of obscure facts.
Second, MoE. IKP’s data has a direct practical implication: for MoE models, total params fit (R²=0.79) is far better than active params (R²=0.51). This means that factual knowledge is distributed across all expert weights, not just the subset activated per token. Active params are a measure of compute cost; total params are the size of the knowledge storage pool. A MoE model that “only activates 40B” is not equivalent to having only 40B of knowledge capacity.
Third, researcher recognition. One of IKP’s interesting findings is that citation count and h-index explain only about 35% of the variance in model recognition rates. What really matters is a more complex set of factors: whether you have a widely used artifact (such as FlashAttention or IPFS), whether your name is easily confusable, and the density of derivative content about your subfield in the training corpus. Models do not remember the paper itself — they remember the derivative content generated around the work that is repeatedly mentioned.
The above judgments all depend on IKP being a valid measurement tool. The framework has just been released, has no independent replication yet, and three sets of caveats need to be stated explicitly here.
First, it measures effective knowledge capacity, not actual parameter count. Calibration points above 1T are sparse, frontier models are extrapolations, and the 90% prediction interval spans a factor of 3. The [3.2T, 28.7T] interval for GPT-5.5 means it could be considered to have comparable factual knowledge capacity to a model with anywhere from 3T to 28T parameters. This is not a precision measurement.
Second, API behavior can affect the readings. Safety alignment can cause models to know but not say — especially Anthropic’s Haiku series and GPT’s nano/mini variants, which show clear underestimation. The paper’s own data shows that Claude Sonnet 4’s refusal rate on T5 rose from 54% in the previous generation to 88% — this is essentially an artifact of alignment strategy, not a knowledge loss. Similarly, if a closed-source API backend applies retrieval augmentation, IKP cannot distinguish whether knowledge comes from weights or retrieval. The paper uses the fact that nearly all T7 scores are near zero as counterevidence, but this cannot be ruled out absolutely.
Third, the probes are publicly available on GitHub, which improves reproducibility but also exposes future evaluations to contamination risk. The paper’s countermeasure is that the probe generation method is reproducible — the stratified Wikidata sampling and DBLP sampling procedures can regenerate an equivalent set of probes within hours. Additionally, criticism from LifeArchitect.ai notes that IKP fits well on some models but shows clear deviations on GPT-4, o1, and certain DeepSeek/Kimi models. These tensions do not overturn the basic judgment that “factual capacity carries a signal,” but they indicate that IKP reads a mixed signal of knowledge capacity, refusal, and post-training effects, not parameter count alone.
Closed-source labs not disclosing parameter counts has been an industry norm for several years. The factor-of-2-or-greater uncertainty in inference economics also makes parameter count difficult to use as a reliable signal. IKP does not answer the question “how many parameters does GPT-5.5 actually have?” — its prediction interval is too wide to be read as an actual parameter count. But a 9× interval is still useful because orders of magnitude are at least discernible. It re-anchors parameter count from a vague “capability score” to the more concrete dimension of long-tail factual storage.
Going forward, whenever claims like “small models catching up to large models” or “distillation closing the gap with frontier models” appear, one more question should be asked: which capability are they catching up on? For reasoning format, problem-solving, and instruction following, small models and distillation can indeed compress. For long-tail facts, obscure expert knowledge, and attributes of specific entities, you need either a larger parameter pool or a retrieval system to compensate. This is consistent with findings that predate IKP — Kandpal et al. 2023 already showed on the BLOOM family that long-tail factual accuracy has a log-linear relationship with model scale (R²=0.98), with accuracy improving roughly 14 to 15 percentage points per order of magnitude increase — consistent with IKP’s 14.7 pp/decade.
Parameter count is not equally important across all dimensions. Its significance for benchmark reasoning is declining, but it remains important for long-tail factual capacity. This is not a conclusion any single paper can settle — IKP has just been released and has no independent replication yet — but it offers a measurement approach that can be tracked over time, and it poses a question that must be answered whenever discussing model scale: what kind of capacity do you mean by scale?