.site-nav{margin-bottom:1.5em;font-size:0.9em}.site-nav a{color:#0066cc;opacity:0.7;text-decoration:none}.site-nav a:hover{opacity:1}@media(prefers-color-scheme:dark){.site-nav a{color:#6db3f2}} ← Directory
Programs have many properties that are obvious from a semantic level but require complex analysis for a compiler to formalize, or may even be impossible to prove.
For example, a function receives five pointer parameters and alternately reads from and writes to the memory pointed to by these pointers within a loop. A person reading the code immediately knows these pointers point to five independently allocated arrays and cannot alias each other. However, for a compiler to prove this, it needs to perform interprocedural alias analysis: tracing the source of each pointer, crossing function call boundaries, and analyzing all possible execution paths. This often fails in practice, especially in scenarios involving cross-translation unit calls, memory allocation through opaque wrappers, or the use of multi-dimensional arrays. The compiler’s response is to handle this conservatively: either by inserting runtime alias checks (increasing code size) or by simply giving up on optimization (losing performance).
LLMs naturally stand on the side of semantic understanding. Compilers naturally stand on the side of formal proof. The existence of this gap means that if an LLM can translate its semantic understanding into a form consumable by the compiler (attributes, metadata), it can directly unlock optimizations that the compiler already knows how to do but is afraid to perform.
This is not just about the restrict keyword. LLVM IR has
a series of attributes (noalias, nonnull,
align, nsw/nuw,
readonly, dereferenceable,
branch weights, etc.), each corresponding to a class of
properties the compiler wants to know but cannot prove. Systematically
identifying which attributes have the largest semantic-proof gap, how
far LLMs can go, and the reasons for their limitations is a meaningful
research contribution in itself.
A compiler’s optimization capability far exceeds its proof capability. It knows hundreds of optimization techniques but gets stuck at the “cannot prove safety” step in a large amount of real-world code. The fundamental reason is that the C/C++ type system is too weak and lacks sufficient means to encode semantic information.
Take gesummv from Polybench as an example: the function
receives five array parameters: matrices A, B and vectors
tmp, x, y, and reads and writes multiple arrays in the
inner loop. What the compiler sees are five pointers of type
double (*)[N]. The C type system has no mechanism to
express that “these five pointers point to non-overlapping memory.” Even
if each array is independently malloced, as long as the
allocation occurs in another translation unit (such as Polybench’s
polybench_alloc_data function), the compiler cannot see the
allocation point in the current translation unit and thus cannot make
the inference.
Control group: Rust’s borrow checker naturally provides LLVM with
alias information that C cannot express (&mut reference
automatically carries noalias semantics). With the same
LLVM backend, IR generated by the Rust frontend usually achieves better
optimization. This isn’t because the Rust compiler is smarter, but
because it sees more information.
The direction of this article is essentially: let AI act as a “virtual type system” to provide C/C++ code with semantic information similar to the Rust level, without requiring programmers to change their programming habits.
Experimental Verification: The size of the gap depends on code
complexity. We used restrict (corresponding to
noalias in LLVM IR) as the first test attribute to verify
the existence and size distribution of the gap on two sets of benchmarks
with different complexities.
The first set of experiments used 10 hand-written kernels, covering
calculation patterns such as vector update, reduction, matrix
multiplication, stencil, sparse matrix, convolution, prefix sum, and
histogram, with one-dimensional array pointer parameters. All kernels
were marked noinline to prevent the compiler from seeing
the allocation point through inlining. Clang 19, -O2, Apple Silicon.
Results: Only 1/10 kernels (Stream Triad) showed a measurable runtime improvement (+5.4%). 3/10 kernels had an 8-18% reduction in code size. The remaining 7/10 were completely unaffected.
Opportunity Sizing
The reasons for being unaffected fall into three categories:
reduction patterns (Dot Product, GEMV, Conv 2D) where the inner loop
accumulates results into local scalar variables, and the loop body does
not write to arrays, so the compiler can vectorize without alias
information. Patterns that cannot be vectorized (Prefix Sum, Histogram,
SpMV) where the performance bottleneck is not in alias analysis. For
affected kernels, the compiler generated runtime alias check code in the
plain version (comparing pointer address differences → deciding whether
to take the vectorized or scalar fallback path), and
restrict eliminated these checks. However, the check itself
is O(1), while the loop body is O(N), so for large arrays, the check
overhead is diluted to the point of being negligible.
At first glance, the value of restrict seems limited.
But this conclusion only holds for simple one-dimensional array
patterns.
The second set of experiments used 10 kernels from Polybench/C 4.2.1
(a standard benchmark suite in the field of compiler optimization),
covering four subcategories: BLAS, linear algebra kernels, stencil, and
data mining. Polybench kernels use two-dimensional VLA parameters
(double A[restrict N][N]), involving more complex
multi-dimensional array access patterns and more pointer parameters.
Polybench comes with restrict support
(-DPOLYBENCH_USE_RESTRICT), and noinline was
similarly injected.
The results were completely different: 6/10 kernels showed a runtime
improvement of >5%, reaching up to +34.9% (BiCG) and +25.7% (SYR2K).
Correctness verification confirmed that the outputs of the plain and
restrict versions of the 10 kernels under
-DPOLYBENCH_DUMP_ARRAYS were identical;
restrict only affected performance and did not change the
calculation results.
Polybench Results
| Kernel | Runtime Speedup | Code Size Change | Mechanism |
|---|---|---|---|
| BiCG | +34.9% | +88% | Vectorization enabled |
| SYR2K | +25.7% | +80% | Vectorization enabled |
| FDTD 2D | +13.2% | -5% | Alias check elimination |
| Jacobi 2D | +9.8% | -28% | Alias check elimination |
| Heat 3D | +9.5% | -35% | Alias check elimination |
| SYMM | +6.9% | -4% | Alias check elimination |
| 2MM | +2.7% | -2% | — |
| Correlation | +1.3% | -17% | — |
| MVT | +1.1% | -31% | — |
| GEMM | +0.4% | -16% | — |
By comparing -Rpass vectorization reports, the gains
from restrict come from two distinct mechanisms.
Mechanism 1: Vectorization enabled. In BiCG and SYR2K, the compiler
completely gave up on vectorizing the inner loop in the plain version
(reporting “loop not vectorized” or “vectorization not beneficial”).
After adding restrict, the same loop became “vectorized
(width: 2, interleaved: 4)”. It’s not that the compiler doesn’t know how
to vectorize, but that it doesn’t dare to—because it cannot prove that
multiple array parameters do not alias. This is exactly the scenario
where the semantic-proof gap is largest: an LLM can see at a glance that
five array parameters come from independent allocations, while the
compiler needs to traverse the entire call chain to (attempt to) prove
it. This gap produced a 25-35% runtime gain. The code actually became
larger (+80-88%) because the vectorized loop itself occupies more
instruction space than the scalar loop.
Mechanism 2: Alias check elimination. For the three stencil kernels
(Jacobi 2D, FDTD 2D, Heat 3D), the vectorization decisions were
identical in both versions. restrict eliminated the runtime
alias check branches and scalar fallback paths, resulting in a 9-13%
runtime gain and a 5-35% reduction in code size.
The comparison between the two sets of experiments reveals a key pattern: the size of the gap depends on the difficulty of the compiler’s alias analysis. One-dimensional arrays, two pointers → the compiler can handle it itself (small gap). Two-dimensional VLA, five pointers with interleaved reads and writes → the compiler is helpless (large gap). And it is precisely the latter—complex data structures and algorithm implementations—that require AI assistance.
restrict/noalias is just one in the LLVM
attribute system. Systematically mapping the “semantic understanding
difficulty” and “formal proof difficulty” of each attribute is necessary
to find the full opportunity space for AI-assisted compiler
optimization.
Below is a preliminary analysis of major attributes in LLVM:
| Attribute | Compiler Proof Difficulty | LLM Identification Difficulty | Unlocked Optimization | Analysis |
|---|---|---|---|---|
noalias |
High (interprocedural alias analysis) | Low (just look at allocation points) | Vectorization, LICM | Verified. Up to +35% on Polybench |
nonnull |
Medium-High (interprocedural null flow) | Low (look at allocation and error handling) | Null check elimination | malloc return values, pointers after null checks. LLM
can directly see control flow |
align(N) |
Medium-High (needs to track alignment through allocation chains) | Low (knows alignment guarantees of allocation APIs) | Aligned SIMD load/store | Verified. After injecting assume_aligned(64) for
posix_memalign(4096) return values, only 5 out of 30
kernels had >2% speedup, and 14 actually slowed down. Weak leverage
on Apple Silicon |
nsw/nuw |
High (value range analysis) | Medium (needs to understand value ranges) | Loop optimization, arithmetic simplification | Loop variables incrementing from 0 to N will not overflow; LLM can infer from loop structure. But overflow reasoning for nested calculations is harder |
readonly/writeonly |
High (requires whole-program analysis) | Low-Medium (read function body) | CSE, dead store elimination, LICM | LLM is very accurate for single-function analysis. May not see everything with cross-module dependencies |
dereferenceable(N) |
High (escape analysis + lifetime) | Low (look at buffer size and lifetime) | Speculative load, LICM | LLM knows the pointer returned by malloc(100) is
dereferenceable(100) |
branch weights |
Impossible (unknown without PGO data) | Medium (understand error handling vs. normal path) | Code layout, branch prediction | Most extreme gap: compiler is completely blind without PGO data, LLM can understand “this is error handling, rarely taken” |
!range |
High (value range propagation) | Medium (understand input domain constraints) | Bounds check elimination, narrowing | Enum values, array indices, validated inputs. LLM can understand “this value is a pixel from 0-255” |
A few observations.
Attributes with the largest gap: noalias (verified) and
branch weights. The former because interprocedural alias
analysis is one of the hardest analyses for a compiler. The latter
because the compiler has zero information about branch probabilities
without a PGO profile, while an LLM can understand from code semantics
which are error paths and which are hot paths.
Dimensions LLMs excel at: pointer-related attributes
(noalias, nonnull,
dereferenceable, align), because these
properties are usually determined at the allocation point, and the LLM
can directly see the semantics of the allocation API.
Dimensions LLMs might struggle with: value range-related attributes
(nsw/nuw, !range), which require
reasoning about arithmetic operation chains; LLM’s numerical reasoning
capability is a known weakness. However, simple cases for loop variables
(incrementing from 0 to N) should be fine.
The most practically valuable direction: attributes that
simultaneously satisfy the three conditions of “high compiler proof
difficulty” + “low LLM identification difficulty” + “large unlocked
optimization gain.” Currently, noalias,
branch weights, and nonnull +
dereferenceable seem to be the highest priority
targets.
As of 2025, there are five main categories of work in this direction, none of which overlap with the idea in this article.
Replacing compiler heuristics. Google MLGO (2021+) uses RL to replace LLVM’s inlining/register allocation decisions and has been deployed in Chrome, Fuchsia, and Android. Magellan/AlphaEvolve (2025) uses Gemini to evolve C++ code for LLVM heuristic functions, performing 4.27% better than upstream heuristics on inlining-for-size. This type of work does not change the amount of information the compiler sees, but only makes better decisions under fixed information conditions.
Predicting optimization pass sequences. Meta LLM Compiler (2024, 7B/13B, trained on 546B token LLVM-IR) predicts the optimal pass sequence, achieving 77% of the effect of autotuning search. IR-OptSet (NeurIPS 2025) provides a 170K-sample LLVM IR optimization dataset; fine-tuned LLMs exceed -O3 on some cases. Again, this is about making choices under existing information.
Rewriting source code. LLNL’s CompilerGPT (2025) has LLMs rewrite code after reading compiler optimization remarks, achieving a 6.5x improvement on prefix sum. This changes the program implementation itself and requires proof of semantic equivalence.
Domain-specific pragma insertion. TimelyHLS / LIFT (2024-2025) has
LLMs insert FPGA HLS pragmas (#pragma HLS pipeline,
#pragma HLS unroll) into C code, achieving 3.5x speedup.
This verified the feasibility of the “AI inserts pragma → compiler
consumes” pipeline in the HLS vertical field, but the correctness
constraints for HLS pragmas are much looser than semantic annotations
for general-purpose compilers (wrong HLS pragmas usually just result in
poor performance, not wrong results).
LLM-enhanced static analysis. The CAFD system by Cheng et
al. (PKU/HUST/NTU/Ant/UNSW, 2025) uses LLMs (Qwen3-14B, LLaMA-3.2-3B,
Phi-4, DeepSeek-R1) to detect custom memory allocation functions in C
code. Core observation: a large number of heap allocations in real C
projects are completed through project-specific wrappers like
xmalloc and g_new, which standard pointer
analysis tools (such as SVF) cannot identify, leading to alias set
explosion. CAFD found 700+ custom allocation functions in 17 large C
projects, reducing the alias set by 41.5%. This work verified the path
of “LLM semantic understanding → more precise alias information,” but
the direction is analyzing existing code to recover lost semantic
information. This article proposes embedding it directly during the
generation phase. The two are complementary: CAFD handles legacy code,
while this solution handles incremental code.
Multiple studies (Licorish et al. 2025, Molison et al. MSR 2025, Abbassi et al. ICSME 2025) compared the quality of LLM-generated code and human-written code. The key fact is: these studies all focus on dimensions such as correctness, maintainability, and security; no study has discussed the “compiler optimizability” of AI-generated code.
The LLVM community has firsthand experience with this problem.
“Type-Alias Analysis: Enabling LLVM IR with Accurate Types” (Zhou et
al., ISSTA 2025) pointed out that LLVM lost type information after
switching to opaque pointers, leading to alias analysis degradation.
nikic’s 2024 LLVM annual review also mentioned that the new
nuw flag on GEP was because “frontends previously had no
means to convey to LLVM that the offset is non-negative.” These are all
direct evidence of the semantic gap.
Alive2 (Lopes et al., PLDI 2021) is a translation validation tool for LLVM that can formally verify the correctness of IR transformations. It can be used to verify whether AI-generated attributes introduce undefined behavior. In 2024, there was work combining fine-tuned LLMs with Alive2, using LLMs to predict transformation correctness when the SMT solver times out, and then confirming with fuzzing. This “formal + AI + fuzzing” layered verification strategy can be directly reused.
Metadata generated by AI at time t=0 is correct, but the code is
constantly modified at t=1, t=2, … Modifications may break the
assumptions the metadata depends on. For example, initially
a and b point to independent
mallocs, but later someone changes it to
b = a + offset, at which point restrict no
longer holds.
This is the core challenge and the most research-worthy problem.
Even at the time of generation, an LLM may generate inconsistent code
and metadata. For example, the code actually does
b = a + 1, but the metadata declares
restrict(a, b). The correctness of metadata is a
second-layer requirement on top of code correctness.
Existing training data rarely contains code with optimization
annotations. Human programmers almost never write restrict,
__builtin_expect, or alignment attributes. LLMs need to be
explicitly guided to generate these annotations.
A function’s metadata may depend on the behavior of the caller or callee. Change detection across modules is harder than within a module.
JIT compilers (V8, HotSpot/Graal, PyPy) have long used a mature pattern to handle “assumption-based optimization”: collect profile → generate optimized code based on assumptions and insert guards → deoptimize when guards fail. “Formally Verified Speculation and Deoptimization in a JIT Compiler” (POPL 2021) formally verified this pattern using Coq.
AI-generated metadata is equivalent to “assumptions” in JIT, guards are equivalent to runtime assertions, and deoptimization is equivalent to falling back to conservative compilation without metadata. The difference is that JIT does this dynamically at runtime, while AOT scenarios need to complete it at compile-time + test-time.
AI-generated metadata is essentially a contract.
restrict(a, b) is an invariant: “During the execution of
this function, a and b do not point to
overlapping memory.” The reason DbC hasn’t gained traction in mainstream
languages is that the maintenance cost is too high. But if contracts are
automatically generated and maintained by AI, this cost barrier
disappears.
Layer 1 (compile-time): Static analysis checks if metadata clearly contradicts the code. Layer 2 (test-time): Each metadata is accompanied by a runtime assertion, enabled in debug builds and CI. Layer 3 (formal verification): Use Alive2 to verify performance-critical paths. If any layer detects a problem, the metadata is automatically removed (graceful degradation).
Goal: Determine the gain distribution of
restrict/noalias under different code
complexities.
Completed content: (1) 10 hand-written kernels (one-dimensional arrays, simple mode), (2) 10 Polybench kernels (two-dimensional VLA, complex mode). Measurement of runtime performance and assembly code size differences completed.
Core finding: The size of the gap depends on code complexity. In
simple one-dimensional mode, the compiler can handle it itself (1/10 has
>5% speedup); in complex multi-dimensional mode, the compiler’s alias
analysis fails (6/10 have >5% speedup, up to +35%).
restrict generates gains through two mechanisms:
vectorization enablement (when the gap is largest) and alias check
elimination (when the gap is medium).
Goal: Push the performance upper bound for more optimizations on all 30 Polybench kernels, while performing attribution analysis on the boundaries of LLM’s capabilities.
Methodology: holistic profile → cross-kernel aggregation → batch optimization
Instead of analyzing kernel by kernel, first collect
-Rpass-missed reports for all kernels and aggregate
statistics on the type distribution of missed optimizations across
kernels. This way, even if a certain optimization is not the largest
bottleneck in each kernel, as long as it appears in most kernels, its
total leverage is worth prioritizing. After determining priorities,
apply the same class of optimization to all kernels in batch and measure
the global effect at once.
Missed optimization panorama: Complete -Rpass-missed
remarks were collected for the plain and restrict versions
of the 30 kernels. Restrict resolved 126 GVN (redundant
load), 96 SLP vectorizer, 71 LICM (load hoisting), and 23 loop-vectorize
missed optimizations. Residual problems ranked by kernel coverage: LICM
still fails (19/30 kernels), cost model rejects vectorization (14/30),
GVN not eliminated (11/30). A typical residual pattern in LICM is that
in the accumulator q[i] += A[i][j] * p[j],
q[i] cannot be hoisted out of the loop, presumably because
LLVM’s alias analysis cannot fully propagate parameter-level
noalias to the load/store of VLA derived pointers.
Forced vectorization verification: After overriding cost model
decisions with -mllvm -force-vector-width=N, 13 out of 30
kernels achieved >10% additional speedup, reaching up to +84%
(gesummv). At the same time, 2 kernels slowed down
(symm -22%, correlation -10%), indicating that
the cost model’s conservative judgment in some cases is correct.
The cumulative effect of the two optimizations is significant.
Restrict alone: 10/30 kernels have >5% speedup, mean
+7.3%. After adding fvw: 19/30 have >5% speedup (63%),
13/30 have >20% speedup (43%), mean +30.6%. Maximum cumulative
speedup +88.4% (bicg) and +88.2% (trisolv).
Correctness verification: the output of the restrict
version of 30/30 kernels is identical to the plain version.
Attribution analysis of four types of kernels. The 30 kernels are
divided into four categories based on “which optimization technique is
effective”: (A) Dual gains from restrict + fvw
(4 kernels, e.g., bicg/trisolv), with complex
multi-pointer patterns AND inaccurate cost model. (B) Primarily
benefiting from restrict (3 kernels, e.g.,
syrk +32%), where alias analysis is the only bottleneck.
(C) Primarily benefiting from fvw (9 kernels, e.g.,
gesummv +85%/durbin +78%), where alias is not
an issue but the cost model misjudged. (D) No significant gain (7
kernels, e.g., floyd-warshall), where the calculation
pattern is unsuitable or the compiler is already good enough.
This reveals two different types of semantic-proof gaps. The first is the correctness proof gap: the compiler lacks alias information and cannot prove the safety of optimization. LLMs compensate by understanding memory layout (high risk, wrong labeling is UB). The second is the strategy judgment gap: the compiler has the ability to perform optimization but the cost model makes a wrong judgment. LLMs compensate by understanding calculation patterns (low risk, wrong choice just results in slowdown). The two are complementary and cumulative; a single technique can only cover 23-43% of kernels, while the combination covers 63%.
Experiments with alignment annotations negated the significance of a
third gap. Polybench’s memory allocation function internally uses
posix_memalign(&ret, 4096, ...), so all arrays are at
least 4096-byte aligned. But what the compiler sees is a
void* return value and doesn’t know the alignment
guarantee. After passing this information to the compiler by adding
__attribute__((assume_aligned(64))) to the function
declaration, only 5 out of 30 kernels had >2% speedup
(gesummv +30%, durbin +29%,
gemver +29%), and 14 actually slowed down (average -12%),
with an overall mean of -2.2%. The reason is that the performance
difference between unaligned load and aligned load on Apple Silicon NEON
is almost non-existent, so the impact of alignment information on the
cost model is very small on this architecture. Extra attributes might
also disturb LLVM’s pass sequence, leading to suboptimal instruction
scheduling. This negative result confirms the priority ranking:
noalias >> vectorization advice >> alignment
annotation. However, the conclusion might be different on x86 platforms
(especially early SSE requiring 16-byte alignment), which is worth
verifying in the future.
From learnings to agentic optimizer: An effective agentic optimizer
needs two modules. Alias annotator analyzes function signatures and call
contexts to provide noalias information (conservative
strategy). Vectorization advisor analyzes loop structures and
-Rpass-missed output to provide vectorization suggestions
(speculate-then-verify strategy). The two modules run independently, and
their results are cumulative. Both modules have been implemented as
opencode skills, with a design philosophy based on “result determinism”:
not prescribing the process, but only defining verification standards,
allowing the agent to iterate on its own in a feedback loop. The
decision results for each kernel (including failed samples) are appended
to optimizer_learnings.md, forming a knowledge
flywheel.
Polybench’s code patterns are relatively homogeneous (linear algebra and stencil). Is the discovery of the two types of gaps just a coincidence of a specific benchmark? TSVC-2 (Test Suite for Vectorizing Compilers, v2) provides a different verification scenario: 151 kernels, covering various calculation patterns such as reduction, loop restructuring, induction variable identification, control flow, and indirect addressing.
The structural characteristics of TSVC-2 eliminate Gap 1. TSVC-2 uses
global static arrays (__attribute__((aligned(64)))), so the
compiler naturally knows these arrays do not alias each other.
Therefore, restrict experiments are meaningless on TSVC-2.
This in turn provides a clean isolation: experimental results on TSVC-2
purely reflect Gap 2 (cost model misjudgment), without mixing in
contributions from Gap 1.
fvw experiments on 146 valid kernels show that Gap 2 is
a universal phenomenon. 33/146 kernels (23%) achieved >5% speedup
through fvw override, 19 achieved >20%, mean +9.2%.
Compared to Polybench (fvw contributing 63% coverage),
TSVC-2’s 23% seems lower, but the difference mainly stems from the
simultaneous existence of Gap 1 in Polybench. Looking only at the
contribution of Gap 2 (Category C kernels in Polybench 9/30 = 30%), the
performance of the two benchmarks is consistent.
The most important discovery: 14 pure cost model gap cases. In these
kernels, the compiler completely failed to auto-vectorize (autovec gain
~0%), but achieved 25-94% speedup after forced vectorization. By
pattern: the first 6 are all reductions (sum: s311 +93%, product: s312
+94%, coupled: s319 +89%), and the rest involve indirect addressing and
induction variable identification. The compiler rejects vectorizing FP
reduction under -O3 (without -ffast-math) because the
reduction tree changes the operation order. But in fact, the numerical
sensitivity of these kernels is very low, and the FP difference brought
by vectorization (maximum relative diff 6.87e-4) is completely
acceptable. This is a typical strategy judgment gap: the compiler has
the ability to do this optimization, and safety is not an issue (just FP
order change, not UB), but the cost model judges it “not worth it.”
Hardware implications of FVW4 being superior to FVW2. Under float32 +
128-bit NEON (4 lanes natural width), fvw4 mean +8.9% while
fvw2 mean -3.6%. Forcing SIMD narrower than the natural
width actually results in degradation. This provides a reliable
heuristic for the vectorization advisor: default to hardware natural
vector width.
The compiler cost model is correct in most scenarios. 107/146 kernels
(73%) are unaffected by fvw within ±5%, indicating that the
cost model’s conservatism is correct in most cases. The value of FVW is
concentrated in a few scenarios where the cost model fails, and the
characteristics of these scenarios are learnable (reduction, indirect
addressing, induction variable patterns).
Continue pushing the upper bound. Alignment annotations have been
tested (weak leverage, see above). Next, test the incremental gain of
IR-level !noalias scope metadata, which may resolve the
residual LICM issues in 19/30 kernels (restrict provides
noalias between parameters at the source level, but LLVM’s
alias analysis fails to fully propagate it to the load/store of VLA
derived pointers).
Build an agentic optimizer prototype. Two modules have been
implemented as opencode skills
(skill_compiler_alias_annotator.md and
skill_compiler_vectorization_advisor.md). Each skill
defines verification standards rather than step-by-step processes,
following the “result determinism” design philosophy: alias annotator
requires diff pass + clean compilation + performance quantification;
vectorization advisor requires numerical tolerance within 1e-2 +
performance quantification. Both skills include a knowledge accumulation
mechanism, where the decision for each completed kernel is appended to
optimizer_learnings.md, forming a compounding effect. Next,
test the accuracy of the skills on the 23 non-Category D kernels of
Polybench.
Writing: The data from the full Polybench set + TSVC-2 is sufficient to support a paper. Core contributions: (1) Discovery and quantification of two types of semantic-proof gaps (correctness proof + strategy judgment), (2) Cumulative effect of 63% coverage and +30.6% mean speedup on 30 Polybench kernels, (3) Verification of the universality of Gap 2 on 146 TSVC-2 kernels (23% coverage), (4) Attribution analysis of four types of kernels and the design framework of an agentic optimizer.
The core thesis is not “AI can make code run faster”—that’s too broad. The core thesis is: there exists a gap in program properties that are “easy for semantic understanding but difficult for formal proof,” LLMs are naturally suited to bridge this gap, and this gap directly maps to the optimization capability boundary of compilers.
Experiments on two benchmarks verified this thesis. 30 kernels from
Polybench/C verified the existence and cumulative effect of the two
types of gaps: the correctness proof gap
(restrict/noalias) covers 10/30 kernels, and
the strategy judgment gap (fvw) covers an additional 9/30
kernels; after accumulation, 19/30 kernels (63%) achieved >5%
speedup, with a mean of +30.6% and a maximum of +88.4%. 146 kernels from
TSVC-2 verified the universality of Gap 2: 33/146 (23%) achieved >5%
speedup, of which 14 were pure cost model gap cases (compiler completely
failed to auto-vectorize, but actually achieved 25-94% speedup). The
negative result of alignment annotations indicates that not all semantic
information has equal value, and the choice of attributes itself
requires architecture-aware judgment.
The two types of gaps correspond to two agentic optimizer modules. Alias annotator handles Gap 1 (conservative strategy, wrong judgment is UB), and vectorization advisor handles Gap 2 (speculate-then-verify, wrong judgment is just a slowdown). Both modules have been implemented as opencode skills, following the “result determinism” principle in design, replacing process prescriptions with verifiable delivery standards. Next, test the accuracy and degree of automation of the skills on the 23 non-Category D kernels of Polybench.
First draft: 2026-03-07, Updated: 2026-03-08. Experiments on 30
Polybench kernels + 146 TSVC-2 kernels completed. Includes four rounds
of experiments: restrict, forced vectorization, alignment
annotations, and TSVC-2 generalization verification. Correctness passed
for 30/30 Polybench kernels, with 19/30 having >5% speedup. 33/146
TSVC-2 kernels have >5% speedup (pure Gap 2). Design of two agentic
optimizer skills completed.
🔊
(function() { ‘use strict’; if (!(‘speechSynthesis’ in window)) { document.getElementById(‘tts-controls’).style.display = ‘none’; return; }
var synth = window.speechSynthesis; var btn = document.getElementById(‘tts-btn’); var status = document.getElementById(‘tts-status’); var CHUNK_SIZE = 200;
var chunks = []; var currentChunk = 0; var isPlaying = false; var isPaused = false;
function getArticleText() { var body = document.body.cloneNode(true); body.querySelectorAll(‘#tts-controls, pre, code, script, style, table’).forEach(function(el) { el.remove(); }); return body.innerText.trim(); }
function splitText(text) { var result = []; var sentences = text.split(/(?<=[.!?])+/); var buf = ’’; for (var i = 0; i < sentences.length; i++) { if (buf.length + sentences[i].length > CHUNK_SIZE && buf.length > 0) { result.push(buf.trim()); buf = sentences[i]; } else { buf += sentences[i]; } } if (buf.trim()) result.push(buf.trim()); return result.filter(function(c) { return c.length > 0; }); }
function speakChunk(index) { if (index >= chunks.length) { stop(); status.textContent = ‘Playback finished’; setTimeout(function() { status.textContent = ’‘; }, 2000); return; } currentChunk = index; var utt = new SpeechSynthesisUtterance(chunks[index]); utt.lang = ’en-US’; utt.rate = 1.05; utt.onend = function() { speakChunk(index + 1); }; utt.onerror = function(e) { if (e.error !== ‘canceled’) speakChunk(index + 1); }; status.textContent = (index + 1) + ‘/’ + chunks.length; synth.speak(utt); }
function play() { if (isPaused) { synth.resume(); isPaused = false; isPlaying = true; btn.textContent = ‘⏸’; btn.title = ‘Pause’; return; } chunks = splitText(getArticleText()); currentChunk = 0; isPlaying = true; isPaused = false; btn.textContent = ‘⏸’; btn.title = ‘Pause’; speakChunk(0); }
function pause() { synth.pause(); isPaused = true; isPlaying = false; btn.textContent = ‘▶’; btn.title = ‘Resume’; }
function stop() { synth.cancel(); isPlaying = false; isPaused = false; chunks = []; currentChunk = 0; btn.textContent = ‘🔊’; btn.title = ‘Read aloud’; status.textContent = ’’; }
btn.addEventListener(‘click’, function() { if (isPlaying) pause(); else play(); });
var holdTimer; btn.addEventListener(‘mousedown’, function() { holdTimer = setTimeout(function() { stop(); status.textContent = ‘Stopped’; setTimeout(function() { status.textContent = ’‘; }, 1500); }, 800); }); btn.addEventListener(’mouseup’, function() { clearTimeout(holdTimer); }); btn.addEventListener(‘mouseleave’, function() { clearTimeout(holdTimer); });
setInterval(function() { if (synth.speaking && !synth.paused) { synth.pause(); synth.resume(); } }, 10000); })();