Science & Tech FrontiersModel Architecture

LLM Semantic Hints for Compiler Optimization

Published Mar 7, 2026

Core Observation

Programs have many properties that are obvious from a semantic level but require complex analysis for a compiler to formalize, or may even be impossible to prove.

For example, a function receives five pointer parameters and alternately reads from and writes to the memory pointed to by these pointers within a loop. A person reading the code immediately knows these pointers point to five independently allocated arrays and cannot alias each other. However, for a compiler to prove this, it needs to perform interprocedural alias analysis: tracing the source of each pointer, crossing function call boundaries, and analyzing all possible execution paths. This often fails in practice, especially in scenarios involving cross-translation unit calls, memory allocation through opaque wrappers, or the use of multi-dimensional arrays. The compiler’s response is to handle this conservatively: either by inserting runtime alias checks (increasing code size) or by simply giving up on optimization (losing performance).

LLMs naturally stand on the side of semantic understanding. Compilers naturally stand on the side of formal proof. The existence of this gap means that if an LLM can translate its semantic understanding into a form consumable by the compiler (attributes, metadata), it can directly unlock optimizations that the compiler already knows how to do but is afraid to perform.

This is not just about the restrict keyword. LLVM IR has a series of attributes (noalias, nonnull, align, nsw/nuw, readonly, dereferenceable, branch weights, etc.), each corresponding to a class of properties the compiler wants to know but cannot prove. Systematically identifying which attributes have the largest semantic-proof gap, how far LLMs can go, and the reasons for their limitations is a meaningful research contribution in itself.

Why Compilers Can’t Do It

A compiler’s optimization capability far exceeds its proof capability. It knows hundreds of optimization techniques but gets stuck at the “cannot prove safety” step in a large amount of real-world code. The fundamental reason is that the C/C++ type system is too weak and lacks sufficient means to encode semantic information.

Take gesummv from Polybench as an example: the function receives five array parameters: matrices A, B and vectors tmp, x, y, and reads and writes multiple arrays in the inner loop. What the compiler sees are five pointers of type double (*)[N]. The C type system has no mechanism to express that “these five pointers point to non-overlapping memory.” Even if each array is independently malloced, as long as the allocation occurs in another translation unit (such as Polybench’s polybench_alloc_data function), the compiler cannot see the allocation point in the current translation unit and thus cannot make the inference.

Control group: Rust’s borrow checker naturally provides LLVM with alias information that C cannot express (&mut reference automatically carries noalias semantics). With the same LLVM backend, IR generated by the Rust frontend usually achieves better optimization. This isn’t because the Rust compiler is smarter, but because it sees more information.

The direction of this article is essentially: let AI act as a “virtual type system” to provide C/C++ code with semantic information similar to the Rust level, without requiring programmers to change their programming habits.

Experimental Verification: The size of the gap depends on code complexity. We used restrict (corresponding to noalias in LLVM IR) as the first test attribute to verify the existence and size distribution of the gap on two sets of benchmarks with different complexities.

Simple Mode: Compilers Already Handle It Well

The first set of experiments used 10 hand-written kernels, covering calculation patterns such as vector update, reduction, matrix multiplication, stencil, sparse matrix, convolution, prefix sum, and histogram, with one-dimensional array pointer parameters. All kernels were marked noinline to prevent the compiler from seeing the allocation point through inlining. Clang 19, -O2, Apple Silicon.

Results: Only 1/10 kernels (Stream Triad) showed a measurable runtime improvement (+5.4%). 3/10 kernels had an 8-18% reduction in code size. The remaining 7/10 were completely unaffected.

Opportunity Sizing

The reasons for being unaffected fall into three categories: reduction patterns (Dot Product, GEMV, Conv 2D) where the inner loop accumulates results into local scalar variables, and the loop body does not write to arrays, so the compiler can vectorize without alias information. Patterns that cannot be vectorized (Prefix Sum, Histogram, SpMV) where the performance bottleneck is not in alias analysis. For affected kernels, the compiler generated runtime alias check code in the plain version (comparing pointer address differences → deciding whether to take the vectorized or scalar fallback path), and restrict eliminated these checks. However, the check itself is O(1), while the loop body is O(N), so for large arrays, the check overhead is diluted to the point of being negligible.

At first glance, the value of restrict seems limited. But this conclusion only holds for simple one-dimensional array patterns.

Complex Mode: The Compiler Gap Significantly Widens

The second set of experiments used 10 kernels from Polybench/C 4.2.1 (a standard benchmark suite in the field of compiler optimization), covering four subcategories: BLAS, linear algebra kernels, stencil, and data mining. Polybench kernels use two-dimensional VLA parameters (double A[restrict N][N]), involving more complex multi-dimensional array access patterns and more pointer parameters. Polybench comes with restrict support (-DPOLYBENCH_USE_RESTRICT), and noinline was similarly injected.

The results were completely different: 6/10 kernels showed a runtime improvement of >5%, reaching up to +34.9% (BiCG) and +25.7% (SYR2K). Correctness verification confirmed that the outputs of the plain and restrict versions of the 10 kernels under -DPOLYBENCH_DUMP_ARRAYS were identical; restrict only affected performance and did not change the calculation results.

Polybench Results

Kernel	Runtime Speedup	Code Size Change	Mechanism
BiCG	+34.9%	+88%	Vectorization enabled
SYR2K	+25.7%	+80%	Vectorization enabled
FDTD 2D	+13.2%	-5%	Alias check elimination
Jacobi 2D	+9.8%	-28%	Alias check elimination
Heat 3D	+9.5%	-35%	Alias check elimination
SYMM	+6.9%	-4%	Alias check elimination
2MM	+2.7%	-2%	—
Correlation	+1.3%	-17%	—
MVT	+1.1%	-31%	—
GEMM	+0.4%	-16%	—

By comparing -Rpass vectorization reports, the gains from restrict come from two distinct mechanisms.

Mechanism 1: Vectorization enabled. In BiCG and SYR2K, the compiler completely gave up on vectorizing the inner loop in the plain version (reporting “loop not vectorized” or “vectorization not beneficial”). After adding restrict, the same loop became “vectorized (width: 2, interleaved: 4)”. It’s not that the compiler doesn’t know how to vectorize, but that it doesn’t dare to—because it cannot prove that multiple array parameters do not alias. This is exactly the scenario where the semantic-proof gap is largest: an LLM can see at a glance that five array parameters come from independent allocations, while the compiler needs to traverse the entire call chain to (attempt to) prove it. This gap produced a 25-35% runtime gain. The code actually became larger (+80-88%) because the vectorized loop itself occupies more instruction space than the scalar loop.

Mechanism 2: Alias check elimination. For the three stencil kernels (Jacobi 2D, FDTD 2D, Heat 3D), the vectorization decisions were identical in both versions. restrict eliminated the runtime alias check branches and scalar fallback paths, resulting in a 9-13% runtime gain and a 5-35% reduction in code size.

The comparison between the two sets of experiments reveals a key pattern: the size of the gap depends on the difficulty of the compiler’s alias analysis. One-dimensional arrays, two pointers → the compiler can handle it itself (small gap). Two-dimensional VLA, five pointers with interleaved reads and writes → the compiler is helpless (large gap). And it is precisely the latter—complex data structures and algorithm implementations—that require AI assistance.

Beyond Restrict: Which Attributes Have the Largest Semantic-Proof Gap?

restrict/noalias is just one in the LLVM attribute system. Systematically mapping the “semantic understanding difficulty” and “formal proof difficulty” of each attribute is necessary to find the full opportunity space for AI-assisted compiler optimization.

Below is a preliminary analysis of major attributes in LLVM:

Attribute	Compiler Proof Difficulty	LLM Identification Difficulty	Unlocked Optimization	Analysis
`noalias`	High (interprocedural alias analysis)	Low (just look at allocation points)	Vectorization, LICM	Verified. Up to +35% on Polybench
`nonnull`	Medium-High (interprocedural null flow)	Low (look at allocation and error handling)	Null check elimination	`malloc` return values, pointers after null checks. LLM can directly see control flow
`align(N)`	Medium-High (needs to track alignment through allocation chains)	Low (knows alignment guarantees of allocation APIs)	Aligned SIMD load/store	Verified. After injecting `assume_aligned(64)` for `posix_memalign(4096)` return values, only 5 out of 30 kernels had >2% speedup, and 14 actually slowed down. Weak leverage on Apple Silicon
`nsw`/`nuw`	High (value range analysis)	Medium (needs to understand value ranges)	Loop optimization, arithmetic simplification	Loop variables incrementing from 0 to N will not overflow; LLM can infer from loop structure. But overflow reasoning for nested calculations is harder
`readonly`/`writeonly`	High (requires whole-program analysis)	Low-Medium (read function body)	CSE, dead store elimination, LICM	LLM is very accurate for single-function analysis. May not see everything with cross-module dependencies
`dereferenceable(N)`	High (escape analysis + lifetime)	Low (look at buffer size and lifetime)	Speculative load, LICM	LLM knows the pointer returned by `malloc(100)` is `dereferenceable(100)`
`branch weights`	Impossible (unknown without PGO data)	Medium (understand error handling vs. normal path)	Code layout, branch prediction	Most extreme gap: compiler is completely blind without PGO data, LLM can understand “this is error handling, rarely taken”
`!range`	High (value range propagation)	Medium (understand input domain constraints)	Bounds check elimination, narrowing	Enum values, array indices, validated inputs. LLM can understand “this value is a pixel from 0-255”

A few observations.

Attributes with the largest gap: noalias (verified) and branch weights. The former because interprocedural alias analysis is one of the hardest analyses for a compiler. The latter because the compiler has zero information about branch probabilities without a PGO profile, while an LLM can understand from code semantics which are error paths and which are hot paths.

Dimensions LLMs excel at: pointer-related attributes (noalias, nonnull, dereferenceable, align), because these properties are usually determined at the allocation point, and the LLM can directly see the semantics of the allocation API.

Dimensions LLMs might struggle with: value range-related attributes (nsw/nuw, !range), which require reasoning about arithmetic operation chains; LLM’s numerical reasoning capability is a known weakness. However, simple cases for loop variables (incrementing from 0 to N) should be fine.

The most practically valuable direction: attributes that simultaneously satisfy the three conditions of “high compiler proof difficulty” + “low LLM identification difficulty” + “large unlocked optimization gain.” Currently, noalias, branch weights, and nonnull + dereferenceable seem to be the highest priority targets.

AI × Compiler Optimization

As of 2025, there are five main categories of work in this direction, none of which overlap with the idea in this article.

Replacing compiler heuristics. Google MLGO (2021+) uses RL to replace LLVM’s inlining/register allocation decisions and has been deployed in Chrome, Fuchsia, and Android. Magellan/AlphaEvolve (2025) uses Gemini to evolve C++ code for LLVM heuristic functions, performing 4.27% better than upstream heuristics on inlining-for-size. This type of work does not change the amount of information the compiler sees, but only makes better decisions under fixed information conditions.

Predicting optimization pass sequences. Meta LLM Compiler (2024, 7B/13B, trained on 546B token LLVM-IR) predicts the optimal pass sequence, achieving 77% of the effect of autotuning search. IR-OptSet (NeurIPS 2025) provides a 170K-sample LLVM IR optimization dataset; fine-tuned LLMs exceed -O3 on some cases. Again, this is about making choices under existing information.

Rewriting source code. LLNL’s CompilerGPT (2025) has LLMs rewrite code after reading compiler optimization remarks, achieving a 6.5x improvement on prefix sum. This changes the program implementation itself and requires proof of semantic equivalence.

Domain-specific pragma insertion. TimelyHLS / LIFT (2024-2025) has LLMs insert FPGA HLS pragmas (#pragma HLS pipeline, #pragma HLS unroll) into C code, achieving 3.5x speedup. This verified the feasibility of the “AI inserts pragma → compiler consumes” pipeline in the HLS vertical field, but the correctness constraints for HLS pragmas are much looser than semantic annotations for general-purpose compilers (wrong HLS pragmas usually just result in poor performance, not wrong results).

LLM-enhanced static analysis. The CAFD system by Cheng et al. (PKU/HUST/NTU/Ant/UNSW, 2025) uses LLMs (Qwen3-14B, LLaMA-3.2-3B, Phi-4, DeepSeek-R1) to detect custom memory allocation functions in C code. Core observation: a large number of heap allocations in real C projects are completed through project-specific wrappers like xmalloc and g_new, which standard pointer analysis tools (such as SVF) cannot identify, leading to alias set explosion. CAFD found 700+ custom allocation functions in 17 large C projects, reducing the alias set by 41.5%. This work verified the path of “LLM semantic understanding → more precise alias information,” but the direction is analyzing existing code to recover lost semantic information. This article proposes embedding it directly during the generation phase. The two are complementary: CAFD handles legacy code, while this solution handles incremental code.

Research on the Quality of AI-Generated Code

Multiple studies (Licorish et al. 2025, Molison et al. MSR 2025, Abbassi et al. ICSME 2025) compared the quality of LLM-generated code and human-written code. The key fact is: these studies all focus on dimensions such as correctness, maintainability, and security; no study has discussed the “compiler optimizability” of AI-generated code.

Semantic Information Loss Problem

The LLVM community has firsthand experience with this problem. “Type-Alias Analysis: Enabling LLVM IR with Accurate Types” (Zhou et al., ISSTA 2025) pointed out that LLVM lost type information after switching to opaque pointers, leading to alias analysis degradation. nikic’s 2024 LLVM annual review also mentioned that the new nuw flag on GEP was because “frontends previously had no means to convey to LLVM that the offset is non-negative.” These are all direct evidence of the semantic gap.

Formal Verification Tools

Alive2 (Lopes et al., PLDI 2021) is a translation validation tool for LLVM that can formally verify the correctness of IR transformations. It can be used to verify whether AI-generated attributes introduce undefined behavior. In 2024, there was work combining fine-tuned LLMs with Alive2, using LLMs to predict transformation correctness when the SMT solver times out, and then confirming with fuzzing. This “formal + AI + fuzzing” layered verification strategy can be directly reused.

Core Challenges

Challenge 1: Metadata Invalidation During Modification

Metadata generated by AI at time t=0 is correct, but the code is constantly modified at t=1, t=2, … Modifications may break the assumptions the metadata depends on. For example, initially a and b point to independent mallocs, but later someone changes it to b = a + offset, at which point restrict no longer holds.

This is the core challenge and the most research-worthy problem.

Challenge 2: Reliability of the LLM Itself

Even at the time of generation, an LLM may generate inconsistent code and metadata. For example, the code actually does b = a + 1, but the metadata declares restrict(a, b). The correctness of metadata is a second-layer requirement on top of code correctness.

Challenge 3: Lack of Training Data

Existing training data rarely contains code with optimization annotations. Human programmers almost never write restrict, __builtin_expect, or alignment attributes. LLMs need to be explicitly guided to generate these annotations.

Challenge 4: Cross-Module Invalidation

A function’s metadata may depend on the behavior of the caller or callee. Change detection across modules is harder than within a module.

Possible Solutions

Precedent from JIT Compilers: Speculative Optimization + Deoptimization

JIT compilers (V8, HotSpot/Graal, PyPy) have long used a mature pattern to handle “assumption-based optimization”: collect profile → generate optimized code based on assumptions and insert guards → deoptimize when guards fail. “Formally Verified Speculation and Deoptimization in a JIT Compiler” (POPL 2021) formally verified this pattern using Coq.

AI-generated metadata is equivalent to “assumptions” in JIT, guards are equivalent to runtime assertions, and deoptimization is equivalent to falling back to conservative compilation without metadata. The difference is that JIT does this dynamically at runtime, while AOT scenarios need to complete it at compile-time + test-time.

Design by Contract (Eiffel/Meyer)

AI-generated metadata is essentially a contract. restrict(a, b) is an invariant: “During the execution of this function, a and b do not point to overlapping memory.” The reason DbC hasn’t gained traction in mainstream languages is that the maintenance cost is too high. But if contracts are automatically generated and maintained by AI, this cost barrier disappears.

Layered Verification Strategy

Layer 1 (compile-time): Static analysis checks if metadata clearly contradicts the code. Layer 2 (test-time): Each metadata is accompanied by a runtime assertion, enabled in debug builds and CI. Layer 3 (formal verification): Use Alive2 to verify performance-critical paths. If any layer detects a problem, the metadata is automatically removed (graceful degradation).

Verification Plan

Experiment 1: General Opportunity Sizing (Completed)

Goal: Determine the gain distribution of restrict/noalias under different code complexities.

Completed content: (1) 10 hand-written kernels (one-dimensional arrays, simple mode), (2) 10 Polybench kernels (two-dimensional VLA, complex mode). Measurement of runtime performance and assembly code size differences completed.

Core finding: The size of the gap depends on code complexity. In simple one-dimensional mode, the compiler can handle it itself (1/10 has >5% speedup); in complex multi-dimensional mode, the compiler’s alias analysis fails (6/10 have >5% speedup, up to +35%). restrict generates gains through two mechanisms: vectorization enablement (when the gap is largest) and alias check elimination (when the gap is medium).

Experiment 2: Pushing the Upper Bound + Attribution Analysis (Full Polybench, 30 kernels)

Goal: Push the performance upper bound for more optimizations on all 30 Polybench kernels, while performing attribution analysis on the boundaries of LLM’s capabilities.

Methodology: holistic profile → cross-kernel aggregation → batch optimization

Instead of analyzing kernel by kernel, first collect -Rpass-missed reports for all kernels and aggregate statistics on the type distribution of missed optimizations across kernels. This way, even if a certain optimization is not the largest bottleneck in each kernel, as long as it appears in most kernels, its total leverage is worth prioritizing. After determining priorities, apply the same class of optimization to all kernels in batch and measure the global effect at once.

Missed optimization panorama: Complete -Rpass-missed remarks were collected for the plain and restrict versions of the 30 kernels. Restrict resolved 126 GVN (redundant load), 96 SLP vectorizer, 71 LICM (load hoisting), and 23 loop-vectorize missed optimizations. Residual problems ranked by kernel coverage: LICM still fails (19/30 kernels), cost model rejects vectorization (14/30), GVN not eliminated (11/30). A typical residual pattern in LICM is that in the accumulator q[i] += A[i][j] * p[j], q[i] cannot be hoisted out of the loop, presumably because LLVM’s alias analysis cannot fully propagate parameter-level noalias to the load/store of VLA derived pointers.

Forced vectorization verification: After overriding cost model decisions with -mllvm -force-vector-width=N, 13 out of 30 kernels achieved >10% additional speedup, reaching up to +84% (gesummv). At the same time, 2 kernels slowed down (symm -22%, correlation -10%), indicating that the cost model’s conservative judgment in some cases is correct.

The cumulative effect of the two optimizations is significant. Restrict alone: 10/30 kernels have >5% speedup, mean +7.3%. After adding fvw: 19/30 have >5% speedup (63%), 13/30 have >20% speedup (43%), mean +30.6%. Maximum cumulative speedup +88.4% (bicg) and +88.2% (trisolv). Correctness verification: the output of the restrict version of 30/30 kernels is identical to the plain version.

Attribution analysis of four types of kernels. The 30 kernels are divided into four categories based on “which optimization technique is effective”: (A) Dual gains from restrict + fvw (4 kernels, e.g., bicg/trisolv), with complex multi-pointer patterns AND inaccurate cost model. (B) Primarily benefiting from restrict (3 kernels, e.g., syrk +32%), where alias analysis is the only bottleneck. (C) Primarily benefiting from fvw (9 kernels, e.g., gesummv +85%/durbin +78%), where alias is not an issue but the cost model misjudged. (D) No significant gain (7 kernels, e.g., floyd-warshall), where the calculation pattern is unsuitable or the compiler is already good enough.

This reveals two different types of semantic-proof gaps. The first is the correctness proof gap: the compiler lacks alias information and cannot prove the safety of optimization. LLMs compensate by understanding memory layout (high risk, wrong labeling is UB). The second is the strategy judgment gap: the compiler has the ability to perform optimization but the cost model makes a wrong judgment. LLMs compensate by understanding calculation patterns (low risk, wrong choice just results in slowdown). The two are complementary and cumulative; a single technique can only cover 23-43% of kernels, while the combination covers 63%.

Experiments with alignment annotations negated the significance of a third gap. Polybench’s memory allocation function internally uses posix_memalign(&ret, 4096, ...), so all arrays are at least 4096-byte aligned. But what the compiler sees is a void* return value and doesn’t know the alignment guarantee. After passing this information to the compiler by adding __attribute__((assume_aligned(64))) to the function declaration, only 5 out of 30 kernels had >2% speedup (gesummv +30%, durbin +29%, gemver +29%), and 14 actually slowed down (average -12%), with an overall mean of -2.2%. The reason is that the performance difference between unaligned load and aligned load on Apple Silicon NEON is almost non-existent, so the impact of alignment information on the cost model is very small on this architecture. Extra attributes might also disturb LLVM’s pass sequence, leading to suboptimal instruction scheduling. This negative result confirms the priority ranking: noalias >> vectorization advice >> alignment annotation. However, the conclusion might be different on x86 platforms (especially early SSE requiring 16-byte alignment), which is worth verifying in the future.

From learnings to agentic optimizer: An effective agentic optimizer needs two modules. Alias annotator analyzes function signatures and call contexts to provide noalias information (conservative strategy). Vectorization advisor analyzes loop structures and -Rpass-missed output to provide vectorization suggestions (speculate-then-verify strategy). The two modules run independently, and their results are cumulative. Both modules have been implemented as opencode skills, with a design philosophy based on “result determinism”: not prescribing the process, but only defining verification standards, allowing the agent to iterate on its own in a feedback loop. The decision results for each kernel (including failed samples) are appended to optimizer_learnings.md, forming a knowledge flywheel.

Experiment 3: TSVC-2 Generalization Verification (Gap 2 across benchmarks)

Polybench’s code patterns are relatively homogeneous (linear algebra and stencil). Is the discovery of the two types of gaps just a coincidence of a specific benchmark? TSVC-2 (Test Suite for Vectorizing Compilers, v2) provides a different verification scenario: 151 kernels, covering various calculation patterns such as reduction, loop restructuring, induction variable identification, control flow, and indirect addressing.

The structural characteristics of TSVC-2 eliminate Gap 1. TSVC-2 uses global static arrays (__attribute__((aligned(64)))), so the compiler naturally knows these arrays do not alias each other. Therefore, restrict experiments are meaningless on TSVC-2. This in turn provides a clean isolation: experimental results on TSVC-2 purely reflect Gap 2 (cost model misjudgment), without mixing in contributions from Gap 1.

fvw experiments on 146 valid kernels show that Gap 2 is a universal phenomenon. 33/146 kernels (23%) achieved >5% speedup through fvw override, 19 achieved >20%, mean +9.2%. Compared to Polybench (fvw contributing 63% coverage), TSVC-2’s 23% seems lower, but the difference mainly stems from the simultaneous existence of Gap 1 in Polybench. Looking only at the contribution of Gap 2 (Category C kernels in Polybench 9/30 = 30%), the performance of the two benchmarks is consistent.

The most important discovery: 14 pure cost model gap cases. In these kernels, the compiler completely failed to auto-vectorize (autovec gain ~0%), but achieved 25-94% speedup after forced vectorization. By pattern: the first 6 are all reductions (sum: s311 +93%, product: s312 +94%, coupled: s319 +89%), and the rest involve indirect addressing and induction variable identification. The compiler rejects vectorizing FP reduction under -O3 (without -ffast-math) because the reduction tree changes the operation order. But in fact, the numerical sensitivity of these kernels is very low, and the FP difference brought by vectorization (maximum relative diff 6.87e-4) is completely acceptable. This is a typical strategy judgment gap: the compiler has the ability to do this optimization, and safety is not an issue (just FP order change, not UB), but the cost model judges it “not worth it.”

Hardware implications of FVW4 being superior to FVW2. Under float32 + 128-bit NEON (4 lanes natural width), fvw4 mean +8.9% while fvw2 mean -3.6%. Forcing SIMD narrower than the natural width actually results in degradation. This provides a reliable heuristic for the vectorization advisor: default to hardware natural vector width.

The compiler cost model is correct in most scenarios. 107/146 kernels (73%) are unaffected by fvw within ±5%, indicating that the cost model’s conservatism is correct in most cases. The value of FVW is concentrated in a few scenarios where the cost model fails, and the characteristics of these scenarios are learnable (reduction, indirect addressing, induction variable patterns).

Next Steps

Continue pushing the upper bound. Alignment annotations have been tested (weak leverage, see above). Next, test the incremental gain of IR-level !noalias scope metadata, which may resolve the residual LICM issues in 19/30 kernels (restrict provides noalias between parameters at the source level, but LLVM’s alias analysis fails to fully propagate it to the load/store of VLA derived pointers).

Build an agentic optimizer prototype. Two modules have been implemented as opencode skills (skill_compiler_alias_annotator.md and skill_compiler_vectorization_advisor.md). Each skill defines verification standards rather than step-by-step processes, following the “result determinism” design philosophy: alias annotator requires diff pass + clean compilation + performance quantification; vectorization advisor requires numerical tolerance within 1e-2 + performance quantification. Both skills include a knowledge accumulation mechanism, where the decision for each completed kernel is appended to optimizer_learnings.md, forming a compounding effect. Next, test the accuracy of the skills on the 23 non-Category D kernels of Polybench.

Writing: The data from the full Polybench set + TSVC-2 is sufficient to support a paper. Core contributions: (1) Discovery and quantification of two types of semantic-proof gaps (correctness proof + strategy judgment), (2) Cumulative effect of 63% coverage and +30.6% mean speedup on 30 Polybench kernels, (3) Verification of the universality of Gap 2 on 146 TSVC-2 kernels (23% coverage), (4) Attribution analysis of four types of kernels and the design framework of an agentic optimizer.

Summary

The core thesis is not “AI can make code run faster”—that’s too broad. The core thesis is: there exists a gap in program properties that are “easy for semantic understanding but difficult for formal proof,” LLMs are naturally suited to bridge this gap, and this gap directly maps to the optimization capability boundary of compilers.

Experiments on two benchmarks verified this thesis. 30 kernels from Polybench/C verified the existence and cumulative effect of the two types of gaps: the correctness proof gap (restrict/noalias) covers 10/30 kernels, and the strategy judgment gap (fvw) covers an additional 9/30 kernels; after accumulation, 19/30 kernels (63%) achieved >5% speedup, with a mean of +30.6% and a maximum of +88.4%. 146 kernels from TSVC-2 verified the universality of Gap 2: 33/146 (23%) achieved >5% speedup, of which 14 were pure cost model gap cases (compiler completely failed to auto-vectorize, but actually achieved 25-94% speedup). The negative result of alignment annotations indicates that not all semantic information has equal value, and the choice of attributes itself requires architecture-aware judgment.

The two types of gaps correspond to two agentic optimizer modules. Alias annotator handles Gap 1 (conservative strategy, wrong judgment is UB), and vectorization advisor handles Gap 2 (speculate-then-verify, wrong judgment is just a slowdown). Both modules have been implemented as opencode skills, following the “result determinism” principle in design, replacing process prescriptions with verifiable delivery standards. Next, test the accuracy and degree of automation of the skills on the 23 non-Category D kernels of Polybench.

First draft: 2026-03-07, Updated: 2026-03-08. Experiments on 30 Polybench kernels + 146 TSVC-2 kernels completed. Includes four rounds of experiments: restrict, forced vectorization, alignment annotations, and TSVC-2 generalization verification. Correctness passed for 30/30 Polybench kernels, with 19/30 having >5% speedup. 33/146 TSVC-2 kernels have >5% speedup (pure Gap 2). Design of two agentic optimizer skills completed.

🔊

(function() { ‘use strict’; if (!(‘speechSynthesis’ in window)) { document.getElementById(‘tts-controls’).style.display = ‘none’; return; }

var synth = window.speechSynthesis; var btn = document.getElementById(‘tts-btn’); var status = document.getElementById(‘tts-status’); var CHUNK_SIZE = 200;

var chunks = []; var currentChunk = 0; var isPlaying = false; var isPaused = false;

function getArticleText() { var body = document.body.cloneNode(true); body.querySelectorAll(‘#tts-controls, pre, code, script, style, table’).forEach(function(el) { el.remove(); }); return body.innerText.trim(); }

function play() { if (isPaused) { synth.resume(); isPaused = false; isPlaying = true; btn.textContent = ‘⏸’; btn.title = ‘Pause’; return; } chunks = splitText(getArticleText()); currentChunk = 0; isPlaying = true; isPaused = false; btn.textContent = ‘⏸’; btn.title = ‘Pause’; speakChunk(0); }

function pause() { synth.pause(); isPaused = true; isPlaying = false; btn.textContent = ‘▶’; btn.title = ‘Resume’; }

function stop() { synth.cancel(); isPlaying = false; isPaused = false; chunks = []; currentChunk = 0; btn.textContent = ‘🔊’; btn.title = ‘Read aloud’; status.textContent = ’’; }

btn.addEventListener(‘click’, function() { if (isPlaying) pause(); else play(); });

var holdTimer; btn.addEventListener(‘mousedown’, function() { holdTimer = setTimeout(function() { stop(); status.textContent = ‘Stopped’; setTimeout(function() { status.textContent = ’‘; }, 1500); }, 800); }); btn.addEventListener(’mouseup’, function() { clearTimeout(holdTimer); }); btn.addEventListener(‘mouseleave’, function() { clearTimeout(holdTimer); });

setInterval(function() { if (synth.speaking && !synth.paused) { synth.pause(); synth.resume(); } }, 10000); })();