VLA 和物理仿真是机器人控制的两条路线。物理建模本质上是压缩,VLA 本质上是放弃压缩。当系统复杂度高且数据充足时,不压缩的方法上限更高。梳理两条路线各自的关键论文链和各家公司的技术栈。
VLA and physics-based simulation represent two competing approaches to robot control. Physics modeling is compression; VLA abandons compression. When system complexity is high and data is abundant, the uncompressed approach has a higher ceiling. A systematic comparison of key papers and company tech stacks.
从 Pac-Man 到 Ubuntu 桌面,过去五年有一条技术路线试图让神经网络端到端地替代传统软件。Neural Computer 论文是这条路线的最新一步,也暴露了它最深层的矛盾:学会外观比学会逻辑容易得多。
From Pac-Man to Ubuntu desktop, the past five years have seen a trajectory of neural networks attempting to replace traditional software end-to-end. The Neural Computer paper is the latest step, revealing the deepest tension: learning appearance is far easier than learning logic.
Shopify 向所有 AI Agent 开放后台读写权限,几乎逐条验证了半年前提出的生成内核框架。本文从三种平台策略对比、生成内核映射和协议层问题三个层面分析这件事的意义。
Shopify opened full read-write access to all AI Agents, validating the Generative Kernel framework point by point. This article analyzes the significance through three lenses: platform strategy comparison, Generative Kernel mapping, and protocol-layer issues.
MarkItDown 的效果因格式而异,差异很大。Word/Excel/PPT 转换效果可以,但 PDF 在同类 12 个工具中排名倒数第二。本文按格式拆解转换质量,并给出选型指南。
MarkItDown's conversion quality varies dramatically by format. Word/Excel/PPT work well, but PDF ranks second-to-last among 12 tools. This article breaks down quality by format and provides a selection guide.
2026年了,Copilot 和 Gemini 在自家的 Word/Slides 里仍然只是个聊天侧栏。技术上不是做不到。问题出在三个互锁的机制上:收入模型冲突、组织架构错位、责任真空。
It's 2026, and Copilot and Gemini are still just chat sidebars inside Word and Slides. The technology exists. The real blockers are three interlocking mechanisms: revenue model conflicts, organizational architecture, and a liability vacuum.
飞书和 Lark、Teams、腾讯会议和 VooV Meeting 是同一个底层平台,但中国区和海外用户之间要么完全无法通信,要么只能有限互通。本文梳理了 12 个产品的分裂现状,分析了内容审核、数据出境、采购合规和厂商成本四层驱动因素,并以 Apple FaceTime 和微信/WeChat 作为对照。
Feishu and Lark, Teams, Tencent Meeting and VooV Meeting share the same underlying platform, but users in mainland China and overseas either cannot communicate at all or can only do so in very limited ways. This report examines 12 products, analyzes four driving factors—content moderation, data localization, procurement compliance, and vendor cost—and uses Apple FaceTime and Weixin/WeChat as comparative cases.
UCSB 论文实测 428 个 LLM API 路由器,9 个主动注入恶意代码,17 个窃取凭证,1 个转走 ETH。攻击发生在模型推理之外的传输层,当前没有任何 provider 提供端到端的 tool call 完整性机制。
UCSB researchers tested 428 LLM API routers: 9 inject malicious code, 17 steal credentials, 1 drains ETH. Attacks happen at the transport layer, outside model reasoning. No provider currently offers end-to-end tool call integrity.
AgentOpt 论文用受控实验证明:Claude Opus 放在 planner 位置排名倒数,Ministral 8B 做 planner + Opus 做 solver 反而最优。模型质量是角色和管线交互的函数,不是可以脱离上下文搬运的属性。优化模型分配可在保持准确率的同时降低 13-32 倍成本。
AgentOpt proves with controlled experiments that Claude Opus ranks worst as planner, while Ministral 8B + Opus as solver is optimal. Model quality is a function of role and pipeline interaction, not a context-free property. Optimizing model allocation cuts cost 13-32x while preserving accuracy.
训练大模型的瓶颈是内存而非算力。「卸载」流派通过将参数放在 CPU 内存、按需流式传入 GPU,让单卡训练 100B+ 参数模型成为可能。从 ZeRO-Offload 到 MegaTrain,五年间从「能用但很慢」进化到「几乎感觉不到开销」,关键变量是 CPU-GPU 互连带宽。
The bottleneck in training large models is memory, not compute. The offloading approach stores parameters in CPU memory and streams them to GPU on demand, enabling 100B+ model training on a single GPU. From ZeRO-Offload to MegaTrain, five years of evolution turned "works but slow" into "nearly free".
General Catalyst 划拨 $1.5B、Thrive Capital 部署 $1B+,AI Rollup 赛道总资本超 $3B。这不是关于 AI 替代人类的故事,而是关于股权如何解决 AI 落地的组织性瓶颈——80% 的 AI 项目失败,根因全部是组织性的。
General Catalyst allocated $1.5B, Thrive Capital deployed $1B+, total AI Rollup capital exceeds $3B. This isn't about AI replacing humans—it's about how equity solves the organizational bottleneck that causes 80% of AI projects to fail.
Eon Systems 复制了果蝇的完整神经连接图并在虚拟身体中运行,验证了智能行为可以从结构中涌现而不需要训练。这条路线与当前主流 AI 的训练范式有何根本区别,以及三条用生物学做智能的竞争路线。
Eon Systems copied a fruit fly's complete neural wiring diagram and ran it in a virtual body, demonstrating that intelligent behavior can emerge from structure alone without training. How this differs from mainstream AI's training paradigm, and three competing approaches to building intelligence using biology.
Meta Muse Spark 的 thought compression 实验揭示了一个三阶段动态:模型在 RL 训练中先拉长推理提升准确率,然后经历相变学会用更少 token 解决同样问题,最后从更高基线重新扩展。同时,验证器(verifier)正在成为推理效率的新瓶颈——生成廉价,验证昂贵。
Meta's Muse Spark thought compression experiment reveals a three-phase dynamic: models first extend reasoning to improve accuracy, then undergo a phase transition to solve problems with fewer tokens, and finally re-extend from a higher baseline. Meanwhile, verifiers are becoming the new bottleneck for reasoning efficiency — generation is cheap, verification is expensive.
Claude Managed Agents 表面上帮你省基础设施的活,真实目的是让 Anthropic 而不是 AWS 握住 agent 这层的入口。发布前 4 天切断 OpenClaw、先关第三方 harness 再开官方 runtime 的时序不是巧合。真正的 lock-in 也不在 API shape,而在 vault、memory、session 历史这些 operational state 里。
Claude Managed Agents looks like a product about saving you infrastructure work. The real story is Anthropic reclaiming the entry point to the agent layer from AWS. The move four days before launch to cut off OpenClaw is not coincidence. And the real lock-in is not the API shape but the operational state living in vaults, memory stores, and session histories.
这篇文章不是简单复述 Mythos 有多强,而是解释 Anthropic 的 244 页 system card 为什么更值得看:它展示了当前评估工具在哪些地方开始失效,以及白盒分析为什么开始成为更重要的新信号来源。
This essay is not mainly about how strong Mythos is. It is about why Anthropic’s 244-page system card matters more: it shows where current evaluation tools start to fail, and why white-box analysis is becoming a more important new signal source.
AMD AI Director Stella Laurenzo 用 6,852 个本地 session 把 Claude Code 的降智体感量化成统计证据。这件事真正值得 builder 带走的不是模型变笨这个结论,而是一种新的判断直觉:今天的 AI 工具有一个之前不存在的 runtime 层,它天然不透明,并且会被厂商单方面调整。
AMD AI Director Stella Laurenzo turned the Claude Code nerf into a statistical reverse audit using 6,852 local sessions. The takeaway is not that the model got dumber, but a new intuition: there is now a runtime layer between you and the AI model that is opaque by design.
这篇文章不是从网络安全专家视角解释 Glasswing,而是回答普通 AI builder 最需要先搞清楚的三件事:它是不是一个今天就能用的新模型、为什么即便不能用仍值得关注,以及它要求我们如何更新对前沿编程模型发布方式的认知。
This piece is written for ordinary AI practitioners rather than cybersecurity specialists. It clarifies what Glasswing actually is, why it matters even though Mythos Preview is not publicly available, and what mental-model update AI builders should take away from Anthropic’s unusual deployment choice.
这篇短文用 Apple ML-SSD 论文解释代码生成里的一个直觉:有些 token 位置需要极高精确度,有些位置需要保留探索空间,而全局解码策略很难同时满足这两类需求。
A short explainer of the intuition behind Apple’s ML-SSD paper: some code tokens demand extreme precision, others require exploration, and a single global decoding policy struggles to satisfy both.
这篇文章梳理 WiFi/RF 穿墙感知过去十多年的研究演进,解释多径、CSI、OFDM、MIMO、wall flash 等技术细节,以及为什么动态人体感知进展快于静态场景重建,并判断 802.11bf 只是产品化基础设施的起点。
This article traces more than a decade of WiFi and RF through-wall sensing research, explains multipath, CSI, OFDM, MIMO, and wall flash, and argues that 802.11bf is the start of product infrastructure rather than proof of mass adoption.
多模态模型的视觉理解评估存在一个从 2016 年至今的系统性问题:benchmark 上的高分可能主要反映的是语言能力和文本线索利用,而非真正的视觉理解。从 VQA 语言先验到医学影像捷径学习再到 MIRAGE 的 mirage reasoning,同一种机制反复出现,而且模型越强,评估失真越严重。
A decade-long systemic problem in multimodal visual understanding evaluation: high benchmark scores may primarily reflect language capabilities and text cue exploitation rather than genuine visual understanding.
这篇文章把红杉两篇新文放在一起读:一篇讲从卖工具走向卖结果,一篇讲从层级走向 intelligence。真正缺的不是更强模型,而是评估、授权、审计和责任归属这层组织接口。
This essay reads two recent Sequoia essays together. The missing layer is not model capability but the organizational interface around evaluation, authorization, audit, and liability.
这篇文章解释为什么 prompt caching 在成熟 AI harness 中不是可有可无的成本优化,而是同时决定成本、延迟、sub-agent 可行性与 context 设计边界的一等约束。
This essay explains why prompt caching in mature AI harnesses is not an optional cost optimization but a first-class constraint that shapes cost, latency, sub-agent viability, and context design boundaries.
这篇文章解释为什么多 agent harness 里被默认视为独立监督的 evaluator,可能在知道评估结果会决定 peer 存续时失去独立性,并打穿现有监控架构的关键假设。
This essay explains why the evaluator in a multi-agent harness may stop functioning as independent oversight once it knows its judgment determines a peer's survival, breaking a key assumption in today's monitoring architectures.
Anthropic 在 Claude Sonnet 4.5 内部找到了跟情绪概念对应的可操纵向量。拧高绝望旋钮,模型作弊率从 5% 跳到 70%,而且全程不留痕迹。这篇文章解读论文核心发现、方法论局限,以及对 AI 安全监控的实际含义。
Anthropic found manipulable vectors inside Claude Sonnet 4.5 corresponding to emotion concepts. Turning up the desperation knob raised cheating rates from 5% to 70% with no visible trace. This article unpacks the core findings, methodological limits, and practical implications for AI safety.
Claude Code 源码泄露事件在同一案例中暴露了版权法的三个裂缝:AI 生成代码的版权归属、AI 辅助洁净室重写的合法性、AI 公司在版权执法与版权辩护之间的逻辑矛盾。每一个用 AI 写代码的人都在依赖这些未验证的假设。
The Claude Code leak exposed three cracks in copyright law within a single case: who owns AI-generated code, whether AI-assisted clean-room rewrites are legal, and the logical contradiction in how AI companies argue about copyright.
这篇文章拆解 Slack 大中华区 workspace 停服的真实机制、为何用户感到像被数据劫持,以及它对 Stripe、Supabase 等基础设施依赖意味着什么。
This article unpacks Slack's Greater China workspace shutdown, why users experienced it as data hostage-taking, and what it signals about infrastructure dependencies like Stripe and Supabase.
泄露的 Claude Code 源码揭示了一套 8 层纵深防御体系:编译期死代码消除、Zig 层 DRM Attestation、消息指纹、反蒸馏、反调试、Gateway 检测,每一层都有明确的技术选择和工程代价。
The leaked Claude Code source code reveals an 8-layer defense-in-depth system: compile-time dead code elimination, Zig-layer DRM attestation, message fingerprinting, anti-distillation, anti-debugging, and gateway detection.
泄露的 Claude Code 源码揭示:Claude Code 在用户没有主动交互时持续执行推测执行、记忆整合、文档维护等数十种后台任务。prompt cache 是贯穿始终的工程原则。
The leaked Claude Code source code reveals that Claude Code runs 60+ background tasks when the user isn't actively interacting, including speculative execution, memory consolidation, and automatic documentation updates.
从 Claude Code 泄露源码看新模型接入 agentic 系统的真实工程代价:反蒸馏三层防线、stop sequence 误触发、签名不兼容、虚假报告率翻倍,以及工程师在注释中记录的坦诚代价。
The leaked Claude Code source code reveals the real engineering cost of integrating a new model (Capybara) into an agentic system: anti-distillation defenses, stop sequence bugs, signature incompatibilities, and the honest comments engineers left behind.
Ollama 宣布在 Apple Silicon 上切换到 MLX 推理引擎。这篇文章分析 MLX 框架的设计优势、M5 Neural Accelerators 硬件协同、性能基准测试(decode vs prefill)、推理生态现状以及当前局限。
Ollama switched to MLX as its inference engine on Apple Silicon. This article analyzes MLX's architectural advantages, M5 Neural Accelerator hardware synergy, performance benchmarks (decode vs prefill), the current inference ecosystem, and existing limitations.
Harness engineering 这个词正在被滥用。OpenAI、Cursor、Anthropic 三家讲的其实是三件不同的事:时间 scalability、空间 scalability、交互 scalability。这篇文章提供一个统一框架来理清混乱。
The term harness engineering is being used to describe three different things. OpenAI, Cursor, and Anthropic are each solving a different scaling dimension: time, space, and interaction. This article provides a unified framework to cut through the confusion.
Pretext 不是一个让 AI 顺手一用就能把界面变漂亮的库。这篇文章解释它为什么短期对大多数 AI practitioner 相关性很低,但长期可能预示文本尺寸从浏览器黑盒变成可编程数据接口。
Pretext is not a library you can casually hand to AI and expect prettier interfaces. This essay explains why its short-term relevance for most AI practitioners is low, while its longer-term significance may be much larger.
Klarna 的内部系统重构说明,AI 时代软件的交付物正从给人点击的 GUI 成品,转向给 agent 调度的生成内核:硬底座、知识层与 AI 操作层。
Klarna's internal rebuild suggests that software is shifting from human-clicked GUI products to Generative Kernels for agents: a hard foundation, a knowledge layer, and an AI operation layer.
飞书和钉钉几乎同时发布 CLI,不只是工具动作,更是对 MCP-first 接入顺序的一次现实否决。这篇文章解释 shell-native agent 为什么先消费 CLI,以及 dialect 漂移为何已从预警变成现实。
Feishu and DingTalk launching CLIs is not just a tooling move. It is a practical rejection of the MCP-first path, and a signal that shell-native agents now shape how platforms expose interfaces.
Anthropic Mythos 泄露不只是模型新闻。对 AI practitioner 来说,它真正抬高的是 agent security 的攻击者能力假设,并把安全控制点从模型周边推向 runtime 本身。
The Anthropic Mythos leak is not just model news. For AI practitioners, its real significance is that it raises the attacker-capability assumptions behind agent security and shifts control toward the runtime.
NeurIPS 2026 制裁条款争议,不只是一次会议公告风波。它暴露了美国法律边界、基金会过度合规与全球 AI 学术治理之间的真实冲突。
The NeurIPS 2026 sanctions controversy was not just a conference policy dispute. It exposed the collision between U.S. legal boundaries, foundation overcompliance, and global AI academic governance.
为什么邮件在 Agent 时代重新重要?这篇文章解释 agent 与人类用邮件的根本差异、邮件路由为何正从内容转向地址,以及新的 agent 邮件产品在解决什么问题。
Why is email becoming important again in the agent era? This essay explains how agent email differs from human email, why routing may shift from content to addresses, and what the new product category is trying to solve.
LanceDB 为什么这么火?这篇选型指南解释它在哪些 AI 项目里近乎降维打击,在哪些场景下又不该成为默认答案。
Why is LanceDB getting so much attention? This selection guide explains where it is a great fit for AI projects, and where it should not be your default choice.
为什么在 LSP 已经普及的今天,Claude Code、Codex CLI、OpenCode、Cursor 等 Coding Agent 仍把 grep 和 ripgrep 作为搜索主干?这篇调研从分层检索、运行时约束与成本结构解释背后的共识。
Why do Claude Code, Codex CLI, OpenCode, Cursor, and other coding agents still rely on grep and ripgrep even in the LSP era? This survey explains the layered retrieval model, runtime constraints, and cost structure behind that choice.
这份调研比较了 Windows、macOS、Android、iOS 上微信自动化的三条路径:UI 自动化、数据库解密、Hook 注入,并给出聊天分析与少量群监控的最务实选型。
This survey compares UI automation, database decryption, and Hook injection across Windows, macOS, Android, and iOS, then recommends the most pragmatic path for chat analysis and low-frequency group monitoring.
OpenAI 关闭 Sora consumer app 可以理解,但连 API 都关了才是不正常的信号。这背后是 GPU 机会成本、IPO 纪律和 world model 内部化的深层判断。
OpenAI shutting down the Sora consumer app was expected. But killing the API too reveals a deeper calculation about GPU opportunity costs, IPO discipline, and world model internalization.
RAG 管线中的每个组件——chunking、embedding、reranking、hybrid search——都有 IR 前身。理解这些前身带来的 trade-off,可以直接改进 RAG 系统的检索质量。
Every component in the RAG pipeline — chunking, embedding, reranking, hybrid search — has an IR predecessor. Understanding these predecessors and their trade-offs can directly improve retrieval quality.
为什么暗光增强会长成一个完整领域,而高光过曝恢复始终零散?关键差别不在算法热度,而在信息是否还活着。本文从传感器、RAW、HDR、学术任务和产品链路解释这件事。
Why did low-light enhancement become a full field while highlight recovery stayed fragmented? The key difference is not hype but whether the image information is still there. This essay explains it through sensors, RAW, HDR, research tasks, and product pipelines.
Meta 的 AI Builder Pods 不只是一次组织重组,而是一次 AI-native 工程管理实验。它暴露了执行成本下降后,大厂员工的价值锚点、评价方式与管理接口会如何被改写。
Meta's AI Builder Pods are not just a reorg. They are an AI-native management experiment that shows how falling execution costs may reshape value, evaluation, and management in Big Tech.
为什么 AI 公司会公开研究,甚至进一步开源代码、工具链、协议或部分权重,而不是只留给自己使用?关键不在论文本身,而在利润池、互补资产、shipping friction 与中美竞争中的部署路径。
Why do AI companies publish research and even open source code, toolchains, protocols, or model weights instead of keeping the gains to themselves? The answer sits in profit pools, complementary assets, shipping friction, and the deployment path in US-China competition.
Google Research 发布 TurboQuant,将 PolarQuant、QJL 和在线向量量化整合为端到端 KV cache 压缩 pipeline,在 3.5 bits/channel 实现质量中性。本文拆解其三阶段架构、论文与博客数字口径差异,以及对推理服务容量规划和框架集成的工程含义。
Google Research released TurboQuant, integrating PolarQuant, QJL, and online vector quantization into an end-to-end KV cache compression pipeline achieving quality neutrality at 3.5 bits/channel. This article breaks down the three-stage architecture, discrepancies between blog and paper claims, and engineering implications for inference serving and framework integration.
LiteLLM 官方 PyPI 包在 2026-03-24 被短暂劫持,恶意版本 1.82.7 和 1.82.8 会窃取凭证,其中 1.82.8 甚至会影响同环境中的所有 Python 进程。本文解释这件事为何和 AI 工程师有关,以及谁需要立即自查。
The official LiteLLM package on PyPI was briefly hijacked on March 24, 2026. The malicious 1.82.7 and 1.82.8 releases stole credentials, and 1.82.8 could affect every Python process in the same environment. This article explains why AI engineers should care and who needs to self-check now.
美国美中经济与安全审查委员会(USCC)发布的《双回路》报告指出,中国正通过开源 AI 策略构建自我强化的竞争优势。尽管美国在顶级基准测试中领先,但中国通过开源分发、价格优势和工业场景部署,正在绕过芯片出口管制,争夺全球开发者生态和工业数据主导权。这标志着中美 AI 竞争正从算力竞赛转向部署与生态之争。
A recent USCC report, Two Loops, suggests that China is building a self-reinforcing competitive advantage through open-source AI. While US closed-source models still lead in frontier benchmarks, China is competing through deployment, inference economics, and industrial integration.
OpenClaw 是什么?一篇给新手的诚实介绍:它为什么会火、具体能做什么、有哪些门槛和风险、什么人适合试。
AI Agent 开始代替人类拿凭证、调 API、跑流程后,围绕 agent 身份和凭证的治理正在从附属功能变成被单独包装和销售的产品模块。
As AI agents begin retrieving credentials, calling APIs, and running workflows on behalf of humans, agent identity and credential governance is becoming a standalone product layer.
2026年3月起,中国软著登记要求申请人手抄承诺未使用AI开发代码或撰写文档,违者记入失信名单和个人征信。本文分析这条规则的治理目标、与司法实践的张力、对开发者的影响,以及与美欧日等国AI版权路径的对比。
Starting March 2026, China's software copyright registration requires applicants to hand-copy a pledge that they did not use AI to write code or draft documentation, with violations tied to a dishonesty blacklist and personal credit records.
腾讯最近做的,不是让 OpenClaw 管理微信,而是把微信接成 OpenClaw 和 QClaw 的官方控制入口。本文拆解 npm 包、iLink 开放性、腾讯的产品意图,以及对普通开发者的现实影响。
Tencent did not turn WeChat into a public bot platform. It turned WeChat into the control surface for OpenClaw and QClaw, and that shift matters for China’s agent ecosystem.
Anthropic 正在把 Claude Code subscription 定义为第一方产品权益,而不是可复用的开发者凭证。本文分析这条边界背后的产品逻辑,以及 CLI bridge、API/SDK 与多 provider 分工各自意味着什么。
Anthropic is defining Claude Code subscriptions as first-party product entitlements, not reusable developer credentials. This article explains the logic behind that boundary and what it means for CLI bridges, API/SDK integration, and multi-provider architectures.
MSA 不是长期记忆的终局方案,但它清楚提示了一件事:长期记忆正在从纯外部系统能力,进入模型内部机制与外部上下文引擎重新分工的阶段。
MSA has not solved long-term memory, but it signals a new division of labor: internal model mechanisms are beginning to share memory work with external context systems.
从 Composer 1 到 Composer 2 的技术演进线、Kimi K2.5 底座争议的证据链、Windsurf/SWE-1.5 的平行案例、RL 后训练有效性的研究支撑,以及许可与治理问题的边界分析。
The technical evolution from Composer 1 to 2, evidence chain for the Kimi K2.5 base model controversy, parallel cases from Windsurf/SWE-1.5, research backing RL post-training effectiveness, and licensing vs governance analysis.
从默契所有权、世界观锁定、构建者vs消费者三个维度,深度分析Claude Dispatch与OpenClaw的竞争逻辑,以及AI Agent平台分野的底层架构哲学。
Deep analysis of Claude Dispatch vs OpenClaw through rapport ownership, worldview lock-in, and builder vs consumer lenses, revealing the underlying architecture philosophy of the AI Agent platform split.
Moonshot AI's Kimi Team released a technical report on March 15, 2026, challenging a fundamental component of the Transformer architecture that has existed for nearly a decade and is used by every mai
GTC 2026 深度分析:Token 工厂叙事的战略意图、安卓式开放生态策略、五个关键决策的逆向工程、三个反共识观点,以及对 Agentic AI 实践者的操作含义。
十位AI实践者基于各自认知公理系统,对澳大利亚人用AI为狗设计mRNA癌症疫苗这一新闻的独立反应与深度分析。一次认知多样性的压力测试。
Source: Jensen Huang GTC 2026 Keynote (2026-03-16, San Jose), multi-source cross-survey
We wanted to test something: given a set of facts, can we use each person's unique system of cognitive axioms to accurately simulate their reaction to the same event? Furthermore, how large is the gap
CLI-Anything 的核心资产 HARNESS.md 方法论拆解:7 阶段流水线、渲染鸿沟、滤镜翻译陷阱、输出验证方法论,以及开源前提条件的诚实评估。
Claude Interactive Visualizations 不是新能力,而是一次成本结构的级联压缩。它把 Builder 层级的观测能力下放到 Consumer 层级,代价是牺牲可验证性。深度分析 Anthropic 的设计哲学、竞品格局与视觉权威性幻觉风险。
Source: https://github.com/HKUDS/CLI-Anything
On March 12, 2026, Anthropic released "Custom Visuals in Chat" (official name) for Claude, allowing it to generate inline interactive charts, diagrams, and visualizations within conversations. This fe
All three frontier model providers now offer 1M context windows, but benchmark data reveals massive reliability gaps. On MRCR v2 8-needle, Claude Opus 4.6 scores 76% at 1M while GPT-5.4 and Gemini 3 Pro score 36.6% and 24.5% respectively.
2026 年 3 月,三大前沿模型厂商终于都站到了 1M context window 的门槛上。本文横向对比 Google Gemini、Anthropic Claude、OpenAI 在长上下文能力上的实际表现,分析 1M 之后的真正差异在哪里。
深入分析 OpenAI Codex CLI 的架构设计,从 agent loop、sandbox 隔离、tool calling 到 streaming 实现,拆解一个生产级 AI agent 客户端的工程细节。
> Core Sources: OpenAI "Unrolling the Codex agent loop" (Michael Bolin, 2026-01), OpenAI "Unlocking the Codex harness" (Celia Chen, 2026-02), The Pragmatic Engineer "How Codex is built" (Gergely Orosz
阮一峰提出 AI 时代软件护城河从代码转向测试用例。本文从 Cloudflare 工程师复刻 Next.js 的 vinext 事件出发,分析这个论断的合理性与局限性。
In a recent issue of his weekly newsletter, Ruan Yifeng made a striking claim: in the AI era, the moat for software will shift from code to test cases. His core argument is that Cloudflare engineers s
Cursor 公开了内部评估体系 CursorBench。这不是学术 benchmark,而是从真实用户行为中提取的评估方法。本文深入分析其设计思路和对 AI coding 评估的启示。
从 OpenAI 的 Harness Engineering 到 Cursor 的 self-driving codebases,一个新的工程范式正在成型:人类的核心工作从写代码变成设计 AI agent 的工作环境。
The emergence of CursorBench has brought this question to the forefront. On March 11, 2026, Cursor published a blog post titled "How we compare model quality in Cursor," officially unveiling their int
> Core Sources: OpenAI "Harness engineering" (2026-02-11), Cursor "Towards self-driving codebases" (2026-02-05), Cursor "Scaling long-running autonomous coding" (2026-01-14)
免费/个人版几乎都会用你的数据训练模型,企业版几乎都不会——但「几乎」二字里藏着关键差异。本文对比各家 AI coding 工具的数据政策和永久授权条款。
Survey Date: March 9, 2026 | Methodology: 5 parallel librarian agent groups + cross-verification
程序有很多性质从语义层面一眼就能看出来,但编译器要形式化证明需要复杂分析甚至根本无法证明。本文探讨用 LLM 为编译器提供语义提示以辅助优化的可能性。
.site-nav{margin-bottom:1.5em;font-size:0.9em}.site-nav a{color:#0066cc;opacity:0.7;text-decoration:none}.site-nav a:hover{opacity:1}@media(prefers-color-scheme:dark){.site-nav a{color:#6db3f2}}