Date: 2026-03-16 Source: https://github.com/HKUDS/CLI-Anything Institution: Data Intelligence Lab, University of Hong Kong (HKUDS)
The core asset of CLI-Anything is HARNESS.md, an SOP document that teaches AI Agents how to systematically analyze the source code of GUI software and generate production-grade CLI wrappers. The project itself is a Claude Code plugin, but the real value lies in this methodology rather than the code.
CLI-Anything is a Claude Code plugin. After installation, executing
/cli-anything ./gimp triggers the AI Agent to read the GIMP
source code, follow the 7-phase pipeline defined in the HARNESS.md SOP,
and ultimately produce a Python CLI package that can be installed via
pip install. Subsequently, the Agent can control GIMP
through structured commands like
cli-anything-gimp --json filter add --name gaussian_blur --radius 5,
without needing to simulate mouse clicks.
Once the generation is complete, CLI-Anything’s mission is over. The output is a standalone Python package that can be used by any Agent capable of executing shell commands.
The title of HARNESS.md is “Agent Harness: GUI-to-CLI for Open Source Software.” Note the qualifier: Open Source.
The SOP requires the Agent to do five things: identify the backend engine, map GUI operations to API calls, identify data models (file formats), find existing CLI tools, and catalog commands/undo systems.
The prerequisite for this step is that the Agent can read the source code. In practice, the Agent’s job is to find the boundary between the presentation layer and the logic layer in the source code. GIMP’s image processing core is GEGL, Shotcut’s video editing logic is in the MLT framework, and Blender exposes the bpy Python interface. The GUI is merely a shell for these backend capabilities.
HARNESS.md also specifically points out “Find existing CLI tools,”
because many backends already come with their own CLIs:
melt, ffmpeg, convert,
inkscape --actions. These are ready-made building
blocks.
Design command groupings (corresponding to the software’s logical
domains), state models (what needs to be persisted and how to serialize
it), and output formats (--json for Agents, tables for
humans). A dual mode of REPL + subcommands is recommended.
There is a practical design decision here: the REPL mode exists because GUI software is naturally stateful (open projects, currently selected layers), whereas traditional CLIs are stateless. Cold-starting a LibreOffice process for every command execution would have unacceptable overhead. REPL solves this by maintaining a persistent process.
The implementation order prescribed by the SOP is: data layer first (XML/JSON for manipulating project files), then adding probe commands (info, list, status), followed by modification commands (one command per logical operation), and finally backend integration and rendering/export.
The actual backend integration code looks like this:
# Blender: Generate a bpy script and pass it to blender for execution
def render_script(script_path, timeout=300):
blender = find_blender() # shutil.which("blender")
cmd = [blender, "--background", "--python", script_path]
result = subprocess.run(cmd, capture_output=True, text=True, timeout=timeout)
return {"returncode": result.returncode, "stdout": result.stdout, "stderr": result.stderr}
# GIMP: Generate a Script-Fu command and pass it to gimp for execution
def batch_script_fu(script, timeout=120):
gimp = find_gimp()
cmd = [gimp, "-i", "-b", script, "-b", "(gimp-quit 0)"]
result = subprocess.run(cmd, capture_output=True, text=True, timeout=timeout)
return {"returncode": result.returncode, "stdout": result.stdout, "stderr": result.stderr}
# LibreOffice: Generate an ODF file and pass it to libreoffice for format conversion
def convert(input_path, output_format, output_dir=None, timeout=120):
lo = find_libreoffice()
cmd = [lo, "--headless", "--convert-to", output_format, "--outdir", output_dir, input_path]
subprocess.run(cmd, capture_output=True, text=True, timeout=timeout)The pattern is identical: shutil.which() finds the
executable → generate intermediate products (scripts/files) →
subprocess.run() calls the actual software → parse the
output. There are no direct Python API imports or FFI calls to shared
libraries; it’s all inter-process communication.
Testing is divided into four levels: unit tests (synthetic data), end-to-end tests (verifying intermediate file structures), real backend tests (calling the actual software to produce PDF/MP4 and verifying them), and CLI subprocess tests (simulating how an Agent would use it).
The SOP explicitly requires “No graceful degradation”: if the actual software is not installed, the test fails immediately, without skipping or downgrading. This is a deliberate design decision.
Releases use PEP 420 namespace packages, allowing packages like
cli_anything.gimp and cli_anything.blender to
be installed independently without conflict.
This is the most valuable part of HARNESS.md—lessons learned from real-world experience, not just generic best practices.
“Use the Real Software — Don’t Reimplement It”
The SOP explicitly warns: do not use Pillow to reimplement GIMP’s image composition, and do not generate bpy scripts without ever calling Blender. Such approaches produce toys that cannot handle real workloads. The correct way is to generate intermediate files and let the actual software render them.
The underlying logic of this rule is that the rendering engines of professional software have been refined over decades; rewriting an equivalent in Python is neither realistic nor necessary. The positioning of a CLI is as a “structured remote control,” not a “replacement.”
“The Rendering Gap”
This is the most profound technical insight in the document. The problem is this: an Agent modifies a project file (e.g., adding a brightness filter in an MLT XML), but if a simple tool (like ffmpeg concat demuxer) is used for rendering, it reads the raw media files and completely ignores the project-level effect definitions. The output is identical to the input; the Agent thinks the operation succeeded, but in reality, nothing happened.
The solution is to establish a “filter translation layer.” The first
choice is to use the software’s native renderer (melt
directly reading MLT project files); the second choice is to translate
the project format’s effect parameters into the rendering tool’s native
syntax (MLT filter → ffmpeg -filter_complex); and the final
fallback is to generate a manually executed rendering script.
This pattern has broad parallels in software engineering: Terraform state files vs. actual cloud resources, Virtual DOM vs. browser rendering, AST vs. machine code. The commonality is that modifying the intermediate representation (IR) does not equal a change in the final product; there must be a materialization step in between.
Specific Pitfalls in Filter Translation
HARNESS.md lists several pitfalls encountered in practice: ffmpeg
does not allow the same type of filter to appear twice in the same
filter chain (brightness + saturation both map to eq=, so
they must be merged into one eq=brightness=X:saturation=Y);
the ffmpeg concat filter requires an interleaved stream
order ([v0][a0][v1][a1], not
[v0][v1][a0][a1]); and the value ranges of effect
parameters vary between tools (MLT brightness 1.15 = +15%, while ffmpeg
eq=brightness=0.06 on a -1..1 scale).
Output Verification Methodology
“Never trust that export worked because it exited 0.” You cannot assume an export succeeded just because the process exited normally. You must verify: are the magic bytes correct? Is the ZIP/OOXML structure intact? Check video frames to see if brightness/color matches expectations; check audio for RMS levels.
This experience is particularly important for non-integer frame
rates: at 29.97fps, using int() for frame number conversion
leads to cumulative errors; round() must be used, and tests
should tolerate an error of ±1 frame.
A sharp question: every piece of software listed in HARNESS.md
already has a public programming interface. Blender has bpy, GIMP has
Script-Fu, LibreOffice has UNO and headless mode, Inkscape has
--actions, and OBS has a WebSocket API. What does
CLI-Anything add on top of these existing interfaces?
The answer is standardization. The native interfaces
of these software packages vary wildly: bpy is Python OOP, Script-Fu is
a Lisp dialect, UNO is CORBA-style IDL, and OBS WebSocket is JSON-RPC.
For an Agent to master all these paradigms, the burden on the context
window is heavy. CLI-Anything unifies them all into the same interaction
protocol: Click CLI + --json + --help + REPL.
Once an Agent learns one pattern, it can operate all the software.
This value is real, but it is also limited. It solves the problem of “interface heterogeneity,” not the problem of “interface non-existence.”
From a code perspective, the operation of CLI-Anything requires three prerequisites to be met simultaneously:
If any of these three conditions is missing, it won’t work.
This leads to an interesting engineering trade-off: if the software itself is open-source and has good programming interfaces, why not just submit a PR upstream to add an official agent-friendly CLI layer?
Resistance to the PR route comes from several factors. Large open-source projects (GIMP with 6 million lines, Blender with over 2 million, LibreOffice with over 10 million) have their own governance structures and priorities. Adding a complete set of CLI interfaces to such a project involves a review scope far beyond a typical PR. Maintainers also have to consider long-term maintenance responsibility, integration with existing architecture, and whether “serving AI Agents” aligns with the project’s direction. In fact, the Blender community is very strict even about changes to the Python API.
CLI-Anything’s external wrapping route bypasses these issues. There’s no need to convince any maintainer, wait for reviews, or follow upstream coding standards. The cost is bearing the maintenance yourself: if the upstream API changes, the wrapper might break. However, since the wrapper is AI-generated, running the pipeline again should theoretically update it.
Another key engineering choice: CLI-Anything relies
on the most stable layer of the software. Not the GUI code (which
changes frequently), not the internal APIs (which may be unstable), but
the CLI interfaces of the backend engines
(libreoffice --headless,
blender --background --python, melt). These
interfaces are the software’s stable contracts with the outside world
and change much more slowly than internal APIs. Thus, the actual
maintenance burden of the wrapper is lighter than one might imagine.
The trade-offs between the three paths can be summarized as follows:
| Dimension | Upstream PR | Manual External Wrapping | CLI-Anything Auto-generation |
|---|---|---|---|
| Initial Cost | High (code + communication) | Medium | Low (one command) |
| Quality Ceiling | Highest (official maintenance) | Depends on the engineer | Depends on AI understanding |
| Maintenance Cost | Zero (upstream responsible) | High (manual updates) | Low (re-generation) |
| Deployment Speed | Months to years | Weeks | Minutes to hours |
| Political Resistance | High | Zero | Zero |
| Coverage | Depends on review | Controllable | Depends on AI understanding |
The title of HARNESS.md says “for Open Source Software,” and that’s not modesty; it’s the truth.
For closed-source software like WeChat, Xiaohongshu, or Photoshop, none of the three prerequisites are met: no readable source code (Phase 1 fails), no public headless/scripting interface (Phase 3 fails), and the file formats are proprietary binary (data layer fails). Not to mention that such platforms often have active anti-automation mechanisms.
There is a Zoom example in the project repository, which works by
directly wrapping Zoom’s REST API (using the requests
library to call api.zoom.us/v2/). This works, but only
because Zoom has a public REST API. WeChat and Xiaohongshu do not have
such public third-party APIs, so this route is also a dead end.
Therefore, the slogan “Make ALL Software Agent-Native” is marketing speak; a more accurate statement would be “Make Open-Source Software with Existing Programmatic Interfaces Agent-Native.” The scope is much narrower, but within that scope, it is indeed effective.
Modal Dialog Hangs. When GUI software in headless
mode encounters an exception (missing fonts, corrupted files), it
sometimes pops up a hidden modal dialog waiting for a user click. Since
there is no GUI, the process hangs indefinitely. Although HARNESS.md
sets a timeout for subprocess.run(), it does not provide
more refined strategies (such as stderr heuristic detection, watchdog
processes, or IPC timeout mechanisms).
State Synchronization. If an Agent is manipulating a project through a CLI REPL while a human user simultaneously opens the same file in the GUI, what happens? HARNESS.md does not discuss file locking, two-way state synchronization, or conflict resolution.
Version Drift. The GUI→API mapping may fail when the
software is upgraded. What if the --actions syntax changes
after an Inkscape upgrade? HARNESS.md lacks a regression testing
strategy to detect breaking changes in the underlying software.
As “instructions for AI to read,” HARNESS.md’s approach of forcing phased execution (analyze first, then design, then implement) is effective: it prevents the AI from skipping thinking and jumping straight to code, reducing API hallucinations. However, Phase 1’s requirement for the Agent to “scan source code” to map GUI→API is impossible for a codebase like Blender’s with millions of lines. The actual effectiveness depends heavily on the Agent’s code search tools (grep, AST analysis) and the quality of project documentation. If the codebase lacks good documentation or has unclear naming, the AI will hallucinate non-existent APIs.
Setting aside the CLI-Anything branding, HARNESS.md offers several lessons worth remembering for any engineer doing AI-tool integration:
--json flag.--headless CLI interfaces over internal APIs, and
choose public file formats over proprietary data structures. This
minimizes maintenance costs.The 15.8k stars for the CLI-Anything project mainly stem from the appealing narrative of “making any software Agent-Native with one command.” But looking deeper, its applicability is much narrower than the marketing suggests: it only works for software that is open-source, has existing programming interfaces, and has editable file formats. Its core work is standardized wrapping, not capability discovery.
That said, HARNESS.md itself is a high-quality engineering document. It turns the task of “how to systematically generate a CLI interface for GUI software” from an experience-based craft into a standardized process that can be handed over to an AI. The summaries of experience regarding the rendering gap, filter translation, and output verification are distilled from real-world pitfalls and are valuable for any engineer doing AI-tool integration.
The project’s greatest contribution is not the code, but this methodology itself.