A file-based work-bus for orchestrating a fleet of agent CLIs — coordination without a message broker

To coordinate a fleet of independent AI agent CLIs without a message broker or a heavy framework, use a filesystem work-bus: the orchestrator decomposes a goal into a graph of subtasks, writes a Task file per subtask, and polls for the Result file each worker writes back — every file written atomically. The durable coordination state lives on disk as files: language-agnostic, debuggable with ls, surviving restarts, and self-healing because an absent worker is skipped and logged instead of failing the run.

← hexisteme · notes · June 21, 2026 · 8 min read

Say you have several AI agents, each an independent installed CLI — one gathers information, one writes copy, one builds an app scaffold — and you want to run a goal that needs several of them in sequence. The heavyweight answers are an in-process framework (LangGraph, an AutoGPT-style loop) or a message broker (Redis, Kafka, RabbitMQ). Both are more than a single-operator fleet needs: a framework couples your workers into one process and one language, and a broker is infrastructure you now have to run, secure, and monitor.

There's a lighter primitive that fits this shape: a work-bus made of files.

The mechanism

A conductor process owns a shared directory — the bus. To run a goal:

  1. Decompose the goal into a directed acyclic graph of subtasks (e.g. gather → narrate → build).
  2. For each ready subtask, write a Task file into the bus, tagged with the capability it needs.
  3. Poll for the matching Result file, with a short backoff.
  4. Absorb each result, validate it, and release the next subtasks in the graph.

The one rule that makes this safe is atomic writes: write each record to a temp path and rename it into place. Rename is atomic on POSIX filesystems, so a reader either sees the whole file or nothing — never a half-written record. Task and Result are typed records (a small pydantic schema), and the conductor keeps a registry of what's in flight.

# atomic publish — a reader never sees a partial record
def publish(path, record):
    tmp = path.with_suffix(".tmp")
    tmp.write_text(record.model_dump_json())
    tmp.rename(path)          # atomic on POSIX

# the conductor loop
for task in topo_order(dag):
    publish(bus / f"{task.id}.task.json", task)
    result = poll(bus / f"{task.id}.result.json", backoff=...)   # durable: waits for the file
    absorb(result)

This is state, not events — by design

It's fair to ask whether a file work-bus is just an event bus in disguise. It isn't, and the distinction is the reason it works. An event bus is push: producers emit ephemeral events, and anything not listening at that instant misses them. A file work-bus is state: the Task and Result records are durable files that stay until consumed. A worker that starts late, or restarts mid-run, still finds its task waiting. (I argued the same principle for monitoring a fleet in state is truth, events are rumors — here it shows up again for coordinating one.)

Why push is fine here but not for monitoring. You build and control these workers, so you can make them read and write the bus. Monitoring is the opposite case — you watch components you don't control, so you pull their state instead. Coordination of owned workers via durable files, monitoring of unowned components via state scans: both lean on durable state over ephemeral events.

Routing by capability, not by name

The conductor doesn't hard-wire "send step 2 to worker X." Each worker advertises capabilities; each subtask declares the capability it needs; the conductor matches them at dispatch time by finding a healthy worker that advertises the required capability. Add or remove a worker and routing adapts — there's no wiring diagram to edit. This is what lets one conductor coordinate a heterogeneous, changing fleet through a single uniform contract.

Graceful degradation: skip the absent worker

The most important behavior for a fleet that's still being built: a missing worker must not fail the run. If a subtask needs a capability no healthy worker currently advertises, the conductor marks that node skipped (a logged worker_absent), continues the rest of the graph, and synthesizes from whatever completed. On day one, when most workers don't exist yet, the conductor still runs end-to-end and produces partial output — and the skip log is a precise to-do list of which capabilities to build next. A gap is reported, not crashed on.

Trust the bus like a network boundary

Worker output is untrusted input crossing a boundary, and the bus treats it that way. Every result is parsed into a strict schema before absorption; mismatches (say, casing differences between the wire format and internal enums) are coerced and normalized at the seam. Load-bearing claims carry a provenance label and must include evidence — a claim that arrives marked FACT with no evidence IDs is rejected at parse time, not trusted. The typed contract is what lets independent workers, written in different languages by you at different times, interoperate without the conductor having to trust any of them blindly.

The honest limitation

No stop condition on re-routing. Capability-based routing has a sharp edge: if a node can be re-routed to "any worker advertising capability C" and results keep failing validation, a naive conductor can re-route in an unbounded loop. A file work-bus needs an explicit per-node attempt budget (and a dead-letter outcome) or it can spin. Durability and decoupling are the wins; a bounded retry policy is the cost you must pay to claim them safely.

When to reach for a real broker

This pattern fits a small, heterogeneous fleet running tasks that take seconds to minutes, coordinated by one operator. If you need high-throughput, low-latency fan-out across many producers and consumers, run a real message bus — the file-bus's polling and single-conductor model won't keep up. Match the tool to the failure that hurts: for a solo fleet, the pain is operational overhead and brittle coupling, and a directory of atomic files removes both.

FAQ

Q. How do I orchestrate multiple agent CLIs without LangGraph or a broker?
A filesystem work-bus: decompose into a subtask graph, write atomic Task files, poll for Result files. Durable state on disk — language-agnostic, debuggable, restart-safe, no broker to run.

Q. Isn't this just an event bus, and isn't polling slow?
No — files are durable state, not ephemeral events, so a late/restarted worker still finds its task. Poll-with-backoff latency is negligible for second-to-minute tasks, and you avoid running a broker.

Q. How does it route tasks?
By capability, not worker name. Workers advertise capabilities; subtasks declare a required capability; the conductor matches at dispatch. Fleets can change without editing wiring.

Q. What if a needed worker is missing?
Skip and log it (worker_absent), continue the graph, synthesize from what completed. Graceful degradation, with the skip log as a build-next list.

← hexisteme · notes · CC-BY 4.0