How do I orchestrate multiple AI agent CLIs without LangGraph or a message broker?

Use a filesystem work-bus. The orchestrator decomposes a goal into a directed graph of subtasks, writes a Task file per subtask into a shared directory, and polls for the corresponding Result file each worker writes back. Each file is written atomically (write to a temp path, then rename) so a reader never sees a half-written record. The durable coordination state is just files on disk: language-agnostic (any worker in any language reads and writes JSON), debuggable with ls and cat, and able to survive a restart of the orchestrator or any worker. You get most of what a broker gives you — durable hand-off, decoupling — with no broker to run or monitor.

Is a file-based work-bus the same as an event bus, and isn't polling slow?

No — and that difference is the point. An event bus is push: producers emit ephemeral events, and anything that isn't listening at that instant misses them. A file work-bus is state: the Task and Result records are durable files that stay until consumed, so a worker that starts late or restarts still finds its work. Polling with a short backoff is the trade-off you make for that durability and simplicity; for a fleet coordinating tasks that take seconds to minutes, poll latency is negligible and you avoid running a broker. Reserve a real message bus for high-throughput, low-latency fan-out where you own and can instrument every producer.

How does the orchestrator route a task to the right agent?

By capability, not by hard-coded worker name. Each worker registers the capabilities it advertises; the orchestrator decomposes the goal into subtasks tagged with a required capability and, for each, finds a healthy worker that advertises it. This keeps the orchestrator decoupled from the specific fleet: add or remove a worker and routing adapts, because nodes are matched to capabilities rather than to a fixed wiring diagram. It also lets the orchestrator coordinate a heterogeneous fleet — different workers, different languages — through one uniform contract.

What happens when a worker the plan needs is missing or down?

Skip it and log it — never fail the whole run. If a subtask requires a capability no healthy worker currently advertises, the orchestrator marks that node skipped (worker_absent) and records it, then continues with the rest of the graph and synthesizes from what did complete. This is graceful degradation: on day one, when most workers don't exist yet, the orchestrator still runs and produces partial output, and the skip log tells you exactly which capabilities to build next. A missing worker is a gap to report, not a crash.

How do you trust the results that workers write back to the bus?

Put a typed contract on the bus and validate at the boundary. Every result a worker writes is parsed into a strict schema before the orchestrator absorbs it; untrusted fields are coerced and normalized (for example, casing differences between the bus wire format and internal enums). Load-bearing claims carry a provenance label and must include evidence — a claim marked FACT without evidence IDs is rejected at parse time rather than trusted. The orchestrator treats worker output as untrusted input crossing a boundary, exactly as it would treat data from the network.

A file-based work-bus for orchestrating a fleet of agent CLIs — coordination without a message broker

To coordinate a fleet of independent AI agent CLIs without a message broker or a heavy framework, use a filesystem work-bus: the orchestrator decomposes a goal into a graph of subtasks, writes a Task file per subtask, and polls for the Result file each worker writes back — every file written atomically. The durable coordination state lives on disk as files: language-agnostic, debuggable with ls, surviving restarts, and self-healing because an absent worker is skipped and logged instead of failing the run.

Say you have several AI agents, each an independent installed CLI — one gathers information, one writes copy, one builds an app scaffold — and you want to run a goal that needs several of them in sequence. The heavyweight answers are an in-process framework (LangGraph, an AutoGPT-style loop) or a message broker (Redis, Kafka, RabbitMQ). Both are more than a single-operator fleet needs: a framework couples your workers into one process and one language, and a broker is infrastructure you now have to run, secure, and monitor.

The mechanism

The one rule that makes this safe is atomic writes: write each record to a temp path and rename it into place. Rename is atomic on POSIX filesystems, so a reader either sees the whole file or nothing — never a half-written record. Task and Result are typed records (a small pydantic schema), and the conductor keeps a registry of what's in flight.

This is state, not events — by design

It's fair to ask whether a file work-bus is just an event bus in disguise. It isn't, and the distinction is the reason it works. An event bus is push: producers emit ephemeral events, and anything not listening at that instant misses them. A file work-bus is state: the Task and Result records are durable files that stay until consumed. A worker that starts late, or restarts mid-run, still finds its task waiting. (I argued the same principle for monitoring a fleet in state is truth, events are rumors — here it shows up again for coordinating one.)

Why push is fine here but not for monitoring. You build and control these workers, so you can make them read and write the bus. Monitoring is the opposite case — you watch components you don't control, so you pull their state instead. Coordination of owned workers via durable files, monitoring of unowned components via state scans: both lean on durable state over ephemeral events.

Routing by capability, not by name

The conductor doesn't hard-wire "send step 2 to worker X." Each worker advertises capabilities; each subtask declares the capability it needs; the conductor matches them at dispatch time by finding a healthy worker that advertises the required capability. Add or remove a worker and routing adapts — there's no wiring diagram to edit. This is what lets one conductor coordinate a heterogeneous, changing fleet through a single uniform contract.

Graceful degradation: skip the absent worker

The most important behavior for a fleet that's still being built: a missing worker must not fail the run. If a subtask needs a capability no healthy worker currently advertises, the conductor marks that node skipped (a logged worker_absent), continues the rest of the graph, and synthesizes from whatever completed. On day one, when most workers don't exist yet, the conductor still runs end-to-end and produces partial output — and the skip log is a precise to-do list of which capabilities to build next. A gap is reported, not crashed on.

Trust the bus like a network boundary

Worker output is untrusted input crossing a boundary, and the bus treats it that way. Every result is parsed into a strict schema before absorption; mismatches (say, casing differences between the wire format and internal enums) are coerced and normalized at the seam. Load-bearing claims carry a provenance label and must include evidence — a claim that arrives marked FACT with no evidence IDs is rejected at parse time, not trusted. The typed contract is what lets independent workers, written in different languages by you at different times, interoperate without the conductor having to trust any of them blindly.

The honest limitation

No stop condition on re-routing. Capability-based routing has a sharp edge: if a node can be re-routed to "any worker advertising capability C" and results keep failing validation, a naive conductor can re-route in an unbounded loop. A file work-bus needs an explicit per-node attempt budget (and a dead-letter outcome) or it can spin. Durability and decoupling are the wins; a bounded retry policy is the cost you must pay to claim them safely.

When to reach for a real broker

This pattern fits a small, heterogeneous fleet running tasks that take seconds to minutes, coordinated by one operator. If you need high-throughput, low-latency fan-out across many producers and consumers, run a real message bus — the file-bus's polling and single-conductor model won't keep up. Match the tool to the failure that hurts: for a solo fleet, the pain is operational overhead and brittle coupling, and a directory of atomic files removes both.

FAQ

Q. How do I orchestrate multiple agent CLIs without LangGraph or a broker?
A filesystem work-bus: decompose into a subtask graph, write atomic Task files, poll for Result files. Durable state on disk — language-agnostic, debuggable, restart-safe, no broker to run.

Q. Isn't this just an event bus, and isn't polling slow?
No — files are durable state, not ephemeral events, so a late/restarted worker still finds its task. Poll-with-backoff latency is negligible for second-to-minute tasks, and you avoid running a broker.

Q. How does it route tasks?
By capability, not worker name. Workers advertise capabilities; subtasks declare a required capability; the conductor matches at dispatch. Fleets can change without editing wiring.

Q. What if a needed worker is missing?
Skip and log it (worker_absent), continue the graph, synthesize from what completed. Graceful degradation, with the skip log as a build-next list.