thespian/docs/remoting-design.md

18 KiB

Thespian Remoting Layer - Design Document

Overview

This document describes the design of a remoting layer for the Thespian actor framework. The remoting layer allows actors in separate OS processes

  • and eventually separate machines - to communicate transparently, as if they were in the same actor system. Once established, the communication channel is symmetric and bidirectional: either side can send messages, establish links, and propagate exit signals to the other.

The implementation is written entirely in Zig and lives under src/remote/. No new C or C++ primitives are expected to be required.


Goals

  • Enable transparent actor-to-actor communication across process boundaries.
  • Propagate actor links and exit signals across the boundary faithfully.
  • Clean up fully on transport collapse, with best-effort notification on both sides.
  • Keep the initial scope narrow: child process transport only, core infrastructure only.
  • Design the abstractions so that Unix socket and TCP transports can be added later with minimal disruption.

Non-Goals (Today)

  • Unix socket transport.
  • TCP transport.
  • Remote spawn (spawning an actor on a remote system).
  • Transport recovery or reconnection.
  • Multi-hop routing.
  • Authentication or encryption.

Core Concepts

Actor System Identity

Each running Thespian instance is an actor system. For remoting purposes, systems do not need globally unique identities in the initial implementation (child process transport is strictly 1:1). Identity concerns will become relevant when Unix socket and TCP transports introduce the listen/accept/connect model, at which point a system identifier will be assigned at connection time.

Well-Known Actors

An actor can be registered under a name in the local env proc table (e.g. env.proc("log")). Remoting extends this: a well-known actor on a remote system can be looked up by name via the endpoint. The endpoint itself is registered as a well-known actor in the local system under a configurable name (e.g. "remote"), making it discoverable by other local actors.

Remote Actor Identity

A remote actor is identified on the wire by an opaque 64-bit integer ID assigned by its home system at the time the proxy is created. Well-known actors are additionally reachable by name for the initial lookup, but thereafter addressed by ID. This keeps message routing in steady state uniform regardless of whether the original actor was named or anonymous.

Proxies

A proxy is a local actor that represents a remote actor within the local system. It holds a pid that local actors can send to, link against, and receive exit signals from, just as with any local actor. Internally, the proxy forwards everything to the endpoint for transmission over the transport.

Proxies are created on demand. The primary trigger is receiving a message from a remote actor that carries a remote PID as its from address: the endpoint creates a proxy for that remote PID if one does not already exist, and substitutes the proxy's local pid as the from before delivering the message to the local destination.

The endpoint maintains a table mapping remote actor IDs to local proxy pids. On transport collapse, all proxies are sent an exit signal.


Architecture

┌─────────────────────────────────────┐         ┌─────────────────────────────────────┐
│           System A                  │         │           System B                  │
│                                     │         │                                     │
│  Actor A ──> Proxy B ──> Endpoint ──┼─────────┼──> Endpoint ──> Proxy A ──> Actor B │
│                                     │Transport│                                     │
│  Actor B <── Proxy A <── Endpoint <─┼─────────┼──> Endpoint <── Proxy B <── Actor A │
└─────────────────────────────────────┘         └─────────────────────────────────────┘

The two key actors are:

  • Endpoint (src/remote/endpoint.zig) - one per remote connection. Owns the transport I/O, manages the proxy table, serialises outgoing messages and deserialises incoming ones.
  • Proxy (src/remote/proxy.zig) - one per remote actor that is currently referenced locally. Forwards messages and link/exit operations to the endpoint.

Transport: Child Process

The child process transport uses a pair of OS pipes: the parent writes to the child's stdin and reads from the child's stdout. The child does the reverse. Both sides run an endpoint actor; the parent spawns the child and passes the file descriptors to its endpoint.

This is the simplest possible 1:1 connection topology. There is no listen/accept phase - the connection is established at process spawn time and is torn down when the child exits or the parent closes the pipes.

The existing subprocess.zig already handles spawning and pipe I/O at the message level. The remoting endpoint will use the underlying file descriptor stream primitives directly to get a byte-stream interface suitable for framed CBOR.

Future Transports

When Unix socket and TCP transports are added, the endpoint will accept a transport interface - a pair of read/write abstractions over a byte stream. The endpoint logic (framing, message dispatch, proxy management) is identical regardless of transport; only the connection establishment differs. For Unix sockets and TCP, a separate listener actor will accept incoming connections and spawn an endpoint actor for each one.


Wire Protocol

Framing

Messages are framed with a 4-byte big-endian unsigned length prefix followed by a CBOR-encoded payload. The length field encodes the byte length of the CBOR payload only, not including itself.

┌───────────────┬─────────────────────────────┐
│ Length: u32BE │ CBOR Payload (Length bytes) │
└───────────────┴─────────────────────────────┘

Maximum payload size matches thespian.max_message_size (currently 32 KB). Frames exceeding this limit cause the transport to be torn down with an error.

CBOR Message Envelope

Every wire message is a CBOR array. The first element is a tag identifying the message type. The envelope format is:

["msg_type", ...fields]

The defined message types are:

Tag Direction Fields Meaning
"send" both from_id: u64, to_id: u64, payload: cbor Deliver payload to actor to_id, from proxy of from_id
"send_named" both from_id: u64, to_name: text, payload: cbor Deliver to a well-known actor by name
"link" both local_id: u64, remote_id: u64 Establish a link between a local actor and a remote actor
"exit" both id: u64, reason: text Remote actor id has exited with reason
"proxy_id" both name: text, id: u64 Response to send_named: here is the opaque ID for this named actor
"transport_error" both reason: text Signal that the sending side is tearing down

The from_id and to_id fields are opaque 64-bit integers assigned by the home system of the actor. ID 0 is reserved and invalid. Named lookups (send_named) trigger the remote side to create or locate a proxy for the named actor and reply with its assigned ID via proxy_id, after which the sender can use the ID directly.


Endpoint Actor

Responsibilities

  • Own the transport read/write loops.
  • Assign and track remote actor IDs.
  • Maintain the proxy table (remote_id → local proxy pid).
  • Serialise outbound messages into framed CBOR.
  • Deserialise inbound frames and dispatch to the correct local actor or proxy.
  • On transport collapse: exit all proxies, unregister from env, exit self.

State

const Endpoint = struct {
    allocator: std.mem.Allocator,
    reader: *FileStream,                          // inbound byte stream
    writer: *FileStream,                          // outbound byte stream
    inbound: std.AutoHashMap(u64, tp.pid),        // remote_id  -> local proxy pid
    outbound: std.AutoHashMap(usize, u64),        // local pid handle -> assigned outbound ID
    next_id: u64,                                 // monotonically increasing ID allocator
    name: [:0]const u8,                           // registered name in env (e.g. "remote")
};

The outbound table maps local actor handle identities to the IDs by which they are known on the wire. When the proxy passes a from pid to the endpoint, the endpoint looks up or assigns an outbound ID for it. The remote side will create an inbound proxy for that ID on first reference.

Message Interface (local actors → endpoint)

Local actors interact with the endpoint by sending messages to its pid. The endpoint understands the following local messages:

Message Meaning
{"send", from_id: u64, remote_id: u64, payload: cbor} Forward payload to remote actor, with originating actor's ID
{"send_named", from_id: u64, name: text, payload: cbor} Forward payload to remote well-known actor
{"link", remote_id: u64} Link the calling actor to a remote actor
{"proxy_exit", remote_id: u64, reason: text} A local proxy is reporting that its remote peer exited

Startup Sequence

  1. Endpoint actor is spawned (by the parent process after forking, or by the child after it starts).
  2. Endpoint registers itself in the local env under its configured name.
  3. Endpoint links to the env logger (if present) so it is cleaned up on exit.
  4. Endpoint starts the read loop: a dedicated receive of framed bytes from the transport.
  5. Endpoint is now ready to accept local send requests and inbound wire messages.

Read Loop

The endpoint issues a read on the transport stream. On each completion:

  1. Accumulate bytes until a complete frame is available (length prefix satisfied).
  2. Decode the CBOR envelope.
  3. Dispatch based on message type tag.
  4. Issue the next read.

On any read error or EOF, the endpoint initiates teardown.

Teardown

On transport collapse (read error, EOF, or transport_error received):

  1. Send transport_error to the remote side if the connection is still writable.
  2. Send an exit signal to every proxy in the proxy table.
  3. Clear the proxy table.
  4. Unregister from the local env.
  5. Exit self with reason "transport_error".

Because the endpoint is linked to all its proxies, and the proxies are linked to local actors that hold references to them, the exit propagates naturally through the local actor graph.


Proxy Actor

Responsibilities

  • Present a local pid that any local actor can send to or link against.
  • Forward all received messages to the endpoint for transmission.
  • Forward link requests to the endpoint.
  • On exit signal received from the endpoint (transport collapse or remote actor exit): exit self, propagating to any linked local actors.

State

const Proxy = struct {
    allocator: std.mem.Allocator,
    endpoint: tp.pid,      // the local endpoint actor
    remote_id: u64,        // the remote actor's opaque ID
};

Lifecycle

A proxy is created by the endpoint in two situations:

  1. Inbound message with unknown from_id: The endpoint receives a wire message from a remote actor ID it has not seen before. It spawns a proxy for that ID and records it in the proxy table before delivering the message locally.
  2. Explicit lookup response (proxy_id): After a send_named exchange, the endpoint now knows the remote ID for a named actor and creates a proxy for it.

A proxy is destroyed when:

  • It receives an exit signal from the endpoint (which forwards the remote actor's exit reason).
  • The transport collapses and the endpoint exits all proxies.

When a proxy exits, it notifies the endpoint via {"proxy_exit", remote_id, reason} so the endpoint can remove it from the proxy table. This prevents the table from growing unboundedly over a long-lived connection.

Message Handling

Every message received by the proxy is forwarded to the endpoint, including the from address:

receive(from, m):
    endpoint.send({"forward", self.remote_id, from_id(from), m})

The from address is a local pid_ref. Before forwarding, the proxy must resolve it to a local actor ID — or request that the endpoint assign one if this is the first time this local actor has sent outbound. The remote side will create a proxy for the from_id if one does not already exist, so that the remote actor can reply directly to the originating actor.


File Layout

src/remote/
├── endpoint.zig       # Endpoint actor
├── proxy.zig          # Proxy actor
├── framing.zig        # Length-prefix read/write helpers
└── protocol.zig       # Wire message encoding/decoding (CBOR envelopes)

The framing.zig module provides two functions:

pub fn write_frame(writer: anytype, payload: []const u8) !void;
pub fn read_frame(reader: anytype, buf: []u8) ![]u8;

The protocol.zig module provides encode/decode for each wire message type, working directly with cbor values.


Design Decisions and Rationale

Why one endpoint per connection?

With child process transport the relationship is strictly 1:1, so there is never more than one endpoint per remote system. When TCP is added, a listener will spawn one endpoint per accepted connection - the same model. Keeping endpoint state entirely local to one actor avoids shared mutable state and fits the actor model cleanly.

Why on-demand proxy creation?

Explicit proxy management (create before use, destroy explicitly) would require a handshake protocol and additional message types. On-demand creation based on observed from_id values is simpler and covers the primary use case: an actor on the remote side sends you a message, and you need to be able to reply to it. Explicit creation can be added later for the named-lookup case.

Child Process Transport and subprocess.zig

The parent-side endpoint uses subprocess.zig as-is to spawn the child and communicate over pipes. subprocess.zig delivers incoming bytes as {"stream", "stdout", "read_complete", bytes} messages which the endpoint receives and accumulates into frames.

The child side cannot use subprocess.zig — it must read from its own stdin and write to its own stdout. A separate stdio endpoint variant wraps file descriptor 0 (stdin) and file descriptor 1 (stdout) directly, using the same thespian/c/file_stream.h primitives, and presents identical behaviour to the parent-side endpoint once running.

Child Process Endpoint Modes

The child process endpoint will eventually support two modes:

  • Fork+exec (different binary): the parent spawns an entirely separate executable. The child starts fresh, initialises a Thespian context, and runs the stdio endpoint. This is the general case for connecting to actors in a different binary.

  • Fork-only (same binary, no exec, or re-exec): the child is a fork of the parent process. This avoids the overhead of loading a new binary and allows the child to share code with the parent. Re-exec (where the child exec's itself with a flag indicating it should run as a remote endpoint) is an alternative that gives a clean address space without a separate binary. Both variants use the stdio endpoint on the child side.

The distinction between modes is an implementation detail of how the child is launched; the endpoint protocol is identical in both cases.

Why CBOR for the wire protocol?

CBOR is already the native message format throughout Thespian. Using it on the wire means the payload of a send message is the actor message verbatim - no transcoding required. The framing overhead is minimal (4 bytes per message).

Why 64-bit opaque IDs?

A simple monotonic counter per endpoint is collision-free within a connection lifetime and requires no coordination. Named actors get an ID assigned at first reference. IDs are connection-scoped, not globally unique, which is sufficient for the 1:1 child process model.


Open Questions (Deferred)

  • System identity: When TCP is added, endpoints will need to identify themselves to each other (to detect loops, to route correctly in multi-hop scenarios). A UUID or similar token exchanged in a handshake is the likely approach.
  • Backpressure: The current model has no backpressure - a fast sender can overwhelm a slow transport. This is acceptable for the initial implementation but will need attention under load.
  • Named actor re-registration: If a well-known actor exits and is restarted under the same name, proxies on the remote side will hold stale IDs. A generation counter or re-lookup mechanism will be needed.