thespian/docs/remoting-design.md

# Thespian Remoting Layer - Design Document

## Overview

This document describes the design of a remoting layer for the Thespian
actor framework. The remoting layer allows actors in separate OS processes

- and eventually separate machines - to communicate transparently, as
  if they were in the same actor system. Once established, the communication
  channel is symmetric and bidirectional: either side can send messages,
  establish links, and propagate exit signals to the other.

The implementation is written entirely in Zig and lives under
`src/remote/`. No new C or C++ primitives are expected to be required.

---

## Goals

- Enable transparent actor-to-actor communication across process
  boundaries.
- Propagate actor links and exit signals across the boundary faithfully.
- Clean up fully on transport collapse, with best-effort notification on
  both sides.
- Keep the initial scope narrow: child process transport only, core
  infrastructure only.
- Design the abstractions so that Unix socket and TCP transports can be
  added later with minimal disruption.

---

## Non-Goals (Today)

- Unix socket transport.
- TCP transport.
- Remote spawn (spawning an actor on a remote system).
- Transport recovery or reconnection.
- Multi-hop routing.
- Authentication or encryption.

---

## Core Concepts

### Actor System Identity

Each running Thespian instance is an _actor system_. For remoting purposes,
systems do not need globally unique identities in the initial
implementation (child process transport is strictly 1:1). Identity concerns
will become relevant when Unix socket and TCP transports introduce the
listen/accept/connect model, at which point a system identifier will be
assigned at connection time.

### Well-Known Actors

An actor can be registered under a name in the local `env` proc table (e.g.
`env.proc("log")`). Remoting extends this: a well-known actor on a remote
system can be looked up by name via the endpoint. The endpoint itself is
registered as a well-known actor in the local system under a configurable
name (e.g. `"remote"`), making it discoverable by other local actors.

### Remote Actor Identity

A remote actor is identified on the wire by an opaque 64-bit integer ID
assigned by its home system at the time the proxy is created. Well-known
actors are additionally reachable by name for the initial lookup, but
thereafter addressed by ID. This keeps message routing in steady state
uniform regardless of whether the original actor was named or anonymous.

### Proxies

A _proxy_ is a local actor that represents a remote actor within the local
system. It holds a `pid` that local actors can send to, link against, and
receive exit signals from, just as with any local actor. Internally, the
proxy forwards everything to the endpoint for transmission over the
transport.

Proxies are created on demand. The primary trigger is receiving a message
from a remote actor that carries a remote PID as its `from` address: the
endpoint creates a proxy for that remote PID if one does not already exist,
and substitutes the proxy's local `pid` as the `from` before delivering the
message to the local destination.

The endpoint maintains a table mapping remote actor IDs to local proxy
`pid`s. On transport collapse, all proxies are sent an exit signal.

---

## Architecture

```
┌─────────────────────────────────────┐         ┌─────────────────────────────────────┐
│           System A                  │         │           System B                  │
│                                     │         │                                     │
│  Actor A ──> Proxy B ──> Endpoint ──┼─────────┼──> Endpoint ──> Proxy A ──> Actor B │
│                                     │Transport│                                     │
│  Actor B <── Proxy A <── Endpoint <─┼─────────┼──> Endpoint <── Proxy B <── Actor A │
└─────────────────────────────────────┘         └─────────────────────────────────────┘
```

The two key actors are:

- **Endpoint** (`src/remote/endpoint.zig`) - one per remote connection.
  Owns the transport I/O, manages the proxy table, serialises outgoing
  messages and deserialises incoming ones.
- **Proxy** (`src/remote/proxy.zig`) - one per remote actor that is
  currently referenced locally. Forwards messages and link/exit operations
  to the endpoint.

---

## Transport: Child Process

The child process transport uses a pair of OS pipes: the parent writes to
the child's stdin and reads from the child's stdout. The child does the
reverse. Both sides run an endpoint actor; the parent spawns the child and
passes the file descriptors to its endpoint.

This is the simplest possible 1:1 connection topology. There is no
listen/accept phase - the connection is established at process spawn time
and is torn down when the child exits or the parent closes the pipes.

The existing `subprocess.zig` already handles spawning and pipe I/O at the
message level. The remoting endpoint will use the underlying file
descriptor stream primitives directly to get a byte-stream interface
suitable for framed CBOR.

### Future Transports

When Unix socket and TCP transports are added, the endpoint will accept a
_transport interface_ - a pair of read/write abstractions over a byte
stream. The endpoint logic (framing, message dispatch, proxy management) is
identical regardless of transport; only the connection establishment
differs. For Unix sockets and TCP, a separate _listener_ actor will accept
incoming connections and spawn an endpoint actor for each one.

---

## Wire Protocol

### Framing

Messages are framed with a 4-byte big-endian unsigned length prefix
followed by a CBOR-encoded payload. The length field encodes the byte
length of the CBOR payload only, not including itself.

```
┌───────────────┬─────────────────────────────┐
│ Length: u32BE │ CBOR Payload (Length bytes) │
└───────────────┴─────────────────────────────┘
```

Maximum payload size matches `thespian.max_message_size` (currently 32 KB).
Frames exceeding this limit cause the transport to be torn down with an
error.

### CBOR Message Envelope

Every wire message is a CBOR array. The first element is a tag identifying
the message type. The envelope format is:

```
["msg_type", ...fields]
```

The defined message types are:

| Tag                 | Direction | Fields                                           | Meaning                                                              |
| ------------------- | --------- | ------------------------------------------------ | -------------------------------------------------------------------- |
| `"send"`            | both      | `from_id: u64`, `to_id: u64`, `payload: cbor`    | Deliver payload to actor `to_id`, from proxy of `from_id`            |
| `"send_named"`      | both      | `from_id: u64`, `to_name: text`, `payload: cbor` | Deliver to a well-known actor by name                                |
| `"link"`            | both      | `local_id: u64`, `remote_id: u64`                | Establish a link between a local actor and a remote actor            |
| `"exit"`            | both      | `id: u64`, `reason: text`                        | Remote actor `id` has exited with reason                             |
| `"proxy_id"`        | both      | `name: text`, `id: u64`                          | Response to `send_named`: here is the opaque ID for this named actor |
| `"transport_error"` | both      | `reason: text`                                   | Signal that the sending side is tearing down                         |

The `from_id` and `to_id` fields are opaque 64-bit integers assigned by the
_home system_ of the actor. ID `0` is reserved and invalid. Named lookups
(`send_named`) trigger the remote side to create or locate a proxy for the
named actor and reply with its assigned ID via `proxy_id`, after which the
sender can use the ID directly.

---

## Endpoint Actor

### Responsibilities

- Own the transport read/write loops.
- Assign and track remote actor IDs.
- Maintain the proxy table (`remote_id → local proxy pid`).
- Serialise outbound messages into framed CBOR.
- Deserialise inbound frames and dispatch to the correct local actor or
  proxy.
- On transport collapse: exit all proxies, unregister from env, exit self.

### State

```zig
const Endpoint = struct {
    allocator: std.mem.Allocator,
    reader: *FileStream,                          // inbound byte stream
    writer: *FileStream,                          // outbound byte stream
    inbound: std.AutoHashMap(u64, tp.pid),        // remote_id  -> local proxy pid
    outbound: std.AutoHashMap(usize, u64),        // local pid handle -> assigned outbound ID
    next_id: u64,                                 // monotonically increasing ID allocator
    name: [:0]const u8,                           // registered name in env (e.g. "remote")
};
```

The `outbound` table maps local actor handle identities to the IDs by which
they are known on the wire. When the proxy passes a `from` pid to the
endpoint, the endpoint looks up or assigns an outbound ID for it. The remote
side will create an inbound proxy for that ID on first reference.

### Message Interface (local actors → endpoint)

Local actors interact with the endpoint by sending messages to its `pid`.
The endpoint understands the following local messages:

| Message                                                   | Meaning                                                      |
| --------------------------------------------------------- | ------------------------------------------------------------ |
| `{"send", from_id: u64, remote_id: u64, payload: cbor}`   | Forward payload to remote actor, with originating actor's ID |
| `{"send_named", from_id: u64, name: text, payload: cbor}` | Forward payload to remote well-known actor                   |
| `{"link", remote_id: u64}`                                | Link the calling actor to a remote actor                     |
| `{"proxy_exit", remote_id: u64, reason: text}`            | A local proxy is reporting that its remote peer exited       |

### Startup Sequence

1. Endpoint actor is spawned (by the parent process after forking, or by
   the child after it starts).
2. Endpoint registers itself in the local `env` under its configured name.
3. Endpoint links to the `env` logger (if present) so it is cleaned up on
   exit.
4. Endpoint starts the read loop: a dedicated receive of framed bytes from
   the transport.
5. Endpoint is now ready to accept local send requests and inbound wire
   messages.

### Read Loop

The endpoint issues a read on the transport stream. On each completion:

1. Accumulate bytes until a complete frame is available (length prefix
   satisfied).
2. Decode the CBOR envelope.
3. Dispatch based on message type tag.
4. Issue the next read.

On any read error or EOF, the endpoint initiates teardown.

### Teardown

On transport collapse (read error, EOF, or `transport_error` received):

1. Send `transport_error` to the remote side if the connection is still
   writable.
2. Send an exit signal to every proxy in the proxy table.
3. Clear the proxy table.
4. Unregister from the local `env`.
5. Exit self with reason `"transport_error"`.

Because the endpoint is linked to all its proxies, and the proxies are
linked to local actors that hold references to them, the exit propagates
naturally through the local actor graph.

---

## Proxy Actor

### Responsibilities

- Present a local `pid` that any local actor can send to or link against.
- Forward all received messages to the endpoint for transmission.
- Forward link requests to the endpoint.
- On exit signal received from the endpoint (transport collapse or remote
  actor exit): exit self, propagating to any linked local actors.

### State

```zig
const Proxy = struct {
    allocator: std.mem.Allocator,
    endpoint: tp.pid,      // the local endpoint actor
    remote_id: u64,        // the remote actor's opaque ID
};
```

### Lifecycle

A proxy is created by the endpoint in two situations:

1. **Inbound message with unknown `from_id`**: The endpoint receives a wire
   message from a remote actor ID it has not seen before. It spawns a proxy
   for that ID and records it in the proxy table before delivering the message
   locally.
2. **Explicit lookup response (`proxy_id`)**: After a `send_named`
   exchange, the endpoint now knows the remote ID for a named actor and
   creates a proxy for it.

A proxy is destroyed when:

- It receives an exit signal from the endpoint (which forwards the remote
  actor's exit reason).
- The transport collapses and the endpoint exits all proxies.

When a proxy exits, it notifies the endpoint via `{"proxy_exit", remote_id,
reason}` so the endpoint can remove it from the proxy table. This prevents
the table from growing unboundedly over a long-lived connection.

### Message Handling

Every message received by the proxy is forwarded to the endpoint, including
the `from` address:

```
receive(from, m):
    endpoint.send({"forward", self.remote_id, from_id(from), m})
```

The `from` address is a local `pid_ref`. Before forwarding, the proxy must
resolve it to a local actor ID — or request that the endpoint assign one if
this is the first time this local actor has sent outbound. The remote side
will create a proxy for the `from_id` if one does not already exist, so
that the remote actor can reply directly to the originating actor.

---

## File Layout

```
src/remote/
├── endpoint.zig       # Endpoint actor
├── proxy.zig          # Proxy actor
├── framing.zig        # Length-prefix read/write helpers
└── protocol.zig       # Wire message encoding/decoding (CBOR envelopes)
```

The `framing.zig` module provides two functions:

```zig
pub fn write_frame(writer: anytype, payload: []const u8) !void;
pub fn read_frame(reader: anytype, buf: []u8) ![]u8;
```

The `protocol.zig` module provides encode/decode for each wire message
type, working directly with `cbor` values.

---

## Design Decisions and Rationale

### Why one endpoint per connection?

With child process transport the relationship is strictly 1:1, so there is
never more than one endpoint per remote system. When TCP is added, a
listener will spawn one endpoint per accepted connection - the same model.
Keeping endpoint state entirely local to one actor avoids shared mutable
state and fits the actor model cleanly.

### Why on-demand proxy creation?

Explicit proxy management (create before use, destroy explicitly) would
require a handshake protocol and additional message types. On-demand
creation based on observed `from_id` values is simpler and covers the
primary use case: an actor on the remote side sends you a message, and you
need to be able to reply to it. Explicit creation can be added later for
the named-lookup case.

### Child Process Transport and `subprocess.zig`

The parent-side endpoint uses `subprocess.zig` as-is to spawn the child and
communicate over pipes. `subprocess.zig` delivers incoming bytes as
`{"stream", "stdout", "read_complete", bytes}` messages which the endpoint
receives and accumulates into frames.

The child side cannot use `subprocess.zig` — it must read from its own
stdin and write to its own stdout. A separate _stdio endpoint_ variant
wraps file descriptor 0 (stdin) and file descriptor 1 (stdout) directly,
using the same `thespian/c/file_stream.h` primitives, and presents
identical behaviour to the parent-side endpoint once running.

### Child Process Endpoint Modes

The child process endpoint will eventually support two modes:

- **Fork+exec** (different binary): the parent spawns an entirely separate
  executable. The child starts fresh, initialises a Thespian context, and
  runs the stdio endpoint. This is the general case for connecting to actors
  in a different binary.

- **Fork-only** (same binary, no exec, or re-exec): the child is a fork of
  the parent process. This avoids the overhead of loading a new binary and
  allows the child to share code with the parent. Re-exec (where the child
  exec's itself with a flag indicating it should run as a remote endpoint)
  is an alternative that gives a clean address space without a separate
  binary. Both variants use the stdio endpoint on the child side.

The distinction between modes is an implementation detail of how the child
is launched; the endpoint protocol is identical in both cases.

### Why CBOR for the wire protocol?

CBOR is already the native message format throughout Thespian. Using it on
the wire means the payload of a `send` message is the actor message
verbatim - no transcoding required. The framing overhead is minimal (4
bytes per message).

### Why 64-bit opaque IDs?

A simple monotonic counter per endpoint is collision-free within a
connection lifetime and requires no coordination. Named actors get an ID
assigned at first reference. IDs are connection-scoped, not globally
unique, which is sufficient for the 1:1 child process model.

---

## Open Questions (Deferred)

- **System identity**: When TCP is added, endpoints will need to identify
  themselves to each other (to detect loops, to route correctly in multi-hop
  scenarios). A UUID or similar token exchanged in a handshake is the likely
  approach.
- **Backpressure**: The current model has no backpressure - a fast sender
  can overwhelm a slow transport. This is acceptable for the initial
  implementation but will need attention under load.
- **Named actor re-registration**: If a well-known actor exits and is
  restarted under the same name, proxies on the remote side will hold stale
  IDs. A generation counter or re-lookup mechanism will be needed.