Server Lifecycle & Health System¶
HoloMUSH's core server uses a structured lifecycle system that manages how components start up, report health, and shut down. This page explains the design — why it works this way, and how the pieces fit together.
The Problem It Solves¶
A MUSH server has a lot of moving parts: a database, an access control system, a plugin stack, session management, and more. If the server starts accepting player connections before all of these are ready, bad things happen — commands get denied, sessions fail to resolve, the game appears broken even though nothing is actually wrong.
The lifecycle system prevents this by:
- Giving each major component a formal startup/shutdown contract
- Starting components in the right order based on declared dependencies
- Gating player traffic until everything reports healthy
- Detecting and responding to runtime health degradation
Subsystems¶
Every major server component implements the Subsystem interface:
type Subsystem interface {
ID() SubsystemID
DependsOn() []SubsystemID
Start(ctx context.Context) error
Stop(ctx context.Context) error
}
Each subsystem has a typed identifier (not a string — typos are compile errors, not runtime surprises) and declares what it depends on. The server uses these declarations to figure out the right startup order automatically.
Current Subsystems¶
graph TD
DB[Database]
TLS[TLS Manager]
ABAC[ABAC Stack]
Auth[Auth Services]
World[World Service]
Plugins[Plugin Stack]
Sessions[Session Store]
Bootstrap[Bootstrap Runner]
GRPC[gRPC Server]
ABAC --> DB
Auth --> DB
World --> DB
World --> ABAC
Plugins --> DB
Sessions --> DB
Bootstrap --> ABAC
Bootstrap --> World
Bootstrap --> Plugins
GRPC --> Bootstrap
GRPC --> Sessions
classDef health fill:#1a3a2a,stroke:#22c55e,color:#e2e8f0
classDef nohealth fill:#1a2433,stroke:#3b82f6,color:#e2e8f0
class DB,ABAC,Plugins,Sessions health
class TLS,Auth,World,Bootstrap,GRPC nohealth
Green nodes report ongoing health. Blue nodes participate in lifecycle (start/stop) but have no meaningful ongoing health — they either work after startup or they don't.
Startup Sequence¶
The server starts in distinct phases:
stateDiagram-v2
[*] --> Infrastructure
Infrastructure --> SubsystemInit
SubsystemInit --> ReadinessGate
ReadinessGate --> AcceptTraffic
AcceptTraffic --> Runtime
state Infrastructure {
[*] --> Config: Parse & validate
Config --> Logging: Initialize telemetry
Logging --> Database: Connect to PostgreSQL
Database --> Migrations: Run schema migrations
Migrations --> Observability: Start metrics server
note right of Observability
Liveness probe available NOW
Readiness probe returns 503
end note
}
state SubsystemInit {
[*] --> TopoSort: Topological sort by DependsOn()
TopoSort --> StartAll: Start in dependency order
note right of StartAll
Independent subsystems
start concurrently
end note
}
state ReadinessGate {
[*] --> WaitReady: registry.WaitReady(ctx)
WaitReady --> Ready: All reporters healthy
WaitReady --> Timeout: 30s exceeded
Timeout --> [*]: Fatal error, exit
note right of Ready
Readiness probe flips to 200
end note
}
state AcceptTraffic {
[*] --> BindGRPC: Bind TCP listener
BindGRPC --> Serve: grpcServer.Serve()
note right of Serve
NOW accepting player connections
end note
}
The key guarantee: no player connections are accepted until every subsystem
with a health reporter says "ready." The liveness probe (/healthz/liveness)
is available immediately — it just says "the process is alive." The readiness
probe (/healthz/readiness) starts at 503 and flips to 200 only when the
system is genuinely ready to serve.
Health Tiers¶
Subsystems with ongoing runtime state (database connections, caches, background loops) report their health through a tiered system:
stateDiagram-v2
direction LR
[*] --> Warm: Initial load succeeds
Warm --> Degraded: Failure detected
Degraded --> Warm: Recovery succeeds
Degraded --> Stale: Grace period expires
Stale --> Warm: Recovery succeeds
Stale --> Dead: Max retries exhausted
Dead --> [*]: Graceful shutdown
state Warm {
note right of Warm
Serving normally
Readiness: 200
end note
}
state Degraded {
note right of Degraded
Serving last-known-good data
Readiness: 200
Metrics bumped
end note
}
state Stale {
note right of Stale
Fail-closed: denying requests
Readiness: 503
Still attempting recovery
end note
}
state Dead {
note right of Dead
Fail-closed
Server shutting down
No recovery possible
end note
}
What Each Tier Means¶
| Tier | Serving? | What's Happening |
|---|---|---|
| Warm | Yes | Everything is working. Normal operation. |
| Degraded | Yes | Something failed (a poll, a compilation), but the last-known-good data is almost certainly still correct. Players don't notice. |
| Stale | No | Failures have persisted long enough that we can't trust the data. Requests are denied until recovery. |
| Dead | No | Unrecoverable failure. The server is shutting down. An operator needs to investigate. |
The design philosophy: a single hiccup shouldn't kick every player. Policies change rarely during a game session. If the database has a brief connectivity blip, the server continues serving from its cached snapshot. Only sustained failure — long enough that the cached data might genuinely be wrong — triggers denial.
Configurable Thresholds¶
| Transition | Default | Meaning |
|---|---|---|
| Warm → Degraded | 1 failure | First sign of trouble |
| Degraded → Stale | 60 seconds | Grace period for transient issues |
| Stale → Dead | 30 failures (~5 min) | Exhausted recovery attempts |
| Any → Warm | 1 success | Single successful operation = recovery |
Policy Cache¶
The ABAC policy cache is the primary consumer of the health system. It has two invalidation paths:
flowchart LR
subgraph "Fast Path"
Store[Policy Store] -->|"Invalidate()"| Cache[Policy Cache]
end
subgraph "Safety Net"
Poller[Policy Poller] -->|"check MAX(updated_at)"| DB[(Database)]
DB -->|changed?| Poller
Poller -->|"Reload()"| Cache
end
Cache -->|success/failure| Tracker[Health Tracker]
Poller -->|success/failure| Tracker
Tracker -->|"tier ≥ Stale"| Engine["Engine.EnterDegradedMode()"]
Tracker -->|"tier = Warm"| Engine2["Engine.ClearDegradedMode()"]
Fast path: When game code changes policies (creating, updating, deleting),
the store layer calls cache.Invalidate() directly. The cache reloads
immediately. No round-trip through the database notification system.
Safety net: A background poller queries the database every 10 seconds for changes. This catches anything the fast path missed — manual database edits, multi-instance deployments, or edge cases where the in-process invalidation didn't fire.
The health tracker watches both paths. Successful reloads keep the tier at Warm. Failures escalate through the tiers. When the tier reaches Stale, the tracker tells the ABAC engine to enter degraded mode — all non-system requests are denied until recovery.
Plugin Health¶
Plugins can optionally report health by implementing the HealthReporter
interface. The plugin stack aggregates individual plugin health into its own
report:
- Simple plugins (most Lua scripts): No health concept. They load successfully or they don't. Assumed healthy after load.
- Complex plugins (future: Discord, external APIs): Implement
HealthReporterto track connection state, rate limits, authentication tokens. Their health feeds into the plugin stack's aggregate report.
Shutdown¶
Shutdown is the reverse of startup. The server stops subsystems in reverse
dependency order — the gRPC server stops accepting connections first, then
the bootstrap runner, then plugins, world service, ABAC, auth, database.
Each subsystem's Stop() method handles its own cleanup (flushing buffers,
closing connections, cancelling background goroutines).