Operations¶

This guide covers monitoring and maintaining HoloMUSH in production.

Health Checks¶

Both core and gateway expose health endpoints.

Endpoints¶

Endpoint	Description
`/healthz/liveness`	Process is alive
`/healthz/readiness`	Ready to accept traffic

Core (default port 9100):

curl http://localhost:9100/healthz/readiness

Gateway (default port 9101):

curl http://localhost:9101/healthz/readiness

Status Command¶

Check overall system health:

holomush status

This queries both core and gateway health endpoints via gRPC.

Prometheus Metrics¶

HoloMUSH exposes Prometheus metrics at /metrics.

Available Metrics¶

Metric	Type	Description
`holomush_events_total`	Counter	Total events processed
`holomush_sessions_active`	Gauge	Current active sessions
`holomush_grpc_requests_total`	Counter	gRPC requests by method
`holomush_grpc_latency_seconds`	Histogram	gRPC request latency

Scrape Configuration¶

scrape_configs:
  - job_name: holomush-core
    static_configs:
      - targets: ["localhost:9100"]

  - job_name: holomush-gateway
    static_configs:
      - targets: ["localhost:9101"]

PostgreSQL Extensions¶

HoloMUSH requires and optionally uses these PostgreSQL extensions:

Extension	Purpose	Required
`pg_trgm`	Fuzzy text matching for exits	Yes
`pg_stat_statements`	Query performance monitoring	Optional

Enable pg_stat_statements¶

For query performance monitoring, enable pg_stat_statements:

# postgresql.conf
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.track = all

Restart PostgreSQL after configuration changes.

Query Performance Monitoring¶

Slowest Queries¶

Find queries consuming the most time:

SELECT
    calls,
    round(total_exec_time::numeric, 2) as total_ms,
    round(mean_exec_time::numeric, 2) as mean_ms,
    round(stddev_exec_time::numeric, 2) as stddev_ms,
    rows,
    query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

High I/O Queries¶

Identify queries causing disk reads:

SELECT
    calls,
    shared_blks_hit + shared_blks_read as total_blocks,
    round(100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0), 2) as cache_hit_pct,
    query
FROM pg_stat_statements
WHERE shared_blks_hit + shared_blks_read > 0
ORDER BY shared_blks_read DESC
LIMIT 10;

Most Frequent Queries¶

Find queries called most often:

SELECT
    calls,
    round(total_exec_time::numeric, 2) as total_ms,
    round(mean_exec_time::numeric, 2) as mean_ms,
    query
FROM pg_stat_statements
ORDER BY calls DESC
LIMIT 10;

Reset Statistics¶

Clear accumulated statistics:

SELECT pg_stat_statements_reset();

Key Metrics¶

Metric	Description	Alert Threshold
`mean_exec_time`	Average query execution time	> 100ms
`stddev_exec_time`	Variance in execution time	High variance
`rows`	Total rows returned	Depends on query
`shared_blks_read`	Blocks read from disk	High = slow

Connection Monitoring¶

Active Connections¶

View current database connections:

SELECT
    datname,
    usename,
    application_name,
    client_addr,
    state,
    query_start,
    now() - query_start as query_duration
FROM pg_stat_activity
WHERE datname = 'holomush'
ORDER BY query_start;

Connection Counts¶

Connections grouped by state:

SELECT state, count(*)
FROM pg_stat_activity
WHERE datname = 'holomush'
GROUP BY state;

LISTEN/NOTIFY Listeners¶

HoloMUSH uses PostgreSQL LISTEN/NOTIFY for real-time events:

SELECT pid, datname, usename, application_name, state, query
FROM pg_stat_activity
WHERE query LIKE 'LISTEN%' OR wait_event_type = 'Client';

Index Health¶

Unused Indexes¶

Find indexes that may be candidates for removal:

SELECT
    schemaname,
    relname as table_name,
    indexrelname as index_name,
    idx_scan,
    pg_size_pretty(pg_relation_size(indexrelid)) as index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;

Index Usage¶

Verify indexes are being used:

SELECT
    relname as table_name,
    round(100.0 * idx_scan / nullif(seq_scan + idx_scan, 0), 2) as index_usage_pct,
    seq_scan,
    idx_scan
FROM pg_stat_user_tables
WHERE seq_scan + idx_scan > 0
ORDER BY seq_scan DESC;

Table Statistics¶

Table Sizes¶

Monitor table and index sizes:

SELECT
    relname as table_name,
    pg_size_pretty(pg_total_relation_size(relid)) as total_size,
    pg_size_pretty(pg_relation_size(relid)) as table_size,
    pg_size_pretty(pg_indexes_size(relid)) as index_size,
    n_live_tup as live_rows,
    n_dead_tup as dead_rows
FROM pg_stat_user_tables
ORDER BY pg_total_relation_size(relid) DESC;

Tables Needing Vacuum¶

Identify tables with dead tuples:

SELECT
    relname,
    n_dead_tup,
    last_vacuum,
    last_autovacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY n_dead_tup DESC;

Backup and Recovery¶

PostgreSQL Backups¶

Use pg_dump for logical backups:

pg_dump -U holomush -h localhost holomush > backup.sql

For production, consider:

Continuous archiving with WAL shipping
Point-in-time recovery (PITR)
Streaming replication for high availability

Restore¶

Restore from a logical backup:

psql -U holomush -h localhost holomush < backup.sql

Log Management¶

Log Locations¶

Component	Location
Core logs	stdout (use log aggregation)
Gateway logs	stdout (use log aggregation)
PostgreSQL	`/var/log/postgresql/` (varies)

Log Aggregation¶

For production, aggregate logs with:

Loki - Grafana's log aggregation
Elasticsearch - Full-text search
CloudWatch - AWS managed logging

Docker Logging¶

View container logs:

docker compose logs -f core
docker compose logs -f gateway

Troubleshooting¶

Connection Issues¶

Symptom: Gateway cannot connect to core.

Check:

Core is running: holomush status
Network connectivity: nc -zv localhost 9000
Firewall rules allow traffic
TLS certificates are valid

Database Connection Failed¶

Symptom: Core fails with database errors.

Check:

PostgreSQL is running: pg_isready -h localhost
DATABASE_URL is correct
User has permissions: \du in psql
Database exists: \l in psql

High Memory Usage¶

Possible causes:

Too many active sessions
Large query result sets
Plugin memory leaks

Investigate:

# Check active sessions
curl http://localhost:9100/metrics | grep sessions_active

# Check PostgreSQL connections
psql -c "SELECT count(*) FROM pg_stat_activity WHERE datname='holomush';"

Slow Queries¶

Investigate:

Enable pg_stat_statements (see above)
Check for missing indexes
Analyze query plans with EXPLAIN ANALYZE

EXPLAIN ANALYZE SELECT * FROM events WHERE stream = 'room:123' ORDER BY id DESC LIMIT 100;

Next Steps¶

Configuration - Adjust server settings
Installation - Deployment options