Operations¶
This guide covers monitoring and maintaining HoloMUSH in production.
Health Checks¶
Both core and gateway expose health endpoints.
Endpoints¶
| Endpoint | Description |
|---|---|
/healthz/liveness |
Process is alive |
/healthz/readiness |
Ready to accept traffic |
Core (default port 9100):
Gateway (default port 9101):
Status Command¶
Check overall system health:
This queries both core and gateway health endpoints via gRPC.
Prometheus Metrics¶
HoloMUSH exposes Prometheus metrics at /metrics.
Available Metrics¶
| Metric | Type | Description |
|---|---|---|
holomush_events_total |
Counter | Total events processed |
holomush_sessions_active |
Gauge | Current active sessions |
holomush_grpc_requests_total |
Counter | gRPC requests by method |
holomush_grpc_latency_seconds |
Histogram | gRPC request latency |
Scrape Configuration¶
scrape_configs:
- job_name: holomush-core
static_configs:
- targets: ["localhost:9100"]
- job_name: holomush-gateway
static_configs:
- targets: ["localhost:9101"]
PostgreSQL Extensions¶
HoloMUSH requires and optionally uses these PostgreSQL extensions:
| Extension | Purpose | Required |
|---|---|---|
pg_trgm |
Fuzzy text matching for exits | Yes |
pg_stat_statements |
Query performance monitoring | Optional |
Enable pg_stat_statements¶
For query performance monitoring, enable pg_stat_statements:
Restart PostgreSQL after configuration changes.
Query Performance Monitoring¶
Slowest Queries¶
Find queries consuming the most time:
SELECT
calls,
round(total_exec_time::numeric, 2) as total_ms,
round(mean_exec_time::numeric, 2) as mean_ms,
round(stddev_exec_time::numeric, 2) as stddev_ms,
rows,
query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
High I/O Queries¶
Identify queries causing disk reads:
SELECT
calls,
shared_blks_hit + shared_blks_read as total_blocks,
round(100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0), 2) as cache_hit_pct,
query
FROM pg_stat_statements
WHERE shared_blks_hit + shared_blks_read > 0
ORDER BY shared_blks_read DESC
LIMIT 10;
Most Frequent Queries¶
Find queries called most often:
SELECT
calls,
round(total_exec_time::numeric, 2) as total_ms,
round(mean_exec_time::numeric, 2) as mean_ms,
query
FROM pg_stat_statements
ORDER BY calls DESC
LIMIT 10;
Reset Statistics¶
Clear accumulated statistics:
Key Metrics¶
| Metric | Description | Alert Threshold |
|---|---|---|
mean_exec_time |
Average query execution time | > 100ms |
stddev_exec_time |
Variance in execution time | High variance |
rows |
Total rows returned | Depends on query |
shared_blks_read |
Blocks read from disk | High = slow |
Connection Monitoring¶
Active Connections¶
View current database connections:
SELECT
datname,
usename,
application_name,
client_addr,
state,
query_start,
now() - query_start as query_duration
FROM pg_stat_activity
WHERE datname = 'holomush'
ORDER BY query_start;
Connection Counts¶
Connections grouped by state:
LISTEN/NOTIFY Listeners¶
HoloMUSH uses PostgreSQL LISTEN/NOTIFY for real-time events:
SELECT pid, datname, usename, application_name, state, query
FROM pg_stat_activity
WHERE query LIKE 'LISTEN%' OR wait_event_type = 'Client';
Index Health¶
Unused Indexes¶
Find indexes that may be candidates for removal:
SELECT
schemaname,
relname as table_name,
indexrelname as index_name,
idx_scan,
pg_size_pretty(pg_relation_size(indexrelid)) as index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;
Index Usage¶
Verify indexes are being used:
SELECT
relname as table_name,
round(100.0 * idx_scan / nullif(seq_scan + idx_scan, 0), 2) as index_usage_pct,
seq_scan,
idx_scan
FROM pg_stat_user_tables
WHERE seq_scan + idx_scan > 0
ORDER BY seq_scan DESC;
Table Statistics¶
Table Sizes¶
Monitor table and index sizes:
SELECT
relname as table_name,
pg_size_pretty(pg_total_relation_size(relid)) as total_size,
pg_size_pretty(pg_relation_size(relid)) as table_size,
pg_size_pretty(pg_indexes_size(relid)) as index_size,
n_live_tup as live_rows,
n_dead_tup as dead_rows
FROM pg_stat_user_tables
ORDER BY pg_total_relation_size(relid) DESC;
Tables Needing Vacuum¶
Identify tables with dead tuples:
SELECT
relname,
n_dead_tup,
last_vacuum,
last_autovacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY n_dead_tup DESC;
Backup and Recovery¶
PostgreSQL Backups¶
Use pg_dump for logical backups:
For production, consider:
- Continuous archiving with WAL shipping
- Point-in-time recovery (PITR)
- Streaming replication for high availability
Restore¶
Restore from a logical backup:
Log Management¶
Log Locations¶
| Component | Location |
|---|---|
| Core logs | stdout (use log aggregation) |
| Gateway logs | stdout (use log aggregation) |
| PostgreSQL | /var/log/postgresql/ (varies) |
Log Aggregation¶
For production, aggregate logs with:
- Loki - Grafana's log aggregation
- Elasticsearch - Full-text search
- CloudWatch - AWS managed logging
Docker Logging¶
View container logs:
Troubleshooting¶
Connection Issues¶
Symptom: Gateway cannot connect to core.
Check:
- Core is running:
holomush status - Network connectivity:
nc -zv localhost 9000 - Firewall rules allow traffic
- TLS certificates are valid
Database Connection Failed¶
Symptom: Core fails with database errors.
Check:
- PostgreSQL is running:
pg_isready -h localhost DATABASE_URLis correct- User has permissions:
\duin psql - Database exists:
\lin psql
High Memory Usage¶
Possible causes:
- Too many active sessions
- Large query result sets
- Plugin memory leaks
Investigate:
# Check active sessions
curl http://localhost:9100/metrics | grep sessions_active
# Check PostgreSQL connections
psql -c "SELECT count(*) FROM pg_stat_activity WHERE datname='holomush';"
Slow Queries¶
Investigate:
- Enable
pg_stat_statements(see above) - Check for missing indexes
- Analyze query plans with
EXPLAIN ANALYZE
Next Steps¶
- Configuration - Adjust server settings
- Installation - Deployment options