Production best practices for SkillFlaw on Kubernetes
This guide describes production operational practices for running SkillFlaw safely and predictably on Kubernetes.
Choose the smallest topology that matches reality
Adopt a deployment topology based on required service capabilities rather than image availability in the repository.
Choose one of these patterns deliberately:
- API-first production: backend + PostgreSQL + Redis
- Full web production: backend + frontend + PostgreSQL + Redis
- Public docs: add the docs service only when required
Keeping the topology explicit makes scaling, security review, and incident response much easier.
Externalize state before you scale
Horizontal scaling only works well when state is no longer local to a single pod.
Before increasing backend replicas, confirm all of the following:
SKILLFLAW_DATABASE_URLpoints to a real PostgreSQL serviceSKILLFLAW_CACHE_TYPE=redisis backed by a reachable Redis deploymentSKILLFLAW_CONFIG_DIRis mounted on durable storageSKILLFLAW_SECRET_KEY_FILEis mounted from a secret-backed file
If any one of these is still pod-local, fix that first.
Scale backend and frontend independently
The backend and frontend have different scaling behavior.
Backend
The backend carries flow execution, API traffic, and most operational risk.
Scale the backend based on:
- request concurrency
- flow complexity
- file upload or large payload pressure
- queueing or latency during peak periods
Start with explicit resource requests, observe real load, and then add replicas or CPU/memory.
Frontend
The frontend is stateless and generally cheaper to scale.
Scale it for:
- concurrent browser users
- slow page load under peak traffic
- isolation from backend rollout cadence
Do not tie frontend replica count to backend replica count unless you have measured a real need.
Treat PostgreSQL as a first-class production dependency
PostgreSQL is not an afterthought in SkillFlaw production environments. It is part of the core runtime contract.
Recommended practices:
- persistent storage
- backup and restore procedures
- controlled schema migration workflow
- connection monitoring and alerting
- explicit failover plan if you operate HA PostgreSQL
If you run multiple backend replicas, your database and storage strategy must already support that shape.
Use Redis intentionally
If you enable Redis-backed caching, run Redis as a managed dependency with clear memory and persistence settings.
Make cache behavior observable. Undocumented cache behavior can become a material source of production instability.
Preserve routing semantics
The current frontend Nginx configuration supports these behaviors:
/serves the app shell/chatand/flow/...use dedicated HTML fallbacks
When you add ingress rules, preserve these semantics instead of applying a generic catch-all SPA rule.
In particular:
- route
/api/traffic directly to the backend service - prefer separate hostnames when public API, public UI, and public docs have different audiences
Secure secrets and configuration boundaries
Follow these principles consistently:
- store API keys, database credentials, and secret key files in Kubernetes secrets or an external secret manager
- mount
SKILLFLAW_SECRET_KEY_FILEas a file - avoid baking environment-specific secrets into images
- scope network access so PostgreSQL and Redis are not broadly reachable
- require HTTPS/TLS at the ingress edge for any real user traffic
Validate with production-like traffic
Before rollout, test more than just "the pod started":
- authenticated requests to
/api/v1/run/{flow_id} - OpenAI-compatible requests to
/api/v1/responsesif clients depend on them - MCP traffic to
/api/v1/mcp/streamableif you expose MCP tools - browser login and flow execution if the UI is enabled
- docs hostname rendering if docs are exposed publicly
Then add targeted load testing with representative flows instead of synthetic hello-world-only checks.
Keep source and image deployment discipline separate
For source deployments, remember that frontend files must be rebuilt into src/backend/base/skillflaw/frontend.
For image deployments, remember that the published docs image is the supported way to serve public documentation.
Confusing these two models is a reliable way to ship stale UI or stale docs.
Prefer explicit rollout and rollback steps
For each production deployment, be ready to answer:
- which image tag is being deployed
- which config or secret changed
- how to revert the backend
- whether frontend and docs should roll back with it or remain pinned
If these questions cannot be answered before rollout, the deployment process requires additional release controls.
Observe health continuously
At minimum, collect and alert on:
- backend
/healthfailures - container restarts
- API latency and error rate
- PostgreSQL saturation or failed connections
- Redis connectivity issues
- ingress 4xx/5xx spikes
Centralized logs are strongly recommended so that backend, ingress, database, and Redis events can be correlated during incidents.