Production best practices for SkillFlaw on Kubernetes

This guide describes production operational practices for running SkillFlaw safely and predictably on Kubernetes.

Choose the smallest topology that matches reality

Adopt a deployment topology based on required service capabilities rather than image availability in the repository.

Choose one of these patterns deliberately:

API-first production: backend + PostgreSQL + Redis
Full web production: backend + frontend + PostgreSQL + Redis
Public docs: add the docs service only when required

Keeping the topology explicit makes scaling, security review, and incident response much easier.

Externalize state before you scale

Horizontal scaling only works well when state is no longer local to a single pod.

Before increasing backend replicas, confirm all of the following:

SKILLFLAW_DATABASE_URL points to a real PostgreSQL service
SKILLFLAW_CACHE_TYPE=redis is backed by a reachable Redis deployment
SKILLFLAW_CONFIG_DIR is mounted on durable storage
SKILLFLAW_SECRET_KEY_FILE is mounted from a secret-backed file

If any one of these is still pod-local, fix that first.

Scale backend and frontend independently

The backend and frontend have different scaling behavior.

Backend

The backend carries flow execution, API traffic, and most operational risk.

Scale the backend based on:

request concurrency
flow complexity
file upload or large payload pressure
queueing or latency during peak periods

Start with explicit resource requests, observe real load, and then add replicas or CPU/memory.

Frontend

The frontend is stateless and generally cheaper to scale.

Scale it for:

concurrent browser users
slow page load under peak traffic
isolation from backend rollout cadence

Do not tie frontend replica count to backend replica count unless you have measured a real need.

Treat PostgreSQL as a first-class production dependency

PostgreSQL is not an afterthought in SkillFlaw production environments. It is part of the core runtime contract.

Recommended practices:

persistent storage
backup and restore procedures
controlled schema migration workflow
connection monitoring and alerting
explicit failover plan if you operate HA PostgreSQL

If you run multiple backend replicas, your database and storage strategy must already support that shape.

Use Redis intentionally

If you enable Redis-backed caching, run Redis as a managed dependency with clear memory and persistence settings.

Make cache behavior observable. Undocumented cache behavior can become a material source of production instability.

Preserve routing semantics

The current frontend Nginx configuration supports these behaviors:

/ serves the app shell
/chat and /flow/... use dedicated HTML fallbacks

When you add ingress rules, preserve these semantics instead of applying a generic catch-all SPA rule.

In particular:

route /api/ traffic directly to the backend service
prefer separate hostnames when public API, public UI, and public docs have different audiences

Secure secrets and configuration boundaries

Follow these principles consistently:

store API keys, database credentials, and secret key files in Kubernetes secrets or an external secret manager
mount SKILLFLAW_SECRET_KEY_FILE as a file
avoid baking environment-specific secrets into images
scope network access so PostgreSQL and Redis are not broadly reachable
require HTTPS/TLS at the ingress edge for any real user traffic

Validate with production-like traffic

Before rollout, test more than just "the pod started":

authenticated requests to /api/v1/run/{flow_id}
OpenAI-compatible requests to /api/v1/responses if clients depend on them
MCP traffic to /api/v1/mcp/streamable if you expose MCP tools
browser login and flow execution if the UI is enabled
docs hostname rendering if docs are exposed publicly

Then add targeted load testing with representative flows instead of synthetic hello-world-only checks.

Keep source and image deployment discipline separate

For source deployments, remember that frontend files must be rebuilt into src/backend/base/skillflaw/frontend.

For image deployments, remember that the published docs image is the supported way to serve public documentation.

Confusing these two models is a reliable way to ship stale UI or stale docs.

Prefer explicit rollout and rollback steps

For each production deployment, be ready to answer:

which image tag is being deployed
which config or secret changed
how to revert the backend
whether frontend and docs should roll back with it or remain pinned

If these questions cannot be answered before rollout, the deployment process requires additional release controls.

Observe health continuously

At minimum, collect and alert on:

backend /health failures
container restarts
API latency and error rate
PostgreSQL saturation or failed connections
Redis connectivity issues
ingress 4xx/5xx spikes

Centralized logs are strongly recommended so that backend, ingress, database, and Redis events can be correlated during incidents.

Choose the smallest topology that matches reality​

Externalize state before you scale​

Scale backend and frontend independently​

Backend​

Frontend​

Treat PostgreSQL as a first-class production dependency​

Use Redis intentionally​

Preserve routing semantics​

Secure secrets and configuration boundaries​

Validate with production-like traffic​

Keep source and image deployment discipline separate​

Prefer explicit rollout and rollback steps​

Observe health continuously​

See also​