Skip to main content

Production best practices for SkillFlaw on Kubernetes

This guide describes production operational practices for running SkillFlaw safely and predictably on Kubernetes.

Choose the smallest topology that matches reality

Adopt a deployment topology based on required service capabilities rather than image availability in the repository.

Choose one of these patterns deliberately:

  • API-first production: backend + PostgreSQL + Redis
  • Full web production: backend + frontend + PostgreSQL + Redis
  • Public docs: add the docs service only when required

Keeping the topology explicit makes scaling, security review, and incident response much easier.

Externalize state before you scale

Horizontal scaling only works well when state is no longer local to a single pod.

Before increasing backend replicas, confirm all of the following:

  • SKILLFLAW_DATABASE_URL points to a real PostgreSQL service
  • SKILLFLAW_CACHE_TYPE=redis is backed by a reachable Redis deployment
  • SKILLFLAW_CONFIG_DIR is mounted on durable storage
  • SKILLFLAW_SECRET_KEY_FILE is mounted from a secret-backed file

If any one of these is still pod-local, fix that first.

Scale backend and frontend independently

The backend and frontend have different scaling behavior.

Backend

The backend carries flow execution, API traffic, and most operational risk.

Scale the backend based on:

  • request concurrency
  • flow complexity
  • file upload or large payload pressure
  • queueing or latency during peak periods

Start with explicit resource requests, observe real load, and then add replicas or CPU/memory.

Frontend

The frontend is stateless and generally cheaper to scale.

Scale it for:

  • concurrent browser users
  • slow page load under peak traffic
  • isolation from backend rollout cadence

Do not tie frontend replica count to backend replica count unless you have measured a real need.

Treat PostgreSQL as a first-class production dependency

PostgreSQL is not an afterthought in SkillFlaw production environments. It is part of the core runtime contract.

Recommended practices:

  • persistent storage
  • backup and restore procedures
  • controlled schema migration workflow
  • connection monitoring and alerting
  • explicit failover plan if you operate HA PostgreSQL

If you run multiple backend replicas, your database and storage strategy must already support that shape.

Use Redis intentionally

If you enable Redis-backed caching, run Redis as a managed dependency with clear memory and persistence settings.

Make cache behavior observable. Undocumented cache behavior can become a material source of production instability.

Preserve routing semantics

The current frontend Nginx configuration supports these behaviors:

  • / serves the app shell
  • /chat and /flow/... use dedicated HTML fallbacks

When you add ingress rules, preserve these semantics instead of applying a generic catch-all SPA rule.

In particular:

  • route /api/ traffic directly to the backend service
  • prefer separate hostnames when public API, public UI, and public docs have different audiences

Secure secrets and configuration boundaries

Follow these principles consistently:

  • store API keys, database credentials, and secret key files in Kubernetes secrets or an external secret manager
  • mount SKILLFLAW_SECRET_KEY_FILE as a file
  • avoid baking environment-specific secrets into images
  • scope network access so PostgreSQL and Redis are not broadly reachable
  • require HTTPS/TLS at the ingress edge for any real user traffic

Validate with production-like traffic

Before rollout, test more than just "the pod started":

  • authenticated requests to /api/v1/run/{flow_id}
  • OpenAI-compatible requests to /api/v1/responses if clients depend on them
  • MCP traffic to /api/v1/mcp/streamable if you expose MCP tools
  • browser login and flow execution if the UI is enabled
  • docs hostname rendering if docs are exposed publicly

Then add targeted load testing with representative flows instead of synthetic hello-world-only checks.

Keep source and image deployment discipline separate

For source deployments, remember that frontend files must be rebuilt into src/backend/base/skillflaw/frontend.

For image deployments, remember that the published docs image is the supported way to serve public documentation.

Confusing these two models is a reliable way to ship stale UI or stale docs.

Prefer explicit rollout and rollback steps

For each production deployment, be ready to answer:

  • which image tag is being deployed
  • which config or secret changed
  • how to revert the backend
  • whether frontend and docs should roll back with it or remain pinned

If these questions cannot be answered before rollout, the deployment process requires additional release controls.

Observe health continuously

At minimum, collect and alert on:

  • backend /health failures
  • container restarts
  • API latency and error rate
  • PostgreSQL saturation or failed connections
  • Redis connectivity issues
  • ingress 4xx/5xx spikes

Centralized logs are strongly recommended so that backend, ingress, database, and Redis events can be correlated during incidents.

See also