The Problem
Every new Cosmos chain we wanted to validate on required:
- Provisioning a new Kubernetes namespace and node-affinity rules
- Hand-rolling YAML for the validator pod, persistent volumes, and services
- Manual key generation and distribution — the highest-risk step
- Wiring up monitoring and alerting from scratch
With multiple chains in scope, this process was consuming days of engineering time per validator, with no two deployments consistent.
What We Built
We designed a Helm chart library that treats a validator as a first-class Kubernetes workload. The chart encapsulates:
- StatefulSet for the validator binary with chain-specific image tags
- Horcrux sidecar integration for threshold signing (2-of-3 by default)
- PersistentVolumeClaims with configurable storage classes
- ServiceMonitor resources for Prometheus scraping out of the box
- Pre-install hooks for key import and genesis validation
A single values.yaml file describes the full validator configuration. Onboarding a new chain became a helm install with a chain-specific values override.
Horcrux: Eliminating the Single Point of Failure
Before Horcrux, the validator key lived on a single node. One compromised machine meant a compromised validator — a serious security and financial risk.
Horcrux implements threshold signing: the private key is split into shares distributed across multiple signers. The validator binary never holds the full key. A quorum of signers (e.g., 2 of 3) must collaborate to produce each signature.
Integrating this into the Helm chart meant the key ceremony only happens once at onboarding — the shares are stored in Kubernetes secrets, and the signing cluster manages itself thereafter.
The Outcome
- Validator onboarding: days → hours
- Zero single points of failure in key management
- Consistent configuration across all validators via Helm values
- Monitoring live from the first block with zero extra setup
What I'd Do Differently
Helm is flexible but its templating language (Go templates) becomes painful at scale. If we were starting today, I'd evaluate a Kubernetes Operator pattern — custom CRDs with a controller that manages validator lifecycle, rather than coordinating through Helm hooks.
The Horcrux architecture is solid, but the operational runbook for signer failover needs to be drilled regularly. Automation is only as good as the recovery procedure it replaces.