How We Automated Cosmos Validator Onboarding with Helm

The Problem

Every new Cosmos chain we wanted to validate on required:

Provisioning a new Kubernetes namespace and node-affinity rules
Hand-rolling YAML for the validator pod, persistent volumes, and services
Manual key generation and distribution — the highest-risk step
Wiring up monitoring and alerting from scratch

With multiple chains in scope, this process was consuming days of engineering time per validator, with no two deployments consistent.

What We Built

We designed a Helm chart library that treats a validator as a first-class Kubernetes workload. The chart encapsulates:

StatefulSet for the validator binary with chain-specific image tags
Horcrux sidecar integration for threshold signing (2-of-3 by default)
PersistentVolumeClaims with configurable storage classes
ServiceMonitor resources for Prometheus scraping out of the box
Pre-install hooks for key import and genesis validation

A single values.yaml file describes the full validator configuration. Onboarding a new chain became a helm install with a chain-specific values override.

Horcrux: Eliminating the Single Point of Failure

Before Horcrux, the validator key lived on a single node. One compromised machine meant a compromised validator — a serious security and financial risk.

Horcrux implements threshold signing: the private key is split into shares distributed across multiple signers. The validator binary never holds the full key. A quorum of signers (e.g., 2 of 3) must collaborate to produce each signature.

Integrating this into the Helm chart meant the key ceremony only happens once at onboarding — the shares are stored in Kubernetes secrets, and the signing cluster manages itself thereafter.

The Outcome

Validator onboarding: days → hours
Zero single points of failure in key management
Consistent configuration across all validators via Helm values
Monitoring live from the first block with zero extra setup

What I'd Do Differently

Helm is flexible but its templating language (Go templates) becomes painful at scale. If we were starting today, I'd evaluate a Kubernetes Operator pattern — custom CRDs with a controller that manages validator lifecycle, rather than coordinating through Helm hooks.

The Horcrux architecture is solid, but the operational runbook for signer failover needs to be drilled regularly. Automation is only as good as the recovery procedure it replaces.