All posts
15 September 20242 min read

How We Automated Cosmos Validator Onboarding with Helm

Onboarding a new Cosmos validator used to take days of manual YAML and key ceremony. Here's how we reduced it to a single Helm install, with Horcrux threshold signing built in from the start.

KubernetesHelmBlockchainCosmos

The Problem

Every new Cosmos chain we wanted to validate on required:

  1. Provisioning a new Kubernetes namespace and node-affinity rules
  2. Hand-rolling YAML for the validator pod, persistent volumes, and services
  3. Manual key generation and distribution — the highest-risk step
  4. Wiring up monitoring and alerting from scratch

With multiple chains in scope, this process was consuming days of engineering time per validator, with no two deployments consistent.

What We Built

We designed a Helm chart library that treats a validator as a first-class Kubernetes workload. The chart encapsulates:

  • StatefulSet for the validator binary with chain-specific image tags
  • Horcrux sidecar integration for threshold signing (2-of-3 by default)
  • PersistentVolumeClaims with configurable storage classes
  • ServiceMonitor resources for Prometheus scraping out of the box
  • Pre-install hooks for key import and genesis validation

A single values.yaml file describes the full validator configuration. Onboarding a new chain became a helm install with a chain-specific values override.

Horcrux: Eliminating the Single Point of Failure

Before Horcrux, the validator key lived on a single node. One compromised machine meant a compromised validator — a serious security and financial risk.

Horcrux implements threshold signing: the private key is split into shares distributed across multiple signers. The validator binary never holds the full key. A quorum of signers (e.g., 2 of 3) must collaborate to produce each signature.

Integrating this into the Helm chart meant the key ceremony only happens once at onboarding — the shares are stored in Kubernetes secrets, and the signing cluster manages itself thereafter.

The Outcome

  • Validator onboarding: days → hours
  • Zero single points of failure in key management
  • Consistent configuration across all validators via Helm values
  • Monitoring live from the first block with zero extra setup

What I'd Do Differently

Helm is flexible but its templating language (Go templates) becomes painful at scale. If we were starting today, I'd evaluate a Kubernetes Operator pattern — custom CRDs with a controller that manages validator lifecycle, rather than coordinating through Helm hooks.

The Horcrux architecture is solid, but the operational runbook for signer failover needs to be drilled regularly. Automation is only as good as the recovery procedure it replaces.