Work
Projects
Infrastructure problems I've solved. Problem → approach → outcome.
Cosmos Validator Onboarding Automation
Problem
Manually spinning up each Cosmos validator on Kubernetes took days of error-prone YAML and key management ceremony.
Approach
Built a Helm chart library for validator lifecycle management, integrated Horcrux for threshold signing, and wired everything into GitLab CI for one-command onboarding.
Outcome
Validator onboarding time reduced from days to hours; zero single points of failure in key management.
Unified Logging System (ELK + Kafka)
Problem
Logs were siloed per service with no central search or alerting. Incidents required manual log hunting across 20+ servers.
Approach
Designed a Kafka-backed ELK pipeline with structured logging standards enforced at the shipper level. Tuned Elasticsearch sharding and retention policies for sustained high throughput.
Outcome
600 GB/day ingested reliably, ~300 hr MTBF on the logging cluster, sub-second search across all services.
Multi-Region Hybrid Video Delivery
Problem
RTMP video streams from a UAE-based platform had high latency for Asian viewers due to single-region origin.
Approach
Deployed hybrid origin-edge topology across AWS (us-east, ap-southeast) and Alicloud (cn-shanghai), with intelligent routing and Nginx RTMP relay.
Outcome
5% reduction in end-to-end RTMP latency for Asian audience; 99.9% stream availability across regions.
Cloud Gaming Microservices Infrastructure
Problem
Monolithic game-server provisioner couldn't scale to concurrent player demand across 14 game titles.
Approach
Re-architected as Terraform-provisioned microservices on AWS with auto-scaling groups, load balancers, and per-game resource quotas.
Outcome
Sustained 450,000 requests/min across all titles with linear horizontal scaling and zero cross-title blast radius.
CI/CD Pipeline Automation
Problem
70% of engineering time on releases was manual: build, sign, stage, deploy steps done by hand across iOS, Android, and backend.
Approach
Built a unified Jenkins + Ansible + Fastlane pipeline with environment promotion gates, Slack notifications, and one-click rollback.
Outcome
80% reduction in manual release work; deployment frequency doubled within 6 months of rollout.
ETL Containerisation & Cost Optimisation
Problem
Long-running ETL jobs ran on always-on EC2 instances, incurring cost 24/7 regardless of actual workload.
Approach
Containerised ETL workloads with Docker, migrated orchestration to AWS Glue and BigQuery scheduled jobs, and right-sized compute.
Outcome
20% reduction in monthly infrastructure cost while improving job reliability and observability.