Work

Projects

Infrastructure problems I've solved. Problem → approach → outcome.

Cosmos Validator Onboarding Automation

Problem

Manually spinning up each Cosmos validator on Kubernetes took days of error-prone YAML and key management ceremony.

Approach

Built a Helm chart library for validator lifecycle management, integrated Horcrux for threshold signing, and wired everything into GitLab CI for one-command onboarding.

Outcome

Validator onboarding time reduced from days to hours; zero single points of failure in key management.

KubernetesHelmHorcruxCosmos SDKGitLab CITerraform

Unified Logging System (ELK + Kafka)

Problem

Logs were siloed per service with no central search or alerting. Incidents required manual log hunting across 20+ servers.

Approach

Designed a Kafka-backed ELK pipeline with structured logging standards enforced at the shipper level. Tuned Elasticsearch sharding and retention policies for sustained high throughput.

Outcome

600 GB/day ingested reliably, ~300 hr MTBF on the logging cluster, sub-second search across all services.

ElasticsearchLogstashKibanaKafkaFilebeatDocker

Multi-Region Hybrid Video Delivery

Problem

RTMP video streams from a UAE-based platform had high latency for Asian viewers due to single-region origin.

Approach

Deployed hybrid origin-edge topology across AWS (us-east, ap-southeast) and Alicloud (cn-shanghai), with intelligent routing and Nginx RTMP relay.

Outcome

5% reduction in end-to-end RTMP latency for Asian audience; 99.9% stream availability across regions.

AWSAlicloudNginxTerraformCloudFrontRoute 53

Cloud Gaming Microservices Infrastructure

Problem

Monolithic game-server provisioner couldn't scale to concurrent player demand across 14 game titles.

Approach

Re-architected as Terraform-provisioned microservices on AWS with auto-scaling groups, load balancers, and per-game resource quotas.

Outcome

Sustained 450,000 requests/min across all titles with linear horizontal scaling and zero cross-title blast radius.

AWSTerraformDockerKubernetesNginxPython

CI/CD Pipeline Automation

Problem

70% of engineering time on releases was manual: build, sign, stage, deploy steps done by hand across iOS, Android, and backend.

Approach

Built a unified Jenkins + Ansible + Fastlane pipeline with environment promotion gates, Slack notifications, and one-click rollback.

Outcome

80% reduction in manual release work; deployment frequency doubled within 6 months of rollout.

JenkinsAnsibleFastlaneDockerAWSPython

ETL Containerisation & Cost Optimisation

Problem

Long-running ETL jobs ran on always-on EC2 instances, incurring cost 24/7 regardless of actual workload.

Approach

Containerised ETL workloads with Docker, migrated orchestration to AWS Glue and BigQuery scheduled jobs, and right-sized compute.

Outcome

20% reduction in monthly infrastructure cost while improving job reliability and observability.

DockerAWS GlueBigQueryTerraformPythonAirflow