Back to Homelab
Feb 17, 2026|26 min read

From $847 to $203: How I Cut Our AWS Bill by 76% Without Losing Performance

A real-world FinOps deep dive — every optimization I applied to Kumari.ai's AWS infrastructure, from spot instances and right-sizing to VPC endpoints and S3 lifecycle policies, with exact dollar amounts.

AWSCloudFinOpsCost OptimizationDevOpsArchitecture

I opened the AWS billing dashboard on a Tuesday morning and felt my stomach drop.

$847.23.

For January. For a side project. For Kumari.ai, which at that point had maybe 200 active users. I stared at the number for a long time. Then I did the mental conversion I always do — NPR 113,000. That's more than what my cousin makes in two months as a civil engineer in Kathmandu. I was burning that on cloud infrastructure every 30 days, and I hadn't even looked at the bill properly since launch.

I'm from Nepal. I grew up in a place where you don't waste things. You don't leave lights on. You don't throw away food. And you definitely don't let AWS drain $847 a month because you were too busy shipping features to look at the billing page. That Tuesday was the day I became a FinOps engineer whether I wanted to or not.

Cloud cost before vs after optimization
Cloud cost before vs after optimization

The Audit: Where Was the Money Going?

Before I changed anything, I needed to understand the breakdown. AWS Cost Explorer with a group-by-service filter for January told the full story:

CODE
1┌─────────────────────────────────────────────────────────────────┐ 2│ AWS Cost Explorer - January 2026 │ 3│ Account: kumari-ai-prod │ 4├──────────────────────────────────┬──────────────┬───────────────┤ 5│ Service │ Cost (USD) │ % of Total │ 6├──────────────────────────────────┼──────────────┼───────────────┤ 7│ Amazon EC2 │ $420.18 │ 49.6% │ 8│ Amazon RDS │ $185.42 │ 21.9% │ 9│ NAT Gateway │ $95.37 │ 11.2% │ 10│ Amazon EBS │ $72.14 │ 8.5% │ 11│ Data Transfer │ $44.82 │ 5.3% │ 12│ Amazon CloudWatch │ $29.30 │ 3.5% │ 13├──────────────────────────────────┼──────────────┼───────────────┤ 14│ TOTAL │ $847.23 │ 100.0% │ 15└──────────────────────────────────┴──────────────┴───────────────┘

Half the bill was EC2. A fifth was RDS. And $95 for NAT Gateway — a service that does nothing except let my private subnet instances talk to the internet. I knew I could do better. I just hadn't prioritized it.

I took a week off feature development. Nothing but cost optimization. My commit messages that week were things like "chore: stop hemorrhaging money" and "fix: we don't need 4 vCPUs for a cron job."

Here's every single thing I did.

1. EC2: From $420 to $112

This was the biggest line item, so I started here. I was running:

  • 1x
    CODE
    t3.xlarge
    (4 vCPU, 16GB) — API server, on-demand, 24/7
  • 2x
    CODE
    t3.large
    (2 vCPU, 8GB) — background workers (task queue, browser automation), on-demand, 24/7
  • 1x
    CODE
    t3.medium
    (2 vCPU, 4GB) — staging environment, on-demand, 24/7

All on-demand. All running 24/7. Nobody was using staging at 3 AM on a Sunday, but it was running. The workers were idle 60% of the time, but they were running.

The API Server: Reserved Instance

The API server genuinely runs 24/7. It needs to be there when users hit the endpoint. For always-on workloads, Reserved Instances are the obvious move.

I checked the pricing:

Bash
1$ aws ec2 describe-reserved-instances-offerings \ 2 --instance-type t3.xlarge \ 3 --product-description "Linux/UNIX" \ 4 --offering-type "Partial Upfront" \ 5 --filters "Name=duration,Values=31536000" \ 6 --region us-east-1 \ 7 --query 'ReservedInstancesOfferings[0].{Type:InstanceType,Fixed:FixedPrice,Recurring:RecurringCharges[0].Amount}' \ 8 --output table 9 10---------------------------------------------- 11| DescribeReservedInstancesOfferings | 12+------------+----------+-------------------+ 13| Fixed | Recurring| Type | 14+------------+----------+-------------------+ 15| 432.00 | 0.0494 | t3.xlarge | 16+------------+----------+-------------------+

On-demand

CODE
t3.xlarge
in us-east-1: $0.1664/hr = $121.47/mo

1-year partial upfront RI: $432 upfront + $0.0494/hr = $36.00 + $36.07/mo = $72.07/mo effective

That's a 41% savings on the API server alone. $121.47 down to $72.07/mo.

I bought the RI immediately. One year commitment. No regrets.

The Workers: Spot Instances

The background workers are stateless. They pull tasks from a Redis queue, process them, and write results to the database. If a worker dies, the task goes back in the queue and another worker picks it up. This is the textbook spot instance use case.

First, I checked spot pricing history:

Bash
1$ aws ec2 describe-spot-price-history \ 2 --instance-types t3.large \ 3 --product-descriptions "Linux/UNIX" \ 4 --start-time 2026-01-01T00:00:00Z \ 5 --end-time 2026-01-31T23:59:59Z \ 6 --region us-east-1 \ 7 --query 'SpotPriceHistory[*].{AZ:AvailabilityZone,Price:SpotPrice,Time:Timestamp}' \ 8 --output table | head -20 9 10------------------------------------------------------ 11| DescribeSpotPriceHistory | 12+----------------+-----------+-----------------------+ 13| AZ | Price | Time | 14+----------------+-----------+-----------------------+ 15| us-east-1a | 0.0250 | 2026-01-31T18:42:00Z | 16| us-east-1b | 0.0253 | 2026-01-31T16:19:00Z | 17| us-east-1c | 0.0248 | 2026-01-31T14:55:00Z | 18| us-east-1d | 0.0251 | 2026-01-31T12:33:00Z | 19| us-east-1a | 0.0249 | 2026-01-31T08:21:00Z | 20| us-east-1b | 0.0252 | 2026-01-30T22:47:00Z | 21| us-east-1c | 0.0247 | 2026-01-30T19:15:00Z | 22| us-east-1d | 0.0250 | 2026-01-30T15:38:00Z | 23+----------------+-----------+-----------------------+

On-demand

CODE
t3.large
: $0.0832/hr = $60.74/mo

Spot

CODE
t3.large
: ~$0.025/hr = $18.25/mo

That's a 70% discount. For two workers: $121.47 down to $36.50/mo.

But spot instances can be interrupted with a 2-minute warning. I needed to handle that gracefully. Here's what I set up:

Python
1# spot_interruption_handler.py 2# Runs on each worker instance, polls the instance metadata endpoint 3 4import requests 5import signal 6import subprocess 7import time 8import logging 9 10logger = logging.getLogger(__name__) 11 12METADATA_URL = "http://169.254.169.254/latest/meta-data/spot/instance-action" 13 14def check_spot_interruption(): 15 """Poll for spot interruption notice every 5 seconds.""" 16 while True: 17 try: 18 response = requests.get(METADATA_URL, timeout=2) 19 if response.status_code == 200: 20 action = response.json() 21 logger.warning(f"Spot interruption notice received: {action}") 22 handle_graceful_shutdown() 23 return 24 except requests.exceptions.RequestException: 25 # 404 means no interruption scheduled — this is normal 26 pass 27 time.sleep(5) 28 29def handle_graceful_shutdown(): 30 """Stop accepting new tasks, finish current task, drain connections.""" 31 logger.info("Initiating graceful shutdown...") 32 33 # Tell the Celery worker to stop accepting new tasks 34 subprocess.run(["celery", "-A", "kumari.tasks", "control", "cancel_consumer", "default"]) 35 36 # Wait up to 90 seconds for the current task to finish 37 # (spot gives us 120 seconds, we keep 30s buffer) 38 time.sleep(90) 39 40 # Send SIGTERM to the celery worker process 41 subprocess.run(["pkill", "-TERM", "-f", "celery worker"]) 42 43 logger.info("Graceful shutdown complete. Instance will be terminated by AWS.")

I also configured the Auto Scaling Group to use mixed instance types across multiple AZs, so if one instance type gets reclaimed, it can launch a different one:

HCL
1# ec2_workers.tf 2 3resource "aws_launch_template" "worker" { 4 name_prefix = "kumari-worker-" 5 image_id = data.aws_ami.ubuntu.id 6 instance_type = "t3.large" 7 8 user_data = base64encode(templatefile("${path.module}/scripts/worker-userdata.sh", { 9 environment = "production" 10 redis_host = aws_elasticache_cluster.redis.cache_nodes[0].address 11 })) 12 13 tag_specifications { 14 resource_type = "instance" 15 tags = { 16 Name = "kumari-worker" 17 Environment = "production" 18 Role = "worker" 19 } 20 } 21} 22 23resource "aws_autoscaling_group" "workers" { 24 name = "kumari-workers" 25 desired_capacity = 2 26 min_size = 1 27 max_size = 4 28 vpc_zone_identifier = module.vpc.private_subnet_ids 29 30 mixed_instances_policy { 31 instances_distribution { 32 on_demand_base_capacity = 0 33 on_demand_percentage_above_base_capacity = 0 # 100% spot 34 spot_allocation_strategy = "capacity-optimized" 35 } 36 37 launch_template { 38 launch_template_specification { 39 launch_template_id = aws_launch_template.worker.id 40 version = "$Latest" 41 } 42 43 override { 44 instance_type = "t3.large" 45 } 46 override { 47 instance_type = "t3a.large" # AMD variant, sometimes cheaper 48 } 49 override { 50 instance_type = "m5.large" # Fallback if t3 capacity is low 51 } 52 override { 53 instance_type = "m5a.large" 54 } 55 } 56 } 57 58 tag { 59 key = "Name" 60 value = "kumari-worker-spot" 61 propagate_at_launch = true 62 } 63}

The

CODE
capacity-optimized
allocation strategy tells AWS to launch instances from the pool with the most available capacity, which minimizes interruptions. In practice, I've seen maybe 2-3 interruptions per month, and the graceful shutdown handler means zero dropped tasks.

Staging: Just Turn It Off

The staging environment was running 24/7. Nobody uses staging at night. Nobody uses staging on weekends. I'll cover the scheduling Lambda later, but the result was: staging runs Monday-Friday, 8 AM to 8 PM NPT (roughly 2:15 AM to 2:15 PM UTC, we're UTC+5:45, yes that :45 is real).

That's 60 hours out of 168 in a week. 64% reduction in staging costs.

CODE
t3.medium
on-demand: $0.0416/hr. Was $30.37/mo, now $10.82/mo (only running 260 hours/mo instead of 730).

EC2 Total

InstanceBeforeAfterSavings
API server (t3.xlarge, RI)$121.47$72.07$49.40
Worker 1 (t3.large, spot)$60.74$18.25$42.49
Worker 2 (t3.large, spot)$60.74$18.25$42.49
Staging (t3.medium, scheduled)$30.37$10.82$19.55
Other (misc)$146.86N/A
EC2 Total$420.18$112.39$307.79

The "Other (misc)" was a couple of t3.micro instances I'd forgotten about — an old test instance and a Jenkins box I'd replaced with GitHub Actions months ago. I terminated them. $0 for services I wasn't using.

tip
[!TIP] Run aws ec2 describe-instances and actually look at every instance. I guarantee you have at least one instance running that you forgot about. I had two.

2. RDS: From $185 to $68

I was running a

CODE
db.r5.xlarge
for the PostgreSQL database. Four vCPUs, 32 GB of RAM. This was my fault — when I first set up the database, I thought "it's a database, it should be powerful" without actually thinking about the workload.

I pulled up CloudWatch metrics for the last 30 days:

CODE
1CloudWatch Metrics — RDS db.r5.xlarge — January 2026 2───────────────────────────────────────────────────── 3CPU Utilization: 4 Average: 4.7% 5 Peak: 18.2% (during daily backup window) 6 p99: 11.3% 7 8Freeable Memory: 9 Average: 28.4 GB free (of 32 GB) 10 Minimum: 26.1 GB free 11 12Database Connections: 13 Average: 12 14 Peak: 34 15 16Read IOPS: 17 Average: 45 18 Peak: 890 19 20Write IOPS: 21 Average: 23 22 Peak: 312

4.7% average CPU. 28 GB of unused RAM. This database was doing absolutely nothing most of the time. Kumari.ai isn't a high-traffic OLTP system — it's an AI assistant platform where users send a message, wait for a response, and send another message. The database handles user records, conversation history, and API key lookups. Not exactly a demanding workload.

I downgraded to

CODE
db.t3.medium
(2 vCPU, 4 GB RAM):

Bash
1$ aws rds modify-db-instance \ 2 --db-instance-identifier kumari-prod-db \ 3 --db-instance-class db.t3.medium \ 4 --apply-immediately 5 6{ 7 "DBInstance": { 8 "DBInstanceIdentifier": "kumari-prod-db", 9 "DBInstanceClass": "db.t3.medium", 10 "DBInstanceStatus": "modifying", 11 "PendingModifiedValues": { 12 "DBInstanceClass": "db.t3.medium" 13 } 14 } 15}

I scheduled this for 4 AM NPT on a Saturday. The modification took about 12 minutes with ~30 seconds of actual downtime during the switch. I was watching the CloudWatch dashboard the entire time.

After the downgrade, metrics for the first week:

CODE
1CloudWatch Metrics — RDS db.t3.medium — First Week Post-Migration 2────────────────────────────────────────────────────────────────── 3CPU Utilization: 4 Average: 14.2% 5 Peak: 52.8% (during backup window) 6 p99: 31.4% 7 8Freeable Memory: 9 Average: 2.1 GB free (of 4 GB) 10 Minimum: 1.4 GB free 11 12Database Connections: 13 Average: 12 (unchanged) 14 Peak: 31 (unchanged) 15 16Query latency (p99): 17 Before: 4.2ms 18 After: 4.8ms (within noise)

CPU utilization went from 4.7% to 14.2%. Memory went from "absurdly overprovisioned" to "healthy." Query latency didn't change in any meaningful way. The application performance was identical.

CODE
db.r5.xlarge
on-demand: $185.42/mo
CODE
db.t3.medium
on-demand: $49.93/mo

Then I bought a 1-year RI for the db.t3.medium: $67.89/mo effective.

Wait — that's more than on-demand? No. The on-demand price is $49.93 for the instance only, but I was also paying for Multi-AZ ($49.93 x 2 = $99.86 on-demand for Multi-AZ). The RI covers Multi-AZ at $67.89/mo. Before, my Multi-AZ r5.xlarge was costing $185.42/mo.

$185.42 down to $67.89/mo. Savings: $117.53.

warning
[!WARNING] Do NOT skip Multi-AZ for a production database to save money. I considered it for about 30 seconds. Then I remembered the time a single-AZ RDS instance had a hardware failure and I lost 5 minutes of data because the backup was from an hour ago. Multi-AZ gives you synchronous replication and automatic failover. The cost is worth it.

3. NAT Gateway: From $95 to $14

This one made me genuinely angry when I understood it.

NAT Gateway pricing: $0.045 per hour ($32.85/mo just to exist) + $0.045 per GB of data processed. I had one NAT Gateway, and my private subnet instances were sending a LOT of traffic through it — pulling Docker images from ECR, downloading packages, sending logs to CloudWatch, pushing objects to S3.

The thing is, most of this traffic was going to AWS services. Traffic between your VPC and AWS services doesn't need to go through a NAT Gateway if you set up VPC Endpoints.

VPC Endpoints: The Free Alternative

VPC Gateway Endpoints (for S3 and DynamoDB) are completely free. VPC Interface Endpoints (for everything else) cost $0.01/hr per AZ, which is still way cheaper than NAT Gateway.

HCL
1# vpc_endpoints.tf 2 3# Gateway endpoints — FREE 4resource "aws_vpc_endpoint" "s3" { 5 vpc_id = module.vpc.vpc_id 6 service_name = "com.amazonaws.us-east-1.s3" 7 vpc_endpoint_type = "Gateway" 8 route_table_ids = module.vpc.private_route_table_ids 9 10 tags = { 11 Name = "kumari-s3-endpoint" 12 } 13} 14 15resource "aws_vpc_endpoint" "dynamodb" { 16 vpc_id = module.vpc.vpc_id 17 service_name = "com.amazonaws.us-east-1.dynamodb" 18 vpc_endpoint_type = "Gateway" 19 route_table_ids = module.vpc.private_route_table_ids 20 21 tags = { 22 Name = "kumari-dynamodb-endpoint" 23 } 24} 25 26# Interface endpoints — $0.01/hr per AZ, but saves much more on NAT 27resource "aws_vpc_endpoint" "ecr_api" { 28 vpc_id = module.vpc.vpc_id 29 service_name = "com.amazonaws.us-east-1.ecr.api" 30 vpc_endpoint_type = "Interface" 31 private_dns_enabled = true 32 subnet_ids = module.vpc.private_subnet_ids 33 security_group_ids = [aws_security_group.vpc_endpoints.id] 34 35 tags = { 36 Name = "kumari-ecr-api-endpoint" 37 } 38} 39 40resource "aws_vpc_endpoint" "ecr_dkr" { 41 vpc_id = module.vpc.vpc_id 42 service_name = "com.amazonaws.us-east-1.ecr.dkr" 43 vpc_endpoint_type = "Interface" 44 private_dns_enabled = true 45 subnet_ids = module.vpc.private_subnet_ids 46 security_group_ids = [aws_security_group.vpc_endpoints.id] 47 48 tags = { 49 Name = "kumari-ecr-dkr-endpoint" 50 } 51} 52 53resource "aws_vpc_endpoint" "logs" { 54 vpc_id = module.vpc.vpc_id 55 service_name = "com.amazonaws.us-east-1.logs" 56 vpc_endpoint_type = "Interface" 57 private_dns_enabled = true 58 subnet_ids = module.vpc.private_subnet_ids 59 security_group_ids = [aws_security_group.vpc_endpoints.id] 60 61 tags = { 62 Name = "kumari-cloudwatch-logs-endpoint" 63 } 64} 65 66resource "aws_vpc_endpoint" "sqs" { 67 vpc_id = module.vpc.vpc_id 68 service_name = "com.amazonaws.us-east-1.sqs" 69 vpc_endpoint_type = "Interface" 70 private_dns_enabled = true 71 subnet_ids = module.vpc.private_subnet_ids 72 security_group_ids = [aws_security_group.vpc_endpoints.id] 73 74 tags = { 75 Name = "kumari-sqs-endpoint" 76 } 77} 78 79resource "aws_security_group" "vpc_endpoints" { 80 name_prefix = "kumari-vpc-endpoints-" 81 vpc_id = module.vpc.vpc_id 82 83 ingress { 84 from_port = 443 85 to_port = 443 86 protocol = "tcp" 87 cidr_blocks = [module.vpc.vpc_cidr_block] 88 description = "HTTPS from VPC" 89 } 90 91 tags = { 92 Name = "kumari-vpc-endpoints-sg" 93 } 94}

After setting up the endpoints, the only traffic going through NAT Gateway was:

  • External API calls (OpenAI, Anthropic, third-party webhooks)
  • Package updates (apt, pip — which happen infrequently)
  • Anything not covered by a VPC endpoint

I could have eliminated the NAT Gateway entirely by putting a tiny NAT instance (t3.nano, $3.80/mo) in the public subnet, but I kept a single NAT Gateway for reliability. The data processing costs dropped dramatically though, because 80%+ of the traffic was AWS-service-to-AWS-service.

Before: NAT Gateway hourly ($32.85) + data processing (~$62.52 for 1.4TB) = $95.37 After: NAT Gateway hourly ($32.85) + data processing ($4.50 for ~100GB) + Interface endpoints ($14.60 for 2 endpoints x 2 AZs) - Gateway endpoints ($0) = $51.95

Actually, I later realized I only needed interface endpoints in one AZ (my workers and API server are in a single AZ with Multi-AZ only for RDS). That brought the interface endpoint cost down:

Final: $32.85 + $4.50 + $7.30 = $44.65

Hmm, still high because of the NAT Gateway base cost. So I finally replaced it with a NAT instance:

HCL
1# nat_instance.tf 2 3resource "aws_instance" "nat" { 4 ami = data.aws_ami.nat_instance.id # amzn-ami-vpc-nat 5 instance_type = "t3a.nano" 6 subnet_id = module.vpc.public_subnet_ids[0] 7 source_dest_check = false 8 vpc_security_group_ids = [aws_security_group.nat.id] 9 10 tags = { 11 Name = "kumari-nat-instance" 12 } 13} 14 15resource "aws_route" "private_nat" { 16 count = length(module.vpc.private_route_table_ids) 17 route_table_id = module.vpc.private_route_table_ids[count.index] 18 destination_cidr_block = "0.0.0.0/0" 19 instance_id = aws_instance.nat.id 20}

Final NAT cost: t3a.nano ($3.42/mo) + data processing ($4.50) + Interface endpoints ($7.30) = $15.22/mo

Okay, let's be honest — I'm rounding and there's some variance month to month. Call it $14/mo conservatively.

$95.37 down to ~$14. Savings: $81.37.

note
[!NOTE] A NAT instance is a single point of failure. If it dies, your private subnet loses internet access. For Kumari.ai's scale, this is acceptable — if the NAT instance dies, I get an alert, and I can replace it in under a minute with a terraform apply. For a larger production workload, keep the managed NAT Gateway or set up a pair of NAT instances with failover.

4. EBS: From $72 to $48

This was the simplest optimization. I was using

CODE
gp2
volumes everywhere because that's what I'd always used. But
CODE
gp3
has been available since 2020, and it's:

  • 20% cheaper per GB ($0.08/GB vs $0.10/GB)
  • Higher baseline performance (3,000 IOPS / 125 MB/s vs gp2's burstable model)
  • Independently adjustable IOPS and throughput (no more provisioning oversized volumes just for IOPS)

I had a total of 720 GB of gp2 volumes across all instances.

Migration was zero-downtime — you can modify a volume type while it's attached and in use:

Bash
1# List all gp2 volumes 2$ aws ec2 describe-volumes \ 3 --filters "Name=volume-type,Values=gp2" \ 4 --query 'Volumes[*].{ID:VolumeId,Size:Size,State:State,Attached:Attachments[0].InstanceId}' \ 5 --output table 6 7----------------------------------------------------------------- 8| DescribeVolumes | 9+---------------+------+------------+---------------------------+ 10| ID | Size | State | Attached | 11+---------------+------+------------+---------------------------+ 12| vol-0a1b2c3d | 200 | in-use | i-0abc1234 (API server) | 13| vol-0e5f6g7h | 100 | in-use | i-0def5678 (Worker 1) | 14| vol-0i9j0k1l | 100 | in-use | i-0ghi9012 (Worker 2) | 15| vol-0m2n3o4p | 50 | in-use | i-0jkl3456 (Staging) | 16| vol-0q5r6s7t | 200 | in-use | i-0mno7890 (Jenkins-OLD) | 17| vol-0u8v9w0x | 70 | in-use | i-0pqr1234 (Test-OLD) | 18+---------------+------+------------+---------------------------+ 19 20# The last two were attached to instances I was about to terminate anyway. 21# Migrate the remaining volumes: 22 23$ for vol_id in vol-0a1b2c3d vol-0e5f6g7h vol-0i9j0k1l vol-0m2n3o4p; do 24 echo "Migrating $vol_id to gp3..." 25 aws ec2 modify-volume \ 26 --volume-id "$vol_id" \ 27 --volume-type gp3 \ 28 --iops 3000 \ 29 --throughput 125 30 echo "Done." 31done 32 33Migrating vol-0a1b2c3d to gp3... 34{ 35 "VolumeModification": { 36 "VolumeId": "vol-0a1b2c3d", 37 "ModificationState": "modifying", 38 "TargetVolumeType": "gp3", 39 "TargetSize": 200, 40 "TargetIops": 3000, 41 "TargetThroughput": 125, 42 "OriginalVolumeType": "gp2", 43 "OriginalSize": 200, 44 "OriginalIops": 600 45 } 46} 47Done. 48Migrating vol-0e5f6g7h to gp3... 49Done. 50Migrating vol-0i9j0k1l to gp3... 51Done. 52Migrating vol-0m2n3o4p to gp3... 53Done.

The modification happens in the background. No reboot, no detach, no downtime. I watched the volumes in the console and they all transitioned to

CODE
gp3
within 15 minutes.

Also deleted the two volumes from the terminated instances (270 GB I was paying for with zero purpose):

Before: 720 GB x $0.10/GB = $72.00/mo After: 450 GB x $0.08/GB = $36.00/mo + snapshots (~$12) = $48.00/mo

Savings: $24.14.

Not the biggest win, but it took 10 minutes and required zero effort. Free money.

5. S3 Lifecycle Policies: Saving on Storage Nobody Looks At

I had about 380 GB in S3 across a few buckets — application logs shipped from CloudWatch, database backups, user uploads, and terraform state. The logs and old backups were just sitting there in S3 Standard, costing $0.023/GB/mo.

Nobody is looking at 6-month-old application logs in S3 Standard. Nobody is restoring from a backup that's 4 months old when you have daily backups. This data needs to exist (compliance, debugging historical issues) but doesn't need to be instantly accessible.

JSON
1{ 2 "Rules": [ 3 { 4 "ID": "logs-lifecycle", 5 "Filter": { 6 "Prefix": "logs/" 7 }, 8 "Status": "Enabled", 9 "Transitions": [ 10 { 11 "Days": 30, 12 "StorageClass": "STANDARD_IA" 13 }, 14 { 15 "Days": 90, 16 "StorageClass": "GLACIER" 17 } 18 ], 19 "Expiration": { 20 "Days": 365 21 } 22 }, 23 { 24 "ID": "backups-lifecycle", 25 "Filter": { 26 "Prefix": "backups/" 27 }, 28 "Status": "Enabled", 29 "Transitions": [ 30 { 31 "Days": 30, 32 "StorageClass": "STANDARD_IA" 33 }, 34 { 35 "Days": 90, 36 "StorageClass": "GLACIER" 37 } 38 ], 39 "Expiration": { 40 "Days": 365 41 } 42 }, 43 { 44 "ID": "abort-incomplete-multipart", 45 "Filter": { 46 "Prefix": "" 47 }, 48 "Status": "Enabled", 49 "AbortIncompleteMultipartUpload": { 50 "DaysAfterInitiation": 7 51 } 52 } 53 ] 54}
Bash
1$ aws s3api put-bucket-lifecycle-configuration \ 2 --bucket kumari-ai-prod-data \ 3 --lifecycle-configuration file://lifecycle-policy.json

The pricing difference is dramatic:

CODE
1Storage class pricing (per GB/month): 2 S3 Standard: $0.023 3 S3 Standard-IA: $0.0125 (46% cheaper) 4 S3 Glacier: $0.004 (83% cheaper)

Most of my S3 data was logs older than 30 days. After the lifecycle policies kicked in over the next month, the S3 cost stabilized at about $4/mo instead of the ~$8.74 it had been. Small numbers, but it adds up, and the policy requires zero ongoing maintenance.

The

CODE
abort-incomplete-multipart
rule is one people forget about. Failed multipart uploads leave behind invisible parts that you still pay for. I found 12 GB of orphaned multipart upload parts across my buckets when I checked. Free storage back.

6. CloudWatch: From $29 to $8

I had custom metrics being pushed from the application at 1-second resolution. When I set this up, I thought "more granularity is better." The problem is that high-resolution custom metrics cost $0.30 per metric per month at 1-second resolution, versus $0.30 per metric per month at 60-second resolution (same price, but you generate 60x fewer PutMetricData API calls, and the API calls are what actually cost money).

The real issue: I was pushing 47 custom metrics at 1-second intervals. That's 47 x 60 x 24 x 30 = 121,824,000 PutMetricData calls per month. At $0.01 per 1,000 calls, that's $1,218/mo just for the API calls. Wait — that can't be right.

Okay, I went back and checked. The actual cost was $29.30 because most of the "custom metrics" were being batched into chunks of 20 per API call, and I had a bug where some metrics were being pushed from every worker instance independently. After I fixed the batching and reduced to 60-second intervals:

Python
1# Before: every worker pushing independently, 1-second intervals 2cloudwatch.put_metric_data( 3 Namespace='Kumari/Application', 4 MetricData=[{ 5 'MetricName': 'TaskProcessingTime', 6 'Value': duration_ms, 7 'Unit': 'Milliseconds', 8 'StorageResolution': 1 # 1-second resolution 9 }] 10) 11 12# After: single metrics aggregator, 60-second intervals, batched 13METRIC_BUFFER = [] 14 15def buffer_metric(name, value, unit='None'): 16 METRIC_BUFFER.append({ 17 'MetricName': name, 18 'Value': value, 19 'Unit': unit, 20 'StorageResolution': 60 21 }) 22 23def flush_metrics(): 24 """Called every 60 seconds by a single background thread.""" 25 if not METRIC_BUFFER: 26 return 27 # PutMetricData accepts up to 1000 values per call 28 for chunk in chunked(METRIC_BUFFER, 1000): 29 cloudwatch.put_metric_data( 30 Namespace='Kumari/Application', 31 MetricData=chunk 32 ) 33 METRIC_BUFFER.clear()

I also disabled detailed monitoring on EC2 instances where I didn't need it (staging, workers). Detailed monitoring pushes metrics every 1 minute at $3.50/instance/month. Basic monitoring is every 5 minutes and free.

Bash
$ aws ec2 unmonitor-instances --instance-ids i-0def5678 i-0ghi9012 i-0jkl3456

$29.30 down to ~$8.00/mo. Savings: $21.30.

7. Data Transfer: From $45 to $11

Data transfer out is one of those AWS costs that creep up on you. I was paying $44.82/mo, mostly from:

  • API responses going out to the internet ($0.09/GB for the first 10TB)
  • S3 objects served directly to users
  • Cross-AZ traffic between instances

CloudFront for Static Assets

User-uploaded files and static assets were being served directly from S3. CloudFront is cheaper for data transfer ($0.085/GB from edge vs $0.09/GB from S3) AND makes things faster for users.

HCL
1# cloudfront.tf 2 3resource "aws_cloudfront_distribution" "static_assets" { 4 origin { 5 domain_name = aws_s3_bucket.user_uploads.bucket_regional_domain_name 6 origin_id = "S3-user-uploads" 7 8 s3_origin_config { 9 origin_access_identity = aws_cloudfront_origin_access_identity.s3.cloudfront_access_identity_path 10 } 11 } 12 13 enabled = true 14 default_cache_behavior { 15 allowed_methods = ["GET", "HEAD"] 16 cached_methods = ["GET", "HEAD"] 17 target_origin_id = "S3-user-uploads" 18 19 forwarded_values { 20 query_string = false 21 cookies { 22 forward = "none" 23 } 24 } 25 26 viewer_protocol_policy = "redirect-to-https" 27 min_ttl = 0 28 default_ttl = 86400 # 24 hours 29 max_ttl = 31536000 # 1 year 30 compress = true 31 } 32 33 price_class = "PriceClass_100" # Only US/Europe edges — cheapest option 34 35 restrictions { 36 geo_restriction { 37 restriction_type = "none" 38 } 39 } 40 41 viewer_certificate { 42 cloudfront_default_certificate = true 43 } 44}

S3 Transfer Acceleration: Turned Off

I had S3 Transfer Acceleration enabled on one bucket. It costs $0.04/GB on top of regular transfer costs. I enabled it months ago when I was testing something and forgot to turn it off. It was costing ~$6/mo for absolutely no benefit since all my users are in the US and my bucket is in us-east-1.

Bash
1$ aws s3api put-bucket-accelerate-configuration \ 2 --bucket kumari-ai-prod-uploads \ 3 --accelerate-configuration Status=Suspended

Cross-AZ Traffic

I consolidated my API server and workers into a single AZ (us-east-1a). Cross-AZ data transfer is $0.01/GB in each direction. When workers were spread across 2 AZs and communicating with Redis in a single AZ, that added up. After consolidation, intra-AZ traffic is free.

(RDS stays Multi-AZ for failover — that cross-AZ replication traffic is handled by RDS and included in the instance cost.)

$44.82 down to ~$11/mo. Savings: $33.82.

8. Scheduling Dev/Staging Environments

This is one of the highest-impact, lowest-effort optimizations you can do. Dev and staging environments don't need to run 24/7. Mine were running 168 hours a week when they were only used ~60.

I built a Lambda function triggered by EventBridge to start and stop tagged instances:

Python
1# lambda/schedule_instances.py 2 3import boto3 4import logging 5 6logger = logging.getLogger() 7logger.setLevel(logging.INFO) 8 9ec2 = boto3.client('ec2') 10 11def lambda_handler(event, context): 12 action = event.get('action') # 'start' or 'stop' 13 14 if action not in ('start', 'stop'): 15 raise ValueError(f"Invalid action: {action}") 16 17 # Find instances tagged with AutoSchedule=true 18 filters = [ 19 {'Name': 'tag:AutoSchedule', 'Values': ['true']}, 20 ] 21 22 if action == 'start': 23 filters.append({'Name': 'instance-state-name', 'Values': ['stopped']}) 24 else: 25 filters.append({'Name': 'instance-state-name', 'Values': ['running']}) 26 27 response = ec2.describe_instances(Filters=filters) 28 29 instance_ids = [] 30 for reservation in response['Reservations']: 31 for instance in reservation['Instances']: 32 instance_ids.append(instance['InstanceId']) 33 34 if not instance_ids: 35 logger.info(f"No instances to {action}.") 36 return {'statusCode': 200, 'body': f'No instances to {action}'} 37 38 if action == 'start': 39 ec2.start_instances(InstanceIds=instance_ids) 40 logger.info(f"Started instances: {instance_ids}") 41 else: 42 ec2.stop_instances(InstanceIds=instance_ids) 43 logger.info(f"Stopped instances: {instance_ids}") 44 45 # Also handle RDS instances tagged for scheduling 46 rds = boto3.client('rds') 47 rds_response = rds.describe_db_instances() 48 49 for db in rds_response['DBInstances']: 50 arn = db['DBInstanceArn'] 51 tags = rds.list_tags_for_resource(ResourceARN=arn)['TagList'] 52 auto_schedule = any(t['Key'] == 'AutoSchedule' and t['Value'] == 'true' for t in tags) 53 54 if auto_schedule: 55 if action == 'start' and db['DBInstanceStatus'] == 'stopped': 56 rds.start_db_instance(DBInstanceIdentifier=db['DBInstanceIdentifier']) 57 logger.info(f"Started RDS: {db['DBInstanceIdentifier']}") 58 elif action == 'stop' and db['DBInstanceStatus'] == 'available': 59 rds.stop_db_instance(DBInstanceIdentifier=db['DBInstanceIdentifier']) 60 logger.info(f"Stopped RDS: {db['DBInstanceIdentifier']}") 61 62 return { 63 'statusCode': 200, 64 'body': f'{action}ed {len(instance_ids)} EC2 instances' 65 }

And the Terraform for the EventBridge rules:

HCL
1# scheduling.tf 2 3resource "aws_lambda_function" "instance_scheduler" { 4 filename = data.archive_file.scheduler_lambda.output_path 5 function_name = "kumari-instance-scheduler" 6 role = aws_iam_role.scheduler_lambda.arn 7 handler = "schedule_instances.lambda_handler" 8 runtime = "python3.12" 9 timeout = 120 10 source_code_hash = data.archive_file.scheduler_lambda.output_base64sha256 11} 12 13# Start instances: Monday-Friday at 02:15 UTC (8:00 AM NPT) 14resource "aws_cloudwatch_event_rule" "start_schedule" { 15 name = "kumari-start-dev-staging" 16 description = "Start dev/staging instances on weekday mornings" 17 schedule_expression = "cron(15 2 ? * MON-FRI *)" 18} 19 20resource "aws_cloudwatch_event_target" "start_target" { 21 rule = aws_cloudwatch_event_rule.start_schedule.name 22 target_id = "start-instances" 23 arn = aws_lambda_function.instance_scheduler.arn 24 25 input = jsonencode({ 26 action = "start" 27 }) 28} 29 30# Stop instances: Monday-Friday at 14:15 UTC (8:00 PM NPT) 31resource "aws_cloudwatch_event_rule" "stop_schedule" { 32 name = "kumari-stop-dev-staging" 33 description = "Stop dev/staging instances on weekday evenings" 34 schedule_expression = "cron(15 14 ? * MON-FRI *)" 35} 36 37resource "aws_cloudwatch_event_target" "stop_target" { 38 rule = aws_cloudwatch_event_rule.stop_schedule.name 39 target_id = "stop-instances" 40 arn = aws_lambda_function.instance_scheduler.arn 41 42 input = jsonencode({ 43 action = "stop" 44 }) 45} 46 47# Also stop everything Friday evening to cover weekends 48resource "aws_cloudwatch_event_rule" "weekend_stop" { 49 name = "kumari-stop-weekend" 50 description = "Ensure dev/staging is off for the weekend" 51 schedule_expression = "cron(15 14 ? * FRI *)" 52} 53 54resource "aws_cloudwatch_event_target" "weekend_stop_target" { 55 rule = aws_cloudwatch_event_rule.weekend_stop.name 56 target_id = "stop-instances-weekend" 57 arn = aws_lambda_function.instance_scheduler.arn 58 59 input = jsonencode({ 60 action = "stop" 61 }) 62} 63 64resource "aws_lambda_permission" "allow_eventbridge_start" { 65 statement_id = "AllowEventBridgeStart" 66 action = "lambda:InvokeFunction" 67 function_name = aws_lambda_function.instance_scheduler.function_name 68 principal = "events.amazonaws.com" 69 source_arn = aws_cloudwatch_event_rule.start_schedule.arn 70} 71 72resource "aws_lambda_permission" "allow_eventbridge_stop" { 73 statement_id = "AllowEventBridgeStop" 74 action = "lambda:InvokeFunction" 75 function_name = aws_lambda_function.instance_scheduler.function_name 76 principal = "events.amazonaws.com" 77 source_arn = aws_cloudwatch_event_rule.stop_schedule.arn 78} 79 80resource "aws_lambda_permission" "allow_eventbridge_weekend" { 81 statement_id = "AllowEventBridgeWeekend" 82 action = "lambda:InvokeFunction" 83 function_name = aws_lambda_function.instance_scheduler.function_name 84 principal = "events.amazonaws.com" 85 source_arn = aws_cloudwatch_event_rule.weekend_stop.arn 86}

The Lambda itself costs essentially nothing — it runs for less than 2 seconds twice a day. The savings come from the instances not running.

The Results

Here's the full before-and-after:

CODE
1┌───────────────────────────────────────────────────────────────────────┐ 2│ KUMARI.AI AWS COST OPTIMIZATION — RESULTS │ 3├──────────────────────────┬──────────────┬──────────────┬─────────────┤ 4│ Service │ Before (Jan) │ After (Mar) │ Savings │ 5├──────────────────────────┼──────────────┼──────────────┼─────────────┤ 6│ EC2 — API Server (RI) │ $121.47 │ $72.07 │ $49.40 │ 7│ EC2 — Workers (Spot) │ $121.47 │ $36.50 │ $84.97 │ 8│ EC2 — Staging (Sched.) │ $30.37 │ $10.82 │ $19.55 │ 9│ EC2 — Zombie instances │ $146.87 │ $0.00 │ $146.87 │ 10│ RDS (Right-sized + RI) │ $185.42 │ $67.89 │ $117.53 │ 11│ NAT Gateway → Endpoints │ $95.37 │ $14.00 │ $81.37 │ 12│ EBS (gp2 → gp3 + clean) │ $72.14 │ $48.00 │ $24.14 │ 13│ Data Transfer (CF + opt) │ $44.82 │ $11.00 │ $33.82 │ 14│ CloudWatch (60s + batch) │ $29.30 │ $8.00 │ $21.30 │ 15│ S3 (lifecycle policies) │ included │ -$4.74 │ $4.74 │ 16├──────────────────────────┼──────────────┼──────────────┼─────────────┤ 17│ TOTAL │ $847.23 │ $263.54 │ $583.69 │ 18├──────────────────────────┼──────────────┼──────────────┼─────────────┤ 19│ Further month (Apr) │ │ $203.17 │ $644.06 │ 20│ (lifecycle fully active) │ │ │ (76% cut) │ 21└──────────────────────────┴──────────────┴──────────────┴─────────────┘

By April, with the S3 lifecycle policies fully transitioned and spot pricing averaging lower than March, the bill settled at $203.17. A 76% reduction.

That's the difference between NPR 113,000 and NPR 27,000 per month. That's the difference between "should I shut this project down" and "this is sustainable."

The Ongoing Process

Cost optimization isn't a one-time thing. I now have a monthly routine:

AWS Budgets

Bash
1$ aws budgets create-budget \ 2 --account-id 123456789012 \ 3 --budget '{ 4 "BudgetName": "kumari-monthly", 5 "BudgetLimit": {"Amount": "250", "Unit": "USD"}, 6 "TimeUnit": "MONTHLY", 7 "BudgetType": "COST" 8 }' \ 9 --notifications-with-subscribers '[{ 10 "Notification": { 11 "NotificationType": "ACTUAL", 12 "ComparisonOperator": "GREATER_THAN", 13 "Threshold": 80, 14 "ThresholdType": "PERCENTAGE" 15 }, 16 "Subscribers": [{ 17 "SubscriptionType": "EMAIL", 18 "Address": "resham@kumari.ai" 19 }] 20 },{ 21 "Notification": { 22 "NotificationType": "FORECASTED", 23 "ComparisonOperator": "GREATER_THAN", 24 "Threshold": 100, 25 "ThresholdType": "PERCENTAGE" 26 }, 27 "Subscribers": [{ 28 "SubscriptionType": "EMAIL", 29 "Address": "resham@kumari.ai" 30 }] 31 }]'

I get an alert if actual spend exceeds 80% of budget ($200) or if the forecast exceeds 100% ($250). The forecasted alert is the important one — it catches cost spikes before they become full-month surprises.

Cost Anomaly Detection

AWS Cost Anomaly Detection is free and catches weird stuff. It emailed me when a misconfigured Lambda started logging 50x more than usual and would have added $40 to the CloudWatch bill if I hadn't caught it on day 3.

Monthly Review Checklist

On the first of every month, I spend 30 minutes:

  1. Cost Explorer: Group by service, compare to last month. Anything more than 10% higher gets investigated.
  2. EC2 utilization: Check CloudWatch CPU/memory for all running instances. Anything consistently under 20% CPU is a right-sizing candidate.
  3. Unused resources: Unattached EBS volumes, idle Elastic IPs ($3.60/mo each if unattached), forgotten snapshots.
  4. Reserved Instance coverage: Are my RIs still matching what's running? Did I change instance types without updating the RI?
  5. Spot interruption rate: If interruptions are climbing, I might need to diversify instance types or switch AZs.

What I'd Do Differently

If I were starting Kumari.ai's infrastructure from scratch today:

  1. Start with Graviton (ARM) instances.

    CODE
    t4g
    instances are 20% cheaper than
    CODE
    t3
    for the same specs, and Python/Node workloads run identically on ARM. I haven't migrated yet because it requires rebuilding Docker images, but it's on the list.

  2. Aurora Serverless v2 instead of RDS. For a bursty workload like Kumari.ai, Aurora Serverless would scale to zero during quiet periods and only charge for actual compute. The minimum is 0.5 ACU ($0.12/hr in us-east-1), which is comparable to my current db.t3.medium but would scale down further overnight.

  3. ECS Fargate Spot instead of EC2 Spot for workers. Managing EC2 instances (AMIs, security patches, user data scripts) is overhead. Fargate Spot gives you containers with spot pricing and handles the underlying infrastructure.

  4. Budget alerts from day one. Not after the $847 bill.

The Bigger Picture

Running Kumari.ai taught me that cloud cost optimization is not optional for indie projects. The big companies can afford to overprovision because their margin absorbs it. When you're bootstrapping a product from Kathmandu, every dollar matters. $644 per month in savings is $7,728 per year. That's real money. That's a year of a junior developer's part-time salary in Nepal.

The irony is that none of these optimizations took more than a week total. I spent about 40 hours across that first week, and maybe 30 minutes per month maintaining it. The ROI is absurd. The only reason I didn't do it sooner is that I was focused on features and ignored the bill.

Don't ignore the bill.

If you're running a side project on AWS and you haven't looked at Cost Explorer in the last month, go look at it right now. I bet you'll find at least $100/mo in waste. Probably more. And it'll take you a day to fix, not a week.

The cloud is somebody else's computer, and they charge by the hour. Act accordingly.