Back to Homelab
Mar 19, 2026|26 min read

Terraform at Scale: Modules, State Surgery, and the Import That Saved My Weekend

Advanced Terraform patterns I learned managing 1,200+ resources across my homelab and AWS — remote state with locking, custom modules, workspace isolation, state surgery, and the mistakes that taught me the hard way.

TerraformIaCDevOpsAWSProxmoxHomelabCloud Architecture

I stared at the terminal for a solid thirty seconds before I pressed Enter.

Bash
terraform state mv 'module.old_infra.proxmox_vm_qemu.vm["grafana"]' 'module.monitoring.proxmox_vm_qemu.vm["grafana"]'

One wrong character and I'd orphan a running production VM from Terraform's knowledge. No undo button. No confirmation dialog. Just me, my coffee, and the quiet hum of the rack in the closet behind me.

This is the story of how I took a single 800-line

CODE
main.tf
that took four minutes to plan and turned it into a modular, workspace-isolated, CI/CD-driven system managing 1,237 resources across my homelab Proxmox cluster and the AWS infrastructure behind Kumari.ai. The lessons here cost me weekends. Hopefully they save yours.

Terraform remote state and workspace architecture
Terraform remote state and workspace architecture

The Tipping Point

Every Terraform project starts the same way. A single

CODE
main.tf
. Maybe a
CODE
variables.tf
if you're feeling organized. It works beautifully for the first twenty resources. Then forty. Then eighty.

My inflection point came on a Tuesday night. I was adding a new LXC container for a Loki instance and ran

CODE
terraform plan
:

Bash
1resham@devbox:~/homelab-iac/terraform$ time terraform plan 2... 3Plan: 1 to add, 0 to change, 0 to destroy. 4 5real 4m12.387s 6user 0m8.241s 7sys 0m1.093s

Four minutes. To add a single container. Terraform was refreshing every resource in state — every VM, every DNS record, every firewall rule, every AWS ECS service — just to tell me it needed to create one LXC.

I opened

CODE
main.tf
and scrolled. And scrolled. Eight hundred and fourteen lines. Proxmox VMs mixed with AWS VPC definitions mixed with Cloudflare DNS records mixed with Let's Encrypt certificates. Variables scattered across three files with no naming convention. Outputs that referenced resources by index because I was too lazy to use
CODE
for_each
early on.

It was a mess. And I was the one who made it.

The real wake-up call came a week later when my laptop's SSD died. I had the Terraform code in git, sure. But the state file? Local.

CODE
terraform.tfstate
sitting in the project directory, committed to
CODE
.gitignore
because I'd read somewhere that you shouldn't commit state. Which is correct. But I also hadn't set up remote state, which meant my state file lived on exactly one drive. The drive that was now making clicking noises.

I spent that weekend recreating state by hand. Importing resources one at a time. Forty-seven

CODE
terraform import
commands. I swore I'd never be in that position again.

Remote State: The Foundation

The first thing I fixed was state storage. If your Terraform state is local, stop reading this and go fix that. Right now. I'll wait.

Here's my backend configuration:

HCL
1# backend.tf 2terraform { 3 backend "s3" { 4 bucket = "kumari-terraform-state" 5 key = "infrastructure/terraform.tfstate" 6 region = "us-east-1" 7 encrypt = true 8 dynamodb_table = "terraform-state-lock" 9 10 # I use a dedicated IAM user for state access 11 # with minimal permissions — just S3 and DynamoDB 12 profile = "terraform-state" 13 } 14}

The S3 bucket has versioning enabled. Every time Terraform writes state, S3 keeps the previous version. This has saved me twice — once when a bad apply corrupted state, and once when I accidentally removed a resource block without running

CODE
state rm
first.

HCL
1# state-backend/main.tf 2# I manage the state backend itself with a SEPARATE Terraform config 3# that uses local state. Yes, it's turtles all the way down. 4 5provider "aws" { 6 region = "us-east-1" 7 profile = "terraform-admin" 8} 9 10resource "aws_s3_bucket" "terraform_state" { 11 bucket = "kumari-terraform-state" 12 13 tags = { 14 Name = "Terraform State" 15 ManagedBy = "terraform" 16 Environment = "global" 17 } 18} 19 20resource "aws_s3_bucket_versioning" "terraform_state" { 21 bucket = aws_s3_bucket.terraform_state.id 22 versioning_configuration { 23 status = "Enabled" 24 } 25} 26 27resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" { 28 bucket = aws_s3_bucket.terraform_state.id 29 rule { 30 apply_server_side_encryption_by_default { 31 sse_algorithm = "aws:kms" 32 } 33 } 34} 35 36resource "aws_s3_bucket_public_access_block" "terraform_state" { 37 bucket = aws_s3_bucket.terraform_state.id 38 39 block_public_acls = true 40 block_public_policy = true 41 ignore_public_acls = true 42 restrict_public_buckets = true 43} 44 45resource "aws_dynamodb_table" "terraform_lock" { 46 name = "terraform-state-lock" 47 billing_mode = "PAY_PER_REQUEST" 48 hash_key = "LockID" 49 50 attribute { 51 name = "LockID" 52 type = "S" 53 } 54 55 tags = { 56 Name = "Terraform State Lock" 57 ManagedBy = "terraform" 58 } 59}

The DynamoDB table is critical. Without it, if two people (or two CI jobs, or you in two terminal tabs — don't ask how I know) run

CODE
terraform apply
at the same time, they'll both read the same state, make different changes, and one will overwrite the other. DynamoDB provides a distributed lock. When Terraform acquires the lock, it writes a lock entry. If another process tries to acquire it, Terraform tells you:

CODE
1Error: Error acquiring the state lock 2 3Error message: ConditionalCheckFailedException: The conditional request failed 4Lock Info: 5 ID: a1b2c3d4-e5f6-7890-abcd-ef1234567890 6 Path: kumari-terraform-state/infrastructure/terraform.tfstate 7 Operation: OperationTypeApply 8 Who: resham@devbox 9 Version: 1.9.8 10 Created: 2026-03-18 02:14:33.048293 +0000 UTC

The

CODE
Who
field has saved me from myself more than once. "Oh, that's me in the other terminal. Right."

Workspaces: One Codebase, Four Environments

My infrastructure spans four distinct environments:

  • homelab-prod — The Proxmox cluster running actual services. Grafana, Prometheus, DNS, media, the NAS.
  • homelab-staging — A smaller set of VMs where I test changes before rolling them to prod. Yes, I have staging for my homelab. I've been burned enough times.
  • cloud-aws — Production AWS infrastructure for Kumari.ai. ECS, RDS, ElastiCache, CloudFront, the works.
  • cloud-aws-dr — Disaster recovery in us-west-2. Minimal footprint, ready to scale up.

Terraform workspaces give each environment its own state file within the same backend:

Bash
1# Create workspaces 2terraform workspace new homelab-prod 3terraform workspace new homelab-staging 4terraform workspace new cloud-aws 5terraform workspace new cloud-aws-dr 6 7# Switch between them 8terraform workspace select homelab-prod 9 10# List all workspaces 11resham@devbox:~/homelab-iac/terraform$ terraform workspace list 12 default 13 homelab-prod 14* homelab-staging 15 cloud-aws 16 cloud-aws-dr

Each workspace uses a different

CODE
.tfvars
file:

HCL
1# envs/homelab-prod.tfvars 2environment = "homelab-prod" 3proxmox_api_url = "https://pve1.internal.resham.dev:8006/api2/json" 4proxmox_node = "pve1" 5vm_defaults = { 6 cores = 2 7 memory = 2048 8 disk = "local-zfs" 9 bridge = "vmbr0" 10 os_type = "cloud-init" 11 template = "ubuntu-2404-cloud" 12} 13monitoring_enabled = true 14backup_schedule = "0 2 * * *" 15 16# envs/homelab-staging.tfvars 17environment = "homelab-staging" 18proxmox_api_url = "https://pve3.internal.resham.dev:8006/api2/json" 19proxmox_node = "pve3" 20vm_defaults = { 21 cores = 1 22 memory = 1024 23 disk = "local-zfs" 24 bridge = "vmbr1" # Isolated staging VLAN 25 os_type = "cloud-init" 26 template = "ubuntu-2404-cloud" 27} 28monitoring_enabled = false 29backup_schedule = "" # No backups for staging 30 31# envs/cloud-aws.tfvars 32environment = "cloud-aws" 33aws_region = "us-east-1" 34vpc_cidr = "10.100.0.0/16" 35ecs_cluster_name = "kumari-prod" 36rds_instance_class = "db.r6g.large" 37redis_node_type = "cache.r6g.large" 38enable_waf = true 39min_ecs_tasks = 2 40max_ecs_tasks = 20

The plan/apply commands always specify the vars file explicitly:

Bash
1terraform plan -var-file="envs/$(terraform workspace show).tfvars" 2terraform apply -var-file="envs/$(terraform workspace show).tfvars"

I have a shell alias for this because I got tired of typing it:

Bash
1# ~/.zshrc 2tplan() { terraform plan -var-file="envs/$(terraform workspace show).tfvars" "$@"; } 3tapply() { terraform apply -var-file="envs/$(terraform workspace show).tfvars" "$@"; }

One lesson learned the painful way: always check which workspace you're in before running apply. I once applied homelab-staging config to homelab-prod because I forgot to switch. Downscaled every VM to 1 core and 1GB RAM. My monitoring stack collapsed, which meant I didn't even get alerts about it. Found out when Nextcloud became unusable.

Now my shell prompt shows the active workspace:

Bash
1# Part of my starship.toml 2[custom.terraform] 3command = "terraform workspace show 2>/dev/null" 4when = "test -f main.tf" 5format = "[tf:$output]($style) " 6style = "bold purple"

Module Architecture

Here's the directory structure after the refactor:

Bash
1resham@devbox:~/homelab-iac/terraform$ tree -L 3 2. 3├── backend.tf 4├── main.tf # Root module — just module calls 5├── variables.tf 6├── outputs.tf 7├── versions.tf 8├── envs/ 9│ ├── homelab-prod.tfvars 10│ ├── homelab-staging.tfvars 11│ ├── cloud-aws.tfvars 12│ └── cloud-aws-dr.tfvars 13├── modules/ 14│ ├── proxmox-vm/ 15│ │ ├── main.tf 16│ │ ├── variables.tf 17│ │ ├── outputs.tf 18│ │ └── versions.tf 19│ ├── proxmox-lxc/ 20│ │ ├── main.tf 21│ │ ├── variables.tf 22│ │ ├── outputs.tf 23│ │ └── versions.tf 24│ ├── aws-vpc/ 25│ │ ├── main.tf 26│ │ ├── variables.tf 27│ │ ├── outputs.tf 28│ │ └── versions.tf 29│ ├── aws-ecs/ 30│ │ ├── main.tf 31│ │ ├── variables.tf 32│ │ ├── outputs.tf 33│ │ └── versions.tf 34│ ├── monitoring/ 35│ │ ├── main.tf 36│ │ ├── variables.tf 37│ │ └── outputs.tf 38│ └── dns/ 39│ ├── main.tf 40│ ├── variables.tf 41│ └── outputs.tf

Here's the complete

CODE
proxmox-vm
module. This is the one I use the most:

HCL
1# modules/proxmox-vm/versions.tf 2terraform { 3 required_version = ">= 1.9.0" 4 required_providers { 5 proxmox = { 6 source = "Telmate/proxmox" 7 version = "~> 3.0" 8 } 9 } 10} 11 12# modules/proxmox-vm/variables.tf 13variable "vms" { 14 description = "Map of VMs to create" 15 type = map(object({ 16 cores = optional(number, 2) 17 memory = optional(number, 2048) 18 disk_size = optional(string, "20G") 19 disk_storage = optional(string, "local-zfs") 20 bridge = optional(string, "vmbr0") 21 vlan_tag = optional(number, -1) 22 ip_address = optional(string) 23 gateway = optional(string, "10.0.0.1") 24 dns_servers = optional(string, "10.0.0.5") 25 template = optional(string, "ubuntu-2404-cloud") 26 onboot = optional(bool, true) 27 tags = optional(list(string), []) 28 description = optional(string, "") 29 })) 30} 31 32variable "target_node" { 33 description = "Proxmox node to deploy VMs on" 34 type = string 35} 36 37variable "ssh_keys" { 38 description = "SSH public keys for cloud-init" 39 type = string 40 sensitive = true 41} 42 43variable "default_user" { 44 description = "Default user created by cloud-init" 45 type = string 46 default = "resham" 47} 48 49# modules/proxmox-vm/main.tf 50resource "proxmox_vm_qemu" "vm" { 51 for_each = var.vms 52 53 name = each.key 54 target_node = var.target_node 55 clone = each.value.template 56 full_clone = true 57 agent = 1 58 onboot = each.value.onboot 59 cores = each.value.cores 60 memory = each.value.memory 61 scsihw = "virtio-scsi-single" 62 tags = join(";", each.value.tags) 63 desc = each.value.description 64 65 disks { 66 scsi { 67 scsi0 { 68 disk { 69 size = each.value.disk_size 70 storage = each.value.disk_storage 71 } 72 } 73 } 74 } 75 76 network { 77 model = "virtio" 78 bridge = each.value.bridge 79 tag = each.value.vlan_tag 80 } 81 82 os_type = "cloud-init" 83 ciuser = var.default_user 84 sshkeys = var.ssh_keys 85 ipconfig0 = each.value.ip_address != null ? "ip=${each.value.ip_address}/24,gw=${each.value.gateway}" : "ip=dhcp" 86 nameserver = each.value.dns_servers 87 88 lifecycle { 89 ignore_changes = [ 90 network, # Proxmox sometimes reorders network devices 91 desc, # Don't fight manual description edits 92 ] 93 } 94} 95 96# modules/proxmox-vm/outputs.tf 97output "vm_ids" { 98 description = "Map of VM names to their Proxmox VMIDs" 99 value = { for name, vm in proxmox_vm_qemu.vm : name => vm.vmid } 100} 101 102output "vm_ips" { 103 description = "Map of VM names to their IP addresses" 104 value = { for name, vm in proxmox_vm_qemu.vm : name => vm.default_ipv4_address } 105} 106 107output "vm_names" { 108 description = "List of all VM names created by this module" 109 value = keys(proxmox_vm_qemu.vm) 110}

And the root module calls it like this:

HCL
1# main.tf (root module) 2module "monitoring" { 3 source = "./modules/proxmox-vm" 4 target_node = var.proxmox_node 5 ssh_keys = var.ssh_public_key 6 7 vms = { 8 "grafana" = { 9 cores = 2 10 memory = 4096 11 disk_size = "50G" 12 ip_address = "10.0.0.20" 13 tags = ["monitoring", "prod"] 14 } 15 "prometheus" = { 16 cores = 4 17 memory = 8192 18 disk_size = "200G" 19 ip_address = "10.0.0.21" 20 tags = ["monitoring", "prod"] 21 } 22 "loki" = { 23 cores = 2 24 memory = 4096 25 disk_size = "100G" 26 ip_address = "10.0.0.22" 27 tags = ["monitoring", "prod"] 28 } 29 } 30} 31 32module "services" { 33 source = "./modules/proxmox-vm" 34 target_node = var.proxmox_node 35 ssh_keys = var.ssh_public_key 36 37 vms = { 38 "nginx-proxy" = { 39 cores = 2 40 memory = 2048 41 disk_size = "20G" 42 ip_address = "10.0.0.10" 43 tags = ["network", "prod"] 44 } 45 "docker-host-01" = { 46 cores = 4 47 memory = 16384 48 disk_size = "100G" 49 ip_address = "10.0.0.30" 50 tags = ["docker", "prod"] 51 } 52 "docker-host-02" = { 53 cores = 4 54 memory = 16384 55 disk_size = "100G" 56 ip_address = "10.0.0.31" 57 tags = ["docker", "prod"] 58 } 59 } 60}

The DRY principle in action. Before modules, each VM was a separate 25-line resource block. Fifteen VMs meant 375 lines just for VM definitions, with every parameter copy-pasted and slightly different. Now the module handles the boilerplate, and the root config is just a clean map of what I actually care about: name, resources, IP.

State Surgery

This is the section that gives Terraform practitioners cold sweats.

When I refactored from a flat structure to modules, every resource path changed. What was

CODE
proxmox_vm_qemu.grafana
became
CODE
module.monitoring.proxmox_vm_qemu.vm["grafana"]
. If I'd just reorganized the code and run
CODE
terraform plan
, Terraform would've shown:

CODE
Plan: 15 to add, 0 to change, 15 to destroy.

Fifteen VMs destroyed and recreated. Production services down. Data lost. Not acceptable.

Instead, I used

CODE
terraform state mv
to update resource addresses in state without touching the actual infrastructure:

Bash
1# Move monitoring VMs into the monitoring module 2terraform state mv \ 3 'proxmox_vm_qemu.grafana' \ 4 'module.monitoring.proxmox_vm_qemu.vm["grafana"]' 5 6terraform state mv \ 7 'proxmox_vm_qemu.prometheus' \ 8 'module.monitoring.proxmox_vm_qemu.vm["prometheus"]' 9 10terraform state mv \ 11 'proxmox_vm_qemu.loki' \ 12 'module.monitoring.proxmox_vm_qemu.vm["loki"]' 13 14# Move service VMs into the services module 15terraform state mv \ 16 'proxmox_vm_qemu.nginx_proxy' \ 17 'module.services.proxmox_vm_qemu.vm["nginx-proxy"]' 18 19terraform state mv \ 20 'proxmox_vm_qemu.docker_host_01' \ 21 'module.services.proxmox_vm_qemu.vm["docker-host-01"]' 22 23terraform state mv \ 24 'proxmox_vm_qemu.docker_host_02' \ 25 'module.services.proxmox_vm_qemu.vm["docker-host-02"]'

Each command outputs something like:

CODE
1Move "proxmox_vm_qemu.grafana" to 2 "module.monitoring.proxmox_vm_qemu.vm[\"grafana\"]" 3Successfully moved 1 object(s).

After all the moves, the moment of truth:

Bash
1resham@devbox:~/homelab-iac/terraform$ terraform plan -var-file="envs/homelab-prod.tfvars" 2 3No changes. Your infrastructure matches the configuration.

I have never felt more relief from a terminal output in my life.

The rules I follow for state surgery:

  1. Always back up state first.
    CODE
    terraform state pull > state-backup-$(date +%Y%m%d-%H%M%S).json
  2. Do it during a maintenance window. Lock deploys. Tell anyone who touches the infra.
  3. Move one resource at a time. Run
    CODE
    terraform plan
    after each move to catch mistakes early.
  4. Never combine state moves with code changes. Move state first, verify with plan, then merge the code.

For resources that should no longer be managed by Terraform (maybe you're moving them to a different tool, or they're being decommissioned manually):

Bash
1# Remove from state without destroying the resource 2terraform state rm 'module.legacy.proxmox_vm_qemu.vm["old-jenkins"]' 3# Removed module.legacy.proxmox_vm_qemu.vm["old-jenkins"] 4# The VM keeps running. Terraform just forgets about it.

I used

CODE
state rm
when I decommissioned my old Jenkins server. Didn't want Terraform to destroy it — I needed to pull build history off it first. Removed it from state, did my data migration over the next week, then manually deleted the VM through the Proxmox UI.

The Import That Saved My Weekend

This is the story from the title.

It was a Saturday morning. I'd been running my Proxmox cluster for about eight months, and over that time I'd created fifteen VMs manually through the Proxmox web UI. Dev boxes, test environments, a Minecraft server for friends, a Kali box for security practice. None of them were in Terraform.

Every time I ran

CODE
terraform plan
, these VMs were invisible. Terraform didn't know they existed. Which meant if I ever needed to recreate my environment, those fifteen VMs and all their configuration details lived exclusively in my head and in the Proxmox database.

So I decided to import them. All of them. On a Saturday.

The process for each VM:

Step 1: Write the resource block that matches the existing VM

HCL
1# I had to look up each VM's config in Proxmox first 2# pvesh get /nodes/pve1/qemu/110/config 3 4module "dev_vms" { 5 source = "./modules/proxmox-vm" 6 target_node = "pve1" 7 ssh_keys = var.ssh_public_key 8 9 vms = { 10 "devbox-golang" = { 11 cores = 4 12 memory = 8192 13 disk_size = "80G" 14 ip_address = "10.0.0.50" 15 tags = ["dev", "golang"] 16 } 17 # ... 14 more VMs 18 } 19}

Step 2: Import each VM into state

Bash
1# The format is: terraform import <address> <proxmox_node>/<vmtype>/<vmid> 2terraform import \ 3 'module.dev_vms.proxmox_vm_qemu.vm["devbox-golang"]' \ 4 pve1/qemu/110 5 6terraform import \ 7 'module.dev_vms.proxmox_vm_qemu.vm["devbox-rust"]' \ 8 pve1/qemu/111 9 10terraform import \ 11 'module.dev_vms.proxmox_vm_qemu.vm["minecraft"]' \ 12 pve2/qemu/200 13 14terraform import \ 15 'module.dev_vms.proxmox_vm_qemu.vm["kali-lab"]' \ 16 pve2/qemu/201 17 18terraform import \ 19 'module.dev_vms.proxmox_vm_qemu.vm["pihole-primary"]' \ 20 pve1/qemu/105 21 22terraform import \ 23 'module.dev_vms.proxmox_vm_qemu.vm["pihole-secondary"]' \ 24 pve3/qemu/106 25 26# ... and so on for all 15 VMs

Each import took about 10-15 seconds as Terraform queried the Proxmox API:

CODE
1module.dev_vms.proxmox_vm_qemu.vm["devbox-golang"]: Importing from ID "pve1/qemu/110"... 2module.dev_vms.proxmox_vm_qemu.vm["devbox-golang"]: Import prepared! 3 Prepared proxmox_vm_qemu for import 4module.dev_vms.proxmox_vm_qemu.vm["devbox-golang"]: Refreshing state... [id=pve1/qemu/110] 5 6Import successful! 7 8The resources that were imported are shown above. These resources are now in 9your Terraform state and will henceforth be managed by Terraform.

Step 3: Iterate until plan is clean

This was the tedious part. After importing, I'd run

CODE
terraform plan
and see a wall of changes because my resource block didn't perfectly match the actual VM config:

CODE
1# module.dev_vms.proxmox_vm_qemu.vm["devbox-golang"] will be updated in-place 2 ~ resource "proxmox_vm_qemu" "vm" { 3 ~ cores = 4 -> 2 # Oops, it actually had 2 cores 4 ~ memory = 8192 -> 4096 # And 4GB, not 8GB 5 name = "devbox-golang" 6 ~ scsihw = "virtio-scsi-single" -> "lsi" # Different SCSI controller 7 # (12 unchanged attributes hidden) 8 }

So I'd update my HCL to match reality, run plan again, fix more discrepancies, repeat. For fifteen VMs. Some had settings I didn't even know I'd configured — NUMA topology, CPU types, specific BIOS settings.

It took about four hours. Fifteen VMs. Lots of

CODE
pvesh get
queries to the Proxmox API. Lots of back-and-forth between the editor and the terminal.

But at the end:

Bash
1resham@devbox:~/homelab-iac/terraform$ terraform plan -var-file="envs/homelab-prod.tfvars" 2 3No changes. Your infrastructure matches the configuration.

Fifteen VMs, fully managed. If my house burns down tomorrow, I can

CODE
terraform apply
on a new Proxmox cluster and have my entire environment back. That's worth a Saturday.

Handling Drift

Drift is what happens when reality diverges from your Terraform state. And in a homelab, it happens constantly, because sometimes it's 11 PM and you just need to bump a VM's RAM from the Proxmox UI and you'll "fix it in Terraform later."

You won't fix it later. You'll forget, and three weeks from now you'll run

CODE
terraform plan
and see:

CODE
1# module.services.proxmox_vm_qemu.vm["docker-host-01"] will be updated in-place 2 ~ resource "proxmox_vm_qemu" "vm" { 3 ~ memory = 32768 -> 16384 4 name = "docker-host-01" 5 # (15 unchanged attributes hidden) 6 }

Terraform wants to revert the memory back to what's in your config (16GB) because you manually bumped it to 32GB three weeks ago. If you apply this, your Docker host loses half its RAM and every container on it starts OOMing.

I handle drift two ways:

Option 1: Update the config to match reality

HCL
1# If the manual change was intentional, update the HCL 2"docker-host-01" = { 3 cores = 4 4 memory = 32768 # Updated from 16384 — bumped for memory-hungry containers 5 disk_size = "100G" 6 ip_address = "10.0.0.30" 7 tags = ["docker", "prod"] 8}

Option 2: Revert the manual change

If the manual change was a mistake or a temporary fix, let Terraform revert it:

Bash
terraform apply -var-file="envs/homelab-prod.tfvars" -target='module.services.proxmox_vm_qemu.vm["docker-host-01"]'

Option 3: Ignore it (carefully)

For attributes that you know will drift and you don't care about, use

CODE
lifecycle.ignore_changes
:

HCL
1lifecycle { 2 ignore_changes = [ 3 network, # Proxmox reorders NICs sometimes 4 desc, # I edit descriptions in the UI 5 tags, # Tags get added by automation scripts 6 ] 7}

Be very careful with

CODE
ignore_changes
. It's a trapdoor. Once you ignore an attribute, Terraform will never manage it again, and you'll forget it's there until the day you desperately need it to work.

I run a weekly drift check as a cron job:

Bash
1# /etc/cron.d/terraform-drift-check 20 8 * * 1 resham cd /home/resham/homelab-iac/terraform && \ 3 terraform workspace select homelab-prod && \ 4 terraform plan -var-file="envs/homelab-prod.tfvars" -detailed-exitcode -no-color \ 5 > /tmp/terraform-drift-report.txt 2>&1; \ 6 if [ $? -eq 2 ]; then \ 7 curl -X POST "$DISCORD_WEBHOOK" \ 8 -H "Content-Type: application/json" \ 9 -d "{\"content\": \"Terraform drift detected in homelab-prod. Check /tmp/terraform-drift-report.txt\"}"; \ 10 fi

The

CODE
-detailed-exitcode
flag is key: exit code 0 means no changes, 1 means error, 2 means changes detected. If I get a Discord ping on Monday morning, I know something drifted over the weekend.

Sensitive Values: The Hard Lesson

Let me tell you about the worst commit I've ever made.

It was early in my Terraform journey. I had a

CODE
terraform.tfvars
file with my Proxmox API token, my AWS access keys, and my Cloudflare API token. I committed it. Pushed to my private GitHub repo. Didn't notice for two weeks.

Now, it was a private repo. Nobody saw it. But the credentials were in the git history forever. Even after I deleted the file,

CODE
git log
would happily show anyone who cloned the repo every secret I'd committed.

Here's how I fixed it:

Bash
1# Step 1: Rotate every compromised credential immediately 2# Don't clean up first. Rotate FIRST. Assume compromise. 3 4# Step 2: Remove the file from ALL git history 5pip install git-filter-repo 6git filter-repo --invert-paths --path terraform.tfvars --force 7 8# Step 3: Force push (only time I'll advocate for force push) 9git push origin --force --all 10 11# Step 4: Tell GitHub to garbage collect 12# (GitHub support can do this for you if you ask nicely)

After that experience, here's my hierarchy for sensitive values:

Tier 1: Environment variables (for CI/CD)

Bash
1export TF_VAR_proxmox_api_token="your-token-here" 2export TF_VAR_aws_access_key="AKIA..." 3export TF_VAR_cloudflare_api_token="..."

Terraform automatically reads

CODE
TF_VAR_<name>
as input variables. No file to accidentally commit.

Tier 2: .tfvars with .gitignore (for local development)

Bash
1# .gitignore 2*.tfvars 3!envs/*.tfvars # Non-sensitive workspace configs ARE committed 4secrets.tfvars # Explicit ignore
Bash
1# secrets.tfvars — NEVER committed 2proxmox_api_token = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" 3ssh_public_key = "ssh-ed25519 AAAA... resham@devbox" 4cloudflare_api_token = "..."
Bash
terraform plan -var-file="envs/homelab-prod.tfvars" -var-file="secrets.tfvars"

Tier 3: HashiCorp Vault (for production)

HCL
1data "vault_generic_secret" "aws_creds" { 2 path = "secret/terraform/aws" 3} 4 5provider "aws" { 6 access_key = data.vault_generic_secret.aws_creds.data["access_key"] 7 secret_key = data.vault_generic_secret.aws_creds.data["secret_key"] 8 region = var.aws_region 9}

I use Vault for the AWS production infrastructure. For the homelab, environment variables are sufficient — the blast radius of a leaked Proxmox token on my local network is limited.

Also: mark sensitive variables as

CODE
sensitive = true
in your variable declarations. Terraform will redact them from plan output:

HCL
1variable "proxmox_api_token" { 2 description = "Proxmox API token" 3 type = string 4 sensitive = true 5}
CODE
1# Plan output shows: 2 + proxmox_api_token = (sensitive value)

CI/CD for Terraform

Manual applies are fine when you're learning. They're not fine when you're managing production infrastructure for a real product. Here's my GitHub Actions workflow:

YAML
1# .github/workflows/terraform.yml 2name: Terraform 3 4on: 5 pull_request: 6 branches: [main] 7 paths: 8 - 'terraform/**' 9 push: 10 branches: [main] 11 paths: 12 - 'terraform/**' 13 14permissions: 15 contents: read 16 pull-requests: write 17 id-token: write 18 19env: 20 TF_VERSION: "1.9.8" 21 AWS_REGION: "us-east-1" 22 23jobs: 24 plan: 25 name: "Terraform Plan" 26 runs-on: ubuntu-latest 27 if: github.event_name == 'pull_request' 28 strategy: 29 matrix: 30 workspace: [homelab-prod, cloud-aws, cloud-aws-dr] 31 steps: 32 - uses: actions/checkout@v4 33 34 - uses: hashicorp/setup-terraform@v3 35 with: 36 terraform_version: ${{ env.TF_VERSION }} 37 38 - name: Configure AWS Credentials 39 uses: aws-actions/configure-aws-credentials@v4 40 with: 41 role-to-arn: arn:aws:iam::role/terraform-github-actions 42 aws-region: ${{ env.AWS_REGION }} 43 44 - name: Terraform Init 45 working-directory: terraform 46 run: terraform init 47 48 - name: Select Workspace 49 working-directory: terraform 50 run: terraform workspace select ${{ matrix.workspace }} 51 52 - name: Terraform Plan 53 id: plan 54 working-directory: terraform 55 run: | 56 terraform plan \ 57 -var-file="envs/${{ matrix.workspace }}.tfvars" \ 58 -no-color \ 59 -out=tfplan-${{ matrix.workspace }} \ 60 2>&1 | tee plan-output.txt 61 env: 62 TF_VAR_proxmox_api_token: ${{ secrets.PROXMOX_API_TOKEN }} 63 TF_VAR_ssh_public_key: ${{ secrets.SSH_PUBLIC_KEY }} 64 TF_VAR_cloudflare_api_token: ${{ secrets.CLOUDFLARE_API_TOKEN }} 65 66 - name: Comment Plan on PR 67 uses: actions/github-script@v7 68 with: 69 script: | 70 const fs = require('fs'); 71 const plan = fs.readFileSync('terraform/plan-output.txt', 'utf8'); 72 const workspace = '${{ matrix.workspace }}'; 73 const body = `### Terraform Plan: \`${workspace}\` 74 \`\`\`hcl 75 ${plan.substring(0, 65000)} 76 \`\`\``; 77 github.rest.issues.createComment({ 78 issue_number: context.issue.number, 79 owner: context.repo.owner, 80 repo: context.repo.repo, 81 body: body 82 }); 83 84 - name: Upload Plan Artifact 85 uses: actions/upload-artifact@v4 86 with: 87 name: tfplan-${{ matrix.workspace }} 88 path: terraform/tfplan-${{ matrix.workspace }} 89 90 apply: 91 name: "Terraform Apply" 92 runs-on: ubuntu-latest 93 if: github.event_name == 'push' && github.ref == 'refs/heads/main' 94 environment: production # Requires manual approval in GitHub 95 strategy: 96 matrix: 97 workspace: [homelab-prod, cloud-aws, cloud-aws-dr] 98 steps: 99 - uses: actions/checkout@v4 100 101 - uses: hashicorp/setup-terraform@v3 102 with: 103 terraform_version: ${{ env.TF_VERSION }} 104 105 - name: Configure AWS Credentials 106 uses: aws-actions/configure-aws-credentials@v4 107 with: 108 role-to-arn: arn:aws:iam::role/terraform-github-actions 109 aws-region: ${{ env.AWS_REGION }} 110 111 - name: Terraform Init 112 working-directory: terraform 113 run: terraform init 114 115 - name: Select Workspace 116 working-directory: terraform 117 run: terraform workspace select ${{ matrix.workspace }} 118 119 - name: Terraform Apply 120 working-directory: terraform 121 run: | 122 terraform apply \ 123 -var-file="envs/${{ matrix.workspace }}.tfvars" \ 124 -auto-approve 125 env: 126 TF_VAR_proxmox_api_token: ${{ secrets.PROXMOX_API_TOKEN }} 127 TF_VAR_ssh_public_key: ${{ secrets.SSH_PUBLIC_KEY }} 128 TF_VAR_cloudflare_api_token: ${{ secrets.CLOUDFLARE_API_TOKEN }}

The key design decisions:

  • Plan on PR, apply on merge. You review the plan in the PR comments before approving.
  • Matrix strategy runs plan/apply for each workspace in parallel. If cloud-aws has issues, homelab-prod still applies.
  • The
    CODE
    production
    environment
    in GitHub requires manual approval. Merging to main doesn't automatically apply — a human has to click "Approve" in the Actions UI.
  • Plan artifacts are saved so the apply uses the exact same plan that was reviewed, not a new one.

Performance: From 4 Minutes to 18 Seconds

The biggest win was splitting resources into modules with separate state consideration. But there are several other tricks:

Targeted plans when you know what you're changing:

Bash
1# Only plan the monitoring module 2terraform plan -var-file="envs/homelab-prod.tfvars" \ 3 -target=module.monitoring 4 5# Plan time: 3 seconds instead of 18

Parallelism tuning:

Bash
1# Default parallelism is 10. For my Proxmox API that can't handle many 2# concurrent requests, I lower it: 3terraform apply -var-file="envs/homelab-prod.tfvars" -parallelism=5 4 5# For AWS with high API rate limits, I raise it: 6terraform apply -var-file="envs/cloud-aws.tfvars" -parallelism=30

Skip refresh when you know nothing changed:

Bash
1# Useful for rapid iteration on HCL syntax 2terraform plan -var-file="envs/homelab-prod.tfvars" -refresh=false 3# Plan time: 2 seconds (no API calls)

Be careful with

CODE
-refresh=false
though. If something changed outside Terraform, you won't see it. I only use this when I'm iterating on a new resource definition and I know nothing else has changed.

State splitting was the nuclear option. When I realized the homelab and AWS resources didn't need to be in the same state at all, I split them into separate root modules:

Bash
1terraform/ 2├── homelab/ # Proxmox resources only 3│ ├── backend.tf # key = "homelab/terraform.tfstate" 4│ ├── main.tf 5│ └── ... 6├── cloud/ # AWS resources only 7│ ├── backend.tf # key = "cloud/terraform.tfstate" 8│ ├── main.tf 9│ └── ... 10└── shared/ # DNS, monitoring that spans both 11 ├── backend.tf # key = "shared/terraform.tfstate" 12 ├── main.tf 13 └── ...

When

CODE
shared/
needs outputs from
CODE
homelab/
or
CODE
cloud/
, it uses
CODE
terraform_remote_state
:

HCL
1# shared/data.tf 2data "terraform_remote_state" "homelab" { 3 backend = "s3" 4 config = { 5 bucket = "kumari-terraform-state" 6 key = "homelab/terraform.tfstate" 7 region = "us-east-1" 8 } 9} 10 11data "terraform_remote_state" "cloud" { 12 backend = "s3" 13 config = { 14 bucket = "kumari-terraform-state" 15 key = "cloud/terraform.tfstate" 16 region = "us-east-1" 17 } 18} 19 20# Now I can reference outputs from other states 21resource "cloudflare_record" "grafana" { 22 zone_id = var.cloudflare_zone_id 23 name = "grafana" 24 content = data.terraform_remote_state.homelab.outputs.monitoring_vm_ips["grafana"] 25 type = "A" 26 proxied = true 27}

This alone cut plan time by 60% because each state only refreshes its own resources.

Disaster Recovery

State files are the crown jewels. If you lose state, Terraform doesn't know what it manages. You're back to importing everything by hand. Here's my backup strategy:

Layer 1: S3 versioning. Every state write creates a new version. I can roll back to any previous state.

Bash
1# List state file versions 2aws s3api list-object-versions \ 3 --bucket kumari-terraform-state \ 4 --prefix "homelab/terraform.tfstate" \ 5 --max-items 5 6 7# Restore a previous version 8aws s3api get-object \ 9 --bucket kumari-terraform-state \ 10 --key "homelab/terraform.tfstate" \ 11 --version-id "abc123..." \ 12 restored-state.json 13 14# Push restored state 15terraform state push restored-state.json

Layer 2: Pre-apply state backup. My CI/CD workflow pulls state before every apply:

Bash
terraform state pull > "backups/state-$(date +%Y%m%d-%H%M%S)-pre-apply.json"

Layer 3: Cross-region replication. The S3 bucket replicates to us-west-2. If us-east-1 goes down, my state is still accessible.

When state gets corrupted (it happened once, after a crash during apply):

Bash
1# Check state integrity 2terraform state list 3# If this errors, state is corrupted 4 5# Option 1: Roll back to last good version from S3 6aws s3api list-object-versions --bucket kumari-terraform-state \ 7 --prefix "homelab/terraform.tfstate" --max-items 10 8 9# Option 2: If lock is stuck from the crashed process 10terraform force-unlock LOCK_ID_HERE 11# Only use this if you're SURE no other process is running

CODE
terraform force-unlock
is the last resort. I've used it exactly once, when my CI runner crashed mid-apply and left a stale lock. The lock had been held for 45 minutes and the runner was confirmed dead. Even then I double-checked that nothing else was running before unlocking.

The Numbers

After six months of iterating on this setup:

MetricBeforeAfter
Total managed resources~80 (rest unmanaged)1,237
Plan time (full)4 min 12 sec18 sec
Plan time (targeted)N/A2-5 sec
State files1 (local)4 (remote, locked)
Manual infrastructure changesWeeklyZero in 6 months
Time to recreate full environment"Unknown, pray it doesn't happen"~45 min
CI/CD pipelineNonePlan on PR, apply on merge
Secrets committed to git1 (that I know of)0

The 1,237 resources break down roughly as:

  • Homelab prod: 487 resources (VMs, LXCs, storage, network, firewall rules)
  • Homelab staging: 89 resources (minimal mirror of prod)
  • Cloud AWS: 584 resources (VPC, subnets, ECS services, RDS, ElastiCache, CloudFront, WAF rules, IAM, the whole stack)
  • Cloud AWS DR: 77 resources (warm standby, ready to scale)

What I'd Do Differently

If I could go back and start over:

  1. Remote state from day one. Not day thirty. Not "when I have more resources." Day one.
  2. Modules from the start. Even if you only have three resources, put them in a module. You'll thank yourself in six months.
  3. Never make manual changes. Not even "just this once." The five minutes you save now becomes two hours of drift debugging later.
  4. Use
    CODE
    for_each
    instead of
    CODE
    count
    .
    I started with
    CODE
    count
    for my VMs and regretted it immediately. Removing a VM from the middle of a list reindexes everything.
    CODE
    for_each
    with a map is the way.
  5. Tag everything. Tags are free. Put the environment, the owner, the Terraform workspace, and the module that manages the resource. Future you running
    CODE
    terraform state list | grep monitoring
    will appreciate it.

The homelab is where I make my mistakes so that Kumari.ai's production infrastructure doesn't suffer them. Every pattern in this post — remote state, workspace isolation, module architecture, CI/CD gates — started as a homelab experiment before I trusted it with customer data.

Terraform is not a tool you master by reading docs. It's a tool you master by running

CODE
terraform plan
at 2 AM, seeing "15 to destroy," feeling your stomach drop, and learning to never let that happen again.

My state files are backed up. My locks are working. My drift checks are running. And I haven't touched the Proxmox UI in six months.

That's the goal.