Terraform at Scale: Modules, State Surgery, and the Import That Saved My Weekend

I stared at the terminal for a solid thirty seconds before I pressed Enter.


Bash
1 line
terraform state mv 'module.old_infra.proxmox_vm_qemu.vm["grafana"]' 'module.monitoring.proxmox_vm_qemu.vm["grafana"]'

One wrong character and I'd orphan a running production VM from Terraform's knowledge. No undo button. No confirmation dialog. Just me, my coffee, and the quiet hum of the rack in the closet behind me.

This is the story of how I took a single 800-line

CODE

1 line

main.tf

that took four minutes to plan and turned it into a modular, workspace-isolated, CI/CD-driven system managing 1,237 resources across my homelab Proxmox cluster and the AWS infrastructure behind Kumari.ai. The lessons here cost me weekends. Hopefully they save yours.

Terraform remote state and workspace architecture

The Tipping Point

Every Terraform project starts the same way. A single

CODE

1 line

main.tf

. Maybe a

CODE

1 line

variables.tf

if you're feeling organized. It works beautifully for the first twenty resources. Then forty. Then eighty.

My inflection point came on a Tuesday night. I was adding a new LXC container for a Loki instance and ran

CODE

1 line

terraform plan


Bash
7 lines
1resham@devbox:~/homelab-iac/terraform$ time terraform plan
2...
3Plan: 1 to add, 0 to change, 0 to destroy.
4
5real    4m12.387s
6user    0m8.241s
7sys     0m1.093s

Four minutes. To add a single container. Terraform was refreshing every resource in state — every VM, every DNS record, every firewall rule, every AWS ECS service — just to tell me it needed to create one LXC.

I opened

CODE

1 line

main.tf

and scrolled. And scrolled. Eight hundred and fourteen lines. Proxmox VMs mixed with AWS VPC definitions mixed with Cloudflare DNS records mixed with Let's Encrypt certificates. Variables scattered across three files with no naming convention. Outputs that referenced resources by index because I was too lazy to use

CODE

1 line

for_each

early on.

It was a mess. And I was the one who made it.

The real wake-up call came a week later when my laptop's SSD died. I had the Terraform code in git, sure. But the state file? Local.

CODE

1 line

terraform.tfstate

sitting in the project directory, committed to

CODE

1 line

.gitignore

because I'd read somewhere that you shouldn't commit state. Which is correct. But I also hadn't set up remote state, which meant my state file lived on exactly one drive. The drive that was now making clicking noises.

I spent that weekend recreating state by hand. Importing resources one at a time. Forty-seven

CODE

1 line

terraform import

commands. I swore I'd never be in that position again.

Remote State: The Foundation

The first thing I fixed was state storage. If your Terraform state is local, stop reading this and go fix that. Right now. I'll wait.

Here's my backend configuration:


HCL
14 lines
1# backend.tf
2terraform {
3  backend "s3" {
4    bucket         = "kumari-terraform-state"
5    key            = "infrastructure/terraform.tfstate"
6    region         = "us-east-1"
7    encrypt        = true
8    dynamodb_table = "terraform-state-lock"
9
10    # I use a dedicated IAM user for state access
11    # with minimal permissions — just S3 and DynamoDB
12    profile = "terraform-state"
13  }
14}

The S3 bucket has versioning enabled. Every time Terraform writes state, S3 keeps the previous version. This has saved me twice — once when a bad apply corrupted state, and once when I accidentally removed a resource block without running

CODE

1 line

state rm

first.


HCL
59 lines
1# state-backend/main.tf
2# I manage the state backend itself with a SEPARATE Terraform config
3# that uses local state. Yes, it's turtles all the way down.
4
5provider "aws" {
6  region  = "us-east-1"
7  profile = "terraform-admin"
8}
9
10resource "aws_s3_bucket" "terraform_state" {
11  bucket = "kumari-terraform-state"
12
13  tags = {
14    Name        = "Terraform State"
15    ManagedBy   = "terraform"
16    Environment = "global"
17  }
18}
19
20resource "aws_s3_bucket_versioning" "terraform_state" {
21  bucket = aws_s3_bucket.terraform_state.id
22  versioning_configuration {
23    status = "Enabled"
24  }
25}
26
27resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
28  bucket = aws_s3_bucket.terraform_state.id
29  rule {
30    apply_server_side_encryption_by_default {
31      sse_algorithm = "aws:kms"
32    }
33  }
34}
35
36resource "aws_s3_bucket_public_access_block" "terraform_state" {
37  bucket = aws_s3_bucket.terraform_state.id
38
39  block_public_acls       = true
40  block_public_policy     = true
41  ignore_public_acls      = true
42  restrict_public_buckets = true
43}
44
45resource "aws_dynamodb_table" "terraform_lock" {
46  name         = "terraform-state-lock"
47  billing_mode = "PAY_PER_REQUEST"
48  hash_key     = "LockID"
49
50  attribute {
51    name = "LockID"
52    type = "S"
53  }
54
55  tags = {
56    Name      = "Terraform State Lock"
57    ManagedBy = "terraform"
58  }
59}

The DynamoDB table is critical. Without it, if two people (or two CI jobs, or you in two terminal tabs — don't ask how I know) run

CODE

1 line

terraform apply

at the same time, they'll both read the same state, make different changes, and one will overwrite the other. DynamoDB provides a distributed lock. When Terraform acquires the lock, it writes a lock entry. If another process tries to acquire it, Terraform tells you:


CODE
10 lines
1Error: Error acquiring the state lock
2
3Error message: ConditionalCheckFailedException: The conditional request failed
4Lock Info:
5  ID:        a1b2c3d4-e5f6-7890-abcd-ef1234567890
6  Path:      kumari-terraform-state/infrastructure/terraform.tfstate
7  Operation: OperationTypeApply
8  Who:       resham@devbox
9  Version:   1.9.8
10  Created:   2026-03-18 02:14:33.048293 +0000 UTC

The

CODE

1 line

Who

field has saved me from myself more than once. "Oh, that's me in the other terminal. Right."

Workspaces: One Codebase, Four Environments

My infrastructure spans four distinct environments:

homelab-prod — The Proxmox cluster running actual services. Grafana, Prometheus, DNS, media, the NAS.
homelab-staging — A smaller set of VMs where I test changes before rolling them to prod. Yes, I have staging for my homelab. I've been burned enough times.
cloud-aws — Production AWS infrastructure for Kumari.ai. ECS, RDS, ElastiCache, CloudFront, the works.
cloud-aws-dr — Disaster recovery in us-west-2. Minimal footprint, ready to scale up.

Terraform workspaces give each environment its own state file within the same backend:


Bash
16 lines
1# Create workspaces
2terraform workspace new homelab-prod
3terraform workspace new homelab-staging
4terraform workspace new cloud-aws
5terraform workspace new cloud-aws-dr
6
7# Switch between them
8terraform workspace select homelab-prod
9
10# List all workspaces
11resham@devbox:~/homelab-iac/terraform$ terraform workspace list
12  default
13  homelab-prod
14* homelab-staging
15  cloud-aws
16  cloud-aws-dr

Each workspace uses a different

CODE

1 line

.tfvars

file:


HCL
40 lines
1# envs/homelab-prod.tfvars
2environment     = "homelab-prod"
3proxmox_api_url = "https://pve1.internal.resham.dev:8006/api2/json"
4proxmox_node    = "pve1"
5vm_defaults = {
6  cores    = 2
7  memory   = 2048
8  disk     = "local-zfs"
9  bridge   = "vmbr0"
10  os_type  = "cloud-init"
11  template = "ubuntu-2404-cloud"
12}
13monitoring_enabled = true
14backup_schedule    = "0 2 * * *"
15
16# envs/homelab-staging.tfvars
17environment     = "homelab-staging"
18proxmox_api_url = "https://pve3.internal.resham.dev:8006/api2/json"
19proxmox_node    = "pve3"
20vm_defaults = {
21  cores    = 1
22  memory   = 1024
23  disk     = "local-zfs"
24  bridge   = "vmbr1"    # Isolated staging VLAN
25  os_type  = "cloud-init"
26  template = "ubuntu-2404-cloud"
27}
28monitoring_enabled = false
29backup_schedule    = ""   # No backups for staging
30
31# envs/cloud-aws.tfvars
32environment      = "cloud-aws"
33aws_region       = "us-east-1"
34vpc_cidr         = "10.100.0.0/16"
35ecs_cluster_name = "kumari-prod"
36rds_instance_class = "db.r6g.large"
37redis_node_type    = "cache.r6g.large"
38enable_waf         = true
39min_ecs_tasks      = 2
40max_ecs_tasks      = 20

The plan/apply commands always specify the vars file explicitly:


Bash
2 lines
1terraform plan -var-file="envs/$(terraform workspace show).tfvars"
2terraform apply -var-file="envs/$(terraform workspace show).tfvars"

I have a shell alias for this because I got tired of typing it:


Bash
3 lines
1# ~/.zshrc
2tplan() { terraform plan -var-file="envs/$(terraform workspace show).tfvars" "$@"; }
3tapply() { terraform apply -var-file="envs/$(terraform workspace show).tfvars" "$@"; }

One lesson learned the painful way: always check which workspace you're in before running apply. I once applied homelab-staging config to homelab-prod because I forgot to switch. Downscaled every VM to 1 core and 1GB RAM. My monitoring stack collapsed, which meant I didn't even get alerts about it. Found out when Nextcloud became unusable.

Now my shell prompt shows the active workspace:


Bash
6 lines
1# Part of my starship.toml
2[custom.terraform]
3command = "terraform workspace show 2>/dev/null"
4when = "test -f main.tf"
5format = "[tf:$output]($style) "
6style = "bold purple"

Module Architecture

Here's the directory structure after the refactor:


Bash
41 lines
1resham@devbox:~/homelab-iac/terraform$ tree -L 3
2.
3├── backend.tf
4├── main.tf                    # Root module — just module calls
5├── variables.tf
6├── outputs.tf
7├── versions.tf
8├── envs/
9│   ├── homelab-prod.tfvars
10│   ├── homelab-staging.tfvars
11│   ├── cloud-aws.tfvars
12│   └── cloud-aws-dr.tfvars
13├── modules/
14│   ├── proxmox-vm/
15│   │   ├── main.tf
16│   │   ├── variables.tf
17│   │   ├── outputs.tf
18│   │   └── versions.tf
19│   ├── proxmox-lxc/
20│   │   ├── main.tf
21│   │   ├── variables.tf
22│   │   ├── outputs.tf
23│   │   └── versions.tf
24│   ├── aws-vpc/
25│   │   ├── main.tf
26│   │   ├── variables.tf
27│   │   ├── outputs.tf
28│   │   └── versions.tf
29│   ├── aws-ecs/
30│   │   ├── main.tf
31│   │   ├── variables.tf
32│   │   ├── outputs.tf
33│   │   └── versions.tf
34│   ├── monitoring/
35│   │   ├── main.tf
36│   │   ├── variables.tf
37│   │   └── outputs.tf
38│   └── dns/
39│       ├── main.tf
40│       ├── variables.tf
41│       └── outputs.tf

Here's the complete

CODE

1 line

proxmox-vm

module. This is the one I use the most:


HCL
110 lines
1# modules/proxmox-vm/versions.tf
2terraform {
3  required_version = ">= 1.9.0"
4  required_providers {
5    proxmox = {
6      source  = "Telmate/proxmox"
7      version = "~> 3.0"
8    }
9  }
10}
11
12# modules/proxmox-vm/variables.tf
13variable "vms" {
14  description = "Map of VMs to create"
15  type = map(object({
16    cores       = optional(number, 2)
17    memory      = optional(number, 2048)
18    disk_size   = optional(string, "20G")
19    disk_storage = optional(string, "local-zfs")
20    bridge      = optional(string, "vmbr0")
21    vlan_tag    = optional(number, -1)
22    ip_address  = optional(string)
23    gateway     = optional(string, "10.0.0.1")
24    dns_servers = optional(string, "10.0.0.5")
25    template    = optional(string, "ubuntu-2404-cloud")
26    onboot      = optional(bool, true)
27    tags        = optional(list(string), [])
28    description = optional(string, "")
29  }))
30}
31
32variable "target_node" {
33  description = "Proxmox node to deploy VMs on"
34  type        = string
35}
36
37variable "ssh_keys" {
38  description = "SSH public keys for cloud-init"
39  type        = string
40  sensitive   = true
41}
42
43variable "default_user" {
44  description = "Default user created by cloud-init"
45  type        = string
46  default     = "resham"
47}
48
49# modules/proxmox-vm/main.tf
50resource "proxmox_vm_qemu" "vm" {
51  for_each = var.vms
52
53  name        = each.key
54  target_node = var.target_node
55  clone       = each.value.template
56  full_clone  = true
57  agent       = 1
58  onboot      = each.value.onboot
59  cores       = each.value.cores
60  memory      = each.value.memory
61  scsihw      = "virtio-scsi-single"
62  tags        = join(";", each.value.tags)
63  desc        = each.value.description
64
65  disks {
66    scsi {
67      scsi0 {
68        disk {
69          size    = each.value.disk_size
70          storage = each.value.disk_storage
71        }
72      }
73    }
74  }
75
76  network {
77    model  = "virtio"
78    bridge = each.value.bridge
79    tag    = each.value.vlan_tag
80  }
81
82  os_type    = "cloud-init"
83  ciuser     = var.default_user
84  sshkeys    = var.ssh_keys
85  ipconfig0  = each.value.ip_address != null ? "ip=${each.value.ip_address}/24,gw=${each.value.gateway}" : "ip=dhcp"
86  nameserver = each.value.dns_servers
87
88  lifecycle {
89    ignore_changes = [
90      network,    # Proxmox sometimes reorders network devices
91      desc,       # Don't fight manual description edits
92    ]
93  }
94}
95
96# modules/proxmox-vm/outputs.tf
97output "vm_ids" {
98  description = "Map of VM names to their Proxmox VMIDs"
99  value       = { for name, vm in proxmox_vm_qemu.vm : name => vm.vmid }
100}
101
102output "vm_ips" {
103  description = "Map of VM names to their IP addresses"
104  value       = { for name, vm in proxmox_vm_qemu.vm : name => vm.default_ipv4_address }
105}
106
107output "vm_names" {
108  description = "List of all VM names created by this module"
109  value       = keys(proxmox_vm_qemu.vm)
110}

And the root module calls it like this:


HCL
60 lines
1# main.tf (root module)
2module "monitoring" {
3  source      = "./modules/proxmox-vm"
4  target_node = var.proxmox_node
5  ssh_keys    = var.ssh_public_key
6
7  vms = {
8    "grafana" = {
9      cores      = 2
10      memory     = 4096
11      disk_size  = "50G"
12      ip_address = "10.0.0.20"
13      tags       = ["monitoring", "prod"]
14    }
15    "prometheus" = {
16      cores      = 4
17      memory     = 8192
18      disk_size  = "200G"
19      ip_address = "10.0.0.21"
20      tags       = ["monitoring", "prod"]
21    }
22    "loki" = {
23      cores      = 2
24      memory     = 4096
25      disk_size  = "100G"
26      ip_address = "10.0.0.22"
27      tags       = ["monitoring", "prod"]
28    }
29  }
30}
31
32module "services" {
33  source      = "./modules/proxmox-vm"
34  target_node = var.proxmox_node
35  ssh_keys    = var.ssh_public_key
36
37  vms = {
38    "nginx-proxy" = {
39      cores      = 2
40      memory     = 2048
41      disk_size  = "20G"
42      ip_address = "10.0.0.10"
43      tags       = ["network", "prod"]
44    }
45    "docker-host-01" = {
46      cores      = 4
47      memory     = 16384
48      disk_size  = "100G"
49      ip_address = "10.0.0.30"
50      tags       = ["docker", "prod"]
51    }
52    "docker-host-02" = {
53      cores      = 4
54      memory     = 16384
55      disk_size  = "100G"
56      ip_address = "10.0.0.31"
57      tags       = ["docker", "prod"]
58    }
59  }
60}

The DRY principle in action. Before modules, each VM was a separate 25-line resource block. Fifteen VMs meant 375 lines just for VM definitions, with every parameter copy-pasted and slightly different. Now the module handles the boilerplate, and the root config is just a clean map of what I actually care about: name, resources, IP.

State Surgery

This is the section that gives Terraform practitioners cold sweats.

When I refactored from a flat structure to modules, every resource path changed. What was

CODE

1 line

proxmox_vm_qemu.grafana

became

CODE

1 line

module.monitoring.proxmox_vm_qemu.vm["grafana"]

. If I'd just reorganized the code and run

CODE

1 line

terraform plan

, Terraform would've shown:


CODE
1 line
Plan: 15 to add, 0 to change, 15 to destroy.

Fifteen VMs destroyed and recreated. Production services down. Data lost. Not acceptable.

Instead, I used

CODE

1 line

terraform state mv

to update resource addresses in state without touching the actual infrastructure:


Bash
25 lines
1# Move monitoring VMs into the monitoring module
2terraform state mv \
3  'proxmox_vm_qemu.grafana' \
4  'module.monitoring.proxmox_vm_qemu.vm["grafana"]'
5
6terraform state mv \
7  'proxmox_vm_qemu.prometheus' \
8  'module.monitoring.proxmox_vm_qemu.vm["prometheus"]'
9
10terraform state mv \
11  'proxmox_vm_qemu.loki' \
12  'module.monitoring.proxmox_vm_qemu.vm["loki"]'
13
14# Move service VMs into the services module
15terraform state mv \
16  'proxmox_vm_qemu.nginx_proxy' \
17  'module.services.proxmox_vm_qemu.vm["nginx-proxy"]'
18
19terraform state mv \
20  'proxmox_vm_qemu.docker_host_01' \
21  'module.services.proxmox_vm_qemu.vm["docker-host-01"]'
22
23terraform state mv \
24  'proxmox_vm_qemu.docker_host_02' \
25  'module.services.proxmox_vm_qemu.vm["docker-host-02"]'

Each command outputs something like:


CODE
3 lines
1Move "proxmox_vm_qemu.grafana" to
2     "module.monitoring.proxmox_vm_qemu.vm[\"grafana\"]"
3Successfully moved 1 object(s).

After all the moves, the moment of truth:


Bash
3 lines
1resham@devbox:~/homelab-iac/terraform$ terraform plan -var-file="envs/homelab-prod.tfvars"
2
3No changes. Your infrastructure matches the configuration.

I have never felt more relief from a terminal output in my life.

The rules I follow for state surgery:

Always back up state first.
CODE
1 line
terraform state pull > state-backup-$(date +%Y%m%d-%H%M%S).json
Do it during a maintenance window. Lock deploys. Tell anyone who touches the infra.
Move one resource at a time. Run
CODE
1 line
terraform plan
after each move to catch mistakes early.
Never combine state moves with code changes. Move state first, verify with plan, then merge the code.

For resources that should no longer be managed by Terraform (maybe you're moving them to a different tool, or they're being decommissioned manually):


Bash
4 lines
1# Remove from state without destroying the resource
2terraform state rm 'module.legacy.proxmox_vm_qemu.vm["old-jenkins"]'
3# Removed module.legacy.proxmox_vm_qemu.vm["old-jenkins"]
4# The VM keeps running. Terraform just forgets about it.

I used

CODE

1 line

state rm

when I decommissioned my old Jenkins server. Didn't want Terraform to destroy it — I needed to pull build history off it first. Removed it from state, did my data migration over the next week, then manually deleted the VM through the Proxmox UI.

The Import That Saved My Weekend

This is the story from the title.

It was a Saturday morning. I'd been running my Proxmox cluster for about eight months, and over that time I'd created fifteen VMs manually through the Proxmox web UI. Dev boxes, test environments, a Minecraft server for friends, a Kali box for security practice. None of them were in Terraform.

Every time I ran

CODE

1 line

terraform plan

, these VMs were invisible. Terraform didn't know they existed. Which meant if I ever needed to recreate my environment, those fifteen VMs and all their configuration details lived exclusively in my head and in the Proxmox database.

So I decided to import them. All of them. On a Saturday.

The process for each VM:

Step 1: Write the resource block that matches the existing VM


HCL
19 lines
1# I had to look up each VM's config in Proxmox first
2# pvesh get /nodes/pve1/qemu/110/config
3
4module "dev_vms" {
5  source      = "./modules/proxmox-vm"
6  target_node = "pve1"
7  ssh_keys    = var.ssh_public_key
8
9  vms = {
10    "devbox-golang" = {
11      cores      = 4
12      memory     = 8192
13      disk_size  = "80G"
14      ip_address = "10.0.0.50"
15      tags       = ["dev", "golang"]
16    }
17    # ... 14 more VMs
18  }
19}

Step 2: Import each VM into state


Bash
26 lines
1# The format is: terraform import <address> <proxmox_node>/<vmtype>/<vmid>
2terraform import \
3  'module.dev_vms.proxmox_vm_qemu.vm["devbox-golang"]' \
4  pve1/qemu/110
5
6terraform import \
7  'module.dev_vms.proxmox_vm_qemu.vm["devbox-rust"]' \
8  pve1/qemu/111
9
10terraform import \
11  'module.dev_vms.proxmox_vm_qemu.vm["minecraft"]' \
12  pve2/qemu/200
13
14terraform import \
15  'module.dev_vms.proxmox_vm_qemu.vm["kali-lab"]' \
16  pve2/qemu/201
17
18terraform import \
19  'module.dev_vms.proxmox_vm_qemu.vm["pihole-primary"]' \
20  pve1/qemu/105
21
22terraform import \
23  'module.dev_vms.proxmox_vm_qemu.vm["pihole-secondary"]' \
24  pve3/qemu/106
25
26# ... and so on for all 15 VMs

Each import took about 10-15 seconds as Terraform queried the Proxmox API:


CODE
9 lines
1module.dev_vms.proxmox_vm_qemu.vm["devbox-golang"]: Importing from ID "pve1/qemu/110"...
2module.dev_vms.proxmox_vm_qemu.vm["devbox-golang"]: Import prepared!
3  Prepared proxmox_vm_qemu for import
4module.dev_vms.proxmox_vm_qemu.vm["devbox-golang"]: Refreshing state... [id=pve1/qemu/110]
5
6Import successful!
7
8The resources that were imported are shown above. These resources are now in
9your Terraform state and will henceforth be managed by Terraform.

Step 3: Iterate until plan is clean

This was the tedious part. After importing, I'd run

CODE

1 line

terraform plan

and see a wall of changes because my resource block didn't perfectly match the actual VM config:


CODE
8 lines
1# module.dev_vms.proxmox_vm_qemu.vm["devbox-golang"] will be updated in-place
2  ~ resource "proxmox_vm_qemu" "vm" {
3      ~ cores      = 4 -> 2    # Oops, it actually had 2 cores
4      ~ memory     = 8192 -> 4096  # And 4GB, not 8GB
5        name       = "devbox-golang"
6      ~ scsihw     = "virtio-scsi-single" -> "lsi"  # Different SCSI controller
7        # (12 unchanged attributes hidden)
8    }

So I'd update my HCL to match reality, run plan again, fix more discrepancies, repeat. For fifteen VMs. Some had settings I didn't even know I'd configured — NUMA topology, CPU types, specific BIOS settings.

It took about four hours. Fifteen VMs. Lots of

CODE

1 line

pvesh get

queries to the Proxmox API. Lots of back-and-forth between the editor and the terminal.

But at the end:


Bash
3 lines
1resham@devbox:~/homelab-iac/terraform$ terraform plan -var-file="envs/homelab-prod.tfvars"
2
3No changes. Your infrastructure matches the configuration.

Fifteen VMs, fully managed. If my house burns down tomorrow, I can

CODE

1 line

terraform apply

on a new Proxmox cluster and have my entire environment back. That's worth a Saturday.

Handling Drift

Drift is what happens when reality diverges from your Terraform state. And in a homelab, it happens constantly, because sometimes it's 11 PM and you just need to bump a VM's RAM from the Proxmox UI and you'll "fix it in Terraform later."

You won't fix it later. You'll forget, and three weeks from now you'll run

CODE

1 line

terraform plan

and see:


CODE
6 lines
1# module.services.proxmox_vm_qemu.vm["docker-host-01"] will be updated in-place
2  ~ resource "proxmox_vm_qemu" "vm" {
3      ~ memory = 32768 -> 16384
4        name   = "docker-host-01"
5        # (15 unchanged attributes hidden)
6    }

Terraform wants to revert the memory back to what's in your config (16GB) because you manually bumped it to 32GB three weeks ago. If you apply this, your Docker host loses half its RAM and every container on it starts OOMing.

I handle drift two ways:

Option 1: Update the config to match reality


HCL
8 lines
1# If the manual change was intentional, update the HCL
2"docker-host-01" = {
3  cores      = 4
4  memory     = 32768   # Updated from 16384 — bumped for memory-hungry containers
5  disk_size  = "100G"
6  ip_address = "10.0.0.30"
7  tags       = ["docker", "prod"]
8}

Option 2: Revert the manual change

If the manual change was a mistake or a temporary fix, let Terraform revert it:


Bash
1 line
terraform apply -var-file="envs/homelab-prod.tfvars" -target='module.services.proxmox_vm_qemu.vm["docker-host-01"]'

Option 3: Ignore it (carefully)

For attributes that you know will drift and you don't care about, use

CODE

1 line

lifecycle.ignore_changes


HCL
7 lines
1lifecycle {
2  ignore_changes = [
3    network,      # Proxmox reorders NICs sometimes
4    desc,         # I edit descriptions in the UI
5    tags,         # Tags get added by automation scripts
6  ]
7}

Be very careful with

CODE

1 line

ignore_changes

. It's a trapdoor. Once you ignore an attribute, Terraform will never manage it again, and you'll forget it's there until the day you desperately need it to work.

I run a weekly drift check as a cron job:


Bash
10 lines
1# /etc/cron.d/terraform-drift-check
20 8 * * 1 resham cd /home/resham/homelab-iac/terraform && \
3  terraform workspace select homelab-prod && \
4  terraform plan -var-file="envs/homelab-prod.tfvars" -detailed-exitcode -no-color \
5  > /tmp/terraform-drift-report.txt 2>&1; \
6  if [ $? -eq 2 ]; then \
7    curl -X POST "$DISCORD_WEBHOOK" \
8      -H "Content-Type: application/json" \
9      -d "{\"content\": \"Terraform drift detected in homelab-prod. Check /tmp/terraform-drift-report.txt\"}"; \
10  fi

The

CODE

1 line

-detailed-exitcode

flag is key: exit code 0 means no changes, 1 means error, 2 means changes detected. If I get a Discord ping on Monday morning, I know something drifted over the weekend.

Sensitive Values: The Hard Lesson

Let me tell you about the worst commit I've ever made.

It was early in my Terraform journey. I had a

CODE

1 line

terraform.tfvars

file with my Proxmox API token, my AWS access keys, and my Cloudflare API token. I committed it. Pushed to my private GitHub repo. Didn't notice for two weeks.

Now, it was a private repo. Nobody saw it. But the credentials were in the git history forever. Even after I deleted the file,

CODE

1 line

git log

would happily show anyone who cloned the repo every secret I'd committed.

Here's how I fixed it:


Bash
12 lines
1# Step 1: Rotate every compromised credential immediately
2# Don't clean up first. Rotate FIRST. Assume compromise.
3
4# Step 2: Remove the file from ALL git history
5pip install git-filter-repo
6git filter-repo --invert-paths --path terraform.tfvars --force
7
8# Step 3: Force push (only time I'll advocate for force push)
9git push origin --force --all
10
11# Step 4: Tell GitHub to garbage collect
12# (GitHub support can do this for you if you ask nicely)

After that experience, here's my hierarchy for sensitive values:

Tier 1: Environment variables (for CI/CD)


Bash
3 lines
1export TF_VAR_proxmox_api_token="your-token-here"
2export TF_VAR_aws_access_key="AKIA..."
3export TF_VAR_cloudflare_api_token="..."

Terraform automatically reads

CODE

1 line

TF_VAR_<name>

as input variables. No file to accidentally commit.

Tier 2: .tfvars with .gitignore (for local development)


Bash
4 lines
1# .gitignore
2*.tfvars
3!envs/*.tfvars     # Non-sensitive workspace configs ARE committed
4secrets.tfvars     # Explicit ignore


Bash
4 lines
1# secrets.tfvars — NEVER committed
2proxmox_api_token    = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
3ssh_public_key       = "ssh-ed25519 AAAA... resham@devbox"
4cloudflare_api_token = "..."


Bash
1 line
terraform plan -var-file="envs/homelab-prod.tfvars" -var-file="secrets.tfvars"

Tier 3: HashiCorp Vault (for production)


HCL
9 lines
1data "vault_generic_secret" "aws_creds" {
2  path = "secret/terraform/aws"
3}
4
5provider "aws" {
6  access_key = data.vault_generic_secret.aws_creds.data["access_key"]
7  secret_key = data.vault_generic_secret.aws_creds.data["secret_key"]
8  region     = var.aws_region
9}

I use Vault for the AWS production infrastructure. For the homelab, environment variables are sufficient — the blast radius of a leaked Proxmox token on my local network is limited.

Also: mark sensitive variables as

CODE

1 line

sensitive = true

in your variable declarations. Terraform will redact them from plan output:


HCL
5 lines
1variable "proxmox_api_token" {
2  description = "Proxmox API token"
3  type        = string
4  sensitive   = true
5}


CODE
2 lines
1# Plan output shows:
2  + proxmox_api_token = (sensitive value)

CI/CD for Terraform

Manual applies are fine when you're learning. They're not fine when you're managing production infrastructure for a real product. Here's my GitHub Actions workflow:


YAML
128 lines
1# .github/workflows/terraform.yml
2name: Terraform
3
4on:
5  pull_request:
6    branches: [main]
7    paths:
8      - 'terraform/**'
9  push:
10    branches: [main]
11    paths:
12      - 'terraform/**'
13
14permissions:
15  contents: read
16  pull-requests: write
17  id-token: write
18
19env:
20  TF_VERSION: "1.9.8"
21  AWS_REGION: "us-east-1"
22
23jobs:
24  plan:
25    name: "Terraform Plan"
26    runs-on: ubuntu-latest
27    if: github.event_name == 'pull_request'
28    strategy:
29      matrix:
30        workspace: [homelab-prod, cloud-aws, cloud-aws-dr]
31    steps:
32      - uses: actions/checkout@v4
33
34      - uses: hashicorp/setup-terraform@v3
35        with:
36          terraform_version: ${{ env.TF_VERSION }}
37
38      - name: Configure AWS Credentials
39        uses: aws-actions/configure-aws-credentials@v4
40        with:
41          role-to-arn: arn:aws:iam::role/terraform-github-actions
42          aws-region: ${{ env.AWS_REGION }}
43
44      - name: Terraform Init
45        working-directory: terraform
46        run: terraform init
47
48      - name: Select Workspace
49        working-directory: terraform
50        run: terraform workspace select ${{ matrix.workspace }}
51
52      - name: Terraform Plan
53        id: plan
54        working-directory: terraform
55        run: |
56          terraform plan \
57            -var-file="envs/${{ matrix.workspace }}.tfvars" \
58            -no-color \
59            -out=tfplan-${{ matrix.workspace }} \
60            2>&1 | tee plan-output.txt
61        env:
62          TF_VAR_proxmox_api_token: ${{ secrets.PROXMOX_API_TOKEN }}
63          TF_VAR_ssh_public_key: ${{ secrets.SSH_PUBLIC_KEY }}
64          TF_VAR_cloudflare_api_token: ${{ secrets.CLOUDFLARE_API_TOKEN }}
65
66      - name: Comment Plan on PR
67        uses: actions/github-script@v7
68        with:
69          script: |
70            const fs = require('fs');
71            const plan = fs.readFileSync('terraform/plan-output.txt', 'utf8');
72            const workspace = '${{ matrix.workspace }}';
73            const body = `### Terraform Plan: \`${workspace}\`
74            \`\`\`hcl
75            ${plan.substring(0, 65000)}
76            \`\`\``;
77            github.rest.issues.createComment({
78              issue_number: context.issue.number,
79              owner: context.repo.owner,
80              repo: context.repo.repo,
81              body: body
82            });
83
84      - name: Upload Plan Artifact
85        uses: actions/upload-artifact@v4
86        with:
87          name: tfplan-${{ matrix.workspace }}
88          path: terraform/tfplan-${{ matrix.workspace }}
89
90  apply:
91    name: "Terraform Apply"
92    runs-on: ubuntu-latest
93    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
94    environment: production  # Requires manual approval in GitHub
95    strategy:
96      matrix:
97        workspace: [homelab-prod, cloud-aws, cloud-aws-dr]
98    steps:
99      - uses: actions/checkout@v4
100
101      - uses: hashicorp/setup-terraform@v3
102        with:
103          terraform_version: ${{ env.TF_VERSION }}
104
105      - name: Configure AWS Credentials
106        uses: aws-actions/configure-aws-credentials@v4
107        with:
108          role-to-arn: arn:aws:iam::role/terraform-github-actions
109          aws-region: ${{ env.AWS_REGION }}
110
111      - name: Terraform Init
112        working-directory: terraform
113        run: terraform init
114
115      - name: Select Workspace
116        working-directory: terraform
117        run: terraform workspace select ${{ matrix.workspace }}
118
119      - name: Terraform Apply
120        working-directory: terraform
121        run: |
122          terraform apply \
123            -var-file="envs/${{ matrix.workspace }}.tfvars" \
124            -auto-approve
125        env:
126          TF_VAR_proxmox_api_token: ${{ secrets.PROXMOX_API_TOKEN }}
127          TF_VAR_ssh_public_key: ${{ secrets.SSH_PUBLIC_KEY }}
128          TF_VAR_cloudflare_api_token: ${{ secrets.CLOUDFLARE_API_TOKEN }}

The key design decisions:

Plan on PR, apply on merge. You review the plan in the PR comments before approving.
Matrix strategy runs plan/apply for each workspace in parallel. If cloud-aws has issues, homelab-prod still applies.
The
CODE
1 line
production
environment in GitHub requires manual approval. Merging to main doesn't automatically apply — a human has to click "Approve" in the Actions UI.
Plan artifacts are saved so the apply uses the exact same plan that was reviewed, not a new one.

Performance: From 4 Minutes to 18 Seconds

The biggest win was splitting resources into modules with separate state consideration. But there are several other tricks:

Targeted plans when you know what you're changing:


Bash
5 lines
1# Only plan the monitoring module
2terraform plan -var-file="envs/homelab-prod.tfvars" \
3  -target=module.monitoring
4
5# Plan time: 3 seconds instead of 18

Parallelism tuning:


Bash
6 lines
1# Default parallelism is 10. For my Proxmox API that can't handle many
2# concurrent requests, I lower it:
3terraform apply -var-file="envs/homelab-prod.tfvars" -parallelism=5
4
5# For AWS with high API rate limits, I raise it:
6terraform apply -var-file="envs/cloud-aws.tfvars" -parallelism=30

Skip refresh when you know nothing changed:


Bash
3 lines
1# Useful for rapid iteration on HCL syntax
2terraform plan -var-file="envs/homelab-prod.tfvars" -refresh=false
3# Plan time: 2 seconds (no API calls)

Be careful with

CODE

1 line

-refresh=false

though. If something changed outside Terraform, you won't see it. I only use this when I'm iterating on a new resource definition and I know nothing else has changed.

State splitting was the nuclear option. When I realized the homelab and AWS resources didn't need to be in the same state at all, I split them into separate root modules:


Bash
13 lines
1terraform/
2├── homelab/       # Proxmox resources only
3│   ├── backend.tf # key = "homelab/terraform.tfstate"
4│   ├── main.tf
5│   └── ...
6├── cloud/         # AWS resources only
7│   ├── backend.tf # key = "cloud/terraform.tfstate"
8│   ├── main.tf
9│   └── ...
10└── shared/        # DNS, monitoring that spans both
11    ├── backend.tf # key = "shared/terraform.tfstate"
12    ├── main.tf
13    └── ...

When

CODE

1 line

shared/

needs outputs from

CODE

1 line

homelab/

CODE

1 line

cloud/

, it uses

CODE

1 line

terraform_remote_state


HCL
27 lines
1# shared/data.tf
2data "terraform_remote_state" "homelab" {
3  backend = "s3"
4  config = {
5    bucket = "kumari-terraform-state"
6    key    = "homelab/terraform.tfstate"
7    region = "us-east-1"
8  }
9}
10
11data "terraform_remote_state" "cloud" {
12  backend = "s3"
13  config = {
14    bucket = "kumari-terraform-state"
15    key    = "cloud/terraform.tfstate"
16    region = "us-east-1"
17  }
18}
19
20# Now I can reference outputs from other states
21resource "cloudflare_record" "grafana" {
22  zone_id = var.cloudflare_zone_id
23  name    = "grafana"
24  content = data.terraform_remote_state.homelab.outputs.monitoring_vm_ips["grafana"]
25  type    = "A"
26  proxied = true
27}

This alone cut plan time by 60% because each state only refreshes its own resources.

Disaster Recovery

State files are the crown jewels. If you lose state, Terraform doesn't know what it manages. You're back to importing everything by hand. Here's my backup strategy:

Layer 1: S3 versioning. Every state write creates a new version. I can roll back to any previous state.


Bash
15 lines
1# List state file versions
2aws s3api list-object-versions \
3  --bucket kumari-terraform-state \
4  --prefix "homelab/terraform.tfstate" \
5  --max-items 5
6
7# Restore a previous version
8aws s3api get-object \
9  --bucket kumari-terraform-state \
10  --key "homelab/terraform.tfstate" \
11  --version-id "abc123..." \
12  restored-state.json
13
14# Push restored state
15terraform state push restored-state.json

Layer 2: Pre-apply state backup. My CI/CD workflow pulls state before every apply:


Bash
1 line
terraform state pull > "backups/state-$(date +%Y%m%d-%H%M%S)-pre-apply.json"

Layer 3: Cross-region replication. The S3 bucket replicates to us-west-2. If us-east-1 goes down, my state is still accessible.

When state gets corrupted (it happened once, after a crash during apply):


Bash
11 lines
1# Check state integrity
2terraform state list
3# If this errors, state is corrupted
4
5# Option 1: Roll back to last good version from S3
6aws s3api list-object-versions --bucket kumari-terraform-state \
7  --prefix "homelab/terraform.tfstate" --max-items 10
8
9# Option 2: If lock is stuck from the crashed process
10terraform force-unlock LOCK_ID_HERE
11# Only use this if you're SURE no other process is running

CODE

1 line

terraform force-unlock

is the last resort. I've used it exactly once, when my CI runner crashed mid-apply and left a stale lock. The lock had been held for 45 minutes and the runner was confirmed dead. Even then I double-checked that nothing else was running before unlocking.

The Numbers

After six months of iterating on this setup:

Metric	Before	After
Total managed resources	~80 (rest unmanaged)	1,237
Plan time (full)	4 min 12 sec	18 sec
Plan time (targeted)	N/A	2-5 sec
State files	1 (local)	4 (remote, locked)
Manual infrastructure changes	Weekly	Zero in 6 months
Time to recreate full environment	"Unknown, pray it doesn't happen"	~45 min
CI/CD pipeline	None	Plan on PR, apply on merge
Secrets committed to git	1 (that I know of)	0

The 1,237 resources break down roughly as:

Homelab prod: 487 resources (VMs, LXCs, storage, network, firewall rules)
Homelab staging: 89 resources (minimal mirror of prod)
Cloud AWS: 584 resources (VPC, subnets, ECS services, RDS, ElastiCache, CloudFront, WAF rules, IAM, the whole stack)
Cloud AWS DR: 77 resources (warm standby, ready to scale)

What I'd Do Differently

If I could go back and start over:

Remote state from day one. Not day thirty. Not "when I have more resources." Day one.
Modules from the start. Even if you only have three resources, put them in a module. You'll thank yourself in six months.
Never make manual changes. Not even "just this once." The five minutes you save now becomes two hours of drift debugging later.
Use
CODE
1 line
for_each
instead of
CODE
1 line
count
. I started with
CODE
1 line
count
for my VMs and regretted it immediately. Removing a VM from the middle of a list reindexes everything.
CODE
1 line
for_each
with a map is the way.
Tag everything. Tags are free. Put the environment, the owner, the Terraform workspace, and the module that manages the resource. Future you running
CODE
1 line
terraform state list | grep monitoring
will appreciate it.

The homelab is where I make my mistakes so that Kumari.ai's production infrastructure doesn't suffer them. Every pattern in this post — remote state, workspace isolation, module architecture, CI/CD gates — started as a homelab experiment before I trusted it with customer data.

Terraform is not a tool you master by reading docs. It's a tool you master by running

CODE

1 line

terraform plan

at 2 AM, seeing "15 to destroy," feeling your stomach drop, and learning to never let that happen again.

My state files are backed up. My locks are working. My drift checks are running. And I haven't touched the Proxmox UI in six months.

That's the goal.