Terraform Drift: The Bad, the Ugly and the Black Swan

What is Terraform Drift? What problems does it cause? And how can we fix it?

What is Terraform drift?
Terraform drift is a well known problem. It occurs when changes happen to your cloud environment resources that were not driven by a Terraform workflow, and leads to differences between what is actually configured in your cloud and what is declared in your Terraform code. In other words, your cloud has “drifted away” from your Terraform. These differences can occur through the following channels:
Engineers making changes via CLI commands, like Google’s gcloud tool or AWS’s CLI.
Application logic or deployment pipelines creating or editing resources outside of a Terraform workflow. Think boto3 or other cloud SDKs.
Terraform drift: The Bad
Chances are high that the infrastructure defined by Terraform, reviewed by other team members, and then deployed, has very intentional specifications. That EC2 instance size, that Subnet’s CIDR bloc’s range, and the access privileges for your S3 bucket were all chosen for a reason.
Terraform drift: The Ugly
As we noted above, for most teams drift is discovered when running terraform plan. Generally engineers do not arbitrarily run Terraform workflows (as far as we know!). In fact, they often are running these workflows exclusively when trying to deploy new functionality or bug fixes. This means that:
Terraform drift: The Black Swan
Undetected changes and resource provisioning outside of a Terraform workflow can lead to “Black Swan” events, which is why so many engineers endure “Bad” and “Ugly” costs to mitigate drift before it becomes a problem. For example:
Security-optimized configurations are relaxed and a data leak or hack occurs.
Maybe a hack does not occur, but when your organization is audited for business-critical compliance certifications, you fail the audit due to at-risk infrastructure.
Possible Solutions
Run a Terraform workflow regularly. This is possible, but to catch all drift, it needs to be done for each Terraform state file. Furthermore, the output of terraform plan is a bit sprawling, requiring further manual review. Lastly, resources outside of Terraform control will be completely missed by this method.
Lock down your cloud environment to only allow changes via Terraform. Again, this is possible (and often a best practice!), but the devil is in the details. Enforcing this policy means restricting engineers with years of experience controlling or touching resources through the AWS console or using a CLI tool. This may put a heavy burden on team members who know Terraform, with all infrastructure provisioning needing to flow through them (no more engineers quickly spinning something up in Dev for themselves via the Azure portal). It also eliminates the opportunity for the occasional hot-fix, for which the legitimate need can arise from time to time.
Use an open-source tool like cloud-concierge cloud-concierge can be configured to perform regular, automated scans of your cloud environment, identify changes made outside of a Terraform workflow, and recommend necessary mitigation steps. Unlike running a Terraform workflow, cloud-concierge offers the following benefits:
(a) Scan for drift across an arbitrary number of state files at one time
(b) Identify and codify resources not controlled by Terraform
(c) Output results into human-readable formatting within a Pull Request
(d) Surface the entities making changes outside of your Terraform workflow so that sources of drift can be locked-down.
Conclusion
We hope this post gives you a better understanding of what Terraform drift is, the problems it causes, and possible solutions. Unaddressed Terraform drift creates immediate, short-term costs, and exposes organizations to significant long-tail risks. Thankfully, automated solutions do exist for organizations to adopt as they move towards best-practices.
dragondrop.cloud’s mission is to automate developer best practices while working with Infrastructure as Code. Our flagship OSS product, cloud-concierge, allows developers to codify their cloud, detect drift, estimate cloud costs and security risks, and more — while delivering the results via a Pull Request. For enterprises running cloud-concierge at scale, we provide a management platform. To learn more, schedule a demo or get started today!