Terraform, Version Constraint and Debugging

This week I encountered some issues with Terraform (and, well, Kubernetes) again. This time, the problem was way more interesting than I thought.

Problem

When deploying to Kubernetes, I got dial tcp 127.0.0.1:80: connect: connection refused, connection reset error.

The more specific error message I got is

Error: Get "http://localhost/apis/apps/v1/namespaces/default/deployments/xxx": dial tcp 127.0.0.1:80: connect: connection refused

As this error happened in our deployment pipeline (we use Terraform to deploy stuff to Kubernetes), my natural thought was that this can be solved easily with a retry. So I retried the deployment right away, and it still failed.

When I finally stopped what I was working on and start to examine the message carefully, I realized it is quite strange: how come the pipeline (or the Kubectl for that matter) trying to connect to localhost when it is meant to connect to a Kubernetes cluster located somewhere else?

As you will see from my solution, this message was not helpful at all and in some sense quite misleading to someone who is trying to debug.

After comparing the log from a previous successful deployment and the said failed deployment. I realized the issue was with the Kubernetes provider for Terraform: while in the successful build, the terraform init command yield something like Installing hashicorp/kubernetes v1.13.3..., in the failed build the same command yield something like Installing hashicorp/kubernetes v2.0.2....

It is quite obvious that this issue was caused by breaking changes in the Terraform provider. According to their changelog, there were several breaking changes in the 2.0.0 version, among them were these two:

Remove load_config_file attribute from provider block (#1052)
Remove default of ~/.kube/config for config_path (#1052)

In our deployment Terraform, we set load_config_file to true to load the kube_config file from the default config_path of ~/.kube/config. Due to the breaking changes quoted above, neither the load_config_file nor the default config_path existed any more, and when Kubernetes can not find these two files, it will try to connect to the 127.0.0.1 (aka localhost) as a fallback which caused the connection refused error.

Solution

There are two kind of solutions to this issue:

  • Updating the Terraform code so it is compatible with the 2.0.0 version of the Kubernetes provider
    OR
  • Downgrade to the last working version of the Kubernetes provider and keep the existing Terraform code

Due to the urgency of getting the pipeline and deployment back online, I chose the downgrading route. Essentially, I'm adding the version constraint to the Kubernetes provider that was previously missing:

1
2
3
4
kubernetes = {
source = "registry.terraform.io/hashicorp/kubernetes"
version = "~> 1.0"
}

Adding in the version constraint means that Terraform will only increase the rightmost version number, therefore it will not be able to upgrade to version 2.0.0 automatically and can avoid this specific problem that was caused by breaking changes.

Takeaways

On debugging:

  • Generally speaking, if you found your Terraform changed behavior without you making any changes, you could be making the same mistake as I did: not specifying the version constraint for the provider. You can find some clues in your terraform init command, for example, by comparing if the same provider version was used on the two builds where one was successful, the other failed.
  • Personally, I was never familiar enough with Kubernetes to know that the default behavior of Kubectl is to use 127.0.0.1 when there's no config file present. Now that I came across this gotcha, I do realize this kind of behavior was not that uncommon per se: Knex which is the library we used for node.js also have similar behavior, and I will keep this in mind if I encounter something similar in the future.

On Terraform:

  • When there's no version constraint specified, Terraform will always use the latest provider version. Therefore, it is important to specify the version constraint. It is recommended by Terraform to always use a specific version when using third party modules. For more information on specifying the version constraint, read the documentation from their website.