Kubernetes Terraform installer for Hetzner Cloud
This Terraform module creates a Kubernetes Cluster on Hetzner Cloud infrastructure running Ubuntu 22.04. The module aims to be simple to use while providing an out-of-the-box secure and maintainable setup. Thanks to Ubuntu's LTS version we get up to 5 years of peace and quiet before having to upgrade the cluster's operating system!
Terraform module published at: https://registry.terraform.io/modules/identiops/k3s/hcloud
What changed in the latest version? See CHANGELOG.md.
Features
k3s based Kubernetes cluster.
Node pools for managing cluster resources efficiently. Pools can be added, resized, and removed at any time.
Automated Kubernetes update management via System Upgrade Controller.
Automated operating system updates with automatic system reboots via kured.
Creation of placement groups for to improve availability.
Multi-region deployments.
Secured default configuration:
Deletion protection for all cloud resources.
SSH key required for remote access.
fail2ban limits SSH brute force attacks.
Cluster nodes have no public network interface.
Internal firewall active on all nodes for minimal exposure.
Support for network policies via Cilium.
CSI hardening guide applied:
Kernel parameters defined.
Audit log directory created.
Network policies, pod security policies, admission policies and the enabling of audit logs are up to the administrator of the cluster to configure.
Integration of Hetzner Cloud Controller Manager for managing cloud resources from the within the cluster.
Integration of Hetzner Cloud Storage Interface for managing volumes from the within the cluster.
Ansible integration with automatically generated inventory.
Convenience scripts for retrieving the Kubernetes configuration and accessing nodes via SSH and SCP.
Calculation of monthly costs for every part of the deployment (see
terraform output
).Documentation of common administrative tasks and troubleshooting approaches.
To be added
OIDC support for user authentication. Some configuration is in place, but it hasn't been tested, yet.
Support for cluster auto scaler.
Contents
Getting Started
Prerequisites
Terraform or OpenTofu. Note that you'll need Terraform v1.0 or newer to run this project.
bash
for executing the generated scripts.jq
for executing the generated scripts.kubectl
for interacting wthe the Kubernetes cluster.ssh
for connecting to cluster nodes.
Recommended Tools
Ansible cli for executing commands simultaneously on multiple cluster nodes.
Cilium cli for verifying and interacting with the Kubernetes networking layer.
Helm cli for installing and updating components inside the cluster.
Installation
Create a Hetzner Cloud API token.
Register with Hetzner Cloud.
Create a new project.
Navigate to the security settings.
Select the "API tokens" tab and add a new token with read & write access and a second token with just read access.
- Store Hetzner Cloud API token locally. Either, pass the tokens to terraform via an environment variable or create a file called
terraform.tfvars
:
# Either, enter your Hetzner Cloud API Token (it will be hidden)
read -sp "Hetzner Cloud API Token: " TF_VAR_hcloud_token
export TF_VAR_hcloud_token
read -sp "Hetzner Cloud API read only Token: " TF_VAR_hcloud_token_read_only
export TF_VAR_hcloud_token_read_only
# Or store the token in terraform.tfvars
touch terraform.tfvars
chmod 600 terraform.tfvars
cat >terraform.tfvars <<EOF
hcloud_token = "XYZ"
hcloud_token_read_only = "ABC"
EOF
Download
examples/1Region_3ControlPlane_3Worker_Nodes/main.tf
:curl -LO https://github.com/identiops/terraform-hcloud-k3s/raw/main/examples/1Region_3ControlPlane_3Worker_Nodes/main.tf
Adjust the cluster configuration in
main.tf
, e.g.cluster_name
default_location
k3s_version
ssh_keys
(to create a new ssh key run:ssh-keygen -t ed25519
)node_pools
For multi-region deployments, there are a few things to consider:
It is recommended to distribute the control plane nodes across multiple regions. If 3 control plane nodes shall be used, create 3 node pools and configure a different
location
for each pool.etcd's default configuration expects a low-latency local network. When distributing nodes across multiple regions, latency will increase. The timing parameters therefore need to be adjusted, see etcd Tuning. Set
control_plane_k3s_additional_options
, e.g. to--etcd-arg=heartbeat-interval=120 --etcd-arg=election-timeout=1200
Measurements between Falkenstein, Nuremberg and Helsinki: I measured a latency of 0.7ms (within Nuremberg region), 3ms (Nuremberg -> Falkenstein), and 24ms (Nuremberg -> Helsinki).Hetzner doesn't support mounting volumes on servers in another region! The most simple setup is to just distribute the control plane nodes across multiple regions, disable the scheduling of workloads on control plane nodes and keep all worker nodes pools within one region. For better availability of workloads, worker node pools should be distributed across regions. This requires a configuration of taints and tolerations and node affinity to ensure that pods with volumes are schedules in the correct region.
Depending on the selected server regions the
network_zone
setting should be adjusted.Hetzner's load balancers are bound to one region. Therefore, a multi-regional cluster with one load balancer is not sufficent for dealing with zone outages. The load balancer would go down with the region it's hosted in. If a protection against zone outages is required, a global load balancer should be deployed elsewhere, as described by Google.
Initialize the configuration:
terraform init
Apply the configuration:
terraform apply
Grab a coffee and enjoy the servers popping up in Hetzner's cloud console. Wait for about 5 minutes.
Test SSH access to the cluster:
./ssh-node cluster
- ATTENTION: don't hammer the cluster with failing SSH requests, or you'll be banned by your cluster automatically! If the request fails, because the cluster node isn't ready yet, wait another minute.
Once SSH connection is established, double check that everything is working as expected:
Did the node initialization finish successfully?
cloud-init status
Is the cluster up and running?
kubectl cluster-info
If the tests were successful, retrieve the Kubernetes configuration and store it locally:
./setkubeconfig
Forward the cluster port locally since it's not exposed to the Internet by default. Do this every time you want to interact with the cluster:
./ssh-node gateway
Test cluster access from your local machine:
kubectl get nodes
Enjoy your new cluster! 🚀
Usage
Start using your favorite Kubernetes tools to interact with the cluster. One of the first steps usually involves deploying an ingress controller since this configuration doesn't ship one.
In addition, a few convenience scripts were created to help with maintenance:
setkubeconfig
: retrieves and stores the Kubernetes configuration locally.unsetkubeconfig
: removes the cluster from the local Kubernetes configuration.ls-nodes
: lists the nodes that are part of the cluster for access viassh-node
andscp-node
.ssh-node
: SSH wrapper for connecting to cluster nodes.scp-node
: SCP wrapper for connecting to cluster nodes..ssh/config
: SSH configuration for connecting to cluster nodes..ansible/hosts
: Ansible hosts configuration for executing commands on multiple nodes in parallel.
Maintenance
Ansible: Execute Commands on Nodes
This module automatically generates an Ansible inventory in file .ansible/hosts
. It can be leveraged to interact with the nodes and node pools of the cluster.
Example: Execute a command on all control plane nodes
ANSIBLE_INVENTORY="$PWD/.ansible/hosts" ansible all_control_plane_nodes -a "kubectl cluster-info"
Add Ingress Controller and Load Balancer
Since this module doesn't ship an ingress controller, one of the first configurations applied to the cluster is usually an ingress controller. Good starting points for an ingress controller are:
The ingress controller, like the rest of the cluster, is not directly exposed to the Internet. Therefore, it is necessary to add a load balancer that is directly exposed to the Internet and has access to the local network of the cluster. The load balancer is added to the cluster simply by adding annotations to the ingress controller's service. Hetzner's Cloud Controller Manager will use the annotations to deploy and configure the load balancer.
The following annotations should be used:
Resource name:
load-balancer.hetzner.cloud/name: "ingress-lb"
Protocol, just
tcp
- the ingress controller will take care of the HTTP connection:load-balancer.hetzner.cloud/protocol: "tcp"
Location, same as the one used for the cluster:
load-balancer.hetzner.cloud/location: "nbg1"
Connection to the servers, must be set to
true
:load-balancer.hetzner.cloud/use-private-ip: "true"
Size, see options:
load-balancer.hetzner.cloud/type: "lb11"
Furthermore, for domain names to work, it is required to point DNS records to the IP address of load balancer. external-dns is a helpful tool that can automate this task from within the cluster. For this to work well with Ingress resources, the ingress controller needs to expose the published service information on the Ingress resources.
Add Nodes or Node Pools
The number of nodes in a node pool can be increased at any point. Just increase the count and apply the new configuration via terraform apply
. After a few minutes the additional nodes will appear in the cluster.
In the same way, node pools can be added to the configuration without any precaution.
Remove Nodes or Node Pools
Removing nodes requires the following steps:
Identify the nodes and node pools that shall be removed. If the number of nodes in a node pool needs to be decreased, the nodes will be removed from the highest to the lowest number. Example: when the number of nodes in pool
system
is decreased from3
to2
, nodecluster-system-02
will be removed and nodescluster-system-01
andcluster-system-00
will remain.Drain all nodes that will be removed of pods:
kubectl drain cluster-system-02
Wait until all pods have been migrated to other nodes before continuing.
- If you drained the wrong node, you can reactivate the node with:
kubectl uncordon cluster-system-02
- If you drained the wrong node, you can reactivate the node with:
Update the terrafrom configuration and apply it:
terraform apply
- Review the plan to verify that the drained nodes will be deleted.
Delete nodes from cluster:
kubectl delete node cluster-system-02
Stop Automated Node Reboots
Nodes are rebooting automatically when they receive updates that require a reboot. The kured service triggers reboots of nodes one by one. Reboots can be disabled system-wide by annotating the Daemonset, see https://kured.dev/docs/operation/.
Upgrade Operating System
WARNING: untested!
An operating system update is not recommended, e.g. from Ubuntu 22.04 to 24.04. Instead, the corresponding nodes should be replaced!
Gateway Node
Set new image as
default_image
. Attention: before changing the default image, make sure that each node pool has its own oppropriateimage
setting.Delete the node in the Hetzner Console
Reapply the configuration:
terraform apply
The gateway will reappear again within a few minutes. This will disrupt the Internet access of the cluster's nodes for tasks like fetching package updates. However, it will not affect the services that are provided via load balancers!
After redeploying the gateway, ssh connections will fail because a new cryptopraphic has been generated for the node. Delete the deprecated key from the .ssh/known_hosts
file, open a new ssh connection and accept the new public key.
Node Pools
Nodes should not be updated manually via agt-get
, but be replaced. For control plane nodes, it is recommended to create a back-up of the etcd data store on an external s3 storage, see k3s Cluster Datastore.
For control plane pools only: Create a new etcd snapshot, see k3s etcd-snapshot.
Then, perform the following steps on consecutively on all existing node pools until they have all been replaced.
Start the replacement with the node pool with the cluster_can_init
setting:
Ensure that there's another control plane node pool. If there's no other control plane node pool, create a temporary one that is deleted after the successful replacement of the node pool with the
cluster_can_init
setting.When the second control plane node pool is up and running, the node pool with the
cluster_can_init
setting must be deleted and replaced in one application of the configuration.Ensure that the
cluster_init_action.init
andcluster_init_action.reset
settings are disabled.Drain the old nodes:
kubectl drain node-xyz
Once all pods have been migrated, delete the old nodes:
kubectl delete node node-xyz
Then rename the node pool to achieve deletion and replacement in one configuration change.
Apply the configuration:
terraform apply
Once the new control plane node pool with the
cluster_can_init
setting is again up and running, the temporary control plane node pool can be deleted.
Perform these steps for all remaining node pools:
Add a new node pool and set the
image
setting to the new version.Once the new node pool is up and running, drain the old nodes:
kubectl drain node-xyz
Once all pods have been migrated, delete the old nodes:
kubectl delete node node-xyz
Remove the node pool from the terraform configuration.
Reapply the configuration:
terraform apply
Update Kubernetes
Determine the next Kubernetes version, see k3s channels, k3s images tags, and k3s upgrade image tags.
Write the upgrade plan, see instructions and examples.
Apply an upgrade plan.
Update the
image
variable in the configuration for future nodes to be deployed with the correct image.
Update Cilium
Available versions: https://github.com/cilium/cilium
Update instructions: https://docs.cilium.io/en/stable/operations/upgrade/
helm repo add cilium https://helm.cilium.io/
helm repo update
helm upgrade --reuse-values cilium cilium/cilium -n kube-system --version '<NEW_VERSION>'
values.yaml
:
# Documentation: https://artifacthub.io/packages/helm/cilium/cilium
# WARNING: needs to be in line with the cluster configuration
routingMode: native
ipv4NativeRoutingCIDR: 10.0.0.0/8
ipam:
operator:
clusterPoolIPv4PodCIDRList: 10.244.0.0/16
k8sServiceHost: 10.0.1.1
k8sServicePort: "6443"
operator:
replicas: 2
Update Hetzner Cloud Controller Manager (CCM)
Available versions: https://github.com/hetznercloud/hcloud-cloud-controller-manager#versioning-policy
Update instructions: https://github.com/hetznercloud/hcloud-cloud-controller-manager/blob/main/CHANGELOG.md
helm repo add hcloud https://charts.hetzner.cloud
helm repo update
helm upgrade --reuse-values hcloud-ccm hcloud/hcloud-cloud-controller-manager -n kube-system --version '<NEW_VERSION>'
values.yaml
:
# Documentation: https://github.com/hetznercloud/hcloud-cloud-controller-manager/tree/main/chart
# WARNING: needs to be in line with the cluster configuration
networking:
enabled: true
clusterCIDR: 10.244.0.0/16
additionalTolerations:
# INFO: this taint occurred but isn't coveryd by default .. and caused the
# whole cluster to not start properly
- key: node.kubernetes.io/not-ready
value: NoSchedule
Update Hetzner Cloud Storage Interface (CSI)
Available versions: https://github.com/hetznercloud/csi-driver/blob/main/docs/kubernetes/README.md#versioning-policy
Update instructions: https://github.com/hetznercloud/csi-driver/blob/main/CHANGELOG.md
helm repo add hcloud https://charts.hetzner.cloud
helm repo update
helm upgrade --reuse-values hcloud-csi hcloud/hcloud-csi -n kube-system --version '<NEW_VERSION>'
values.yaml
:
# Documentation: https://github.com/hetznercloud/csi-driver/tree/main/chart
storageClasses:
- name: hcloud-volumes
defaultStorageClass: true
retainPolicy: Retain
Update Kured
Available versions: https://artifacthub.io/packages/helm/kured/kured
Update instructions: https://github.com/kubereboot/kured
helm repo add kubereboot https://kubereboot.github.io/charts
helm repo update
helm upgrade --reuse-values kured kubereboot/kured -n kube-system --version '<NEW_VERSION>'
values.yaml
:
# Documentation: https://artifacthub.io/packages/helm/kured/kured
configuration:
timeZone: Europe/Berlin
startTime: 1am
endTime: 5am
rebootDays:
- mo
- tu
- we
- th
- fr
- sa
- su
tolerations:
- key: CriticalAddonsOnly
operator: Exists
Update Metrics Server
Available versions: https://artifacthub.io/packages/helm/metrics-server/metrics-server
Update instructions: https://github.com/kubernetes-sigs/metrics-server
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
helm repo update
helm upgrade --reuse-values metrics-server metrics-server/metrics-server -n kube-system --version '<NEW_VERSION>'
values.yaml
:
# Documentation: https://artifacthub.io/packages/helm/metrics-server/metrics-server
Update System Upgrade Controller
Available versions: https://github.com/rancher/charts/tree/dev-v2.9/charts/system-upgrade-controller
Update instructions: https://github.com/rancher/system-upgrade-controller
helm repo add rancher https://charts.rancher.io
helm repo update
helm upgrade --reuse-values system-upgrade-controller rancher/system-upgrade-controller -n cattle-system --version '<NEW_VERSION>'
values.yaml
:
# Documentation: https://github.com/rancher/system-upgrade-controller
# Documentation: https://github.com/rancher/charts/tree/dev-v2.9/charts/system-upgrade-controller
global:
cattle:
psp:
enabled: false
Deletion
After applying the Terraform plan you'll see several output variables like the load balancer's, control plane's, and node pools' IP addresses.
terraform destroy -force
Be sure to clean up any CSI created Block Storage Volumes, and CCM created NodeBalancers that you no longer require.
Troubleshooting
Gateway
Ensure gateway is set up correctly: ./ssh-node gateway
Verify packet masquerading is set up properly
iptables -L -t nat
# Expected output:
# Chain PREROUTING (policy ACCEPT)
# target prot opt source destination
#
# Chain INPUT (policy ACCEPT)
# target prot opt source destination
#
# Chain OUTPUT (policy ACCEPT)
# target prot opt source destination
#
# Chain POSTROUTING (policy ACCEPT)
# target prot opt source destination
# MASQUERADE all -- 10.0.1.0/24 anywhere
Verify firewall is set up properly
ufw status
# Expected output:
# Status: active
#
# To Action From
# -- ------ ----
# 22,6443/tcp ALLOW Anywhere
# 22,6443/tcp (v6) ALLOW Anywhere (v6)
#
# Anywhere on eth0 ALLOW FWD Anywhere on ens10
# Anywhere (v6) on eth0 ALLOW FWD Anywhere (v6) on ens10
Nodes
Inspect local firewall settings
ufw status
Verify correctness of date/timezone and locale
date
echo $LANG
Inspect cloud-init logs
# Retrieve status
cloud-init status
# Verify configuration
cloud-init schema --system
# Collect logs for inspection
cloud-init collect-logs
tar xvzf cloud-init.tar.gz
# Inspect cloud-init.log for error messages
# Quickly find runcmd
find /var/lib/cloud/instances -name runcmd
sh -ex PATH_TO_RUNCMD
Cluster
Ensure cluster is set up correctly: ./ssh-node cluster
Verify default route
ip r s
# Expected output:
# default via 10.0.0.1 dev ens10 proto static onlink <-- this is the important line
# 10.0.0.0/8 via 10.0.0.1 dev ens10 proto dhcp src 10.0.1.2 metric 1024
# 10.0.0.1 dev ens10 proto dhcp scope link src 10.0.1.2 metric 1024
# 169.254.169.254 via 10.0.0.1 dev ens10 proto dhcp src 10.0.1.2 metric 1024
Verify connectivity to the Internet
ping 1.1.1.1
# Expected output:
# PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data.
# 64 bytes from 1.1.1.1: icmp_seq=1 ttl=53 time=4.60 ms
# 64 bytes from 1.1.1.1: icmp_seq=2 ttl=53 time=6.82 ms
# ...
Verify name resolution
host k3s.io
# Expected output:
# k3s.io has address 185.199.108.153
# k3s.io has address 185.199.110.153
# k3s.io has address 185.199.111.153
# k3s.io has address 185.199.109.153
# ...
Verify cluster status
k3s kubectl get nodes
# Expected output:
# k3s.io has address 185.199.108.153
# k3s.io has address 185.199.110.153
# k3s.io has address 185.199.111.153
# k3s.io has address 185.199.109.153
# ...
Verify Cilium Networking Status
This command only works after installing the cilium cli.
cilium status
# Expected output:
# /¯¯\
# /¯¯\__/¯¯\ Cilium: OK
# \__/¯¯\__/ Operator: OK
# /¯¯\__/¯¯\ Envoy DaemonSet: disabled (using embedded mode)
# \__/¯¯\__/ Hubble Relay: disabled
# \__/ ClusterMesh: disabled
#
# Deployment cilium-operator Desired: 1, Ready: 1/1, Available: 1/1
# DaemonSet cilium Desired: 3, Ready: 3/3, Available: 3/3
# Containers: cilium Running: 3
# cilium-operator Running: 1
# Cluster Pods: 9/9 managed by Cilium
# Helm chart version: 1.14.5
# Image versions cilium quay.io/cilium/cilium:v1.14.5@sha256:d3b287029755b6a47dee01420e2ea469469f1b174a2089c10af7e5e9289ef05b: 3
# cilium-operator quay.io/cilium/operator-generic:v1.14.5@sha256:303f9076bdc73b3fc32aaedee64a14f6f44c8bb08ee9e3956d443021103ebe7a: 1
Verify k3s Cluster Configuration
This command only works out of the box on the first node of the control plane node pool with the cluster_can_init
setting.
k3s check-config
# Expected output:
# ...
# STATUS: pass
Inspect cluster status and logs
systemctl status k3s.service
journalctl -u k3s.service
Related Documentation
Similar Projects
hcloud-kube-hetzner very popular k3s stack based on openSUSE MicroOS.
hcloud-k3s Original project that this project has been forked from.
hetzner-cloud-k3s A fully functional, super cheap Kubernetes cluster in Hetzner Cloud in 1m30s or less
Not terraform-based.
Scripts that make it easy to manage a cluster.
hetzner-k3s A CLI tool to create and manage Kubernetes clusters in Hetzner Cloud using the lightweight distribution k3s by Rancher. Successor of hetzner-cloud-k3s.
- Not terraform-based.
k-andy Zero friction Kubernetes stack on Hetzner Cloud.
Terraform-based stack.
Distributed across multiple Hetzner sites and data centers.
Support for multiple control-plane servers.
terraform-hcloud-kube-hetzner. Optimized and Maintenance-free Kubernetes on Hetzner Cloud in one command!
Special Thanks
The initiators of and contributors to this project for getting the k3s cluster running via terraform.
And to God for providing and enabling me to do my share of this work. Solo Deo Gloria.