DIY: Create Your Own Cloud with Kubernetes (Part 1)

·

86 min read

By Andrei Kvapil (Ænix) | Friday, April 05, 2024

At Ænix, we have a deep affection for Kubernetes and dream that all modern technologies will soon start utilizing its remarkable patterns.

Have you ever thought about building your own cloud? I bet you have. But is it possible to do this using only modern technologies and approaches, without leaving the cozy Kubernetes ecosystem? Our experience in developing Cozystack required us to delve deeply into it.

You might argue that Kubernetes is not intended for this purpose and why not simply use OpenStack for bare metal servers and run Kubernetes inside it as intended. But by doing so, you would simply shift the responsibility from your hands to the hands of OpenStack administrators. This would add at least one more huge and complex system to your ecosystem.

Why complicate things? - after all, Kubernetes already has everything needed to run tenant Kubernetes clusters at this point.

I want to share with you our experience in developing a cloud platform based on Kubernetes, highlighting the open-source projects that we use ourselves and believe deserve your attention.

In this series of articles, I will tell you our story about how we prepare managed Kubernetes from bare metal using only open-source technologies. Starting from the basic level of data center preparation, running virtual machines, isolating networks, setting up fault-tolerant storage to provisioning full-featured Kubernetes clusters with dynamic volume provisioning, load balancers, and autoscaling.

With this article, I start a series consisting of several parts:

  • Part 1: Preparing the groundwork for your cloud. Challenges faced during the preparation and operation of Kubernetes on bare metal and a ready-made recipe for provisioning infrastructure.

  • Part 2: Networking, storage, and virtualization. How to turn Kubernetes into a tool for launching virtual machines and what is needed for this.

  • Part 3: Cluster API and how to start provisioning Kubernetes clusters at the push of a button. How autoscaling works, dynamic provisioning of volumes, and load balancers.

I will try to describe various technologies as independently as possible, but at the same time, I will share our experience and why we came to one solution or another.

To begin with, let's understand the main advantage of Kubernetes and how it has changed the approach to using cloud resources.

It is important to understand that the use of Kubernetes in the cloud and on bare metal differs.

Kubernetes in the cloud

When you operate Kubernetes in the cloud, you don't worry about persistent volumes, cloud load balancers, or the process of provisioning nodes. All of this is handled by your cloud provider, who accepts your requests in the form of Kubernetes objects. In other words, the server side is completely hidden from you, and you don't really want to know how exactly the cloud provider implements as it's not in your area of responsibility.

A diagram showing cloud Kubernetes, with load balancing and storage done outside the cluster

A diagram showing cloud Kubernetes, with load balancing and storage done outside the cluster

Kubernetes offers convenient abstractions that work the same everywhere, allowing you to deploy your application on any Kubernetes in any cloud.

In the cloud, you very commonly have several separate entities: the Kubernetes control plane, virtual machines, persistent volumes, and load balancers as distinct entities. Using these entities, you can create highly dynamic environments.

Thanks to Kubernetes, virtual machines are now only seen as a utility entity for utilizing cloud resources. You no longer store data inside virtual machines. You can delete all your virtual machines at any moment and recreate them without breaking your application. The Kubernetes control plane will continue to hold information about what should run in your cluster. The load balancer will keep sending traffic to your workload, simply changing the endpoint to send traffic to a new node. And your data will be safely stored in external persistent volumes provided by cloud.

This approach is fundamental when using Kubernetes in clouds. The reason for it is quite obvious: the simpler the system, the more stable it is, and for this simplicity you go buying Kubernetes in the cloud.

Kubernetes on bare metal

Using Kubernetes in the clouds is really simple and convenient, which cannot be said about bare metal installations. In the bare metal world, Kubernetes, on the contrary, becomes unbearably complex. Firstly, because the entire network, backend storage, cloud balancers, etc. are usually run not outside, but inside your cluster. As result such a system is much more difficult to update and maintain.

A diagram showing bare metal Kubernetes, with load balancing and storage done inside the cluster

A diagram showing bare metal Kubernetes, with load balancing and storage done inside the cluster

Judge for yourself: in the cloud, to update a node, you typically delete the virtual machine (or even use kubectl delete node) and you let your node management tooling create a new one, based on an immutable image. The new node will join the cluster and ”just work” as a node; following a very simple and commonly used pattern in the Kubernetes world. Many clusters order new virtual machines every few minutes, simply because they can use cheaper spot instances. However, when you have a physical server, you can't just delete and recreate it, firstly because it often runs some cluster services, stores data, and its update process is significantly more complicated.

There are different approaches to solving this problem, ranging from in-place updates, as done by kubeadm, kubespray, and k3s, to full automation of provisioning physical nodes through Cluster API and Metal3.

I like the hybrid approach offered by Talos Linux, where your entire system is described in a single configuration file. Most parameters of this file can be applied without rebooting or recreating the node, including the version of Kubernetes control-plane components. However, it still keeps the maximum declarative nature of Kubernetes. This approach minimizes unnecessary impact on cluster services when updating bare metal nodes. In most cases, you won't need to migrate your virtual machines and rebuild the cluster filesystem on minor updates.

Preparing a base for your future cloud

So, suppose you've decided to build your own cloud. To start somewhere, you need a base layer. You need to think not only about how you will install Kubernetes on your servers but also about how you will update and maintain it. Consider the fact that you will have to think about things like updating the kernel, installing necessary modules, as well packages and security patches. Now you have to think much more that you don't have to worry about when using a ready-made Kubernetes in the cloud.

Of course you can use standard distributions like Ubuntu or Debian, or you can consider specialized ones like Flatcar Container Linux, Fedora Core, and Talos Linux. Each has its advantages and disadvantages.

What about us? At Ænix, we use quite a few specific kernel modules like ZFS, DRBD, and OpenvSwitch, so we decided to go the route of forming a system image with all the necessary modules in advance. In this case, Talos Linux turned out to be the most convenient for us. For example, such a config is enough to build a system image with all the necessary kernel modules:

arch: amd64
platform: metal
secureboot: false
version: v1.6.4
input:
  kernel:
    path: /usr/install/amd64/vmlinuz
  initramfs:
    path: /usr/install/amd64/initramfs.xz
  baseInstaller:
    imageRef: ghcr.io/siderolabs/installer:v1.6.4
  systemExtensions:
    - imageRef: ghcr.io/siderolabs/amd-ucode:20240115
    - imageRef: ghcr.io/siderolabs/amdgpu-firmware:20240115
    - imageRef: ghcr.io/siderolabs/bnx2-bnx2x:20240115
    - imageRef: ghcr.io/siderolabs/i915-ucode:20240115
    - imageRef: ghcr.io/siderolabs/intel-ice-firmware:20240115
    - imageRef: ghcr.io/siderolabs/intel-ucode:20231114
    - imageRef: ghcr.io/siderolabs/qlogic-firmware:20240115
    - imageRef: ghcr.io/siderolabs/drbd:9.2.6-v1.6.4
    - imageRef: ghcr.io/siderolabs/zfs:2.1.14-v1.6.4
output:
  kind: installer
  outFormat: raw

Then we use the docker command line tool to build an OS image:

cat config.yaml | docker run --rm -i -v /dev:/dev --privileged "ghcr.io/siderolabs/imager:v1.6.4" -

And as a result, we get a Docker container image with everything we need, which we can use to install Talos Linux on our servers. You can do the same; this image will contain all the necessary firmware and kernel modules.

But the question arises, how do you deliver the freshly formed image to your nodes?

I have been contemplating the idea of PXE booting for quite some time. For example, the Kubefarm project that I wrote an article about two years ago was entirely built using this approach. But unfortunately, it does help you to deploy your very first parent cluster that will hold the others. So now you have prepared a solution that will help you do this the same using PXE approach.

Essentially, all you need to do is run temporary DHCP and PXE servers inside containers. Then your nodes will boot from your image, and you can use a simple Debian-flavored script to help you bootstrap your nodes.

asciicast

The source for that talos-bootstrap script is available on GitHub.

This script allows you to deploy Kubernetes on bare metal in five minutes and obtain a kubeconfig for accessing it. However, many unresolved issues still lie ahead.

Delivering system components

At this stage, you already have a Kubernetes cluster capable of running various workloads. However, it is not fully functional yet. In other words, you need to set up networking and storage, as well as install necessary cluster extensions, like KubeVirt to run virtual machines, as well the monitoring stack and other system-wide components.

Traditionally, this is solved by installing Helm charts into your cluster. You can do this by running helm install commands locally, but this approach becomes inconvenient when you want to track updates, and if you have multiple clusters and you want to keep them uniform. In fact, there are plenty of ways to do this declaratively. To solve this, I recommend using best GitOps practices. I mean tools like ArgoCD and FluxCD.

While ArgoCD is more convenient for dev purposes with its graphical interface and a central control plane, FluxCD, on the other hand, is better suited for creating Kubernetes distributions. With FluxCD, you can specify which charts with what parameters should be launched and describe dependencies. Then, FluxCD will take care of everything for you.

It is suggested to perform a one-time installation of FluxCD in your newly created cluster and provide it with the configuration. This will install everything necessary, bringing the cluster to the expected state.

By carrying out a single installation of FluxCD in your newly minted cluster and configuring it accordingly, you enable it to automatically deploy all the essentials. This will allow your cluster to upgrade itself into the desired state. For example, after installing our platform you'll see the next pre-configured Helm charts with system components:

NAMESPACE                        NAME                        AGE    READY   STATUS
cozy-cert-manager                cert-manager                4m1s   True    Release reconciliation succeeded
cozy-cert-manager                cert-manager-issuers        4m1s   True    Release reconciliation succeeded
cozy-cilium                      cilium                      4m1s   True    Release reconciliation succeeded
cozy-cluster-api                 capi-operator               4m1s   True    Release reconciliation succeeded
cozy-cluster-api                 capi-providers              4m1s   True    Release reconciliation succeeded
cozy-dashboard                   dashboard                   4m1s   True    Release reconciliation succeeded
cozy-fluxcd                      cozy-fluxcd                 4m1s   True    Release reconciliation succeeded
cozy-grafana-operator            grafana-operator            4m1s   True    Release reconciliation succeeded
cozy-kamaji                      kamaji                      4m1s   True    Release reconciliation succeeded
cozy-kubeovn                     kubeovn                     4m1s   True    Release reconciliation succeeded
cozy-kubevirt-cdi                kubevirt-cdi                4m1s   True    Release reconciliation succeeded
cozy-kubevirt-cdi                kubevirt-cdi-operator       4m1s   True    Release reconciliation succeeded
cozy-kubevirt                    kubevirt                    4m1s   True    Release reconciliation succeeded
cozy-kubevirt                    kubevirt-operator           4m1s   True    Release reconciliation succeeded
cozy-linstor                     linstor                     4m1s   True    Release reconciliation succeeded
cozy-linstor                     piraeus-operator            4m1s   True    Release reconciliation succeeded
cozy-mariadb-operator            mariadb-operator            4m1s   True    Release reconciliation succeeded
cozy-metallb                     metallb                     4m1s   True    Release reconciliation succeeded
cozy-monitoring                  monitoring                  4m1s   True    Release reconciliation succeeded
cozy-postgres-operator           postgres-operator           4m1s   True    Release reconciliation succeeded
cozy-rabbitmq-operator           rabbitmq-operator           4m1s   True    Release reconciliation succeeded
cozy-redis-operator              redis-operator              4m1s   True    Release reconciliation succeeded
cozy-telepresence                telepresence                4m1s   True    Release reconciliation succeeded
cozy-victoria-metrics-operator   victoria-metrics-operator   4m1s   True    Release reconciliation succeeded

Conclusion

As a result, you achieve a highly repeatable environment that you can provide to anyone, knowing that it operates exactly as intended. This is actually what the Cozystack project does, which you can try out for yourself absolutely free.

In the following articles, I will discuss how to prepare Kubernetes for running virtual machines and how to run Kubernetes clusters with the click of a button. Stay tuned, it'll be fun!

DIY: Create Your Own Cloud with Kubernetes (Part 2)

By Andrei Kvapil (Ænix) | Friday, April 05, 2024

Continuing our series of posts on how to build your own cloud using just the Kubernetes ecosystem. In the previous article, we explained how we prepare a basic Kubernetes distribution based on Talos Linux and Flux CD. In this article, we'll show you a few various virtualization technologies in Kubernetes and prepare everything need to run virtual machines in Kubernetes, primarily storage and networking.

We will talk about technologies such as KubeVirt, LINSTOR, and Kube-OVN.

But first, let's explain what virtual machines are needed for, and why can't you just use docker containers for building cloud? The reason is that containers do not provide a sufficient level of isolation. Although the situation improves year by year, we often encounter vulnerabilities that allow escaping the container sandbox and elevating privileges in the system.

On the other hand, Kubernetes was not originally designed to be a multi-tenant system, meaning the basic usage pattern involves creating a separate Kubernetes cluster for every independent project and development team.

Virtual machines are the primary means of isolating tenants from each other in a cloud environment. In virtual machines, users can execute code and programs with administrative privilege, but this doesn't affect other tenants or the environment itself. In other words, virtual machines allow to achieve hard multi-tenancy isolation, and run in environments where tenants do not trust each other.

Virtualization technologies in Kubernetes

There are several different technologies that bring virtualization into the Kubernetes world: KubeVirt and Kata Containers are the most popular ones. But you should know that they work differently.

Kata Containers implements the CRI (Container Runtime Interface) and provides an additional level of isolation for standard containers by running them in virtual machines. But they work in a same single Kubernetes-cluster.

A diagram showing how container isolation is ensured by running containers in virtual machines with Kata Containers

A diagram showing how container isolation is ensured by running containers in virtual machines with Kata Containers

KubeVirt allows running traditional virtual machines using the Kubernetes API. KubeVirt virtual machines are run as regular linux processes in containers. In other words, in KubeVirt, a container is used as a sandbox for running virtual machine (QEMU) processes. This can be clearly seen in the figure below, by looking at how live migration of virtual machines is implemented in KubeVirt. When migration is needed, the virtual machine moves from one container to another.

A diagram showing live migration of a virtual machine from one container to another in KubeVirt

A diagram showing live migration of a virtual machine from one container to another in KubeVirt

There is also an alternative project - Virtink, which implements lightweight virtualization using Cloud-Hypervisor and is initially focused on running virtual Kubernetes clusters using the Cluster API.

Considering our goals, we decided to use KubeVirt as the most popular project in this area. Besides we have extensive expertise and already made a lot of contributions to KubeVirt.

KubeVirt is easy to install and allows you to run virtual machines out-of-the-box using containerDisk feature - this allows you to store and distribute VM images directly as OCI images from container image registry. Virtual machines with containerDisk are well suited for creating Kubernetes worker nodes and other VMs that do not require state persistence.

For managing persistent data, KubeVirt offers a separate tool, Containerized Data Importer (CDI). It allows for cloning PVCs and populating them with data from base images. The CDI is necessary if you want to automatically provision persistent volumes for your virtual machines, and it is also required for the KubeVirt CSI Driver, which is used to handle persistent volumes claims from tenant Kubernetes clusters.

But at first, you have to decide where and how you will store these data.

Storage for Kubernetes VMs

With the introduction of the CSI (Container Storage Interface), a wide range of technologies that integrate with Kubernetes has become available. In fact, KubeVirt fully utilizes the CSI interface, aligning the choice of storage for virtualization closely with the choice of storage for Kubernetes itself. However, there are nuances, which you need to consider. Unlike containers, which typically use a standard filesystem, block devices are more efficient for virtual machine.

Although the CSI interface in Kubernetes allows the request of both types of volumes: filesystems and block devices, it's important to verify that your storage backend supports this.

Using block devices for virtual machines eliminates the need for an additional abstraction layer, such as a filesystem, that makes it more performant and in most cases enables the use of the ReadWriteMany mode. This mode allows concurrent access to the volume from multiple nodes, which is a critical feature for enabling the live migration of virtual machines in KubeVirt.

The storage system can be external or internal (in the case of hyper-converged infrastructure). Using external storage in many cases makes the whole system more stable, as your data is stored separately from compute nodes.

A diagram showing external data storage communication with the compute nodes

A diagram showing external data storage communication with the compute nodes

External storage solutions are often popular in enterprise systems because such storage is frequently provided by an external vendor, that takes care of its operations. The integration with Kubernetes involves only a small component installed in the cluster - the CSI driver. This driver is responsible for provisioning volumes in this storage and attaching them to pods run by Kubernetes. However, such storage solutions can also be implemented using purely open-source technologies. One of the popular solutions is TrueNAS powered by democratic-csi driver.

A diagram showing local data storage running on the compute nodes

A diagram showing local data storage running on the compute nodes

On the other hand, hyper-converged systems are often implemented using local storage (when you do not need replication) and with software-defined storages, often installed directly in Kubernetes, such as Rook/Ceph, OpenEBS, Longhorn, LINSTOR, and others.

A diagram showing clustered data storage running on the compute nodes

A diagram showing clustered data storage running on the compute nodes

A hyper-converged system has its advantages. For example, data locality: when your data is stored locally, access to such data is faster. But there are disadvantages as such a system is usually more difficult to manage and maintain.

At Ænix, we wanted to provide a ready-to-use solution that could be used without the need to purchase and setup an additional external storage, and that was optimal in terms of speed and resource utilization. LINSTOR became that solution. The time-tested and industry-popular technologies such as LVM and ZFS as backend gives confidence that data is securely stored. DRBD-based replication is incredible fast and consumes a small amount of computing resources.

For installing LINSTOR in Kubernetes, there is the Piraeus project, which already provides a ready-made block storage to use with KubeVirt.

Note:

In case you are using Talos Linux, as we described in the previous article, you will need to enable the necessary kernel modules in advance, and configure piraeus as described in the instruction.

Networking for Kubernetes VMs

Despite having the similar interface - CNI, The network architecture in Kubernetes is actually more complex and typically consists of many independent components that are not directly connected to each other. In fact, you can split Kubernetes networking into four layers, which are described below.

Node Network (Data Center Network)

The network through which nodes are interconnected with each other. This network is usually not managed by Kubernetes, but it is an important one because, without it, nothing would work. In practice, the bare metal infrastructure usually has more than one of such networks e.g. one for node-to-node communication, second for storage replication, third for external access, etc.

A diagram showing the role of the node network (data center network) on the Kubernetes networking scheme

A diagram showing the role of the node network (data center network) on the Kubernetes networking scheme

Configuring the physical network interaction between nodes goes beyond the scope of this article, as in most situations, Kubernetes utilizes already existing network infrastructure.

Pod Network

This is the network provided by your CNI plugin. The task of the CNI plugin is to ensure transparent connectivity between all containers and nodes in the cluster. Most CNI plugins implement a flat network from which separate blocks of IP addresses are allocated for use on each node.

A diagram showing the role of the pod network (CNI-plugin) on the Kubernetes network scheme

A diagram showing the role of the pod network (CNI-plugin) on the Kubernetes network scheme

In practice, your cluster can have several CNI plugins managed by Multus. This approach is often used in virtualization solutions based on KubeVirt - Rancher and OpenShift. The primary CNI plugin is used for integration with Kubernetes services, while additional CNI plugins are used to implement private networks (VPC) and integration with the physical networks of your data center.

The default CNI-plugins can be used to connect bridges or physical interfaces. Additionally, there are specialized plugins such as macvtap-cni which are designed to provide more performance.

One additional aspect to keep in mind when running virtual machines in Kubernetes is the need for IPAM (IP Address Management), especially for secondary interfaces provided by Multus. This is commonly managed by a DHCP server operating within your infrastructure. Additionally, the allocation of MAC addresses for virtual machines can be managed by Kubemacpool.

Although in our platform, we decided to go another way and fully rely on Kube-OVN. This CNI plugin is based on OVN (Open Virtual Network) which was originally developed for OpenStack and it provides a complete network solution for virtual machines in Kubernetes, features Custom Resources for managing IPs and MAC addresses, supports live migration with preserving IP addresses between the nodes, and enables the creation of VPCs for physical network separation between tenants.

In Kube-OVN you can assign separate subnets to an entire namespace or connect them as additional network interfaces using Multus.

Services Network

In addition to the CNI plugin, Kubernetes also has a services network, which is primarily needed for service discovery. Contrary to traditional virtual machines, Kubernetes is originally designed to run pods with a random address. And the services network provides a convenient abstraction (stable IP addresses and DNS names) that will always direct traffic to the correct pod. The same approach is also commonly used with virtual machines in clouds despite the fact that their IPs are usually static.

A diagram showing the role of the services network (services network plugin) on the Kubernetes network scheme

A diagram showing the role of the services network (services network plugin) on the Kubernetes network scheme

The implementation of the services network in Kubernetes is handled by the services network plugin, The standard implementation is called kube-proxy and is used in most clusters. But nowadays, this functionality might be provided as part of the CNI plugin. The most advanced implementation is offered by the Cilium project, which can be run in kube-proxy replacement mode.

Cilium is based on the eBPF technology, which allows for efficient offloading of the Linux networking stack, thereby improving performance and security compared to traditional methods based on iptables.

In practice, Cilium and Kube-OVN can be easily integrated to provide a unified solution that offers seamless, multi-tenant networking for virtual machines, as well as advanced network policies and combined services network functionality.

External Traffic Load Balancer

At this stage, you already have everything needed to run virtual machines in Kubernetes. But there is actually one more thing. You still need to access your services from outside your cluster, and an external load balancer will help you with organizing this.

For bare metal Kubernetes clusters, there are several load balancers available: MetalLB, kube-vip, LoxiLB, also Cilium and Kube-OVN provides built-in implementation.

The role of a external load balancer is to provide a stable address available externally and direct external traffic to the services network. The services network plugin will direct it to your pods and virtual machines as usual.

The role of the external load balancer on the Kubernetes network scheme

A diagram showing the role of the external load balancer on the Kubernetes network scheme

In most cases, setting up a load balancer on bare metal is achieved by creating floating IP address on the nodes within the cluster, and announce it externally using ARP/NDP or BGP protocols.

After exploring various options, we decided that MetalLB is the simplest and most reliable solution, although we do not strictly enforce the use of only it.

Another benefit is that in L2 mode, MetalLB speakers continuously check their neighbour's state by sending preforming liveness checks using a memberlist protocol. This enables failover that works independently of Kubernetes control-plane.

Conclusion

This concludes our overview of virtualization, storage, and networking in Kubernetes. The technologies mentioned here are available and already pre-configured on the Cozystack platform, where you can try them with no limitations.

In the next article, I'll detail how, on top of this, you can implement the provisioning of fully functional Kubernetes clusters with just the click of a button.

DIY: Create Your Own Cloud with Kubernetes (Part 3)

By Andrei Kvapil (Ænix) | Friday, April 05, 2024

Approaching the most interesting phase, this article delves into running Kubernetes within Kubernetes. Technologies such as Kamaji and Cluster API are highlighted, along with their integration with KubeVirt.

Previous discussions have covered preparing Kubernetes on bare metal and how to turn Kubernetes into virtual machines management system. This article concludes the series by explaining how, using all of the above, you can build a full-fledged managed Kubernetes and run virtual Kubernetes clusters with just a click.

First up, let's dive into the Cluster API.

Cluster API

Cluster API is an extension for Kubernetes that allows the management of Kubernetes clusters as custom resources within another Kubernetes cluster.

The main goal of the Cluster API is to provide a unified interface for describing the basic entities of a Kubernetes cluster and managing their lifecycle. This enables the automation of processes for creating, updating, and deleting clusters, simplifying scaling, and infrastructure management.

Within the context of Cluster API, there are two terms: management cluster and tenant clusters.

  • Management cluster is a Kubernetes cluster used to deploy and manage other clusters. This cluster contains all the necessary Cluster API components and is responsible for describing, creating, and updating tenant clusters. It is often used just for this purpose.

  • Tenant clusters are the user clusters or clusters deployed using the Cluster API. They are created by describing the relevant resources in the management cluster. They are then used for deploying applications and services by end-users.

It's important to understand that physically, tenant clusters do not necessarily have to run on the same infrastructure with the management cluster; more often, they are running elsewhere.

A diagram showing interaction of management Kubernetes cluster and tenant Kubernetes clusters using Cluster API

A diagram showing interaction of management Kubernetes cluster and tenant Kubernetes clusters using Cluster API

For its operation, Cluster API utilizes the concept of providers which are separate controllers responsible for specific components of the cluster being created. Within Cluster API, there are several types of providers. The major ones are:

  • Infrastructure Provider, which is responsible for providing the computing infrastructure, such as virtual machines or physical servers.

  • Control Plane Provider, which provides the Kubernetes control plane, namely the components kube-apiserver, kube-scheduler, and kube-controller-manager.

  • Bootstrap Provider, which is used for generating cloud-init configuration for the virtual machines and servers being created.

To get started, you will need to install the Cluster API itself and one provider of each type. You can find a complete list of supported providers in the project's documentation.

For installation, you can use the clusterctl utility, or Cluster API Operator as the more declarative method.

Choosing providers

Infrastructure provider

To run Kubernetes clusters using KubeVirt, the KubeVirt Infrastructure Provider must be installed. It enables the deployment of virtual machines for worker nodes in the same management cluster, where the Cluster API operates.

Control plane provider

The Kamaji project offers a ready solution for running the Kubernetes control plane for tenant clusters as containers within the management cluster. This approach has several significant advantages:

  • Cost-effectiveness: Running the control plane in containers avoids the use of separate control plane nodes for each cluster, thereby significantly reducing infrastructure costs.

  • Stability: Simplifying architecture by eliminating complex multi-layered deployment schemes. Instead of sequentially launching a virtual machine and then installing etcd and Kubernetes components inside it, there's a simple control plane that is deployed and run as a regular application inside Kubernetes and managed by an operator.

  • Security: The cluster's control plane is hidden from the end user, reducing the possibility of its components being compromised, and also eliminates user access to the cluster's certificate store. This approach to organizing a control plane invisible to the user is often used by cloud providers.

Bootstrap provider

Kubeadm as the Bootstrap Provider - as the standard method for preparing clusters in Cluster API. This provider is developed as part of the Cluster API itself. It requires only a prepared system image with kubelet and kubeadm installed and allows generating configs in the cloud-init and ignition formats.

It's worth noting that Talos Linux also supports provisioning via the Cluster API and has providers for this. Although previous articles discussed using Talos Linux to set up a management cluster on bare-metal nodes, to provision tenant clusters the Kamaji+Kubeadm approach has more advantages. It facilitates the deployment of Kubernetes control planes in containers, thus removing the need for separate virtual machines for control plane instances. This simplifies the management and reduces costs.

How it works

The primary object in Cluster API is the Cluster resource, which acts as the parent for all the others. Typically, this resource references two others: a resource describing the control plane and a resource describing the infrastructure, each managed by a separate provider.

Unlike the Cluster, these two resources are not standardized, and their kind depends on the specific provider you are using:

A diagram showing the relationship of a Cluster resource and the resources it links to in Cluster API

A diagram showing the relationship of a Cluster resource and the resources it links to in Cluster API

Within Cluster API, there is also a resource named MachineDeployment, which describes a group of nodes, whether they are physical servers or virtual machines. This resource functions similarly to standard Kubernetes resources such as Deployment, ReplicaSet, and Pod, providing a mechanism for the declarative description of a group of nodes and automatic scaling.

In other words, the MachineDeployment resource allows you to declaratively describe nodes for your cluster, automating their creation, deletion, and updating according to specified parameters and the requested number of replicas.

A diagram showing the relationship of a Cluster resource and its children in Cluster API

A diagram showing the relationship of a MachineDeployment resource and its children in Cluster API

To create machines, MachineDeployment refers to a template for generating the machine itself and a template for generating its cloud-init config:

A diagram showing the relationship of a Cluster resource and the resources it links to in Cluster API

A diagram showing the relationship of a MachineDeployment resource and the resources it links to in Cluster API

To deploy a new Kubernetes cluster using Cluster API, you will need to prepare the following set of resources:

  • A general Cluster resource

  • A KamajiControlPlane resource, responsible for the control plane operated by Kamaji

  • A KubevirtCluster resource, describing the cluster configuration in KubeVirt

  • A KubevirtMachineTemplate resource, responsible for the virtual machine template

  • A KubeadmConfigTemplate resource, responsible for generating tokens and cloud-init

  • At least one MachineDeployment to create some workers

Polishing the cluster

In most cases, this is sufficient, but depending on the providers used, you may need other resources as well. You can find examples of the resources created for each type of provider in the Kamaji project documentation.

At this stage, you already have a ready tenant Kubernetes cluster, but so far, it contains nothing but API workers and a few core plugins that are standardly included in the installation of any Kubernetes cluster: kube-proxy and CoreDNS. For full integration, you will need to install several more components:

To install additional components, you can use a separate Cluster API Add-on Provider for Helm, or the same FluxCD discussed in previous articles.

When creating resources in FluxCD, it's possible to specify the target cluster by referring to the kubeconfig generated by Cluster API. Then, the installation will be performed directly into it. Thus, FluxCD becomes a universal tool for managing resources both in the management cluster and in the user tenant clusters.

A diagram showing the interaction scheme of fluxcd, which can install components in both management and tenant Kubernetes clusters

A diagram showing the interaction scheme of fluxcd, which can install components in both management and tenant Kubernetes clusters

What components are being discussed here? Generally, the set includes the following:

CNI Plugin

To ensure communication between pods in a tenant Kubernetes cluster, it's necessary to deploy a CNI plugin. This plugin creates a virtual network that allows pods to interact with each other and is traditionally deployed as a Daemonset on the cluster's worker nodes. You can choose and install any CNI plugin that you find suitable.

A diagram showing a CNI plugin installed inside the tenant Kubernetes cluster on a scheme of nested Kubernetes clusters

A diagram showing a CNI plugin installed inside the tenant Kubernetes cluster on a scheme of nested Kubernetes clusters

Cloud Controller Manager

The main task of the Cloud Controller Manager (CCM) is to integrate Kubernetes with the cloud infrastructure provider's environment (in your case, it is the management Kubernetes cluster in which all worksers of tenant Kubernetes are provisioned). Here are some tasks it performs:

  1. When a service of type LoadBalancer is created, the CCM initiates the process of creating a cloud load balancer, which directs traffic to your Kubernetes cluster.

  2. If a node is removed from the cloud infrastructure, the CCM ensures its removal from your cluster as well, maintaining the cluster's current state.

  3. When using the CCM, nodes are added to the cluster with a special taint, node.cloudprovider.kubernetes.io/uninitialized, which allows for the processing of additional business logic if necessary. After successful initialization, this taint is removed from the node.

Depending on the cloud provider, the CCM can operate both inside and outside the tenant cluster.

The KubeVirt Cloud Provider is designed to be installed in the external parent management cluster. Thus, creating services of type LoadBalancer in the tenant cluster initiates the creation of LoadBalancer services in the parent cluster, which direct traffic into the tenant cluster.

A diagram showing a Cloud Controller Manager installed outside of a tenant Kubernetes cluster on a scheme of nested Kubernetes clusters and the mapping of services it manages from the parent to the child Kubernetes cluster

A diagram showing a Cloud Controller Manager installed outside of a tenant Kubernetes cluster on a scheme of nested Kubernetes clusters and the mapping of services it manages from the parent to the child Kubernetes cluster

CSI Driver

The Container Storage Interface (CSI) is divided into two main parts for interacting with storage in Kubernetes:

  • csi-controller: This component is responsible for interacting with the cloud provider's API to create, delete, attach, detach, and resize volumes.

  • csi-node: This component runs on each node and facilitates the mounting of volumes to pods as requested by kubelet.

In the context of using the KubeVirt CSI Driver, a unique opportunity arises. Since virtual machines in KubeVirt runs within the management Kubernetes cluster, where a full-fledged Kubernetes API is available, this opens the path for running the csi-controller outside of the user's tenant cluster. This approach is popular in the KubeVirt community and offers several key advantages:

  • Security: This method hides the internal cloud API from the end-user, providing access to resources exclusively through the Kubernetes interface. Thus, it reduces the risk of direct access to the management cluster from user clusters.

  • Simplicity and Convenience: Users don't need to manage additional controllers in their clusters, simplifying the architecture and reducing the management burden.

However, the CSI-node must necessarily run inside the tenant cluster, as it directly interacts with kubelet on each node. This component is responsible for the mounting and unmounting of volumes into pods, requiring close integration with processes occurring directly on the cluster nodes.

The KubeVirt CSI Driver acts as a proxy for ordering volumes. When a PVC is created inside the tenant cluster, a PVC is created in the management cluster, and then the created PV is connected to the virtual machine.

A diagram showing a CSI plugin components installed on both inside and outside of a tenant Kubernetes cluster on a scheme of nested Kubernetes clusters and the mapping of persistent volumes it manages from the parent to the child Kubernetes cluster

A diagram showing a CSI plugin components installed on both inside and outside of a tenant Kubernetes cluster on a scheme of nested Kubernetes clusters and the mapping of persistent volumes it manages from the parent to the child Kubernetes cluster

Cluster Autoscaler

The Cluster Autoscaler is a versatile component that can work with various cloud APIs, and its integration with Cluster-API is just one of the available functions. For proper configuration, it requires access to two clusters: the tenant cluster, to track pods and determine the need for adding new nodes, and the managing Kubernetes cluster (management kubernetes cluster), where it interacts with the MachineDeployment resource and adjusts the number of replicas.

Although Cluster Autoscaler usually runs inside the tenant Kubernetes cluster, in this situation, it is suggested to install it outside for the same reasons described before. This approach is simpler to maintain and more secure as it prevents users of tenant clusters from accessing the management API of the management cluster.

A diagram showing a Cloud Controller Manager installed outside of a tenant Kubernetes cluster on a scheme of nested Kubernetes clusters

A diagram showing a Cluster Autoscaler installed outside of a tenant Kubernetes cluster on a scheme of nested Kubernetes clusters

Konnectivity

There's another additional component I'd like to mention - Konnectivity. You will likely need it later on to get webhooks and the API aggregation layer working in your tenant Kubernetes cluster. This topic is covered in detail in one of my previous article.

Unlike the components presented above, Kamaji allows you to easily enable Konnectivity and manage it as one of the core components of your tenant cluster, alongside kube-proxy and CoreDNS.

Conclusion

Now you have a fully functional Kubernetes cluster with the capability for dynamic scaling, automatic provisioning of volumes, and load balancers.

Going forward, you might consider metrics and logs collection from your tenant clusters, but that goes beyond the scope of this article.

Of course, all the components necessary for deploying a Kubernetes cluster can be packaged into a single Helm chart and deployed as a unified application. This is precisely how we organize the deployment of managed Kubernetes clusters with the click of a button on our open PaaS platform, Cozystack, where you can try all the technologies described in the article for free.

Spotlight on SIG Architecture: Code Organization

By Frederico Muñoz (SAS Institute) | Thursday, April 11, 2024

This is the third interview of a SIG Architecture Spotlight series that will cover the different subprojects. We will cover SIG Architecture: Code Organization.

In this SIG Architecture spotlight I talked with Madhav Jivrajani (VMware), a member of the Code Organization subproject.

Introducing the Code Organization subproject

Frederico (FSM): Hello Madhav, thank you for your availability. Could you start by telling us a bit about yourself, your role and how you got involved in Kubernetes?

Madhav Jivrajani (MJ): Hello! My name is Madhav Jivrajani, I serve as a technical lead for SIG Contributor Experience and a GitHub Admin for the Kubernetes project. Apart from that I also contribute to SIG API Machinery and SIG Etcd, but more recently, I’ve been helping out with the work that is needed to help Kubernetes stay on supported versions of Go, and it is through this that I am involved with the Code Organization subproject of SIG Architecture.

FSM: A project the size of Kubernetes must have unique challenges in terms of code organization -- is this a fair assumption? If so, what would you pick as some of the main challenges that are specific to Kubernetes?

MJ: That’s a fair assumption! The first interesting challenge comes from the sheer size of the Kubernetes codebase. We have ≅2.2 million lines of Go code (which is steadily decreasing thanks to dims and other folks in this sub-project!), and a little over 240 dependencies that we rely on either directly or indirectly, which is why having a sub-project dedicated to helping out with dependency management is crucial: we need to know what dependencies we’re pulling in, what versions these dependencies are at, and tooling to help make sure we are managing these dependencies across different parts of the codebase in a consistent manner.

Another interesting challenge with Kubernetes is that we publish a lot of Go modules as part of the Kubernetes release cycles, one example of this is client-go.However, we as a project would also like the benefits of having everything in one repository to get the advantages of using a monorepo, like atomic commits... so, because of this, code organization works with other SIGs (like SIG Release) to automate the process of publishing code from the monorepo to downstream individual repositories which are much easier to consume, and this way you won’t have to import the entire Kubernetes codebase!

Code organization and Kubernetes

FSM: For someone just starting contributing to Kubernetes code-wise, what are the main things they should consider in terms of code organization? How would you sum up the key concepts?

MJ: I think one of the key things to keep in mind at least as you’re starting off is the concept of staging directories. In the kubernetes/kubernetes repository, you will come across a directory called staging/. The sub-folders in this directory serve as a bunch of pseudo-repositories. For example, the kubernetes/client-go repository that publishes releases for client-go is actually a staging repo.

FSM: So the concept of staging directories fundamentally impact contributions?

MJ: Precisely, because if you’d like to contribute to any of the staging repos, you will need to send in a PR to its corresponding staging directory in kubernetes/kubernetes. Once the code merges there, we have a bot called the publishing-bot that will sync the merged commits to the required staging repositories (like kubernetes/client-go). This way we get the benefits of a monorepo but we also can modularly publish code for downstream consumption. PS: The publishing-bot needs more folks to help out!

For more information on staging repositories, please see the contributor documentation.

FSM: Speaking of contributions, the very high number of contributors, both individuals and companies, must also be a challenge: how does the subproject operate in terms of making sure that standards are being followed?

MJ: When it comes to dependency management in the project, there is a dedicated team that helps review and approve dependency changes. These are folks who have helped lay the foundation of much of the tooling that Kubernetes uses today for dependency management. This tooling helps ensure there is a consistent way that contributors can make changes to dependencies. The project has also worked on additional tooling to signal statistics of dependencies that is being added or removed: depstat

Apart from dependency management, another crucial task that the project does is management of the staging repositories. The tooling for achieving this (publishing-bot) is completely transparent to contributors and helps ensure that the staging repos get a consistent view of contributions that are submitted to kubernetes/kubernetes.

Code Organization also works towards making sure that Kubernetes stays on supported versions of Go. The linked KEP provides more context on why we need to do this. We collaborate with SIG Release to ensure that we are testing Kubernetes as rigorously and as early as we can on Go releases and working on changes that break our CI as a part of this. An example of how we track this process can be found here.

Release cycle and current priorities

FSM: Is there anything that changes during the release cycle?

MJ During the release cycle, specifically before code freeze, there are often changes that go in that add/update/delete dependencies, fix code that needs fixing as part of our effort to stay on supported versions of Go.

Furthermore, some of these changes are also candidates for backporting to our supported release branches.

FSM: Is there any major project or theme the subproject is working on right now that you would like to highlight?

MJ: I think one very interesting and immensely useful change that has been recently added (and I take the opportunity to specifically highlight the work of Tim Hockin on this) is the introduction of Go workspaces to the Kubernetes repo. A lot of our current tooling for dependency management and code publishing, as well as the experience of editing code in the Kubernetes repo, can be significantly improved by this change.

Wrapping up

FSM: How would someone interested in the topic start helping the subproject?

MJ: The first step, as is the first step with any project in Kubernetes, is to join our slack: slack.k8s.io, and after that join the #k8s-code-organization channel. There is also a code-organization office hours that takes place that you can choose to attend. Timezones are hard, so feel free to also look at the recordings or meeting notes and follow up on slack!

FSM: Excellent, thank you! Any final comments you would like to share?

MJ: The Code Organization subproject always needs help! Especially areas like the publishing bot, so don’t hesitate to get involved in the #k8s-code-organization Slack channel.

Kubernetes v1.30: Uwubernetes

By Kubernetes v1.30 Release Team | Wednesday, April 17, 2024

Editors: Amit Dsouza, Frederick Kautz, Kristin Martin, Abigail McCarthy, Natali Vlatko

Announcing the release of Kubernetes v1.30: Uwubernetes, the cutest release!

Similar to previous releases, the release of Kubernetes v1.30 introduces new stable, beta, and alpha features. The consistent delivery of top-notch releases underscores the strength of our development cycle and the vibrant support from our community.

This release consists of 45 enhancements. Of those enhancements, 17 have graduated to Stable, 18 are entering Beta, and 10 have graduated to Alpha.

Kubernetes v1.30: Uwubernetes

Kubernetes v1.30 Uwubernetes logo

Kubernetes v1.30 makes your clusters cuter!

Kubernetes is built and released by thousands of people from all over the world and all walks of life. Most contributors are not being paid to do this; we build it for fun, to solve a problem, to learn something, or for the simple love of the community. Many of us found our homes, our friends, and our careers here. The Release Team is honored to be a part of the continued growth of Kubernetes.

For the people who built it, for the people who release it, and for the furries who keep all of our clusters online, we present to you Kubernetes v1.30: Uwubernetes, the cutest release to date. The name is a portmanteau of “kubernetes” and “UwU,” an emoticon used to indicate happiness or cuteness. We’ve found joy here, but we’ve also brought joy from our outside lives that helps to make this community as weird and wonderful and welcoming as it is. We’re so happy to share our work with you.

UwU ♥️

Improvements that graduated to stable in Kubernetes v1.30

This is a selection of some of the improvements that are now stable following the v1.30 release.

Robust VolumeManager reconstruction after kubelet restart (SIG Storage)

This is a volume manager refactoring that allows the kubelet to populate additional information about how existing volumes are mounted during the kubelet startup. In general, this makes volume cleanup after kubelet restart or machine reboot more robust.

This does not bring any changes for user or cluster administrators. We used the feature process and feature gate NewVolumeManagerReconstruction to be able to fall back to the previous behavior in case something goes wrong. Now that the feature is stable, the feature gate is locked and cannot be disabled.

Prevent unauthorized volume mode conversion during volume restore (SIG Storage)

For Kubernetes v1.30, the control plane always prevents unauthorized changes to volume modes when restoring a snapshot into a PersistentVolume. As a cluster administrator, you'll need to grant permissions to the appropriate identity principals (for example: ServiceAccounts representing a storage integration) if you need to allow that kind of change at restore time.

Warning:

Action required before upgrading. The prevent-volume-mode-conversion feature flag is enabled by default in the external-provisioner v4.0.0 and external-snapshotter v7.0.0. Volume mode change will be rejected when creating a PVC from a VolumeSnapshot unless you perform the steps described in the "Urgent Upgrade Notes" sections for the external-provisioner 4.0.0 and the external-snapshotter v7.0.0.

For more information on this feature also read converting the volume mode of a Snapshot.

Pod Scheduling Readiness (SIG Scheduling)

Pod scheduling readiness graduates to stable this release, after being promoted to beta in Kubernetes v1.27.

This now-stable feature lets Kubernetes avoid trying to schedule a Pod that has been defined, when the cluster doesn't yet have the resources provisioned to allow actually binding that Pod to a node. That's not the only use case; the custom control on whether a Pod can be allowed to schedule also lets you implement quota mechanisms, security controls, and more.

Crucially, marking these Pods as exempt from scheduling cuts the work that the scheduler would otherwise do, churning through Pods that can't or won't schedule onto the nodes your cluster currently has. If you have cluster autoscaling active, using scheduling gates doesn't just cut the load on the scheduler, it can also save money. Without scheduling gates, the autoscaler might otherwise launch a node that doesn't need to be started.

In Kubernetes v1.30, by specifying (or removing) a Pod's .spec.schedulingGates, you can control when a Pod is ready to be considered for scheduling. This is a stable feature and is now formally part of the Kubernetes API definition for Pod.

Min domains in PodTopologySpread (SIG Scheduling)

The minDomains parameter for PodTopologySpread constraints graduates to stable this release, which allows you to define the minimum number of domains. This feature is designed to be used with Cluster Autoscaler.

If you previously attempted use and there weren't enough domains already present, Pods would be marked as unschedulable. The Cluster Autoscaler would then provision node(s) in new domain(s), and you'd eventually get Pods spreading over enough domains.

Go workspaces for k/k (SIG Architecture)

The Kubernetes repo now uses Go workspaces. This should not impact end users at all, but does have a impact for developers of downstream projects. Switching to workspaces caused some breaking changes in the flags to the various k8s.io/code-generator tools. Downstream consumers should look at staging/src/k8s.io/code-generator/kube_codegen.sh to see the changes.

For full details on the changes and reasons why Go workspaces was introduced, read Using Go workspaces in Kubernetes.

Improvements that graduated to beta in Kubernetes v1.30

This is a selection of some of the improvements that are now beta following the v1.30 release.

Node log query (SIG Windows)

To help with debugging issues on nodes, Kubernetes v1.27 introduced a feature that allows fetching logs of services running on the node. To use the feature, ensure that the NodeLogQuery feature gate is enabled for that node, and that the kubelet configuration options enableSystemLogHandler and enableSystemLogQuery are both set to true.

Following the v1.30 release, this is now beta (you still need to enable the feature to use it, though).

On Linux the assumption is that service logs are available via journald. On Windows the assumption is that service logs are available in the application log provider. Logs are also available by reading files within /var/log/ (Linux) or C:\var\log\ (Windows). For more information, see the log query documentation.

CRD validation ratcheting (SIG API Machinery)

You need to enable the CRDValidationRatcheting feature gate to use this behavior, which then applies to all CustomResourceDefinitions in your cluster.

Provided you enabled the feature gate, Kubernetes implements validation ratcheting for CustomResourceDefinitions. The API server is willing to accept updates to resources that are not valid after the update, provided that each part of the resource that failed to validate was not changed by the update operation. In other words, any invalid part of the resource that remains invalid must have already been wrong. You cannot use this mechanism to update a valid resource so that it becomes invalid.

This feature allows authors of CRDs to confidently add new validations to the OpenAPIV3 schema under certain conditions. Users can update to the new schema safely without bumping the version of the object or breaking workflows.

Contextual logging (SIG Instrumentation)

Contextual Logging advances to beta in this release, empowering developers and operators to inject customizable, correlatable contextual details like service names and transaction IDs into logs through WithValues and WithName. This enhancement simplifies the correlation and analysis of log data across distributed systems, significantly improving the efficiency of troubleshooting efforts. By offering a clearer insight into the workings of your Kubernetes environments, Contextual Logging ensures that operational challenges are more manageable, marking a notable step forward in Kubernetes observability.

Make Kubernetes aware of the LoadBalancer behaviour (SIG Network)

The LoadBalancerIPMode feature gate is now beta and is now enabled by default. This feature allows you to set the .status.loadBalancer.ingress.ipMode for a Service with type set to LoadBalancer. The .status.loadBalancer.ingress.ipMode specifies how the load-balancer IP behaves. It may be specified only when the .status.loadBalancer.ingress.ip field is also specified. See more details about specifying IPMode of load balancer status.

Structured Authentication Configuration (SIG Auth)

Structured Authentication Configuration graduates to beta in this release.

Kubernetes has had a long-standing need for a more flexible and extensible authentication system. The current system, while powerful, has some limitations that make it difficult to use in certain scenarios. For example, it is not possible to use multiple authenticators of the same type (e.g., multiple JWT authenticators) or to change the configuration without restarting the API server. The Structured Authentication Configuration feature is the first step towards addressing these limitations and providing a more flexible and extensible way to configure authentication in Kubernetes. See more details about structured authentication configuration.

Structured Authorization Configuration (SIG Auth)

Structured Authorization Configuration graduates to beta in this release.

Kubernetes continues to evolve to meet the intricate requirements of system administrators and developers alike. A critical aspect of Kubernetes that ensures the security and integrity of the cluster is the API server authorization. Until recently, the configuration of the authorization chain in kube-apiserver was somewhat rigid, limited to a set of command-line flags and allowing only a single webhook in the authorization chain. This approach, while functional, restricted the flexibility needed by cluster administrators to define complex, fine-grained authorization policies. The latest Structured Authorization Configuration feature aims to revolutionize this aspect by introducing a more structured and versatile way to configure the authorization chain, focusing on enabling multiple webhooks and providing explicit control mechanisms. See more details about structured authorization configuration.

New alpha features

Speed up recursive SELinux label change (SIG Storage)

From the v1.27 release, Kubernetes already included an optimization that sets SELinux labels on the contents of volumes, using only constant time. Kubernetes achieves that speed up using a mount option. The slower legacy behavior requires the container runtime to recursively walk through the whole volumes and apply SELinux labelling individually to each file and directory; this is especially noticable for volumes with large amount of files and directories.

Kubernetes v1.27 graduated this feature as beta, but limited it to ReadWriteOncePod volumes. The corresponding feature gate is SELinuxMountReadWriteOncePod. It's still enabled by default and remains beta in v1.30.

Kubernetes v1.30 extends support for SELinux mount option to all volumes as alpha, with a separate feature gate: SELinuxMount. This feature gate introduces a behavioral change when multiple Pods with different SELinux labels share the same volume. See KEP for details.

We strongly encourage users that run Kubernetes with SELinux enabled to test this feature and provide any feedback on the KEP issue.

Feature gateStage in v1.30Behavior change
SELinuxMountReadWriteOncePodBetaNo
SELinuxMountAlphaYes

Both feature gates SELinuxMountReadWriteOncePod and SELinuxMount must be enabled to test this feature on all volumes.

This feature has no effect on Windows nodes or on Linux nodes without SELinux support.

Recursive Read-only (RRO) mounts (SIG Node)

Introducing Recursive Read-Only (RRO) Mounts in alpha this release, you'll find a new layer of security for your data. This feature lets you set volumes and their submounts as read-only, preventing accidental modifications. Imagine deploying a critical application where data integrity is key—RRO Mounts ensure that your data stays untouched, reinforcing your cluster's security with an extra safeguard. This is especially crucial in tightly controlled environments, where even the slightest change can have significant implications.

Job success/completion policy (SIG Apps)

From Kubernetes v1.30, indexed Jobs support .spec.successPolicy to define when a Job can be declared succeeded based on succeeded Pods. This allows you to define two types of criteria:

  • succeededIndexes indicates that the Job can be declared succeeded when these indexes succeeded, even if other indexes failed.

  • succeededCount indicates that the Job can be declared succeeded when the number of succeeded Indexes reaches this criterion.

After the Job meets the success policy, the Job controller terminates the lingering Pods.

Traffic distribution for services (SIG Network)

Kubernetes v1.30 introduces the spec.trafficDistribution field within a Kubernetes Service as alpha. This allows you to express preferences for how traffic should be routed to Service endpoints. While traffic policies focus on strict semantic guarantees, traffic distribution allows you to express preferences (such as routing to topologically closer endpoints). This can help optimize for performance, cost, or reliability. You can use this field by enabling the ServiceTrafficDistribution feature gate for your cluster and all of its nodes. In Kubernetes v1.30, the following field value is supported:

PreferClose: Indicates a preference for routing traffic to endpoints that are topologically proximate to the client. The interpretation of "topologically proximate" may vary across implementations and could encompass endpoints within the same node, rack, zone, or even region. Setting this value gives implementations permission to make different tradeoffs, for example optimizing for proximity rather than equal distribution of load. You should not set this value if such tradeoffs are not acceptable.

If the field is not set, the implementation (like kube-proxy) will apply its default routing strategy.

See Traffic Distribution for more details.

Storage Version Migration (SIG API Machinery)

Kubernetes v1.30 introduces a new built-in API for StorageVersionMigration. Kubernetes relies on API data being actively re-written, to support some maintenance activities related to at rest storage. Two prominent examples are the versioned schema of stored resources (that is, the preferred storage schema changing from v1 to v2 for a given resource) and encryption at rest (that is, rewriting stale data based on a change in how the data should be encrypted).

StorageVersionMigration is alpha API which was available out of tree before.

See storage version migration for more details.

Graduations, deprecations and removals for Kubernetes v1.30

Graduated to stable

This lists all the features that graduated to stable (also known as general availability). For a full list of updates including new features and graduations from alpha to beta, see the release notes.

This release includes a total of 17 enhancements promoted to Stable:

Deprecations and removals

Removed the SecurityContextDeny admission plugin, deprecated since v1.27

(SIG Auth, SIG Security, and SIG Testing) With the removal of the SecurityContextDeny admission plugin, the Pod Security Admission plugin, available since v1.25, is recommended instead.

Release notes

Check out the full details of the Kubernetes v1.30 release in our release notes.

Availability

Kubernetes v1.30 is available for download on GitHub. To get started with Kubernetes, check out these interactive tutorials or run local Kubernetes clusters using minikube. You can also easily install v1.30 using kubeadm.

Release team

Kubernetes is only possible with the support, commitment, and hard work of its community. Each release team is made up of dedicated community volunteers who work together to build the many pieces that make up the Kubernetes releases you rely on. This requires the specialized skills of people from all corners of our community, from the code itself to its documentation and project management.

We would like to thank the entire release team for the hours spent hard at work to deliver the Kubernetes v1.30 release to our community. The Release Team's membership ranges from first-time shadows to returning team leads with experience forged over several release cycles. A very special thanks goes out our release lead, Kat Cosgrove, for supporting us through a successful release cycle, advocating for us, making sure that we could all contribute in the best way possible, and challenging us to improve the release process.

Project velocity

The CNCF K8s DevStats project aggregates a number of interesting data points related to the velocity of Kubernetes and various sub-projects. This includes everything from individual contributions to the number of companies that are contributing and is an illustration of the depth and breadth of effort that goes into evolving this ecosystem.

In the v1.30 release cycle, which ran for 14 weeks (January 8 to April 17), we saw contributions from 863 companies and 1391 individuals.

Event update

  • KubeCon + CloudNativeCon China 2024 will take place in Hong Kong, from 21 – 23 August 2024! You can find more information about the conference and registration on the event site.

  • KubeCon + CloudNativeCon North America 2024 will take place in Salt Lake City, Utah, The United States of America, from 12 – 15 November 2024! You can find more information about the conference and registration on the eventsite.

Upcoming release webinar

Join members of the Kubernetes v1.30 release team on Thursday, May 23rd, 2024, at 9 A.M. PT to learn about the major features of this release, as well as deprecations and removals to help plan for upgrades. For more information and registration, visit the event page on the CNCF Online Programs site.

Get involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below. Thank you for your continued feedback and support.

This blog was updated on April 19th, 2024 to highlight two additional changes not originally included in the release blog.

Kubernetes 1.30: Beta Support For Pods With User Namespaces

By Rodrigo Campos Catelin (Microsoft), Giuseppe Scrivano (Red Hat), Sascha Grunert (Red Hat) | Monday, April 22, 2024

Linux provides different namespaces to isolate processes from each other. For example, a typical Kubernetes pod runs within a network namespace to isolate the network identity and a PID namespace to isolate the processes.

One Linux namespace that was left behind is the user namespace. This namespace allows us to isolate the user and group identifiers (UIDs and GIDs) we use inside the container from the ones on the host.

This is a powerful abstraction that allows us to run containers as "root": we are root inside the container and can do everything root can inside the pod, but our interactions with the host are limited to what a non-privileged user can do. This is great for limiting the impact of a container breakout.

A container breakout is when a process inside a container can break out onto the host using some unpatched vulnerability in the container runtime or the kernel and can access/modify files on the host or other containers. If we run our pods with user namespaces, the privileges the container has over the rest of the host are reduced, and the files outside the container it can access are limited too.

In Kubernetes v1.25, we introduced support for user namespaces only for stateless pods. Kubernetes 1.28 lifted that restriction, and now, with Kubernetes 1.30, we are moving to beta!

What is a user namespace?

Note: Linux user namespaces are a different concept from Kubernetes namespaces. The former is a Linux kernel feature; the latter is a Kubernetes feature.

User namespaces are a Linux feature that isolates the UIDs and GIDs of the containers from the ones on the host. The identifiers in the container can be mapped to identifiers on the host in a way where the host UID/GIDs used for different containers never overlap. Furthermore, the identifiers can be mapped to unprivileged, non-overlapping UIDs and GIDs on the host. This brings two key benefits:

  • Prevention of lateral movement: As the UIDs and GIDs for different containers are mapped to different UIDs and GIDs on the host, containers have a harder time attacking each other, even if they escape the container boundaries. For example, suppose container A runs with different UIDs and GIDs on the host than container B. In that case, the operations it can do on container B's files and processes are limited: only read/write what a file allows to others, as it will never have permission owner or group permission (the UIDs/GIDs on the host are guaranteed to be different for different containers).

  • Increased host isolation: As the UIDs and GIDs are mapped to unprivileged users on the host, if a container escapes the container boundaries, even if it runs as root inside the container, it has no privileges on the host. This greatly protects what host files it can read/write, which process it can send signals to, etc. Furthermore, capabilities granted are only valid inside the user namespace and not on the host, limiting the impact a container escape can have.

Image showing IDs 0-65535 are reserved to the host, pods use higher IDs

User namespace IDs allocation

Without using a user namespace, a container running as root in the case of a container breakout has root privileges on the node. If some capabilities were granted to the container, the capabilities are valid on the host too. None of this is true when using user namespaces (modulo bugs, of course 🙂).

Changes in 1.30

In Kubernetes 1.30, besides moving user namespaces to beta, the contributors working on this feature:

  • Introduced a way for the kubelet to use custom ranges for the UIDs/GIDs mapping

  • Have added a way for Kubernetes to enforce that the runtime supports all the features needed for user namespaces. If they are not supported, Kubernetes will show a clear error when trying to create a pod with user namespaces. Before 1.30, if the container runtime didn't support user namespaces, the pod could be created without a user namespace.

  • Added more tests, including tests in the cri-tools repository.

You can check the documentation on user namespaces for how to configure custom ranges for the mapping.

Demo

A few months ago, CVE-2024-21626 was disclosed. This vulnerability score is 8.6 (HIGH). It allows an attacker to escape a container and read/write to any path on the node and other pods hosted on the same node.

Rodrigo created a demo that exploits CVE 2024-21626 and shows how the exploit, which works without user namespaces, is mitigated when user namespaces are in use.

Please note that with user namespaces, an attacker can do on the host file system what the permission bits for "others" allow. Therefore, the CVE is not completely prevented, but the impact is greatly reduced.

Node system requirements

There are requirements on the Linux kernel version and the container runtime to use this feature.

On Linux you need Linux 6.3 or greater. This is because the feature relies on a kernel feature named idmap mounts, and support for using idmap mounts with tmpfs was merged in Linux 6.3.

Suppose you are using CRI-O with crun; as always, you can expect support for Kubernetes 1.30 with CRI-O 1.30. Please note you also need crun 1.9 or greater. If you are using CRI-O with runc, this is still not supported.

Containerd support is currently targeted for containerd 2.0, and the same crun version requirements apply. If you are using containerd with runc, this is still not supported.

Please note that containerd 1.7 added experimental support for user namespaces, as implemented in Kubernetes 1.25 and 1.26. We did a redesign in Kubernetes 1.27, which requires changes in the container runtime. Those changes are not present in containerd 1.7, so it only works with user namespaces support in Kubernetes 1.25 and 1.26.

Another limitation of containerd 1.7 is that it needs to change the ownership of every file and directory inside the container image during Pod startup. This has a storage overhead and can significantly impact the container startup latency. Containerd 2.0 will probably include an implementation that will eliminate the added startup latency and storage overhead. Consider this if you plan to use containerd 1.7 with user namespaces in production.

None of these containerd 1.7 limitations apply to CRI-O.

Kubernetes 1.30: Read-only volume mounts can be finally literally read-only

By Akihiro Suda (NTT) | Tuesday, April 23, 2024

Read-only volume mounts have been a feature of Kubernetes since the beginning. Surprisingly, read-only mounts are not completely read-only under certain conditions on Linux. As of the v1.30 release, they can be made completely read-only, with alpha support for recursive read-only mounts.

Read-only volume mounts are not really read-only by default

Volume mounts can be deceptively complicated.

You might expect that the following manifest makes everything under /mnt in the containers read-only:

---
apiVersion: v1
kind: Pod
spec:
  volumes:
    - name: mnt
      hostPath:
        path: /mnt
  containers:
    - volumeMounts:
        - name: mnt
          mountPath: /mnt
          readOnly: true

However, any sub-mounts beneath /mnt may still be writable! For example, consider that /mnt/my-nfs-server is writeable on the host. Inside the container, writes to /mnt/* will be rejected but /mnt/my-nfs-server/* will still be writeable.

New mount option: recursiveReadOnly

Kubernetes 1.30 added a new mount option recursiveReadOnly so as to make submounts recursively read-only.

The option can be enabled as follows:

---
apiVersion: v1
kind: Pod
spec:
  volumes:
    - name: mnt
      hostPath:
        path: /mnt
  containers:
    - volumeMounts:
        - name: mnt
          mountPath: /mnt
          readOnly: true
          # NEW
          # Possible values are `Enabled`, `IfPossible`, and `Disabled`.
          # Needs to be specified in conjunction with `readOnly: true`.
          recursiveReadOnly: Enabled

This is implemented by applying the MOUNT_ATTR_RDONLY attribute with the AT_RECURSIVE flag using mount_setattr(2) added in Linux kernel v5.12.

For backwards compatibility, the recursiveReadOnly field is not a replacement for readOnly, but is used in conjunction with it. To get a properly recursive read-only mount, you must set both fields.

Feature availability

To enable recursiveReadOnly mounts, the following components have to be used:

  • Kubernetes: v1.30 or later, with the RecursiveReadOnlyMounts feature gate enabled. As of v1.30, the gate is marked as alpha.

  • CRI runtime:

    • containerd: v2.0 or later
  • OCI runtime:

    • runc: v1.1 or later

    • crun: v1.8.6 or later

  • Linux kernel: v5.12 or later

What's next?

Kubernetes SIG Node hope - and expect - that the feature will be promoted to beta and eventually general availability (GA) in future releases of Kubernetes, so that users no longer need to enable the feature gate manually.

The default value of recursiveReadOnly will still remain Disabled, for backwards compatibility.

How can I learn more?

Please check out the documentation for the further details of recursiveReadOnly mounts.

How to get involved?

This feature is driven by the SIG Node community. Please join us to connect with the community and share your ideas and feedback around the above feature and beyond. We look forward to hearing from you!

Kubernetes 1.30: Validating Admission Policy Is Generally Available

By Jiahui Feng (Google) | Wednesday, April 24, 2024

On behalf of the Kubernetes project, I am excited to announce that ValidatingAdmissionPolicy has reached general availability as part of Kubernetes 1.30 release. If you have not yet read about this new declarative alternative to validating admission webhooks, it may be interesting to read our previous post about the new feature. If you have already heard about ValidatingAdmissionPolicies and you are eager to try them out, there is no better time to do it than now.

Let's have a taste of a ValidatingAdmissionPolicy, by replacing a simple webhook.

Example admission webhook

First, let's take a look at an example of a simple webhook. Here is an excerpt from a webhook that enforces runAsNonRoot, readOnlyRootFilesystem, allowPrivilegeEscalation, and privileged to be set to the least permissive values.

func verifyDeployment(deploy *appsv1.Deployment) error {
    var errs []error
    for i, c := range deploy.Spec.Template.Spec.Containers {
        if c.Name == "" {
            return fmt.Errorf("container %d has no name", i)
        }
        if c.SecurityContext == nil {
            errs = append(errs, fmt.Errorf("container %q does not have SecurityContext", c.Name))
        }
        if c.SecurityContext.RunAsNonRoot == nil || !*c.SecurityContext.RunAsNonRoot {
            errs = append(errs, fmt.Errorf("container %q must set RunAsNonRoot to true in its SecurityContext", c.Name))
        }
        if c.SecurityContext.ReadOnlyRootFilesystem == nil || !*c.SecurityContext.ReadOnlyRootFilesystem {
            errs = append(errs, fmt.Errorf("container %q must set ReadOnlyRootFilesystem to true in its SecurityContext", c.Name))
        }
        if c.SecurityContext.AllowPrivilegeEscalation != nil && *c.SecurityContext.AllowPrivilegeEscalation {
            errs = append(errs, fmt.Errorf("container %q must NOT set AllowPrivilegeEscalation to true in its SecurityContext", c.Name))
        }
        if c.SecurityContext.Privileged != nil && *c.SecurityContext.Privileged {
            errs = append(errs, fmt.Errorf("container %q must NOT set Privileged to true in its SecurityContext", c.Name))
        }
    }
    return errors.NewAggregate(errs)
}

Check out What are admission webhooks? Or, see the full code of this webhook to follow along with this walkthrough.

The policy

Now let's try to recreate the validation faithfully with a ValidatingAdmissionPolicy.

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
  name: "pod-security.policy.example.com"
spec:
  failurePolicy: Fail
  matchConstraints:
    resourceRules:
    - apiGroups:   ["apps"]
      apiVersions: ["v1"]
      operations:  ["CREATE", "UPDATE"]
      resources:   ["deployments"]
  validations:
  - expression: object.spec.template.spec.containers.all(c, has(c.securityContext) && has(c.securityContext.runAsNonRoot) && c.securityContext.runAsNonRoot)
    message: 'all containers must set runAsNonRoot to true'
  - expression: object.spec.template.spec.containers.all(c, has(c.securityContext) && has(c.securityContext.readOnlyRootFilesystem) && c.securityContext.readOnlyRootFilesystem)
    message: 'all containers must set readOnlyRootFilesystem to true'
  - expression: object.spec.template.spec.containers.all(c, !has(c.securityContext) || !has(c.securityContext.allowPrivilegeEscalation) || !c.securityContext.allowPrivilegeEscalation)
    message: 'all containers must NOT set allowPrivilegeEscalation to true'
  - expression: object.spec.template.spec.containers.all(c, !has(c.securityContext) || !has(c.securityContext.Privileged) || !c.securityContext.Privileged)
    message: 'all containers must NOT set privileged to true'

Create the policy with kubectl. Great, no complain so far. But let's get the policy object back and take a look at its status.

kubectl get -oyaml validatingadmissionpolicies/pod-security.policy.example.com
  status:
    typeChecking:
      expressionWarnings:
      - fieldRef: spec.validations[3].expression
        warning: |
          apps/v1, Kind=Deployment: ERROR: <input>:1:76: undefined field 'Privileged'
           | object.spec.template.spec.containers.all(c, !has(c.securityContext) || !has(c.securityContext.Privileged) || !c.securityContext.Privileged)
           | ...........................................................................^
          ERROR: <input>:1:128: undefined field 'Privileged'
           | object.spec.template.spec.containers.all(c, !has(c.securityContext) || !has(c.securityContext.Privileged) || !c.securityContext.Privileged)
           | ...............................................................................................................................^

The policy was checked against its matched type, which is apps/v1.Deployment. Looking at the fieldRef, the problem was with the 3rd expression (index starts with 0) The expression in question accessed an undefined Privileged field. Ahh, looks like it was a copy-and-paste error. The field name should be in lowercase.

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
  name: "pod-security.policy.example.com"
spec:
  failurePolicy: Fail
  matchConstraints:
    resourceRules:
    - apiGroups:   ["apps"]
      apiVersions: ["v1"]
      operations:  ["CREATE", "UPDATE"]
      resources:   ["deployments"]
  validations:
  - expression: object.spec.template.spec.containers.all(c, has(c.securityContext) && has(c.securityContext.runAsNonRoot) && c.securityContext.runAsNonRoot)
    message: 'all containers must set runAsNonRoot to true'
  - expression: object.spec.template.spec.containers.all(c, has(c.securityContext) && has(c.securityContext.readOnlyRootFilesystem) && c.securityContext.readOnlyRootFilesystem)
    message: 'all containers must set readOnlyRootFilesystem to true'
  - expression: object.spec.template.spec.containers.all(c, !has(c.securityContext) || !has(c.securityContext.allowPrivilegeEscalation) || !c.securityContext.allowPrivilegeEscalation)
    message: 'all containers must NOT set allowPrivilegeEscalation to true'
  - expression: object.spec.template.spec.containers.all(c, !has(c.securityContext) || !has(c.securityContext.privileged) || !c.securityContext.privileged)
    message: 'all containers must NOT set privileged to true'

Check its status again, and you should see all warnings cleared.

Next, let's create a namespace for our tests.

kubectl create namespace policy-test

Then, I bind the policy to the namespace. But at this point, I set the action to Warn so that the policy prints out warnings instead of rejecting the requests. This is especially useful to collect results from all expressions during development and automated testing.

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
  name: "pod-security.policy-binding.example.com"
spec:
  policyName: "pod-security.policy.example.com"
  validationActions: ["Warn"]
  matchResources:
    namespaceSelector:
      matchLabels:
        "kubernetes.io/metadata.name": "policy-test"

Tests out policy enforcement.

kubectl create -n policy-test -f- <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: nginx
  name: nginx
spec:
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - image: nginx
        name: nginx
        securityContext:
          privileged: true
          allowPrivilegeEscalation: true
EOF
Warning: Validation failed for ValidatingAdmissionPolicy 'pod-security.policy.example.com' with binding 'pod-security.policy-binding.example.com': all containers must set runAsNonRoot to true
Warning: Validation failed for ValidatingAdmissionPolicy 'pod-security.policy.example.com' with binding 'pod-security.policy-binding.example.com': all containers must set readOnlyRootFilesystem to true
Warning: Validation failed for ValidatingAdmissionPolicy 'pod-security.policy.example.com' with binding 'pod-security.policy-binding.example.com': all containers must NOT set allowPrivilegeEscalation to true
Warning: Validation failed for ValidatingAdmissionPolicy 'pod-security.policy.example.com' with binding 'pod-security.policy-binding.example.com': all containers must NOT set privileged to true
Error from server: error when creating "STDIN": admission webhook "webhook.example.com" denied the request: [container "nginx" must set RunAsNonRoot to true in its SecurityContext, container "nginx" must set ReadOnlyRootFilesystem to true in its SecurityContext, container "nginx" must NOT set AllowPrivilegeEscalation to true in its SecurityContext, container "nginx" must NOT set Privileged to true in its SecurityContext]

Looks great! The policy and the webhook give equivalent results. After a few other cases, when we are confident with our policy, maybe it is time to do some cleanup.

  • For every expression, we repeat access to object.spec.template.spec.containers and to each securityContext;

  • There is a pattern of checking presence of a field and then accessing it, which looks a bit verbose.

Fortunately, since Kubernetes 1.28, we have new solutions for both issues. Variable Composition allows us to extract repeated sub-expressions into their own variables. Kubernetes enables the optional library for CEL, which are excellent to work with fields that are, you guessed it, optional.

With both features in mind, let's refactor the policy a bit.

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
  name: "pod-security.policy.example.com"
spec:
  failurePolicy: Fail
  matchConstraints:
    resourceRules:
    - apiGroups:   ["apps"]
      apiVersions: ["v1"]
      operations:  ["CREATE", "UPDATE"]
      resources:   ["deployments"]
  variables:
  - name: containers
    expression: object.spec.template.spec.containers
  - name: securityContexts
    expression: 'variables.containers.map(c, c.?securityContext)'
  validations:
  - expression: variables.securityContexts.all(c, c.?runAsNonRoot == optional.of(true))
    message: 'all containers must set runAsNonRoot to true'
  - expression: variables.securityContexts.all(c, c.?readOnlyRootFilesystem == optional.of(true))
    message: 'all containers must set readOnlyRootFilesystem to true'
  - expression: variables.securityContexts.all(c, c.?allowPrivilegeEscalation != optional.of(true))
    message: 'all containers must NOT set allowPrivilegeEscalation to true'
  - expression: variables.securityContexts.all(c, c.?privileged != optional.of(true))
    message: 'all containers must NOT set privileged to true'

The policy is now much cleaner and more readable. Update the policy, and you should see it function the same as before.

Now let's change the policy binding from warning to actually denying requests that fail validation.

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
  name: "pod-security.policy-binding.example.com"
spec:
  policyName: "pod-security.policy.example.com"
  validationActions: ["Deny"]
  matchResources:
    namespaceSelector:
      matchLabels:
        "kubernetes.io/metadata.name": "policy-test"

And finally, remove the webhook. Now the result should include only messages from the policy.

kubectl create -n policy-test -f- <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: nginx
  name: nginx
spec:
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - image: nginx
        name: nginx
        securityContext:
          privileged: true
          allowPrivilegeEscalation: true
EOF
The deployments "nginx" is invalid: : ValidatingAdmissionPolicy 'pod-security.policy.example.com' with binding 'pod-security.policy-binding.example.com' denied request: all containers must set runAsNonRoot to true

Please notice that, by design, the policy will stop evaluation after the first expression that causes the request to be denied. This is different from what happens when the expressions generate only warnings.

Set up monitoring

Unlike a webhook, a policy is not a dedicated process that can expose its own metrics. Instead, you can use metrics from the API server in their place.

Here are some examples in Prometheus Query Language of common monitoring tasks.

To find the 95th percentile execution duration of the policy shown above.

histogram_quantile(0.95, sum(rate(apiserver_validating_admission_policy_check_duration_seconds_bucket{policy="pod-security.policy.example.com"}[5m])) by (le))

To find the rate of the policy evaluation.

rate(apiserver_validating_admission_policy_check_total{policy="pod-security.policy.example.com"}[5m])

You can read the metrics reference to learn more about the metrics above. The metrics of ValidatingAdmissionPolicy are currently in alpha, and more and better metrics will come while the stability graduates in the future release.

Kubernetes 1.30: Structured Authentication Configuration Moves to Beta

By Anish Ramasekar (Microsoft) | Thursday, April 25, 2024

With Kubernetes 1.30, we (SIG Auth) are moving Structured Authentication Configuration to beta.

Today's article is about authentication: finding out who's performing a task, and checking that they are who they say they are. Check back in tomorrow to find about what's new in Kubernetes v1.30 around authorization (deciding what someone can and can't access).

Motivation

Kubernetes has had a long-standing need for a more flexible and extensible authentication system. The current system, while powerful, has some limitations that make it difficult to use in certain scenarios. For example, it is not possible to use multiple authenticators of the same type (e.g., multiple JWT authenticators) or to change the configuration without restarting the API server. The Structured Authentication Configuration feature is the first step towards addressing these limitations and providing a more flexible and extensible way to configure authentication in Kubernetes.

What is structured authentication configuration?

Kubernetes v1.30 builds on the experimental support for configurating authentication based on a file, that was added as alpha in Kubernetes v1.30. At this beta stage, Kubernetes only supports configuring JWT authenticators, which serve as the next iteration of the existing OIDC authenticator. JWT authenticator is an authenticator to authenticate Kubernetes users using JWT compliant tokens. The authenticator will attempt to parse a raw ID token, verify it's been signed by the configured issuer.

The Kubernetes project added configuration from a file so that it can provide more flexibility than using command line options (which continue to work, and are still supported). Supporting a configuration file also makes it easy to deliver further improvements in upcoming releases.

Benefits of structured authentication configuration

Here's why using a configuration file to configure cluster authentication is a benefit:

  1. Multiple JWT authenticators: You can configure multiple JWT authenticators simultaneously. This allows you to use multiple identity providers (e.g., Okta, Keycloak, GitLab) without needing to use an intermediary like Dex that handles multiplexing between multiple identity providers.

  2. Dynamic configuration: You can change the configuration without restarting the API server. This allows you to add, remove, or modify authenticators without disrupting the API server.

  3. Any JWT-compliant token: You can use any JWT-compliant token for authentication. This allows you to use tokens from any identity provider that supports JWT. The minimum valid JWT payload must contain the claims documented in structured authentication configuration page in the Kubernetes documentation.

  4. CEL (Common Expression Language) support: You can use CEL to determine whether the token's claims match the user's attributes in Kubernetes (e.g., username, group). This allows you to use complex logic to determine whether a token is valid.

  5. Multiple audiences: You can configure multiple audiences for a single authenticator. This allows you to use the same authenticator for multiple audiences, such as using a different OAuth client for kubectl and dashboard.

  6. Using identity providers that don't support OpenID connect discovery: You can use identity providers that don't support OpenID Connect discovery. The only requirement is to host the discovery document at a different location than the issuer (such as locally in the cluster) and specify the issuer.discoveryURL in the configuration file.

How to use Structured Authentication Configuration

To use structured authentication configuration, you specify the path to the authentication configuration using the --authentication-config command line argument in the API server. The configuration file is a YAML file that specifies the authenticators and their configuration. Here is an example configuration file that configures two JWT authenticators:

apiVersion: apiserver.config.k8s.io/v1beta1
kind: AuthenticationConfiguration
# Someone with a valid token from either of these issuers could authenticate
# against this cluster.
jwt:
- issuer:
    url: https://issuer1.example.com
    audiences:
    - audience1
    - audience2
    audienceMatchPolicy: MatchAny
  claimValidationRules:
    expression: 'claims.hd == "example.com"'
    message: "the hosted domain name must be example.com"
  claimMappings:
    username:
      expression: 'claims.username'
    groups:
      expression: 'claims.groups'
    uid:
      expression: 'claims.uid'
    extra:
    - key: 'example.com/tenant'
      expression: 'claims.tenant'
  userValidationRules:
  - expression: "!user.username.startsWith('system:')"
    message: "username cannot use reserved system: prefix"
# second authenticator that exposes the discovery document at a different location
# than the issuer
- issuer:
    url: https://issuer2.example.com
    discoveryURL: https://discovery.example.com/.well-known/openid-configuration
    audiences:
    - audience3
    - audience4
    audienceMatchPolicy: MatchAny
  claimValidationRules:
    expression: 'claims.hd == "example.com"'
    message: "the hosted domain name must be example.com"
  claimMappings:
    username:
      expression: 'claims.username'
    groups:
      expression: 'claims.groups'
    uid:
      expression: 'claims.uid'
    extra:
    - key: 'example.com/tenant'
      expression: 'claims.tenant'
  userValidationRules:
  - expression: "!user.username.startsWith('system:')"
    message: "username cannot use reserved system: prefix"

Migration from command line arguments to configuration file

The Structured Authentication Configuration feature is designed to be backwards-compatible with the existing approach, based on command line options, for configuring the JWT authenticator. This means that you can continue to use the existing command-line options to configure the JWT authenticator. However, we (Kubernetes SIG Auth) recommend migrating to the new configuration file-based approach, as it provides more flexibility and extensibility.

Note

If you specify --authentication-config along with any of the --oidc-* command line arguments, this is a misconfiguration. In this situation, the API server reports an error and then immediately exits.

If you want to switch to using structured authentication configuration, you have to remove the --oidc-* command line arguments, and use the configuration file instead.

Here is an example of how to migrate from the command-line flags to the configuration file:

Command-line arguments

--oidc-issuer-url=https://issuer.example.com
--oidc-client-id=example-client-id
--oidc-username-claim=username
--oidc-groups-claim=groups
--oidc-username-prefix=oidc:
--oidc-groups-prefix=oidc:
--oidc-required-claim="hd=example.com"
--oidc-required-claim="admin=true"
--oidc-ca-file=/path/to/ca.pem

There is no equivalent in the configuration file for the --oidc-signing-algs. For Kubernetes v1.30, the authenticator supports all the asymmetric algorithms listed in oidc.go.

Configuration file

apiVersion: apiserver.config.k8s.io/v1beta1
kind: AuthenticationConfiguration
jwt:
- issuer:
    url: https://issuer.example.com
    audiences:
    - example-client-id
    certificateAuthority: <value is the content of file /path/to/ca.pem>
  claimMappings:
    username:
      claim: username
      prefix: "oidc:"
    groups:
      claim: groups
      prefix: "oidc:"
  claimValidationRules:
  - claim: hd
    requiredValue: "example.com"
  - claim: admin
    requiredValue: "true"

What's next?

For Kubernetes v1.31, we expect the feature to stay in beta while we get more feedback. In the coming releases, we want to investigate:

  • Making distributed claims work via CEL expressions.

  • Egress selector configuration support for calls to issuer.url and issuer.discoveryURL.

You can learn more about this feature on the structured authentication configuration page in the Kubernetes documentation. You can also follow along on the KEP-3331 to track progress across the coming Kubernetes releases.

Try it out

In this post, I have covered the benefits the Structured Authentication Configuration feature brings in Kubernetes v1.30. To use this feature, you must specify the path to the authentication configuration using the --authentication-config command line argument. From Kubernetes v1.30, the feature is in beta and enabled by default. If you want to keep using command line arguments instead of a configuration file, those will continue to work as-is.

We would love to hear your feedback on this feature. Please reach out to us on the #sig-auth-authenticators-dev channel on Kubernetes Slack (for an invitation, visit https://slack.k8s.io/).

How to get involved

If you are interested in getting involved in the development of this feature, share feedback, or participate in any other ongoing SIG Auth projects, please reach out on the #sig-auth channel on Kubernetes Slack.

You are also welcome to join the bi-weekly SIG Auth meetings held every-other Wednesday.

Kubernetes 1.30: Multi-Webhook and Modular Authorization Made Much Easier

By Rita Zhang (Microsoft), Jordan Liggitt (Google), Nabarun Pal (VMware), Leigh Capili (VMware) | Friday, April 26, 2024

With Kubernetes 1.30, we (SIG Auth) are moving Structured Authorization Configuration to beta.

Today's article is about authorization: deciding what someone can and cannot access. Check a previous article from yesterday to find about what's new in Kubernetes v1.30 around authentication (finding out who's performing a task, and checking that they are who they say they are).

Introduction

Kubernetes continues to evolve to meet the intricate requirements of system administrators and developers alike. A critical aspect of Kubernetes that ensures the security and integrity of the cluster is the API server authorization. Until recently, the configuration of the authorization chain in kube-apiserver was somewhat rigid, limited to a set of command-line flags and allowing only a single webhook in the authorization chain. This approach, while functional, restricted the flexibility needed by cluster administrators to define complex, fine-grained authorization policies. The latest Structured Authorization Configuration feature (KEP-3221) aims to revolutionize this aspect by introducing a more structured and versatile way to configure the authorization chain, focusing on enabling multiple webhooks and providing explicit control mechanisms.

The Need for Improvement

Cluster administrators have long sought the ability to specify multiple authorization webhooks within the API Server handler chain and have control over detailed behavior like timeout and failure policy for each webhook. This need arises from the desire to create layered security policies, where requests can be validated against multiple criteria or sets of rules in a specific order. The previous limitations also made it difficult to dynamically configure the authorizer chain, leaving no room to manage complex authorization scenarios efficiently.

The Structured Authorization Configuration feature addresses these limitations by introducing a configuration file format to configure the Kubernetes API Server Authorization chain. This format allows specifying multiple webhooks in the authorization chain (all other authorization types are specified no more than once). Each webhook authorizer has well-defined parameters, including timeout settings, failure policies, and conditions for invocation with CEL rules to pre-filter requests before they are dispatched to webhooks, helping you prevent unnecessary invocations. The configuration also supports automatic reloading, ensuring changes can be applied dynamically without restarting the kube-apiserver. This feature addresses current limitations and opens up new possibilities for securing and managing Kubernetes clusters more effectively.

Sample Configurations

Here is a sample structured authorization configuration along with descriptions for all fields, their defaults, and possible values.

apiVersion: apiserver.config.k8s.io/v1beta1
kind: AuthorizationConfiguration
authorizers:
  - type: Webhook
    # Name used to describe the authorizer
    # This is explicitly used in monitoring machinery for metrics
    # Note:
    #   - Validation for this field is similar to how K8s labels are validated today.
    # Required, with no default
    name: webhook
    webhook:
      # The duration to cache 'authorized' responses from the webhook
      # authorizer.
      # Same as setting `--authorization-webhook-cache-authorized-ttl` flag
      # Default: 5m0s
      authorizedTTL: 30s
      # The duration to cache 'unauthorized' responses from the webhook
      # authorizer.
      # Same as setting `--authorization-webhook-cache-unauthorized-ttl` flag
      # Default: 30s
      unauthorizedTTL: 30s
      # Timeout for the webhook request
      # Maximum allowed is 30s.
      # Required, with no default.
      timeout: 3s
      # The API version of the authorization.k8s.io SubjectAccessReview to
      # send to and expect from the webhook.
      # Same as setting `--authorization-webhook-version` flag
      # Required, with no default
      # Valid values: v1beta1, v1
      subjectAccessReviewVersion: v1
      # MatchConditionSubjectAccessReviewVersion specifies the SubjectAccessReview
      # version the CEL expressions are evaluated against
      # Valid values: v1
      # Required, no default value
      matchConditionSubjectAccessReviewVersion: v1
      # Controls the authorization decision when a webhook request fails to
      # complete or returns a malformed response or errors evaluating
      # matchConditions.
      # Valid values:
      #   - NoOpinion: continue to subsequent authorizers to see if one of
      #     them allows the request
      #   - Deny: reject the request without consulting subsequent authorizers
      # Required, with no default.
      failurePolicy: Deny
      connectionInfo:
        # Controls how the webhook should communicate with the server.
        # Valid values:
        # - KubeConfig: use the file specified in kubeConfigFile to locate the
        #   server.
        # - InClusterConfig: use the in-cluster configuration to call the
        #   SubjectAccessReview API hosted by kube-apiserver. This mode is not
        #   allowed for kube-apiserver.
        type: KubeConfig
        # Path to KubeConfigFile for connection info
        # Required, if connectionInfo.Type is KubeConfig
        kubeConfigFile: /kube-system-authz-webhook.yaml
        # matchConditions is a list of conditions that must be met for a request to be sent to this
        # webhook. An empty list of matchConditions matches all requests.
        # There are a maximum of 64 match conditions allowed.
        #
        # The exact matching logic is (in order):
        #   1. If at least one matchCondition evaluates to FALSE, then the webhook is skipped.
        #   2. If ALL matchConditions evaluate to TRUE, then the webhook is called.
        #   3. If at least one matchCondition evaluates to an error (but none are FALSE):
        #      - If failurePolicy=Deny, then the webhook rejects the request
        #      - If failurePolicy=NoOpinion, then the error is ignored and the webhook is skipped
      matchConditions:
      # expression represents the expression which will be evaluated by CEL. Must evaluate to bool.
      # CEL expressions have access to the contents of the SubjectAccessReview in v1 version.
      # If version specified by subjectAccessReviewVersion in the request variable is v1beta1,
      # the contents would be converted to the v1 version before evaluating the CEL expression.
      #
      # Documentation on CEL: https://kubernetes.io/docs/reference/using-api/cel/
      #
      # only send resource requests to the webhook
      - expression: has(request.resourceAttributes)
      # only intercept requests to kube-system
      - expression: request.resourceAttributes.namespace == 'kube-system'
      # don't intercept requests from kube-system service accounts
      - expression: !('system:serviceaccounts:kube-system' in request.user.groups)
  - type: Node
    name: node
  - type: RBAC
    name: rbac
  - type: Webhook
    name: in-cluster-authorizer
    webhook:
      authorizedTTL: 5m
      unauthorizedTTL: 30s
      timeout: 3s
      subjectAccessReviewVersion: v1
      failurePolicy: NoOpinion
      connectionInfo:
        type: InClusterConfig

The following configuration examples illustrate real-world scenarios that need the ability to specify multiple webhooks with distinct settings, precedence order, and failure modes.

Protecting Installed CRDs

Ensuring of Custom Resource Definitions (CRDs) availability at cluster startup has been a key demand. One of the blockers of having a controller reconcile those CRDs is having a protection mechanism for them, which can be achieved through multiple authorization webhooks. This was not possible before as specifying multiple authorization webhooks in the Kubernetes API Server authorization chain was simply not possible. Now, with the Structured Authorization Configuration feature, administrators can specify multiple webhooks, offering a solution where RBAC falls short, especially when denying permissions to 'non-system' users for certain CRDs.

Assuming the following for this scenario:

  • The "protected" CRDs are installed.

  • They can only be modified by users in the group admin.

apiVersion: apiserver.config.k8s.io/v1beta1
kind: AuthorizationConfiguration
authorizers:
  - type: Webhook
    name: system-crd-protector
    webhook:
      unauthorizedTTL: 30s
      timeout: 3s
      subjectAccessReviewVersion: v1
      matchConditionSubjectAccessReviewVersion: v1
      failurePolicy: Deny
      connectionInfo:
        type: KubeConfig
        kubeConfigFile: /files/kube-system-authz-webhook.yaml
      matchConditions:
      # only send resource requests to the webhook
      - expression: has(request.resourceAttributes)
      # only intercept requests for CRDs
      - expression: request.resourceAttributes.resource.resource = "customresourcedefinitions"
      - expression: request.resourceAttributes.resource.group = ""
      # only intercept update, patch, delete, or deletecollection requests
      - expression: request.resourceAttributes.verb in ['update', 'patch', 'delete','deletecollection']
  - type: Node
  - type: RBAC

Preventing unnecessarily nested webhooks

A system administrator wants to apply specific validations to requests before handing them off to webhooks using frameworks like Open Policy Agent. In the past, this would require running nested webhooks within the one added to the authorization chain to achieve the desired result. The Structured Authorization Configuration feature simplifies this process, offering a structured API to selectively trigger additional webhooks when needed. It also enables administrators to set distinct failure policies for each webhook, ensuring more consistent and predictable responses.

apiVersion: apiserver.config.k8s.io/v1beta1
kind: AuthorizationConfiguration
authorizers:
  - type: Webhook
    name: system-crd-protector
    webhook:
      unauthorizedTTL: 30s
      timeout: 3s
      subjectAccessReviewVersion: v1
      matchConditionSubjectAccessReviewVersion: v1
      failurePolicy: Deny
      connectionInfo:
        type: KubeConfig
        kubeConfigFile: /files/kube-system-authz-webhook.yaml
      matchConditions:
      # only send resource requests to the webhook
      - expression: has(request.resourceAttributes)
      # only intercept requests for CRDs
      - expression: request.resourceAttributes.resource.resource = "customresourcedefinitions"
      - expression: request.resourceAttributes.resource.group = ""
      # only intercept update, patch, delete, or deletecollection requests
      - expression: request.resourceAttributes.verb in ['update', 'patch', 'delete','deletecollection']
  - type: Node
  - type: RBAC
  - name: opa
    type: Webhook
    webhook:
      unauthorizedTTL: 30s
      timeout: 3s
      subjectAccessReviewVersion: v1
      matchConditionSubjectAccessReviewVersion: v1
      failurePolicy: Deny
      connectionInfo:
        type: KubeConfig
        kubeConfigFile: /files/opa-default-authz-webhook.yaml
      matchConditions:
      # only send resource requests to the webhook
      - expression: has(request.resourceAttributes)
      # only intercept requests to default namespace
      - expression: request.resourceAttributes.namespace == 'default'
      # don't intercept requests from default service accounts
      - expression: !('system:serviceaccounts:default' in request.user.groups)

What's next?

From Kubernetes 1.30, the feature is in beta and enabled by default. For Kubernetes v1.31, we expect the feature to stay in beta while we get more feedback from users. Once it is ready for GA, the feature flag will be removed, and the configuration file version will be promoted to v1.

Learn more about this feature on the structured authorization configuration Kubernetes doc website. You can also follow along with KEP-3221 to track progress in coming Kubernetes releases.

Call to action

In this post, we have covered the benefits of the Structured Authorization Configuration feature in Kubernetes v1.30 and a few sample configurations for real-world scenarios. To use this feature, you must specify the path to the authorization configuration using the --authorization-config command line argument. From Kubernetes 1.30, the feature is in beta and enabled by default. If you want to keep using command line flags instead of a configuration file, those will continue to work as-is. Specifying both --authorization-config and --authorization-modes/--authorization-webhook-* won't work. You need to drop the older flags from your kube-apiserver command.

The following kind Cluster configuration sets that command argument on the APIserver to load an AuthorizationConfiguration from a file (authorization_config.yaml) in the files folder. Any needed kubeconfig and certificate files can also be put in the files directory.

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
  StructuredAuthorizationConfiguration: true  # enabled by default in v1.30
kubeadmConfigPatches:
  - |
    kind: ClusterConfiguration
    metadata:
      name: config
    apiServer:
      extraArgs:
        authorization-config: "/files/authorization_config.yaml"
      extraVolumes:
      - name: files
        hostPath: "/files"
        mountPath: "/files"
        readOnly: true    
nodes:
- role: control-plane
  extraMounts:
  - hostPath: files
    containerPath: /files

We would love to hear your feedback on this feature. In particular, we would like feedback from Kubernetes cluster administrators and authorization webhook implementors as they build their integrations with this new API. Please reach out to us on the #sig-auth-authorizers-dev channel on Kubernetes Slack.

How to get involved

If you are interested in helping develop this feature, sharing feedback, or participating in any other ongoing SIG Auth projects, please reach out on the #sig-auth channel on Kubernetes Slack.

You are also welcome to join the bi-weekly SIG Auth meetings held every other Wednesday.

Acknowledgments

This feature was driven by contributors from several different companies. We would like to extend a huge thank you to everyone who contributed their time and effort to make this possible.

Kubernetes 1.30: Preventing unauthorized volume mode conversion moves to GA

By Raunak Pradip Shah (Mirantis) | Tuesday, April 30, 2024

With the release of Kubernetes 1.30, the feature to prevent the modification of the volume mode of a PersistentVolumeClaim that was created from an existing VolumeSnapshot in a Kubernetes cluster, has moved to GA!

The problem

The Volume Mode of a PersistentVolumeClaim refers to whether the underlying volume on the storage device is formatted into a filesystem or presented as a raw block device to the Pod that uses it.

Users can leverage the VolumeSnapshot feature, which has been stable since Kubernetes v1.20, to create a PersistentVolumeClaim (shortened as PVC) from an existing VolumeSnapshot in the Kubernetes cluster. The PVC spec includes a dataSource field, which can point to an existing VolumeSnapshot instance. Visit Create a PersistentVolumeClaim from a Volume Snapshot for more details on how to create a PVC from an existing VolumeSnapshot in a Kubernetes cluster.

When leveraging the above capability, there is no logic that validates whether the mode of the original volume, whose snapshot was taken, matches the mode of the newly created volume.

This presents a security gap that allows malicious users to potentially exploit an as-yet-unknown vulnerability in the host operating system.

There is a valid use case to allow some users to perform such conversions. Typically, storage backup vendors convert the volume mode during the course of a backup operation, to retrieve changed blocks for greater efficiency of operations. This prevents Kubernetes from blocking the operation completely and presents a challenge in distinguishing trusted users from malicious ones.

Preventing unauthorized users from converting the volume mode

In this context, an authorized user is one who has access rights to perform update or patch operations on VolumeSnapshotContents, which is a cluster-level resource.
It is up to the cluster administrator to provide these rights only to trusted users or applications, like backup vendors. Users apart from such authorized ones will never be allowed to modify the volume mode of a PVC when it is being created from a VolumeSnapshot.

To convert the volume mode, an authorized user must do the following:

  1. Identify the VolumeSnapshot that is to be used as the data source for a newly created PVC in the given namespace.

  2. Identify the VolumeSnapshotContent bound to the above VolumeSnapshot.

kubectl describe volumesnapshot -n <namespace> <name>
  1. Add the annotation snapshot.storage.kubernetes.io/allow-volume-mode-change: "true" to the above VolumeSnapshotContent. The VolumeSnapshotContent annotations must include one similar to the following manifest fragment:
kind: VolumeSnapshotContent
metadata:
  annotations:
    - snapshot.storage.kubernetes.io/allow-volume-mode-change: "true"
...

Note: For pre-provisioned VolumeSnapshotContents, you must take an extra step of setting spec.sourceVolumeMode field to either Filesystem or Block, depending on the mode of the volume from which this snapshot was taken.

An example is shown below:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotContent
metadata:
  annotations:
  - snapshot.storage.kubernetes.io/allow-volume-mode-change: "true"
  name: <volume-snapshot-content-name>
spec:
  deletionPolicy: Delete
  driver: hostpath.csi.k8s.io
  source:
    snapshotHandle: <snapshot-handle>
  sourceVolumeMode: Filesystem
  volumeSnapshotRef:
    name: <volume-snapshot-name>
    namespace: <namespace>

Repeat steps 1 to 3 for all VolumeSnapshotContents whose volume mode needs to be converted during a backup or restore operation. This can be done either via software with credentials of an authorized user or manually by the authorized user(s).

If the annotation shown above is present on a VolumeSnapshotContent object, Kubernetes will not prevent the volume mode from being converted. Users should keep this in mind before they attempt to add the annotation to any VolumeSnapshotContent.

Action required

The prevent-volume-mode-conversion feature flag is enabled by default in the external-provisioner v4.0.0 and external-snapshotter v7.0.0. Volume mode change will be rejected when creating a PVC from a VolumeSnapshot unless the steps described above have been performed.

What's next

To determine which CSI external sidecar versions support this feature, please head over to the CSI docs page. For any queries or issues, join Kubernetes on Slack and create a thread in the #csi or #sig-storage channel. Alternately, create an issue in the CSI external-snapshotter repository.

Container Runtime Interface streaming explained

By Sascha Grunert | Wednesday, May 01, 2024

The Kubernetes Container Runtime Interface (CRI) acts as the main connection between the kubelet and the Container Runtime. Those runtimes have to provide a gRPC server which has to fulfill a Kubernetes defined Protocol Buffer interface. This API definition evolves over time, for example when contributors add new features or fields are going to become deprecated.

In this blog post, I'd like to dive into the functionality and history of three extraordinary Remote Procedure Calls (RPCs), which are truly outstanding in terms of how they work: Exec, Attach and PortForward.

Exec can be used to run dedicated commands within the container and stream the output to a client like kubectl or crictl. It also allows interaction with that process using standard input (stdin), for example if users want to run a new shell instance within an existing workload.

Attach streams the output of the currently running process via standard I/O from the container to the client and also allows interaction with them. This is particularly useful if users want to see what is going on in the container and be able to interact with the process.

PortForward can be utilized to forward a port from the host to the container to be able to interact with it using third party network tools. This allows it to bypass Kubernetes services for a certain workload and interact with its network interface.

What is so special about them?

All RPCs of the CRI either use the gRPC unary calls for communication or the server side streaming feature (only GetContainerEvents right now). This means that mainly all RPCs retrieve a single client request and have to return a single server response. The same applies to Exec, Attach, and PortForward, where their protocol definition looks like this:

// Exec prepares a streaming endpoint to execute a command in the container.
rpc Exec(ExecRequest) returns (ExecResponse) {}
// Attach prepares a streaming endpoint to attach to a running container.
rpc Attach(AttachRequest) returns (AttachResponse) {}
// PortForward prepares a streaming endpoint to forward ports from a PodSandbox.
rpc PortForward(PortForwardRequest) returns (PortForwardResponse) {}

The requests carry everything required to allow the server to do the work, for example, the ContainerId or command (Cmd) to be run in case of Exec. More interestingly, all of their responses only contain a url:

message ExecResponse {
    // Fully qualified URL of the exec streaming server.
    string url = 1;
}
message AttachResponse {
    // Fully qualified URL of the attach streaming server.
    string url = 1;
}
message PortForwardResponse {
    // Fully qualified URL of the port-forward streaming server.
    string url = 1;
}

Why is it implemented like that? Well, the original design document for those RPCs even predates Kubernetes Enhancements Proposals (KEPs) and was originally outlined back in 2016. The kubelet had a native implementation for Exec, Attach, and PortForward before the initiative to bring the functionality to the CRI started. Before that, everything was bound to Docker or the later abandoned container runtime rkt.

The CRI related design document also elaborates on the option to use native RPC streaming for exec, attach, and port forward. The downsides outweighed this approach: the kubelet would still create a network bottleneck and future runtimes would not be free in choosing the server implementation details. Also, another option that the Kubelet implements a portable, runtime-agnostic solution has been abandoned over the final one, because this would mean another project to maintain which nevertheless would be runtime dependent.

This means, that the basic flow for Exec, Attach and PortForward was proposed to look like this:

CRI Streaming flow

Clients like crictl or the kubelet (via kubectl) request a new exec, attach or port forward session from the runtime using the gRPC interface. The runtime implements a streaming server that also manages the active sessions. This streaming server provides an HTTP endpoint for the client to connect to. The client upgrades the connection to use the SPDY streaming protocol or (in the future) to a WebSocket connection and starts to stream the data back and forth.

This implementation allows runtimes to have the flexibility to implement Exec, Attach and PortForward the way they want, and also allows a simple test path. Runtimes can change the underlying implementation to support any kind of feature without having a need to modify the CRI at all.

Many smaller enhancements to this overall approach have been merged into Kubernetes in the past years, but the general pattern has always stayed the same. The kubelet source code transformed into a reusable library, which is nowadays usable from container runtimes to implement the basic streaming capability.

How does the streaming actually work?

At a first glance, it looks like all three RPCs work the same way, but that's not the case. It's possible to group the functionality of Exec and Attach, while PortForward follows a distinct internal protocol definition.

Exec and Attach

Kubernetes defines Exec and Attach as remote commands, where its protocol definition exists in five different versions:

#VersionNote
1channel.k8s.ioInitial (unversioned) SPDY sub protocol (#13394, #13395)
2v2.channel.k8s.ioResolves the issues present in the first version (#15961)
3v3.channel.k8s.ioAdds support for resizing container terminals (#25273)
4v4.channel.k8s.ioAdds support for exit codes using JSON errors (#26541)
5v5.channel.k8s.ioAdds support for a CLOSE signal (#119157)

On top of that, there is an overall effort to replace the SPDY transport protocol using WebSockets as part KEP #4006. Runtimes have to satisfy those protocols over their life cycle to stay up to date with the Kubernetes implementation.

Let's assume that a client uses the latest (v5) version of the protocol as well as communicating over WebSockets. In that case, the general flow would be:

  1. The client requests an URL endpoint for Exec or Attach using the CRI.

    • The server (runtime) validates the request, inserts it into a connection tracking cache, and provides the HTTP endpoint URL for that request.
  2. The client connects to that URL, upgrades the connection to establish a WebSocket, and starts to stream data.

    • In the case of Attach, the server has to stream the main container process data to the client.

    • In the case of Exec, the server has to create the subprocess command within the container and then streams the output to the client.

If stdin is required, then the server needs to listen for that as well and redirect it to the corresponding process.

Interpreting data for the defined protocol is fairly simple: The first byte of every input and output packet defines the actual stream:

First ByteTypeDescription
0standard inputData streamed from stdin
1standard outputData streamed to stdout
2standard errorData streamed to stderr
3stream errorA streaming error occurred
4stream resizeA terminal resize event
255stream closeStream should be closed (for WebSockets)

How should runtimes now implement the streaming server methods for Exec and Attach by using the provided kubelet library? The key is that the streaming server implementation in the kubelet outlines an interface called Runtime which has to be fulfilled by the actual container runtime if it wants to use that library:

// Runtime is the interface to execute the commands and provide the streams.
type Runtime interface {
        Exec(ctx context.Context, containerID string, cmd []string, in io.Reader, out, err io.WriteCloser, tty bool, resize <-chan remotecommand.TerminalSize) error
        Attach(ctx context.Context, containerID string, in io.Reader, out, err io.WriteCloser, tty bool, resize <-chan remotecommand.TerminalSize) error
        PortForward(ctx context.Context, podSandboxID string, port int32, stream io.ReadWriteCloser) error
}

Everything related to the protocol interpretation is already in place and runtimes only have to implement the actual Exec and Attach logic. For example, the container runtime CRI-O does it like this pseudo code:

func (s StreamService) Exec(
    ctx context.Context,
    containerID string,
    cmd []string,
    stdin io.Reader, stdout, stderr io.WriteCloser,
    tty bool,
    resizeChan <-chan remotecommand.TerminalSize,
) error {
    // Retrieve the container by the provided containerID
    // …

    // Update the container status and verify that the workload is running
    // …

    // Execute the command and stream the data
    return s.runtimeServer.Runtime().ExecContainer(
        s.ctx, c, cmd, stdin, stdout, stderr, tty, resizeChan,
    )
}

PortForward

Forwarding ports to a container works a bit differently when comparing it to streaming IO data from a workload. The server still has to provide a URL endpoint for the client to connect to, but then the container runtime has to enter the network namespace of the container, allocate the port as well as stream the data back and forth. There is no simple protocol definition available like for Exec or Attach. This means that the client will stream the plain SPDY frames (with or without an additional WebSocket connection) which can be interpreted using libraries like moby/spdystream.

Luckily, the kubelet library already provides the PortForward interface method which has to be implemented by the runtime. CRI-O does that by (simplified):

func (s StreamService) PortForward(
    ctx context.Context,
    podSandboxID string,
    port int32,
    stream io.ReadWriteCloser,
) error {
    // Retrieve the pod sandbox by the provided podSandboxID
    sandboxID, err := s.runtimeServer.PodIDIndex().Get(podSandboxID)
    sb := s.runtimeServer.GetSandbox(sandboxID)
    // …

    // Get the network namespace path on disk for that sandbox
    netNsPath := sb.NetNsPath()
    // …

    // Enter the network namespace and stream the data
    return s.runtimeServer.Runtime().PortForwardContainer(
        ctx, sb.InfraContainer(), netNsPath, port, stream,
    )
}

Future work

The flexibility Kubernetes provides for the RPCs Exec, Attach and PortForward is truly outstanding compared to other methods. Nevertheless, container runtimes have to keep up with the latest and greatest implementations to support those features in a meaningful way. The general effort to support WebSockets is not only a plain Kubernetes thing, it also has to be supported by container runtimes as well as clients like crictl.

For example, crictl v1.30 features a new --transport flag for the subcommands exec, attach and port-forward (#1383, #1385) to allow choosing between websocket and spdy.

CRI-O is going an experimental path by moving the streaming server implementation into conmon-rs (a substitute for the container monitor conmon). conmon-rs is a Rust implementation of the original container monitor and allows streaming WebSockets directly using supported libraries (#2070). The major benefit of this approach is that CRI-O does not even have to be running while conmon-rs can keep active Exec, Attach and PortForward sessions open. The simplified flow when using crictl directly will then look like this:

conmon-rsContainer Runtimecrictlconmon-rsContainer RuntimecrictlContainer Runtime Interface (CRI)Cap’n ProtoExec, Attach, PortForward1Serve Exec, Attach, PortForward2HTTP endpoint (URL)3Response URL4Connection upgrade to WebSocket5Stream data6

All of those enhancements require iterative design decisions, while the original well-conceived implementation acts as the foundation for those. I really hope you've enjoyed this compact journey through the history of CRI RPCs. Feel free to reach out to me anytime for suggestions or feedback using the official Kubernetes Slack.