Kubernetes Architecture

Intro

At the highest level, the architecture of Kubernetes consists of a Control Plane, and Data Plane. The architecture and framework of a control plane and data plane is not something new in networking, as it is a common architectural design pattern and the two most commonly referenced "planes" in networking. In networking, a plane is an abstract conception of where certain processes take place. The term is used in the sense of "plane of existence."

The control plane is the part of a network that controls how data packets are forwarded — meaning how data is sent from one place to another. In contrast to the control plane, which determines how packets should be forwarded, the data plane actually forwards the packets. (The data plane is also sometimes called the forwarding plane.)

Kubernetes has unique components that respectively reside in its control plane an data plane. All of which this blog post will explore more in detail. These main components are:

  • Control Plane
  • API Server
  • Scheduler
  • Controller Manager
  • etcd
  • Data Plane
  • Kube-Proxy
  • Kubelet
  • Container Runtime
  • Kubernetes system components communicate only with the API server, meaning they don’t talk to each other directly. The API server is the only component that communicates with etcd. None of the other components communicate with etcd directly, but instead modify the cluster state by talking to the API server. Connections between the API server and the other components are almost always initiated by the components, but the API server does connect to the Kubelet when you use kubectl to fetch logs, use kubectl attach to connect to a running container, or use the kubectl port-forward command.

    Nodes
    A Kubernetes cluster consists of a set of worker machines, called nodes, that run containerized applications. The node or nodes in the control plane is called the "master" node(s), and the node(s) in the data plane are referred to as "worker nodes". Every cluster has at least one worker node. The worker node(s) host the Pods that are the components of the application workload. The control plane manages the whole system; the worker nodes and the Pods in the cluster. In production environments, the control plane usually runs across multiple computers and a cluster usually runs multiple nodes, providing fault-tolerance and high availability.

    • Nodes: Nodes are VMs or physical servers that host containerized applications. Each node in a cluster can run one or more application instances. There can be as few as one node, however, a typical Kubernetes cluster will have several nodes (and deployments with hundreds or more nodes are not uncommon).
    • Image Registry: Container images are kept in the registry and transferred to nodes by the control plane for execution in container pods.
    • Pods: Pods are where containerized applications run. They can include one or more containers and are the smallest unit of deployment for applications in a Kubernetes cluster.


    Kubernetes Control Plane And Data Plane Components

    Kubernetes Control Plane Architecture

    The control plane's components make global decisions about the cluster (for example, scheduling), as well as detecting and responding to cluster events (for example, starting up a new pod when a deployment's replicas field is unsatisfied). Control plane components can be run on any machine in the cluster. However, in all cloud provider setups (GCP, AWS, etc...), the control plane is hosted and controlled by the cloud provider. The Control plane holds and controls the state of the cluster, but does not run the app. Running the apps is done by the Worker Nodes.

    A Kubernetes control plane is the control plane for a Kubernetes cluster. Its components include:

  • kube-apiserver. As its name suggests the API server exposes the Kubernetes API, which is communications central. External communications via command line interface (CLI) or other user interfaces (UI) pass to the kube-apiserver, and all control planes to node communications also goes through the API server.
  • https://kubernetes.io/docs/concepts/overview/components/#kube-apiserver
  • etcd: The key value store where all data relating to the cluster is stored. etcd is highly available and consistent since all access to etcd is through the API server. Information in etcd is generally formatted in human-readable YAML.
  • https://kubernetes.io/docs/concepts/overview/components/#etcd
  • kube-scheduler: When a new Pod is created, this component assigns it to a node for execution based on resource requirements, policies, and ‘affinity’ specifications regarding geolocation and interference with other workloads.
  • https://kubernetes.io/docs/concepts/overview/components/#kube-scheduler
  • kube-controller-manager: Although a Kubernetes cluster has several controller functions, they are all compiled into a single binary known as kube-controller-manager.
  • Controller functions included in this process consist of:
  • Replication Controller: Ensures the correct number of pods is in existence for each replicated pod running in the cluster.
  • Node Controller: Monitors the health of each node and notifies the cluster when nodes come online or become unresponsive.
  • Endpoints Controller: Connects Pods and Services to populate the Endpoints object.
  • Service Account and Token Controllers: Allocates API access tokens and default accounts to new namespaces in the cluster.
  • https://kubernetes.io/docs/concepts/overview/components/#kube-controller-manager
  • cloud-controller-manager: If the cluster is partly or entirely cloud-based, the cloud controller manager links the cluster to the cloud provider’s API. Only those controls specific to the cloud provider will run. The cloud controller manager does not exist on clusters that are entirely on-premises. More than one cloud controller manager can be running in a cluster for fault tolerance or to improve overall cloud performance.
  • Elements of the cloud controller manager include:
  • Node Controller: Determines status of a cloud-based node that has stopped responding, i.e., if it has been deleted.
  • Route Controller: Establishes routes in the cloud provider infrastructure.
  • Service Controller: Manages cloud provider’s load balancers.
  • https://kubernetes.io/docs/concepts/overview/components/#loud-controller-manager

  • Kubernetes Node Architecture

    Nodes are the machines, either VMs or physical servers, where Kubernetes place Pods to execute. Node components include:

  • kubelet: Every node has an agent called kubelet. It ensures that the container described in PodSpecs are up and running properly.
  • https://kubernetes.io/docs/concepts/overview/components/#kubelet
  • kube-proxy: A network proxy on each node that maintains network nodes which allows for the communication from Pods to network sessions, whether inside or outside the cluster, using operating system (OS) packet filtering if available.
  • https://kubernetes.io/docs/concepts/overview/components/#kube-proxy
  • container runtime: Software responsible for running the containerized applications. Although Docker is the most popular, Kubernetes supports any runtime that adheres to the Kubernetes CRI (Container Runtime Interface).
  • https://kubernetes.io/docs/concepts/overview/components/#container-runtime

  • Other Kubernetes Infrastructure Components

    • Pods: By encapsulating one (or more) application containers, pods are the most basic execution unit of a Kubernetes application. Each Pod contains the code and storage resources required for execution and has its own IP address. Pods include configuration options as well. Typically, a Pod contains a single container or few containers that are coupled into an application or business function and that share a set of resources and data.
    • Deployments: A method of deploying containerized application Pods. A desired state described in a Deployment will cause controllers to change the actual state of the cluster to achieve that state in an orderly manner. Learn more about Kubernetes Deployments.
    • ReplicaSet: Ensures that a specified number of identical Pods are running at any given point in time.
    • Cluster DNS: serves DNS records needed to operate Kubernetes services.
    • Container Resource Monitoring: Captures and records container metrics in a central database.

    Addons

    Addons use Kubernetes resources (DaemonSet, Deployment, etc) to implement cluster features. Because these are providing cluster-level features, namespaced resources for addons belong within the kube-system namespace. There are many, but a few selected addons are described below:

  • DNS: https://kubernetes.io/docs/concepts/overview/components/#dns
  • Web UI (Dashboard): https://kubernetes.io/docs/concepts/overview/components/#web-ui-dashboard
  • Container Resource Monitoring: https://kubernetes.io/docs/concepts/overview/components/#container-resource-monitoring
  • Cluster-level Logging: https://kubernetes.io/docs/concepts/overview/components/#cluster-level-logging

  • Kubernetes Architecture Best Practices and Design Principles

    Gartner’s Container Best Practices suggest a platform strategy that considers security, governance, monitoring, storage, networking, container lifecycle management and orchestration like Kubernetes. Here are some best practices for architecting Kubernetes clusters:

    • Ensure you have updated to the latest Kubernetes version.
    • Invest up-front in training for developer and operations teams.
    • Establish governance enterprise-wide. Ensure tools and vendors are aligned and integrated with Kubernetes orchestration.
    • Enhance security by integrating image-scanning processes as part of your CI/CD process, scanning during build and run phases. Open-source code pulled from a Github repository should always be considered suspect.
    • Adopt role-based access control (RBAC) across the cluster. Least privilege, zero-trust models should be the standard.
    • Further secure containers by using only non-root users and making the file system read-only.
    • Avoid use of default value, since simple declaratives are less error-prone and demonstrate intent more clearly.
    • Be careful when using basic Docker Hub images, which can contain malware or be bloated with unnecessary code. Start with lean, clean code and build packages up from there. Small images build faster, are smaller on disk, and image pulls are faster as well.
    • Keep containers simple. One process per container will let the orchestrator report if that one process is healthy or not.
    • When in doubt, crash. Kubernetes will restart a failed container, so do not restart on failure.
    • Be verbose. Descriptive labels help current developers and will be invaluable to developers to follow in their footsteps.
    • Don’t get too granular with microservices. Not every function within a logical code component need be its own microservice.
    • Automate, where it makes sense. Automating CI/CD pipeline lets you avoid manual Kubernetes deployments entirely.
    • Use livenessProbe and readinessProbe to help manage Pod lifecycles, or pods may end up being terminated while initializing or begin receiving user requests before they are ready.


    Control Plane: Etcd

    Pods, ReplicationControllers, Services, Secrets, and so on...need to be stored somewhere in a persistent manner so their manifests survive API server restarts and failures. For this, Kubernetes uses etcd, which is a fast, distributed and consistent key-value store.

    Because it’s distributed, you can run more than one etcd instance to provide both high availability and better performance. The only component that talks to etcd directly is the Kubernetes API server. All other components read and write data to etcd indirectly through the API server. This brings a few benefits, among them a more robust optimistic locking system as well as validation; and, by abstracting away the actual storage mechanism from all the other components, it’s much simpler to replace it in the future. It’s worth emphasizing that etcd is the only place Kubernetes stores cluster state and metadata.

    While multiple instances of etcd and API server can be active at the same time and do perform their jobs in parallel, only a single instance of the Scheduler and the Controller Manager may be active at a given time...with the others in standby mode.

    The Kubelet is the only component that always runs as a regular system component, and it’s the Kubelet that then runs all the other components as pods. To run the Control Plane components as pods, the Kubelet is also deployed on the master. The next listing shows the pods in the kube-system namespace.

    etcd v2 stores keys in a hierarchical key space, which makes key-value pairs similar to files in a file system. Each key in etcd is either:

    • a directory, which contains other keys
    • or is a regular key with a corresponding value. etcd v3 doesn’t support directories
    But because the key format remains the same (keys can include slashes), you can still think of them as being grouped into directories. Kubernetes stores all its data in etcd under /registry.


    Ensuring Consistency and Availability

    Kubernetes requires all other Control Plane components to go through the API server. This way updates to the cluster state are always consistent, because the optimistic locking mechanism is implemented in a single place, so less chance exists, if any, of error. The API server also makes sure that the data written to the store is always valid and that changes to the data are only performed by authorized clients.

    You’ll recognize that this is nothing other than a pod definition in JSON format. The API server stores the complete JSON representation of a resource in etcd. Because of etcd's hierarchical key space, you can think of all the stored resources as JSON files in a file system.

    For ensuring high availability, you’ll usually run more than a single instance of etcd. Multiple etcd instances will need to remain consistent. Such a distributed system needs to reach a consensus on what the actual state is. etcd uses the RAFT consensus algorithm to achieve this, which ensures that at any given moment, each node’s state is either what the majority of the nodes agrees is the current state or is one of the previously agreed upon states.

    Clients connecting to different nodes of an etcd cluster will either see the actual current state or one of the states from the past (in Kubernetes, the only etcd client is the API server, but there may be multiple instances). The consensus algorithm requires a majority (or quorum) for the cluster to progress to the next state. As a result, if the cluster splits into two disconnected groups of nodes, the state in the two groups can never diverge, because to transition from the previous state to the new one, there needs to be more than half of the nodes taking part in the state change. If one group contains the majority of all nodes, the other one obviously doesn’t. The first group can modify the cluster state, whereas the other one can’t. When the two groups reconnect, the second group can catch up with the state in the first group.

    etcd is usually deployed with an odd number of instances. Comparing having (2) vs.having (1) instance:
    • Having two instances requires both instances to be present to have a majority.
    • If either of them fails, the etcd cluster can’t transition to a new state because no majority exists. Having two instances is worse than having only a single instance.
    • By having two, the chance of the whole cluster failing has increased by 100%, compared to that of a single-node cluster failing.
    • The same applies when comparing three vs. four etcd instances. With three instances, one instance can fail and a majority (of two) still exists.
    • With four instances, you need three nodes for a majority (two aren’t enough). In both three- and four-instance clusters, only a single instance may fail.
    • But when running four instances, if one fails, a higher possibility exists of an additional instance of the three remaining instances failing (compared to a three-node cluster with one failed node and two remaining nodes.

    Control Plane: The API Server

    First, the API server needs to authenticate the client sending the request. This is performed by one or more authentication plugins configured in the API server. The API server calls these plugins in turn, until one of them determines who is sending the request. It does this by inspecting the HTTP request.

    Besides authentication plugins, the API server is also configured to use one or more authorization plugins. Their job is to determine whether the authenticated user can perform the requested action on the requested resource. For example, when creating pods, the API server consults all authorization plugins in turn, to determine whether the user can create pods in the requested namespace. As soon as a plugin says the user can perform the action, the API server progresses to the next stage.

    If the request is trying to create, modify, or delete a resource, the request is sent through Admission Control. Again, the server is configured with multiple Admission Control plugins. These plugins can modify the resource for different reasons. They may initialize fields missing from the resource specification to the configured default values or even override them. They may even modify other related resources, which aren’t in the request, and can also reject a request for whatever reason. The resource passes through all Admission Control plugins.

    ​​Examples of Admission Control plugins include:

  • AlwaysPullImages: Overrides the pod’s imagePullPolicy to Always, forcing the image to be pulled every time the pod is deployed.
  • ServiceAccount: Applies the default service account to pods that don’t specify it explicitly.
  • Namespace Lifecycle: Prevents creation of pods in namespaces that are in the process of being deleted, as well as in non-existing namespaces.
  • ResourceQuota: Ensures pods in a certain namespace only use as much CPU and memory as has been allotted to the namespace.
  • The Api Server doesn’t create pods when you create a ReplicaSet resource and it doesn’t manage the endpoints of a service. That’s what controllers in the Controller Manager do. But the API server doesn’t even tell these controllers what to do. All it does is enable those controllers and other components to observe changes to deployed resources. A Control Plane component can request to be notified when a resource is created, modified, or deleted. This enables the component to perform whatever task it needs in response to a change of the cluster metadata. Clients watch for changes by opening an HTTP connection to the API server.

    Through this connection, the client will then receive a stream of modifications to the watched objects. Every time an object is updated, the server sends the new version of the object to all connected clients watching the object.


    Control Plane: The Scheduler


    Control Plane: The Controller Manager

    The Replication Manager


    Control Plane: Other Controllers

    The ReplicaSet, The DaemonSet, and Job Controllers

    The ReplicaSet controller does almost the same thing as the Replication Manager described previously, so we don’t have much to add here. The DaemonSet and Job controllers are similar. They create Pod resources from the pod template defined in their respective resources. Like the Replication Manager, these controllers don’t run the pods, but post Pod definitions to the API server, letting the Kubelet create their containers and run them.


    The Deployment Controller

    The Deployment controller takes care of keeping the actual state of a deployment in sync with the desired state specified in the corresponding Deployment API object. The Deployment controller performs a rollout of a new version each time a Deployment object is modified (if the modification should affect the deployed pods). It does this by creating a ReplicaSet and then appropriately scaling both the old and the new ReplicaSet based on the strategy specified in the Deployment, until all the old pods have been replaced with new ones. It doesn’t create any pods directly.


    The StatefulSet Controller

    The StatefulSet controller, similarly to the ReplicaSet controller and other related controllers, creates, manages, and deletes Pods according to the spec of a StatefulSet resource. But while those other controllers only manage Pods, the StatefulSet controller also instantiates and manages PersistentVolumeClaims for each Pod instance.


    The Node Controller

    The Node controller manages the Node resources, which describe the cluster’s worker nodes. Among other things, a Node controller keeps the list of Node objects in sync with the actual list of machines running in the cluster. It also monitors each node’s health and evicts pods from unreachable nodes. The Node controller isn’t the only component making changes to Node objects. They’re also changed by the Kubelet, and can obviously also be modified by users through REST API calls.


    The Service Controller

    One of the services is the LoadBalancer service, which requests a load balancer from the infrastructure to make the service available externally. The Service controller is the one requesting and releasing a load balancer from the infrastructure, when a LoadBalancer-type Service is created or deleted.


    The Endpoints Controller

    Services aren’t linked directly to pods, but instead contain a list of endpoints (IPs and ports), which is created and updated either manually or automatically according to the pod selector defined on the Service. The Endpoints controller is the active component that keeps the endpoint list constantly updated with the IPs and ports of pods matching the label selector.

    The controller watches both Services and Pods. When Services are added or updated or Pods are added, updated, or deleted, it selects Pods matching the Service’s pod selector and adds their IPs and ports to the Endpoints resource. Remember, the Endpoints object is a standalone object, so the controller creates it if necessary. Likewise, it also deletes the Endpoints object when the Service is deleted.


    The Namespace Controller

    When a Namespace resource is deleted, all the resources in that namespace must also be deleted. This is what the Namespace controller does. When it’s notified of the deletion of a Namespace object, it deletes all the resources belonging to the namespace through the API server.


    The PersistentVolume Controller

    Once a user creates a PersistentVolumeClaim, Kubernetes must find an appropriate PersistentVolume and bind it to the claim. This is performed by the PersistentVolume controller. When a PersistentVolumeClaim pops up, the controller finds the best match for the claim by selecting the smallest PersistentVolume with the access mode matching the one requested in the claim and the declared capacity above the capacity requested in the claim. It does this by keeping an ordered list of PersistentVolumes for each access mode by ascending capacity and returning the first volume from the list. Then, when the user deletes the PersistentVolumeClaim, the volume is unbound and reclaimed according to the volume’s reclaim policy (left as is, deleted, or emptied).


    Again, all of these controllers operate on the API objects through the API server. They don’t communicate with the Kubelets directly or issue any kind of instructions to them. In fact, they don’t even know Kubelets exist. After a controller updates a resource in the API server, the Kubelets and Kubernetes Service Proxies, also oblivious of the controllers’ existence, perform their work, such as spinning up a pod’s containers and attaching network storage to them, or in the case of services, setting up the actual load balancing across pods.


    Data Plane: The Kubelet and Service Proxy

    The Kublet

    In a nutshell, the Kubelet is the component responsible for everything running on a worker node. Its initial job is to register the node it’s running on by creating a Node resource in the API server. Then it needs to continuously monitor the API server for Pods that have been scheduled to the node, and start the pod’s containers. It does this by telling the configured container runtime (which is Docker, CoreOS’ rkt, or something else) to run a container from a specific container image. The Kubelet then constantly monitors running containers and reports their status, events, and resource consumption to the API server.

    The Kubelet is also the component that runs the container liveness probes, restarting containers when the probes fail. Lastly, it terminates containers when their Pod is deleted from the API server and notifies the server that the pod has terminated.

    Although the Kubelet talks to the Kubernetes API server and gets the pod manifests from there, it can also run pods based on pod manifest files in a specific local directory as shown in figure 11.8. This feature is used to run the containerized versions of the Control Plane components as pods. Instead of running Kubernetes system components natively, you can put their pod manifests into the Kubelet’s manifest directory and have the Kubelet run and manage them. You can also use the same method to run your custom system containers, but doing it through a DaemonSet is the recommended method.


    The Service Proxy

    Beside the Kubelet, every worker node also runs the kube-proxy, whose purpose is to make sure clients can connect to the services you define through the Kubernetes API. The kube-proxy makes sure connections to the service IP and port end up at one of the pods backing that service (or other, non-pod service endpoints). When a service is backed by more than one pod, the proxy performs load balancing across those pods.

    The kube-proxy got its name because it was an actual proxy, but the current, much better performing implementation only uses iptables rules to redirect packets to a randomly selected backend pod without passing them through an actual proxy server. This mode is called the iptables proxy mode.

    The DNS server pod is exposed through the kube-dns service, allowing the pod to be moved around the cluster, like any other pod. The service’s IP address is specified as the nameserver in the /etc/resolv.conf file inside every container deployed in the cluster. The kube-dns pod uses the API server’s watch mechanism to observe.

    An Ingress controller runs a reverse proxy server (like Nginx, for example), and keeps it configured according to the Ingress, Service, and Endpoints resources defined in the cluster. The controller thus needs to observe those resources (again, through the watch mechanism) and change the proxy server’s config every time one of them changes. Although the Ingress resource’s definition points to a Service, Ingress Controllers forward traffic to the service’s pod directly instead of going through the service IP. This affects the preservation of client IPs when external clients connect through the Ingress controller, which makes them preferred over Services in certain use cases.


    Written: December 21, 2024