Kubernetes is an open-source platform for running cloud-native apps.
A cloud native app is a collection of interactive services that come together to form something useful. Made from small components, it's easy to scale and update, do things like canary releases, blue-greens etc.
Kubernetes is made up of a bunch of Linux nodes (VMs or physical).
Some of those nodes form the control plane, others are worker nodes.
Control plane is the "brains" of your cluster, the workers do the heavy lifting, that's where your apps run.
Control plane composition
- API Server
- Persistent store (etcd, Stateful)
Beware of etcd at scale (doesn't scale that well)
The Kubernetes API
The API is where everything in Kubernetes is defined (resources).
It is an HTTP REST API that supports standards CRUD operations (Create, Read, Update, Delete).
For the most part the interaction with the API happens via the
kubectl command line utility. This utility handles the requests in the background (e.g. POSTs .yaml configs into the API server, etc)
API can be used declaratively or imperatively, but declarative is preferred. Which basically means use configurations instead of commands.
The API used to be a monolith, where all resources were defined in the core API group, often referred to with empty quotation marks
Resource come with API groups, and each group is versioned:
- Alpha - Beware, can be dropped
- Beta - Becoming stable
- GA - Ready for production
kubectl get apiservices
A major force driving Kubernetes is containers (e.g. Docker), but it doesn't run containers. Instead it wraps them in a high level construct called pods.
A pod can run one or more containers.
If the atomic unit for a virtualization environment is a VM, the atomic unit for the Kubernetes environment is the pod.
App code exists in containers, containers live in pods, and pods run on Kubernetes.
Pods are defined in the v1 API group and they can be used on their own, but it's more useful to wrap them in something called Deployments which is a resource defined in the apps/v1 API group.
Deployments are more useful because they are flexible, they offer rolling updates, they scale, etc.
Besides Deployments there are other objects as well that can wrap you pod
- DaemonSet - One Pod per node
- StatefulSet - Stateful app components
Other objects would be secrets, volumes, load-balancers, etc.
Networking Basic Rules
- All Nodes can talk
- All Pods can talk (No NAT)
- Every Pod gets its own IP
- Node Network
- Kubernetes does not implement the Node network
- 443 (HTTPS)
- Pod networking
- Pod Network
- CNI plugin for the network interface
- Third parties provide the plugins for implementing the pod network
- Large and flat (like Overlay)
- Each node gets allocated a subset of addresses
- Pod sees itself "as the allocated address", the same way other pods see it
- Open to talk between pods
Pods come and go, and IPs associated with them come and go. It's hard to keep up with it, so we rely on Kubernetes Services.
Kubernetes Services are a stable abstraction point for a bunch of pods.
Every service gets a name and an IP. Kubernetes guarantees these are stable for the lifetime of a service.
These services get registered with the CoreDNS.
Every Kubernetes cluster has a native DNS service.
The service balances traffic throughout the pods via the label selector.
When you create a service object with a label selector, Kubernetes automatically creates another object in the cluster called an Endpoint object.
Endpoint objects contain the IPs and ports that match the service label selector, which are automatically updates by the service if the number of pods changes in the cluster.
- Integrates with public cloud platform
- Gets cluster-wide port
- Also accessible from outside of cluster
- Default between (30000-32767)
- Default and most basic
- Gets own IP
- Only accessible from within the cluster
All services provide the a stable abstraction point for a number of pods.
When you create a service it gets a unique, long lived IP.
This IP however, is not on any network that you recognize. You cannot find it on the Pod network or on the Node network — it is on a third network called service network.
Service network is not a proper network in the sense that no matter how much digging you do with your normal networking tools, you are not going to find any interfaces on it. And there is no routes to it either.
The traffic gets to the service network via the kube-proxy.
You get one kube-proxy pod on each node.
One of the kube-proxy's responsibilities is to write a bunch of IPTABLES rules, that send the requests to the appropriate pods on the pod network.
The pod sends traffic to it's virtual ethernet interface, which has no clue about the service network — normal networking etiquette applies — and the interface sends the packets to its default gateway.
This happens to be a Linux bridge called cbr0 (similar to docker0), which sends the packets upstream again to the node's eth0 interface.
When this happens the packets get processed by the kernel and the host. At this point the kernel checks the IPTABLES rules which state that the destination address should be rewritten to one on the pod network and sent.
Kube-proxy in IPTABLES Mode has been the default since 1.2 — but it doesn't scale well.
Kube-proxy in IPVS Mode has been Stable (GA) since Kubernetes 1.11, and is more scalable.
- It uses Linux kernel IP Virtual Server
- Native Layer-4 load balancer
- Supports more algorithms
- It is going to be the default going forward
Kubernetes Volumes decouple storage from Pods.
Storage is vital. File & Block storage are First-class Citizens in Kubernetes.
- Pluggable backend
- Rich API
Kubernetes offers you an interface through which you can consume the storage of your choice. This interface is called Container Storage Interface.
PV subsystem (pv, pvc, sc) is used to consume the storage.
- PersistentVolume (PV) — actual storage e.g. 2GB
- PersistentVolumeClaim (PVC) — Ticket/voucher to use PV
- StorageClass (SC) — Makes it dynamic
Container Storage Interface
Initially Kubernetes didn't support CSI.
The link between the PV subsystems and the storage was built in, but it wasn't a good approach. It was tied to releases and vendor plugins had to be open source.
Container Storage Interface (CSI) is an initiative to unify the storage interface of Container Orchestration Systems (e.g. Swarm, Kubernetes, etc.) with storage vendors/providers (e.g. Ceph, Portworx, GCS, etc.).
CSI is out-of-tree and is an independent open-standard.
Is in beta in K8S 1.11
PV and PVC
A PersistentVolume resource is a piece of storage that has been provisioned manually by an administrator, or dynamically using a Storage Class.
It is a resource just like any other in the cluster. PVs are volume plugins like Volumes, but have a lifecycle that is independent of any individual Pod.
A PersistentVolumeClaim is a request for storage by a user. It is similar to a Pod.
Pods consume node resources, and PVCs consume PV resources, in the sense that pods can request specific levels of resources (CPU and Memory), while claims can request specific size and access modes.
Storage Classes enable dynamic provisioning of volumes.
Each Storage Class contains the fields
reclaimPolicy, which are used when a PersistentVolume belonging to the class needs to be dynamically provisioned.
The name of a StorageClass object is significant, and is how users can request a particular class.
Default storage class is present out of the box.
A Deployment provides declarative updates for Pods and ReplicaSets.
Deployments are the most common form of deploying containerized applications.
- Build Docker Image
- Image runs in Conainer
- Container is dressed up in a Kubernetes Pod
- Pod is managed by Deployment (for scaling, etc)
- (behind the scenes there is something called a ReplicaSet that sits between the Deployment and the Pod - but we don't manually mess with it)
A ReplicaSet is like an array of Pods. Its purpose is to maintain a stable set of identical Pods running at any given time.
A Deployment is responsible for managing one type of Pod (e.g. redis, mongo, etc).
If we need a different type Pod, we make another Deployment.
strategy property refers to the update strategy.
RollingUpdate is the one most frequently used. In this context, the
maxSurge property refers to the maximum number of pods we are willing to increase our replicas by, and the
maxUnavailable means how many we are willing to sacrifice during an update.
Kubernetes Auto-scaling Apps
Kubernetes Auto-scaling Technologies
- Horizontal Pod Autoscaler (HPA)
- Cluster Autoscaler (CA)
- Vertical Pod Autoscaler (VPA)
The Horizontal Pod Autoscaler automatically scales the number of pods in a Deployment, based on CPU utilization (works with replication controller, replica set and stateful set as well — also recently with memory and other metrics).
The HPA is a resource like everything else, and can be initiated via a config file.
For the auto-scaling to work, the Deployment should have the resource requirements configured.
The Cluster Autoscaler wakes up every 10 seconds, checking for pending pods.
The cluster autoscaler makes use of the cloud platform API to allocate nodes — this means that autoscaling should be enabled on the cloud level.
Different clouds offer different levels of support (Google has been the best to support the scaling API).
Client (e.g. kubectl) authenticates with the API server using credentials.
- Bearer tokens
- Client certs
- Bootstrap tokens
- External systems
Kubernetes has no internal user support. User accounts cannot be stored and managed locally within Kubernetes.
All regular users and groups must be created and managed outside the cluster.
- Subject - e.g. jack, sam, john, user1
- Verb - e.g. get, create delete, update
- Resource - e.g. pods, deployments, services, crd
When you create a new cluster, you get a context and a user that has very powerful permissions (too powerful for production).
- ClusterRole - verb and resource specified here
- RoleBinding - subject is bound to the role
kubectl config current-context
kubectl get clusterrolebindings
kubectl get clusterroles
- Mutating - changes the request
- Validating - just validates
There is no such thing as a DENY rule in RBAC, everything is denied by default and the system is additive.
Other Notable Resources
- Like Deployment, but makes sure all (or some) Nodes run a copy of a Pod
- User to manage Stateful apps, provides and guarantees ordering and uniqueness of Pods
- Ensures Pods run to completion (successfully terminate)
- Job, but on a time-based schedule
- Controls security sensitive aspect of the pod specs
- Defines a set of conditions that pods must adhere to
- Limits resource consumption per namespace
- E.g. limit number of objects created in a namespace by type
- Add your own resources to Kubernetes, and have them be treated like first-class citizens