In this lab, we will learn Kubernetes architecture, Kubernetes components such as kubelet, scheduler.
Note
YAML files for this lab are located in the directory ~/k8s-examples/architecture/.
Chapter Details | |
---|---|
Chapter Goal | Understand Kubernetes architecture |
Chapter Sections |
Static pods are managed directly by kubelet daemon on a specific node. It does not have associated any replication controller, kubelet daemon itself watches it and restarts it when it crashes. Static pods are always bound to one kubelet daemon and always run on the same node with it.
Step 1 We will create a new static pod on the master node. First, you need to check that kubelet is started with the --pod-manifest-path command line argument. In your cluster, kubelet is controlled by systemd, you can find kubelet parameters in the corresponding configuration file:
$ sudo cat /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
# Note: This dropin only works with kubeadm and kubelet v1.11+
[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"
# This is a file that "kubeadm init" and "kubeadm join" generates at runtime, populating the KUBELET_KUBEADM_ARGS variable dynamically
EnvironmentFile=-/var/lib/kubelet/kubeadm-flags.env
# This is a file that the user can use for overrides of the kubelet args as a last resort. Preferably, the user should use
# the .NodeRegistration.KubeletExtraArgs object in the configuration files instead. KUBELET_EXTRA_ARGS should be sourced from this file.
EnvironmentFile=-/etc/default/kubelet
ExecStart=
ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS
The configuration parameters for kubelet are in KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml. To verify the path for static Pod declarations execute:
$ sudo grep staticPod /var/lib/kubelet/config.yaml
staticPodPath: /etc/kubernetes/manifests
Kubelet periodically scans this directory and creates/deletes pods when their corresponding yaml/json declration files appear/disappear there. Note that kubelet will ignore files starting with dots when scanning the specified directory. If this parameter is not set in your cluster, you need to set it - there is no default value for it - and restart kubelet.
Step 2 Define a new pod in the file echoserver-pod-1.yaml (or use the existing one from the directory ~/k8s-examples/architecture/) similar to the pod from the section 6.3. Create a Pod:
apiVersion: v1
kind: Pod
metadata:
name: echoserver
spec:
containers:
- name: echoserver
image: gcr.io/google_containers/echoserver:1.4
ports:
- containerPort: 8080
Step 3 Copy the file to the directory /etc/kubernetes/manifests:
$ sudo cp echoserver-pod-1.yaml /etc/kubernetes/manifests/
Step 4 Check that the echoserver container is running on the master node:
$ docker ps | grep echoserver
... gcr.io/google_containers/echoserver@... k8s_echoserver_echoserver-master...
Step 5 Kubelet automatically creates so-called mirror pod on Kubernetes API server for each static pod, so the pods are visible there, but they cannot be controlled from the API server. Check that there is a mirror pod visible to Kubernetes API:
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
echoserver-master 1/1 Running 0 3m ... master
Step 6 Try to delete the pod using kubectl:
$ kubectl delete pod echoserver-master
pod "echoserver-master" deleted
Check that the pod is still running and there are no restarts:
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
echoserver-master 1/1 Running 0 4m ... master
Step 7 To delete to pod, remove its declaration from /etc/kubernetes/manifests/:
$ sudo rm /etc/kubernetes/manifests/echoserver-pod-1.yaml
Check that the pod is no longer running:
$ kubectl get pods -o wide
No resources found.
Step 8 Note that some of the cluster components are declared as static pods:
$ ls -1 /etc/kubernetes/manifests
etcd.yaml
kube-apiserver.yaml
kube-controller-manager.yaml
kube-scheduler.yaml
You can constrain a pod to only be able to run on particular nodes or to prefer to run on particular nodes. To do this, you can use labels for nodes and label selectors.
Step 1 In addition to labels you attach, nodes come pre-populated with a standard set of labels. Let’s check the standard labels for nodes:
$ kubectl get node node1 -o jsonpath="{.metadata.labels}"
map[beta.kubernetes.io/arch:amd64 beta.kubernetes.io/os:linux kubernetes.io/hostname:node1]
$ kubectl get node master -o jsonpath="{.metadata.labels}"
map[node-role.kubernetes.io/master: beta.kubernetes.io/arch:amd64 beta.kubernetes.io/os:linux kubernetes.io/hostname:master]
Step 2 Define a new pod in the file echoserver-pod-2.yaml (or use the existing one from the directory ~/k8s-examples/architecture/) similar to the pod from the section 6.3. Create a Pod, but with the additional key nodeSelector:
apiVersion: v1
kind: Pod
metadata:
name: echoserver
spec:
nodeSelector: # Define a node selector for the pod
apptype: echoserver # that selects nodes with the specified label and value
containers:
- name: echoserver
image: gcr.io/google_containers/echoserver:1.4
ports:
- containerPort: 8080
Step 3 Create the pod:
$ kubectl create -f echoserver-pod-2.yaml
pod "echoserver" created
Step 4 Check the pod:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
echoserver 0/1 Pending 0 5s
Note that the pod is in the Pending status, because there are no nodes with the apptype label. You can also check the details of the pod using kubectl describe pod echoserver. The output should contain the following event:
$ kubectl describe pod echoserver
...
Events:
Type Reason Message
-------- ------ -------
Warning FailedScheduling No nodes are available that match all of the following predicates ...
...
Step 5 Attach the apptype label to, for example node1:
$ kubectl label node node1 apptype=echoserver
node "node1" labeled
Step 6 Check the pod again:
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
echoserver 1/1 Running 0 3m ... node1
Note that our pod is in the Running status and scheduled to the node1.
Step 7 Remove our label from the node1:
$ kubectl label node node1 apptype-
node "node1" labeled
Step 8 Check the pod again:
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
echoserver 1/1 Running 0 4m ... node1
Note that the pod is still running: a node selector is used by the Kubernetes scheduler to schedule a pod to a node.
Step 9 Delete the pod echoserver:
$ kubectl delete pod echoserver
pod "echoserver" deleted
Node affinity was introduced as alpha in Kubernetes 1.2. It is is conceptually similar to the node selector: it allows you to constrain which nodes your pod is eligible to schedule on, based on labels on the node. There are currently two types of node affinity:
“Hard” node affinity is similar to specifying a node selector, but using a more expressive syntax.
Step 1 Define a new pod in the file echoserver-na-1.yaml (or use the existing one from the directory ~/k8s-examples/architecture/) similar to the pod from the section 6.3. Create a Pod, but with the additional key affinity in the pod spec:
apiVersion: v1
kind: Pod
metadata:
name: echoserver
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: apptype
operator: In
values:
- echoserver
containers:
- name: echoserver
image: gcr.io/google_containers/echoserver:1.4
ports:
- containerPort: 8080
This node affinity rule says the pod can only be placed on a node with a label whose key is apptype and whose value is echoserver.
Step 2 Create the pod:
$ kubectl create -f echoserver-na-1.yaml
pod "echoserver" created
Step 3 Check the pod:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
echoserver 0/1 Pending 0 5s
Note that the pod is in the Pending status, because there are no nodes with the apptype label. You can also check the details of the pod using kubectl describe pod echoserver. The output should contain the following event:
$ kubectl describe pod echoserver
...
Events:
Type Reason Message
-------- ------ -------
Warning FailedScheduling No nodes are available that match all of the following predicates ...
...
Step 4 Attach the apptype label to, for example node1:
$ kubectl label node node1 apptype=echoserver
node "node1" labeled
Step 5 Check the pod again:
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
echoserver 1/1 Running 0 3m ... node1
Note that our pod is in the Running status and scheduled to the node1.
Step 6 Remove the label from the node1:
$ kubectl label node node1 apptype-
node "node1" labeled
Step 7 Check the pod again:
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
echoserver 1/1 Running 0 4m ... node1
Note that the pod is still running: node affinity is used by the Kubernetes scheduler to schedule a pod to a node.
Step 8 Delete the pod echoserver:
$ kubectl delete pod echoserver
pod "echoserver" deleted
For node anti-affinity you can use the operators NotIn and DoesNotExist.
Step 1 Define a new pod in the file echoserver-na-2.yaml (or use the existing one from the directory ~/k8s-examples/architecture/) similar to the pod from the previous section, but with the different operator (NotIn instead of In) and different value in matchExpressions (otherserver instead of echoserver):
apiVersion: v1
kind: Pod
metadata:
name: echoserver
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: apptype
operator: NotIn
values:
- otherserver
containers:
- name: echoserver
image: gcr.io/google_containers/echoserver:1.4
ports:
- containerPort: 8080
This node affinity rule says the pod can only be placed on a node without a label whose key is apptype and whose value is otherserver.
Step 2 Attach the apptype label to, for example node1:
$ kubectl label node node1 apptype=otherserver
node "node1" labeled
Step 3 Create the pod:
$ kubectl create -f echoserver-na-2.yaml
pod "echoserver" created
Step 4 Check the pod:
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
echoserver 1/1 Running 0 1m ... node2
Note that our pod is in the Running status and scheduled to the node2, because we attach the label apptype=otherserver to the node1.
Step 5 Remove the label from the node1:
$ kubectl label node node1 apptype-
node "node1" labeled
Step 6 Delete the pod echoserver:
$ kubectl delete pod echoserver
pod "echoserver" deleted
Taints and tolerations work together to ensure that pods are not scheduled onto inappropriate nodes. One or more taints can be applied to a node, and they allow the node to repel any pods that do not tolerate the taints. Tolerations are applied to pods, and allow (but do not require) the pods to schedule onto nodes with matching taints.
Step 1 In the section 6.14. Create a Daemon Set, we created a daemon set, that starts one replica of the echoserver pod on each node. Let’s define that daemon set in the file daemonset-1.yaml (also available in ~/k8s-examples/architecture/):
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: echoserver
labels:
name: echoserver
spec:
template:
metadata:
labels:
name: echoserver
spec:
containers:
- name: echoserver
image: gcr.io/google_containers/echoserver:1.4
ports:
- containerPort: 8080
Step 2 Create a daemon set:
$ kubectl apply -f daemonset-1.yaml
daemonset "echoserver" created
Step 3 Check the pods started by our daemon set:
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
echoserver-... 1/1 Running 0 5m ... node2
echoserver-... 1/1 Running 0 5m ... node1
You see that echoserver is not running on the master node. The best practice is not use the master node to run workload, and, in our case, the master node is “tainted” with a special system key node-role.kubernetes.io/master with the NoSchedule effect.
Step 4 Let’s check if the master node is “tainted”, so the Kubernetes scheduler does not use the master node to run pods:
$ kubectl get node master -o jsonpath='{.spec.taints}{"\n"}'
[map[key:node-role.kubernetes.io/master effect:NoSchedule timeAdded:<nil>]]
You can compare the result with the output of the same command, for example, for node1:
$ kubectl get node node1 -o jsonpath='{.spec.taints}{"\n"}'
Step 5 To enable pods scheduling for the master node, remove the taint from the master node, allowing the scheduler to use the master node for pods:
$ kubectl taint node master node-role.kubernetes.io/master-
node "master" untainted
Step 6 Check the pods again:
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
echoserver-... 1/1 Running 0 15m ... node2
echoserver-... 1/1 Running 0 51s ... master
echoserver-... 1/1 Running 0 15m ... node1
As you see, now we have one echoserver pod replica running on each cluster node, including the master node.
Step 7 Node taints are key-value pairs associated with an effect. Available effects are:
NoSchedule: Pods that do not tolerate this taint are not scheduled on the node.
PreferNoSchedule: Kubernetes avoids scheduling Pods that do not tolerate this taint onto the node.
NoExecute: Pod is evicted from the node if it is already running on the node, and is not scheduled onto the node if it is not yet running on the node.
We can add the taint that we removed back to the master node and no new pod will be scheduled on the master node. However, this will not delete or evict the already running pod on the master node. Let’s taint the master node again and delete the pod on the master node manually:
$ kubectl taint node master node-role.kubernetes.io/master=:NoSchedule
node "master" tainted
$ kubectl delete pod <echoserver-pod-on-master-node>
pod "echoserver-..." deleted
Step 8 Check the pods again:
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
echoserver-... 1/1 Running 0 15m ... node2
echoserver-... 1/1 Running 0 15m ... node1
As you see, a new pod replica was not launched on the master node again because of the taint.
Step 9 Instead of removing the system taint from the master node, let’s make the echoserver pod tolerant to the taint. Create a new file daemonset-2.yaml (also available in ~/k8s-examples/architecture/) similar to the existing daemon set, but add a new tolerations key into the pod’s spec:
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: echoserver
labels:
name: echoserver
spec:
template:
metadata:
labels:
name: echoserver
spec:
tolerations: # Add these 3 lines to define a new toleration
- key: node-role.kubernetes.io/master # The 'tolerations' key should be in the pod's spec
operator: Exists # With the same indentation as, for example, the 'containers' key
containers:
- name: echoserver
image: gcr.io/google_containers/echoserver:1.4
ports:
- containerPort: 8080
Update the existing daemon set using the file daemonset-2.yaml:
$ kubectl apply -f daemonset-2.yaml
daemonset "echoserver" replaced
Step 10 Check the pods again:
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
echoserver-... 1/1 Running 0 15m ... node2
echoserver-... 1/1 Running 0 51s ... master
echoserver-... 1/1 Running 0 15m ... node1
As you see, now we have one echoserver pod replica running on each cluster node, including the master node.
Step 11 Remove the daemon set and check that there are no pod’s replicas running:
$ kubectl delete ds echoserver
daemonset "echoserver" deleted
$ kubectl get pods
No resources found.
Pod affinity and anti-affinity were introduced as beta in Kubernetes 1.4. They allow you to constrain which nodes your pod is eligible to be scheduled based on labels on pods that are already running on the node rather than based on labels on nodes. Pod affinity is specified as the field podAffinity of the field affinity in the pod specification. Pod anti-affinity is specified as the field podAntiAffinity of field affinity in the pod specification. For pod affinity or anti-affinity, you need to specify a topology domain like node, rack, cloud provider zone, cloud provider region, and so on. You express it using the topologyKey, which is the key for the node label that the system uses to denote such a topology domain (you can use built-in node labels for that). When a topology domain is set, a label selector is used to identify if there is a pod with the corresponding label running in the specified topology domain. If such pod exists, then the affinity rule allows (and anti-affinity rule respectively disallows) to use this node for scheduling.
In the following example, we will launch a web application that has in-memory cache such as redis. We want the web servers to be co-located with the cache as much as possible. Ideally, you should use 9.2. Multi-Container Pods and run web server container and its cache in the same pod, however our example still makes sense if you want to control web servers and their in-memory cache independently.
Step 1 Define a new deployment for our web cache in the file web-cache.yaml (also available in ~/k8s-examples/architecture/):
apiVersion: apps/v1beta1
kind: Deployment
metadata:
name: web-cache
spec:
replicas: 1
template:
metadata:
labels:
app: web-cache
spec:
containers:
- name: redis-server
image: redis:3.2-alpine
Step 2 Create the web-cache deployment:
$ kubectl create -f web-cache.yaml
deployment "web-cache" created
Step 3 Check that pod is running:
$ kubectl get pods -o wide --show-labels
NAME READY STATUS RESTARTS AGE IP NODE LABELS
web-cache-... 1/1 Running 0 22s 192.168.1.33 node2 app=web-cache,pod-template-hash=3811522112
Step 4 Define a new deployment for our web server in the file web-server.yaml (also available in ~/k8s-examples/architecture/):
apiVersion: apps/v1beta1
kind: Deployment
metadata:
name: web-server
spec:
replicas: 1
template:
metadata:
labels:
app: web-server
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: cache
operator: In
values:
- active
topologyKey: "kubernetes.io/hostname"
containers:
- name: echoserver
image: gcr.io/google_containers/echoserver:1.4
Note that we define a pod affinity rule (spec.affinity.podAffinity), where we specify a node name (using built-in node label kubernetes.io/hostname) as a topology domain. Also, in the pod affinity rule, we specify the label selector cache=active.
Step 5 Create the web-server deployment:
$ kubectl create -f web-server.yaml
deployment "web-server" created
Step 6 Check the pods:
$ kubectl get pods -o wide --show-labels --watch
NAME READY STATUS RESTARTS AGE IP NODE LABELS
web-cache-... 1/1 Running 0 56s 192.168.1.33 node2 app=web-cache,pod-template-hash=3811522112
web-server-... 0/1 Pending 0 5s <none> <none> app=web-server,pod-template-hash=1505979478
Note that web-server pod is in the Pending status, because there are no pods with the label cache=active in the specified topology domain.
Step 7 In a second window edit the deployment web-cache to attach the required label to its pods:
$ kubectl edit deployment web-cache
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
...
labels:
app: web-cache
...
spec:
...
template:
metadata:
creationTimestamp: null
labels:
app: web-cache
cache: active # <-- Add this line
...
Step 8 In the first window, wait while the deployment is updating its pods:
$ kubectl get pods -o wide --show-labels --watch
NAME READY STATUS RESTARTS AGE IP NODE
...
web-cache-... 0/1 Pending 0 0s <none> <none> app=web-cache,cache=active,pod-template-hash=1656162424
web-cache-... 0/1 Pending 0 0s <none> node1 app=web-cache,cache=active,pod-template-hash=1656162424
web-server-... 0/1 Pending 0 1m <none> node1 app=web-server,pod-template-hash=1505979478
web-cache-... 0/1 ContainerCreating 0 0s <none> node1 app=web-cache,cache=active,pod-template-hash=1656162424
web-server-... 0/1 ContainerCreating 0 1m <none> node1 app=web-server,pod-template-hash=1505979478
web-cache-... 0/1 ContainerCreating 0 1s <none> node1 app=web-cache,cache=active,pod-template-hash=1656162424
web-server-... 0/1 ContainerCreating 0 1m <none> node1 app=web-server,pod-template-hash=1505979478
web-cache-... 1/1 Running 0 4s 192.168.2.52 node1 app=web-cache,cache=active,pod-template-hash=1656162424
web-cache-... 1/1 Terminating 0 2m 192.168.1.33 node2 app=web-cache,pod-template-hash=3811522112
web-cache-... 0/1 Terminating 0 2m 192.168.1.33 node2 app=web-cache,pod-template-hash=3811522112
web-cache-... 0/1 Terminating 0 2m 192.168.1.33 node2 app=web-cache,pod-template-hash=3811522112
web-cache-... 0/1 Terminating 0 2m 192.168.1.33 node2 app=web-cache,pod-template-hash=3811522112
web-server-... 1/1 Running 0 1m 192.168.2.53 node1 app=web-server,pod-template-hash=1505979478
As you can see, web-server pod is now running and is automatically co-located with the cache as expected. Note that we could have used the label app=web-cache in our pod affinity rule. In that case, no additional label is required to automatically co-locate we servers with the cache.
Step 9 Delete our deployments:
$ kubectl delete deployment web-cache web-server
deployment "web-cache" deleted
deployment "web-server" deleted
Kubernetes ships with a default scheduler. If the default scheduler does not suit your needs you can implement your own scheduler. You can even run multiple schedulers simultaneously alongside the default scheduler and instruct Kubernetes what scheduler to use for each of your pods. Let’s learn how to use multiple schedulers in Kubernetes.
Each new pod is normally scheduled by the default scheduler. You can provide the name of your own custom scheduler, the default scheduler will ignore that pod and allow your scheduler to schedule the pod to a node.
Step 1 Define a new pod that uses a custom scheduler in the file nginx-cs.yaml (also available in ~/k8s-examples/architecture/):
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
app: nginx
spec:
schedulerName: my-scheduler
containers:
- name: nginx
image: nginx:1.10
Note that we specify the schedulerName key equals to my-scheduler.
Step 2 Create the pod:
$ kubectl create -f nginx-cs.yaml
pod "nginx" created
Step 3 Check the pods:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx 0/1 Pending 0 1m
Note that the pod is in the Pending state, because the default scheduler ignores our pod.
A custom scheduler can be written in any language and can be as simple or complex as you need. You also can use a Kubernetes Deployment to run and manage a custom scheduler. As you know, a Deployment manages a Replica Set which in turn manages the pods, thereby making the scheduler resilient to failures. In our example, we will not be implementing a new application, building a Docker image and run a custom scheduler as a Kubernetes deployment. Instead, we will schedule pods directly from Bash using Kubernetes API. A real implementation of any custom scheduler, might do the same steps.
Step 1 Open a new terminal and log in to the lab again (you also can use tmux or screen to open a new window in the same SSH session). In a new terminal, run the following command to watch pods:
$ kubectl proxy
Starting to serve on 127.0.0.1:8001
Step 2 Return to the first terminal and execute the following commands to list pods, which should be scheduler using the scheduler my-scheduler:
$ SERVER='localhost:8001'
$ PODS=$(kubectl --server $SERVER get pods -o json | jq \
'.items[] | select(.spec.schedulerName == "my-scheduler") | select(.spec.nodeName == null) | .metadata.name' | \
tr -d '"')
$ echo $PODS
nginx
The variable PODS now contains names of pods (nginx in our case), which should be scheduler using the scheduler my-scheduler.
Step 3 Let’s schedule pods to the master node:
$ NODE="master"
$ for PODNAME in $PODS; do
curl --header "Content-Type:application/json" --request POST --data \
'{"apiVersion":"v1", "kind": "Binding", "metadata": {"name": "'$PODNAME'"},
"target": {"apiVersion": "v1", "kind": "Node", "name": "'$NODE'"}}' \
http://$SERVER/api/v1/namespaces/default/pods/$PODNAME/binding/
done
Note that we accessed the Kubernetes API directly via exposed proxy to bind a pod to the node.
Step 4 Verify that our pod is assigned to the master node:
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
nginx 0/1 ContainerCreating 0 8m <none> master
Step 1 Switch to the SSH session where you have running kubectl proxy. Terminate kubectl proxy (press Ctrl-C) and close the terminal window, we will not use it any more.
Step 2 In the remaining SSH session, delete the pod nginx:
$ kubectl delete pod nginx
pod "nginx" deleted
Checkpoint