10. Cluster Management

Time estimate: 45 minutes

In this section, we will learn how to tear down and reinstall a Kubernetes cluster. We will also work with calicoctl and etcdctl for examining our Calico and etcd resources.

Chapter Details
Chapter Goal Kubernetes cluster tear down and reinstallation
Chapter Sections

10.1. Calico Management

Step 1 Find Calico cni pods:

$ kubectl -n kube-system get pod -o wide -l k8s-app=calico-node
NAME                READY     STATUS    RESTARTS   AGE       IP            NODE   ...
calico-node-9qmdk   1/1       Running   0          2d        172.16.1.90   node1
calico-node-bscn7   1/1       Running   0          2d        172.16.1.56   node2
calico-node-zfmx4   1/1       Running   0          2d        172.16.1.41   master

Step 2 Investigate the calico-node pod. Note in the output the usage of an init Container:

$ kubectl -n kube-system describe pod calico-node-xxxxx
Name:               calico-node-zfmx4
Namespace:          kube-system
....
Node:               master/172.16.1.41
Start Time:         Tue, 11 Mar 2020 16:51:49 +0000
Labels:             controller-revision-hash=2178918083
                    k8s-app=calico-node
                    pod-template-generation=1
Annotations:        scheduler.alpha.kubernetes.io/critical-pod=
Status:             Running
IP:                 172.16.1.41
Controlled By:      DaemonSet/calico-node
Init Containers:
  install-cni:
    Container ID:  docker://94253d62fd04282b54b72d72a0e3f9544910517342245fe7bfc47c4222e40137
    Image:         calico/cni:v3.13.1
    Image ID:      docker-pullable://calico/cni@sha256:c699d5ec4d0799ca5785e9134cfb1f55a1376ebdbb607f5601394736fceef7c8
    Port:          <none>
    Host Port:     <none>
    Command:
      /install-cni.sh
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 11 Mar 2020 21:08:41 +0000
      Finished:     Fri, 11 Mar 2020 21:08:41 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      CNI_CONF_NAME:         10-calico.conflist
      CNI_NETWORK_CONFIG:    <set to the key 'cni_network_config' of config map 'calico-config'>  Optional: false
      KUBERNETES_NODE_NAME:   (v1:spec.nodeName)
      CNI_MTU:               <set to the key 'veth_mtu' of config map 'calico-config'>  Optional: false
      SLEEP:                 false
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-zfmx4 (ro)
Containers:
  calico-node:
    Container ID:   docker://b8984f13d50a367af13665755e16c5c1774b289848b20e49111025ee2197e6be
    Image:          calico/node:v3.13.1
    Image ID:       docker-pullable://calico/node@sha256:f24c59e93881178bfae85ee1375889fe9399edf1e15b0026713b2870cef079be
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Fri, 11 Mar 2020 21:08:42 +0000
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:      250m
    Liveness:   exec [/bin/calico-node -felix-live -bird-live] delay=10s timeout=1s period=10s #success=1 #failure=6
    Readiness:  exec [/bin/calico-node -felix-ready -bird-ready] delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      DATASTORE_TYPE:                     kubernetes
      WAIT_FOR_DATASTORE:                 true
      NODENAME:                            (v1:spec.nodeName)
      CALICO_NETWORKING_BACKEND:          <set to the key 'calico_backend' of config map 'calico-config'>  Optional: false
      CLUSTER_TYPE:                       k8s,bgp
      IP:                                 autodetect
      CALICO_IPV4POOL_IPIP:               Always
      FELIX_IPINIPMTU:                    <set to the key 'veth_mtu' of config map 'calico-config'>  Optional: false
      CALICO_IPV4POOL_CIDR:               192.168.0.0/16
      CALICO_DISABLE_FILE_LOGGING:        true
      FELIX_DEFAULTENDPOINTTOHOSTACTION:  ACCEPT
      FELIX_IPV6SUPPORT:                  false
      FELIX_LOGSEVERITYSCREEN:            info
      FELIX_HEALTHENABLED:                true
    Mounts:
      /lib/modules from lib-modules (ro)
      /run/xtables.lock from xtables-lock (rw)
      /var/lib/calico from var-lib-calico (rw)
      /var/run/calico from var-run-calico (rw)
      /var/run/nodeagent from policysync (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-zfmx4 (ro)
Conditions:
....
Tolerations:     :NoSchedule
                 :NoExecute
                 CriticalAddonsOnly
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/unreachable:NoExecute
Events:          <none>

Step 4 Take a look at the install-cni.sh script. Considering that the script resides in an initContainer, we will use Docker to start a container and redirect the output to the terminal. Observe the option to monitor Kubernetes and updates the pod’s configuration. If we needed additional flexibility, we would need to reconfigure this container from being an initContainer to run indefinitely:

$ docker run -it --rm calico/cni:v3.13.1 tail -20 /install-cni.sh
# Unless told otherwise, sleep forever.
# This prevents Kubernetes from restarting the pod repeatedly.
should_sleep=${SLEEP:-"true"}
echo "Done configuring CNI.  Sleep=$should_sleep"
while [ "$should_sleep" == "true"  ]; do
  # Kubernetes Secrets can be updated.  If so, we need to install the updated
  # version to the host. Just check the timestamp on the certificate to see if it
  # has been updated.  A bit hokey, but likely good enough.
  if [ -e ${SECRETS_MOUNT_DIR}/etcd-cert ];
  then
    stat_output=$(stat -c%y ${SECRETS_MOUNT_DIR}/etcd-cert 2>/dev/null)
    sleep 10;
    if [ "$stat_output" != "$(stat -c%y ${SECRETS_MOUNT_DIR}/etcd-cert 2>/dev/null)" ]; then
      echo "Updating installed secrets at: $(date)"
      cp -p ${SECRETS_MOUNT_DIR}/* /host/etc/cni/net.d/calico-tls/
    fi
  else
    sleep 10
  fi
done

Step 5 Check the processes running in the calico-node container, which is a good example of a complex, multi-process container:

$ kubectl -n kube-system exec calico-node-xxxxx -- sh -c "ps axf"
PID   USER     TIME   COMMAND
    1 root       0:00 /sbin/runsvdir -P /etc/service/enabled
   69 root       0:00 runsv felix
   70 root       0:00 runsv bird
   71 root       0:00 runsv confd
   72 root       0:00 runsv bird6
   73 root       0:01 bird6 -R -s /var/run/calico/bird6.ctl -d -c /etc/calico/confd/config/bird6.cfg
   74 root       0:17 confd -confdir=/etc/calico/confd
   75 root       0:01 bird -R -s /var/run/calico/bird.ctl -d -c /etc/calico/confd/config/bird.cfg
   76 root       2:29 calico-felix
 2201 root       0:00 sh -c ps axf
 2205 root       0:00 ps axf

Step 6 Check the calico-node configurations for felix. bird, and confd:

$ kubectl -n kube-system exec calico-node-xxxxx -- sh -c "ls -R /etc/calico"
/etc/calico:
confd      felix.cfg

/etc/calico/confd:
conf.d     config     templates

/etc/calico/confd/conf.d:
bird.toml        bird6_aggr.toml  bird_aggr.toml   tunl-ip.toml
bird6.toml       bird6_ipam.toml  bird_ipam.toml

/etc/calico/confd/config:
bird.cfg        bird6_aggr.cfg  bird_aggr.cfg
bird6.cfg       bird6_ipam.cfg  bird_ipam.cfg
....

Step 7 Download calicoctl client application:

$ curl -O -L  https://github.com/projectcalico/calicoctl/releases/download/v3.13.1/calicoctl
$ chmod +x calicoctl

Step 8 Use calicoctl to interact with Calico cni:

$ export DATASTORE_TYPE=kubernetes
$ export KUBECONFIG=~/.kube/config
$ sudo -E ./calicoctl node status
Calico process is running.

IPv4 BGP status
+--------------+-------------------+-------+----------+-------------+
| PEER ADDRESS |     PEER TYPE     | STATE |  SINCE   |    INFO     |
+--------------+-------------------+-------+----------+-------------+
| 172.16.1.14  | node-to-node mesh | up    | 17:25:07 | Established |
| 172.16.1.119 | node-to-node mesh | up    | 17:25:35 | Established |
+--------------+-------------------+-------+----------+-------------+

IPv6 BGP status
No IPv6 peers found.

$ sudo -E ./calicoctl get workloadendpoints -n kube-system
NAMESPACE     WORKLOAD                                   NODE     NETWORKS            INTERFACE
kube-system   calico-kube-controllers-788d6b9876-kcmd6   master   192.168.219.65/32   caliab5181af135
kube-system   coredns-6955765f44-wjz2d                   master   192.168.219.66/32   calif23cafbf8fb
kube-system   coredns-6955765f44-z5vbv                   master   192.168.219.67/32   cali0302f58d39d

Notes

calicoctl is namespace-aware, so try running the ./calicoctl get workloadendpoints with a different namespace (e.g. sudo -E ./calicoctl get wep -n monitoring)

10.2. Etcd Management

Step 1 Find etcd pods:

$ kubectl -n kube-system get pod -o wide -l component=etcd
NAME          READY     STATUS    RESTARTS   AGE       IP             NODE
etcd-master   1/1       Running   0          5d        172.16.1.182   master

Step 2 Download the etcdctl client application shipped with the container image:

$ kubectl -n kube-system cp etcd-master:usr/local/bin/etcdctl ./etcdctl
$ chmod +x etcdctl
$ sudo cp etcdctl /usr/local/bin/etcdctl

Step 3 The command for using etcdctl requires many parameters. To simplify it let’s define a shell alias for etcdctl to interact with etcd:

$ sudo -i
# alias ctl8="ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --cacert=/etc/kubernetes/pki/etcd/ca.crt"

Step 4 Use etcdctl to manage etcd:

# ctl8 version
etcdctl version: 3.4.3
API version: 3.4

# ctl8 member list
a874c87fd42044f, started, master, https://127.0.0.1:2380, https://127.0.0.1:2379, false

# ctl8 endpoint status
https://127.0.0.1:2379, a874c87fd42044f, 3.3.3, 11 MB, true, false, 2, 7783047, 7783047

# ctl8 check perf
 60 / 60 Boooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%1m0s
PASS: Throughput is 151 writes/s
PASS: Slowest request took too long: 0.025225s
PASS: Stddev is 0.002121s
PASS

# ctl8 snapshot save snap1.etcdb
...
Snapshot saved at snap1.etcdb

# ctl8 --write-out=table snapshot status snap1.etcdb
+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| bbfe515e |  6793301 |      19150 |      24 MB |
+----------+----------+------------+------------+

Step 5 Use etcdctl to query etcd:

# ctl8 get --prefix --keys-only /registry/services
/registry/services/endpoints/default/kubernetes

/registry/services/endpoints/kube-system/calico-typha

/registry/services/endpoints/kube-system/kube-controller-manager
...

# ctl8 get --prefix --keys-only /registry/pods/kube-system
/registry/pods/kube-system/calico-node-9b2sz

/registry/pods/kube-system/calico-node-ggwrd

/registry/pods/kube-system/calico-node-w7zq6
...

Step 6 Log-out from root user before moving on to the next step:

# logout
stack@master:~$

10.3. Tear Down the Cluster

Tear down and reinstallation involves all the nodes your cluster.

Step 1 Before proceeding further record the ip address of the nodes in your cluster and make sure you can ssh to each one as follows:

$ for node in node1 node2; do
ssh $node uptime
done

Step 2 For each worker node in the cluster drain the node, delete it, and reset it:

$ for node in node1 node2; do
kubectl drain $node --delete-local-data --force --ignore-daemonsets
kubectl delete node $node
ssh $node sudo kubeadm reset
ssh $node sudo rm -f /etc/cni/net.d/*
done
node/node1 cordoned
WARNING: Ignoring DaemonSet-managed pods: calico-node-zqfxt, kube-proxy-4t4rr
node "node1" deleted
[reset] WARNING: changes made to this host by 'kubeadm init' or 'kubeadm join' will be reverted.
[reset] are you sure you want to proceed? [y/N]: y
[preflight] running pre-flight checks
[reset] stopping the kubelet service
[reset] unmounting mounted directories in "/var/lib/kubelet"
[reset] removing kubernetes-managed containers
[reset] cleaning up running containers using crictl with socket /var/run/dockershim.sock
[reset] failed to list running pods using crictl: exit status 1. Trying to use docker instead[reset] no etcd manifest found in "/etc/kubernetes/manifests/etcd.yaml". Assuming external etcd
[reset] deleting contents of stateful directories: [/var/lib/kubelet /etc/cni/net.d /var/lib/dockershim /var/run/kubernetes]
[reset] deleting contents of config directories: [/etc/kubernetes/manifests /etc/kubernetes/pki]
[reset] deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf /etc/kubernetes/controller-manager.conf /etc/kubernetes/scheduler.conf]
....

Step 3 Check for available nodes:

$ kubectl get node
NAME      STATUS    ROLES     AGE       VERSION
master    Ready     master    1d        v1.17.4

Step 4 Drain, delete, and reset the master node:

$ kubectl drain master --delete-local-data --force --ignore-daemonsets
node/master cordoned
WARNING: Ignoring DaemonSet-managed pods: calico-node-kj7w6, kube-proxy-p5frg; Deleting pods with local storage: metrics-server-5c4945fb9f-kmsbn
pod/metrics-server-5c4945fb9f-kmsbn evicted
pod/coredns-78fcdf6894-cmnpk evicted
pod/coredns-78fcdf6894-gsbjl evicted
node/master evicted

If the above command doesn’t complete because the calico Pod is evicted you can safely terminate it with ^C and continue with:

$ kubectl delete node master
node "master" deleted

Reset the master node:

$ sudo kubeadm reset
[reset] WARNING: changes made to this host by 'kubeadm init' or 'kubeadm join' will be reverted.
[reset] are you sure you want to proceed? [y/N]: y
[preflight] running pre-flight checks
[reset] stopping the kubelet service
[reset] unmounting mounted directories in "/var/lib/kubelet"
[reset] removing kubernetes-managed containers
[reset] cleaning up running containers using crictl with socket /var/run/dockershim.sock
[reset] failed to list running pods using crictl: exit status 1. Trying to use docker instead[reset] deleting contents of stateful directories: [/var/lib/kubelet /etc/cni/net.d /var/lib/dockershim /var/run/kubernetes /var/lib/etcd]
[reset] deleting contents of config directories: [/etc/kubernetes/manifests /etc/kubernetes/pki]
[reset] deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf /etc/kubernetes/controller-manager.conf /etc/kubernetes/scheduler.conf]

Step 5 Remove kubectl configuration and any cache files:

$ sudo rm -rf ~/.kube/*
$ sudo rm -f /etc/cni/net.d/*

10.4. Install a Cluster

Step 1 Before installing a cluster with kubeadm, make sure that each node has docker and kubeadm:

$ docker version
$ kubeadm version
$ for node in node1 node2; do
ssh $node docker version && kubeadm version
done

Also, there should be no cluster running on the nodes already (see 10.3. Tear Down the Cluster).

Step 2 Apply the Calico or Flannel CNI plugin instructions.

For Calico:

$ export POD_NETWORK="192.168.0.0/16"

For Flannel:

$ export POD_NETWORK="10.244.0.0/16"

Step 3 On the master node:

$ sudo kubeadm init --kubernetes-version v1.17.4 \
    --apiserver-advertise-address $PrivateIP \
    --pod-network-cidr $POD_NETWORK
...
Your Kubernetes master has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

You can now join any number of machines by running the following on each node
as root:

  kubeadm join 172.16.1.43:6443 --token bx8ny9.61p0sedk22w6qfev --discovery-token-ca-cert-hash sha256:b1101068444867fcc00fd612474bdec560cc67870e3fa37115356c7ec6435369

Record the value of token, --discovery-token-ca-cert-hash, and your apiserver IP address:port:

$ token=$(sudo kubeadm token list | awk 'NR > 1 {print $1}')
$ discoveryhash=sha256:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Step 4 Follow the instruction in the output from the previous step:

$ sudo cp -i /etc/kubernetes/admin.conf ~/.kube/config
$ sudo chown stack:stack ~/.kube/config

Step 5 Apply the Calico or Flannel CNI plugin for the Kubernetes cluster.

For Calico execute:

stack@master:~$ kubectl apply -f ~/k8s-examples/addons/calico/kube-calico.yaml
configmap/calico-config created
customresourcedefinition.apiextensions.k8s.io/bgpconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/bgppeers.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/blockaffinities.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/clusterinformations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/felixconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/globalnetworkpolicies.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/globalnetworksets.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/hostendpoints.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamblocks.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamconfigs.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamhandles.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ippools.crd.projectcalico.org created

For Flannel execute:

$ kubectl apply -f ~/k8s-examples/addons/flannel/kube-flannel.yaml
podsecuritypolicy.policy/psp.flannel.unprivileged created
clusterrole.rbac.authorization.k8s.io/flannel created
clusterrolebinding.rbac.authorization.k8s.io/flannel created
serviceaccount/flannel created
configmap/kube-flannel-cfg created
daemonset.apps/kube-flannel-ds-amd64 created
daemonset.apps/kube-flannel-ds-arm64 created
daemonset.apps/kube-flannel-ds-arm created
daemonset.apps/kube-flannel-ds-ppc64le created
daemonset.apps/kube-flannel-ds-s390x created

Step 6 Test your master node control-plane (below output is for Flannel):

$ kubectl -n kube-system get pod
NAME                             READY     STATUS    RESTARTS   AGE
coredns-58b5ccf64b-swwp5         1/1     Running   0          19m
coredns-58b5ccf64b-x6g9w         1/1     Running   0          19m
etcd-master                      1/1     Running   0          18m
kube-apiserver-master            1/1     Running   0          18m
kube-controller-manager-master   1/1     Running   0          18m
kube-flannel-ds-amd64-jh5hh      1/1     Running   0          3m9s
kube-flannel-ds-amd64-qlpth      1/1     Running   0          3m11s
kube-flannel-ds-amd64-rls7k      1/1     Running   0          4m44s
kube-proxy-mfg8j                 1/1     Running   0          19m
kube-proxy-wjbdn                 1/1     Running   0          3m9s
kube-proxy-zx2tl                 1/1     Running   0          3m11s
kube-scheduler-master            1/1     Running   0          18m

Step 7 Add node1 and node2 to the cluster using the master’s IP address and token:

$ for node in node1 node2 ; do
ssh $node sudo kubeadm join ${PrivateIP}:6443 --token $token --discovery-token-ca-cert-hash $discoveryhash
done
[preflight] running pre-flight checks
I0909 17:04:37.833965   29232 kernel_validator.go:81] Validating kernel version
I0909 17:04:37.834034   29232 kernel_validator.go:96] Validating kernel config
[discovery] Trying to connect to API Server "172.16.1.43:6443"
[discovery] Created cluster-info discovery client, requesting info from "https://172.16.1.43:6443"
[discovery] Requesting info from "https://172.16.1.43:6443" again to validate TLS against the pinned public key
[discovery] Cluster info signature and contents are valid and TLS certificate validates against pinned roots, will use API Server "172.16.1.43:6443"
[discovery] Successfully established connection with API Server "172.16.1.43:6443"
[kubelet] Downloading configuration for the kubelet from the "kubelet-config-1.11" ConfigMap in the kube-system namespace
[kubelet] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[preflight] Activating the kubelet service
[tlsbootstrap] Waiting for the kubelet to perform the TLS Bootstrap...
[patchnode] Uploading the CRI Socket information "/var/run/dockershim.sock" to the Node API object "node1" as an annotation

This node has joined the cluster:
* Certificate signing request was sent to master and a response
  was received.
* The Kubelet was informed of the new secure connection details.

Run 'kubectl get nodes' on the master to see this node join the cluster.
....

Step 8 Test your cluster by creating a Deployment and testing the Pod endpoints:

$ cat <<EOF | kubectl create -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: echoserver
spec:
  replicas: 2
  selector:
    matchLabels:
      app: echoserver
  template:
    metadata:
      labels:
        app: echoserver
    spec:
      containers:
      - name: echoserver
        image: k8s.gcr.io/echoserver:1.6
        ports:
        - containerPort: 8080
EOF
deployment.apps/echoserver created

$ kubectl get pod -o wide
NAME                          READY     STATUS    RESTARTS   AGE       IP           NODE      ...
echoserver-5d5c779b47-7zvnq   1/1       Running   0          39s       10.244.2.3   node2
echoserver-5d5c779b47-l7prm   1/1       Running   0          39s       10.244.1.3   node1

$ curl 10.244.2.3:8080

Hostname: echoserver-5d5c779b47-7zvnq

Pod Information:
....

$ curl 10.244.1.3:8080

Hostname: echoserver-5d5c779b47-l7prm

Pod Information:
....

Congratulations! You have successfully re-installed your cluster with the Network plugin of your choice!