13. Cluster Management

In this section, we will learn how to tear down and reinstall a Kubernetes cluster

Chapter Details
Chapter Goal Kubernetes cluster tear down and reinstallation
Chapter Sections

13.1. Calico Management

Step 1 Find Calico cni pods:

$ kubectl -n kube-system get pod -o wide -l k8s-app=calico-node
NAME                READY     STATUS    RESTARTS   AGE       IP            NODE
calico-node-9qmdk   2/2       Running   0          2d        172.16.1.90   node1
calico-node-bscn7   2/2       Running   0          2d        172.16.1.56   node2
calico-node-zfmx4   2/2       Running   0          2d        172.16.1.41   master

Step 2 Investigate the calico-node pod comprising of two containers:

$ kubectl -n kube-system describe pod calico-node-xxxxx
Name:               calico-node-zfmx4
Namespace:          kube-system
....
Node:               master/172.16.1.41
Start Time:         Tue, 11 Sep 2018 16:51:49 +0000
Labels:             controller-revision-hash=2178918083
                    k8s-app=calico-node
                    pod-template-generation=1
Annotations:        scheduler.alpha.kubernetes.io/critical-pod=
Status:             Running
IP:                 172.16.1.41
Controlled By:      DaemonSet/calico-node
Containers:
  calico-node:
    Container ID:   docker://b1e9ff232809d730791e34ade1ac63377335a45f93044858aabb1b2a0e42a00e
    Image:          quay.io/calico/node:v3.1.3
    ....
    Requests:
      cpu:      250m
    Liveness:   http-get http://:9099/liveness delay=10s timeout=1s period=10s #success=1 #failure=6
    Readiness:  http-get http://:9099/readiness delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      DATASTORE_TYPE:                     kubernetes
      FELIX_LOGSEVERITYSCREEN:            info
      CLUSTER_TYPE:                       k8s,bgp
      ....
    Mounts:
      /lib/modules from lib-modules (ro)
      /var/lib/calico from var-lib-calico (rw)
      /var/run/calico from var-run-calico (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-t6zwv (ro)
  install-cni:
    Container ID:  docker://edd0926e4884c33bca135c55290ae1375f77134ab7bc567d0ac2f0205532e423
    Image:         quay.io/calico/cni:v3.1.3
    ....
    Command:
      /install-cni.sh
    ....
    Environment:
      CNI_CONF_NAME:         10-calico.conflist
      ....
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-t6zwv (ro)
Conditions:
....
Tolerations:     :NoSchedule
                 :NoExecute
                 CriticalAddonsOnly
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/unreachable:NoExecute
Events:          <none>

Step 3 Check the processes running in the install-cni container:

$ kubectl -n kube-system exec calico-node-xxxxx -c install-cni -- sh -c "ps axf"
PID   USER     TIME  COMMAND
    1 root      0:00 {install-cni.sh} /bin/sh /install-cni.sh
 1467 root      0:00 sleep 10
 1468 root      0:00 ps axf

Step 4 Take a look at the install-cni.sh script. Observe how it monitors Kubernetes and updates the pod’s configuration:

$ kubectl -n kube-system exec calico-node-xxxxx -c install-cni -- sh -c "tail -20 /install-cni.sh"
# Unless told otherwise, sleep forever.
# This prevents Kubernetes from restarting the pod repeatedly.
should_sleep=${SLEEP:-"true"}
echo "Done configuring CNI.  Sleep=$should_sleep"
while [ "$should_sleep" == "true"  ]; do
  # Kubernetes Secrets can be updated.  If so, we need to install the updated
  # version to the host. Just check the timestamp on the certificate to see if it
  # has been updated.  A bit hokey, but likely good enough.
  if [ -e ${SECRETS_MOUNT_DIR}/etcd-cert ];
  then
    stat_output=$(stat -c%y ${SECRETS_MOUNT_DIR}/etcd-cert 2>/dev/null)
    sleep 10;
    if [ "$stat_output" != "$(stat -c%y ${SECRETS_MOUNT_DIR}/etcd-cert 2>/dev/null)" ]; then
      echo "Updating installed secrets at: $(date)"
      cp -p ${SECRETS_MOUNT_DIR}/* /host/etc/cni/net.d/calico-tls/
    fi
  else
    sleep 10
  fi
done

Step 5 Check the processes running in the calico-node container, which is a good example of a complex, multi-process container:

$ kubectl -n kube-system exec calico-node-xxxxx -c calico-node -- sh -c "ps axf"
PID   USER     TIME   COMMAND
    1 root       0:00 /sbin/runsvdir -P /etc/service/enabled
   69 root       0:00 runsv felix
   70 root       0:00 runsv bird
   71 root       0:00 runsv confd
   72 root       0:00 runsv bird6
   73 root       0:01 bird6 -R -s /var/run/calico/bird6.ctl -d -c /etc/calico/confd/config/bird6.cfg
   74 root       0:17 confd -confdir=/etc/calico/confd
   75 root       0:01 bird -R -s /var/run/calico/bird.ctl -d -c /etc/calico/confd/config/bird.cfg
   76 root       2:29 calico-felix
 2201 root       0:00 sh -c ps axf
 2205 root       0:00 ps axf

Step 6 Check the calico-node configurations for felix. bird, and confd:

$ kubectl -n kube-system exec calico-node-xxxxx -it -c calico-node -- sh -c "ls -R /etc/calico"
/etc/calico:
confd      felix.cfg

/etc/calico/confd:
conf.d     config     templates

/etc/calico/confd/conf.d:
bird.toml        bird6_aggr.toml  bird_aggr.toml   tunl-ip.toml
bird6.toml       bird6_ipam.toml  bird_ipam.toml

/etc/calico/confd/config:
bird.cfg        bird6_aggr.cfg  bird_aggr.cfg
bird6.cfg       bird6_ipam.cfg  bird_ipam.cfg
....

Step 7 Download calicoctl client application:

$ curl -O -L https://github.com/projectcalico/calicoctl/releases/download/v3.1.3/calicoctl
$ chmod +x calicoctl

Step 8 Use calicoctl to interact with Calico cni:

$ export DATASTORE_TYPE=kubernetes
$ export KUBECONFIG=~/.kube/config
$ sudo -E ./calicoctl node status
Calico process is running.

IPv4 BGP status
+--------------+-------------------+-------+----------+-------------+
| PEER ADDRESS |     PEER TYPE     | STATE |  SINCE   |    INFO     |
+--------------+-------------------+-------+----------+-------------+
| 172.16.1.14  | node-to-node mesh | up    | 17:25:07 | Established |
| 172.16.1.119 | node-to-node mesh | up    | 17:25:35 | Established |
+--------------+-------------------+-------+----------+-------------+

IPv6 BGP status
No IPv6 peers found.

$ sudo -E ./calicoctl get workloadendpoints
WORKLOAD                      NODE    NETWORKS         INTERFACE
php-apache-55b8c5f78f-xpp5b   node1   192.168.2.3/32   cali2c0e9a68ed4

13.2. Etcd Management

Step 1 Find etcd pods:

$ kubectl -n kube-system get pod -o wide -l component=etcd
NAME          READY     STATUS    RESTARTS   AGE       IP             NODE
etcd-master   1/1       Running   0          50d       172.16.1.182   master

Step 2 Download the etcdctl client application shipped with the container image:

$ kubectl -n kube-system cp etcd-master:/usr/local/bin/etcdctl-3.2.18 .
$ chmod +x etcdctl-3.2.18
$ sudo cp etcdctl-3.2.18 /usr/local/bin/etcdctl

Step 3 The command for using etcdctl requires many parameters. To simplify it let’s define a shell alias for etcdctl to interact with etcd:

$ sudo -i
# alias ctl8="ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --cacert=/etc/kubernetes/pki/etcd/ca.crt"

Step 4 Use etcdctl to manage etcd:

# ctl8 version
etcdctl version: 3.2.18

# ctl8 member list
a874c87fd42044f, started, master, https://127.0.0.1:2380, https://127.0.0.1:2379

# ctl8 endpoint status
https://127.0.0.1:2379, a874c87fd42044f, 3.2.18, 2.4 MB, true, 2, 7783047

# ctl8 check perf
 60 / 60 Boooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%1m0s
PASS: Throughput is 151 writes/s
Slowest request took too long: 0.549516s
PASS: Stddev is 0.050280s
FAIL

# ctl8 snapshot save snap1.etcdb
Snapshot saved at snap1.etcdb

# ctl8 --write-out=table snapshot status snap1.etcdb
+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| bbfe515e |  6793301 |      19150 |      24 MB |
+----------+----------+------------+------------+

Step 5 Use etcdctl to query etcd:

# ctl8 get --prefix --keys-only /registry/services
/registry/services/endpoints/default/kubernetes

/registry/services/endpoints/kube-system/calico-typha

/registry/services/endpoints/kube-system/kube-controller-manager
...

# ctl8 get --prefix --keys-only /registry/pods/kube-system
/registry/pods/kube-system/calico-node-9b2sz

/registry/pods/kube-system/calico-node-ggwrd

/registry/pods/kube-system/calico-node-w7zq6
...

13.3. Tear Down the Cluster

Tear down and reinstallation involves all the nodes your cluster.

Step 1 Before proceeding further record the ip address of the nodes in your cluster and make sure you can ssh to each one as follows:

$ for node in node1 node2; do
ssh $node uptime
done

Step 2 For each worker node in the cluster drain the node, delete it, and reset it:

$ for node in node1 node2; do
kubectl drain $node --delete-local-data --force --ignore-daemonsets
kubectl delete node $node
ssh $node sudo kubeadm reset
done
node/node1 cordoned
WARNING: Ignoring DaemonSet-managed pods: calico-node-zqfxt, kube-proxy-4t4rr
node "node1" deleted
[reset] WARNING: changes made to this host by 'kubeadm init' or 'kubeadm join' will be reverted.
[reset] are you sure you want to proceed? [y/N]: y
[preflight] running pre-flight checks
[reset] stopping the kubelet service
[reset] unmounting mounted directories in "/var/lib/kubelet"
[reset] removing kubernetes-managed containers
[reset] cleaning up running containers using crictl with socket /var/run/dockershim.sock
[reset] failed to list running pods using crictl: exit status 1. Trying to use docker instead[reset] no etcd manifest found in "/etc/kubernetes/manifests/etcd.yaml". Assuming external etcd
[reset] deleting contents of stateful directories: [/var/lib/kubelet /etc/cni/net.d /var/lib/dockershim /var/run/kubernetes]
[reset] deleting contents of config directories: [/etc/kubernetes/manifests /etc/kubernetes/pki]
[reset] deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf /etc/kubernetes/controller-manager.conf /etc/kubernetes/scheduler.conf]
....

Step 3 Check for available nodes:

$ kubectl get node
NAME      STATUS    ROLES     AGE       VERSION
master    Ready     master    1d        v1.11.1

Step 4 Drain, delete, and reset the master node:

$ kubectl drain master --delete-local-data --force --ignore-daemonsets
node/master cordoned
WARNING: Ignoring DaemonSet-managed pods: calico-node-kj7w6, kube-proxy-p5frg; Deleting pods with local storage: metrics-server-5c4945fb9f-kmsbn
pod/metrics-server-5c4945fb9f-kmsbn evicted
pod/coredns-78fcdf6894-cmnpk evicted
pod/coredns-78fcdf6894-gsbjl evicted

If the above command doesn’t complete because the calico Pod is evicted you can safely terminate it with ^C and continue with:

$ kubectl delete node master
node "master" deleted

$ sudo kubeadm reset
[reset] WARNING: changes made to this host by 'kubeadm init' or 'kubeadm join' will be reverted.
[reset] are you sure you want to proceed? [y/N]: y
[preflight] running pre-flight checks
[reset] stopping the kubelet service
[reset] unmounting mounted directories in "/var/lib/kubelet"
[reset] removing kubernetes-managed containers
[reset] cleaning up running containers using crictl with socket /var/run/dockershim.sock
[reset] failed to list running pods using crictl: exit status 1. Trying to use docker instead[reset] deleting contents of stateful directories: [/var/lib/kubelet /etc/cni/net.d /var/lib/dockershim /var/run/kubernetes /var/lib/etcd]
[reset] deleting contents of config directories: [/etc/kubernetes/manifests /etc/kubernetes/pki]
[reset] deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf /etc/kubernetes/controller-manager.conf /etc/kubernetes/scheduler.conf]

Step 5 Remove kubectl configuration and any cache files:

$ sudo rm -rf ~/.kube/*

13.4. Install a Cluster

Step 1 Before installing a cluster with kubeadm, make sure that each node has docker and kubeadm:

$ docker version
$ kubeadm version
$ for node in node1 node2; do
ssh $node docker version && kubeadm version
done

Also, there should be no cluster running on the nodes already (see 13.3. Tear Down the Cluster).

Step 2 Apply the Calico or Flannel CNI plugin instructions.

For Calico:

$ export POD_NETWORK="192.168.0.0/16"

For Flannel:

$ export POD_NETWORK="10.244.0.0/16"

Step 3 On the master node:

$ sudo kubeadm init --kubernetes-version v1.11.1 \
    --apiserver-advertise-address $PrivateIP \
    --pod-network-cidr $POD_NETWORK
...
Your Kubernetes master has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

You can now join any number of machines by running the following on each node
as root:

  kubeadm join 172.16.1.43:6443 --token bx8ny9.61p0sedk22w6qfev --discovery-token-ca-cert-hash sha256:b1101068444867fcc00fd612474bdec560cc67870e3fa37115356c7ec6435369

Record the value of token, --discovery-token-ca-cert-hash, and your apiserver IP address:port:

$ token=$(sudo kubeadm token list | awk 'NR > 1 {print $1}')
$ discoveryhash=sha256:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Step 4 Follow the instruction in the output from the previous step:

$ sudo cp -i /etc/kubernetes/admin.conf ~/.kube/config
$ sudo chown stack:stack ~/.kube/config

Step 5 Apply the Calico or Flannel CNI plugin for Kubernetes v1.11.1.

For Calico execute:

stack@master:~$ kubectl apply -f \
  https://docs.projectcalico.org/v3.1/getting-started/kubernetes/installation/hosted/rbac-kdd.yaml
clusterrole.rbac.authorization.k8s.io/calico-node created
clusterrolebinding.rbac.authorization.k8s.io/calico-node created

stack@master:~$ kubectl apply -f \
  https://docs.projectcalico.org/v3.1/getting-started/kubernetes/installation/hosted/kubernetes-datastore/calico-networking/1.7/calico.yaml
configmap/calico-config created
service/calico-typha created
deployment.apps/calico-typha created
daemonset.extensions/calico-node created
customresourcedefinition.apiextensions.k8s.io/felixconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/bgppeers.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/bgpconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ippools.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/hostendpoints.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/clusterinformations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/globalnetworkpolicies.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/globalnetworksets.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/networkpolicies.crd.projectcalico.org created
serviceaccount/calico-node created

For Flannel execute:

$ kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/v0.10.0/Documentation/kube-flannel.yml
clusterrole.rbac.authorization.k8s.io/flannel created
clusterrolebinding.rbac.authorization.k8s.io/flannel created
serviceaccount/flannel created
configmap/kube-flannel-cfg created
daemonset.extensions/kube-flannel-ds created

Step 6 Test your master node control-plane (below output is for Flannel):

$ kubectl -n kube-system get pod
NAME                             READY     STATUS    RESTARTS   AGE
coredns-78fcdf6894-8v94c         1/1       Running   0          27m
coredns-78fcdf6894-f9tsx         1/1       Running   0          27m
etcd-master                      1/1       Running   0          5m
kube-apiserver-master            1/1       Running   0          5m
kube-controller-manager-master   1/1       Running   0          5m
kube-flannel-ds-hr5jh            1/1       Running   0          5m
kube-proxy-8fj87                 1/1       Running   0          27m
kube-scheduler-master            1/1       Running   0          5m

Step 7 Add node1 and node2 to the cluster using the master’s IP address and token:

$ for node in node1 node2 ; do
ssh $node sudo kubeadm join ${PrivateIP}:6443 --token $token --discovery-token-ca-cert-hash $discoveryhash
done
[preflight] running pre-flight checks
I0909 17:04:37.833965   29232 kernel_validator.go:81] Validating kernel version
I0909 17:04:37.834034   29232 kernel_validator.go:96] Validating kernel config
[discovery] Trying to connect to API Server "172.16.1.43:6443"
[discovery] Created cluster-info discovery client, requesting info from "https://172.16.1.43:6443"
[discovery] Requesting info from "https://172.16.1.43:6443" again to validate TLS against the pinned public key
[discovery] Cluster info signature and contents are valid and TLS certificate validates against pinned roots, will use API Server "172.16.1.43:6443"
[discovery] Successfully established connection with API Server "172.16.1.43:6443"
[kubelet] Downloading configuration for the kubelet from the "kubelet-config-1.11" ConfigMap in the kube-system namespace
[kubelet] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[preflight] Activating the kubelet service
[tlsbootstrap] Waiting for the kubelet to perform the TLS Bootstrap...
[patchnode] Uploading the CRI Socket information "/var/run/dockershim.sock" to the Node API object "node1" as an annotation

This node has joined the cluster:
* Certificate signing request was sent to master and a response
  was received.
* The Kubelet was informed of the new secure connection details.

Run 'kubectl get nodes' on the master to see this node join the cluster.
....

Step 8 Test your cluster:

$ cat <<EOF | kubectl create -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: echoserver
spec:
  replicas: 2
  selector:
    matchLabels:
      app: echoserver
  template:
    metadata:
      labels:
        app: echoserver
    spec:
      containers:
      - name: echoserver
        image: k8s.gcr.io/echoserver:1.6
        ports:
        - containerPort: 8080
EOF
deployment.apps/echoserver created

$ kubectl get pod -o wide
NAME                          READY     STATUS    RESTARTS   AGE       IP           NODE
echoserver-5d5c779b47-7zvnq   1/1       Running   0          39s       10.244.2.3   node2
echoserver-5d5c779b47-l7prm   1/1       Running   0          39s       10.244.1.3   node1

$ curl 10.244.2.3:8080

Hostname: echoserver-5d5c779b47-7zvnq

Pod Information:
....