The ungleich kubernetes infrastructure » History » Revision 225
Revision 224 (Nico Schottelius, 12/31/2024 04:44 AM) → Revision 225/226 (Nico Schottelius, 12/31/2024 05:00 AM)
h1. The ungleich kubernetes infrastructure and ungleich kubernetes manual
{{toc}}
h2. Status
This document is **production**.
This document is the ungleich kubernetes infrastructure overview as well as the ungleich kubernetes manual.
h2. k8s clusters
| Cluster | Purpose/Setup | Maintainer | Master(s) | argo | v4 http proxy | last verified |
| c0.k8s.ooo | Dev | - | UNUSED | | | 2021-10-05 |
| c1.k8s.ooo | retired | | - | | | 2022-03-15 |
| c2.k8s.ooo | Dev p7 HW | Nico | server47 server53 server54 | "argo":https://argocd-server.argocd.svc.c2.k8s.ooo | | 2021-10-05 |
| c3.k8s.ooo | retired | - | - | | | 2021-10-05 |
| c4.k8s.ooo | Dev2 p7 HW | Jin-Guk | server52 server53 server54 | | | - |
| c5.k8s.ooo | retired | | - | | | 2022-03-15 |
| c6.k8s.ooo | Dev p6 VM Jin-Guk | Jin-Guk | | | | |
| [[p5.k8s.ooo]] | production | | server34 server36 server38 | "argo":https://argocd-server.argocd.svc.p5.k8s.ooo | - | |
| [[p5-cow.k8s.ooo]] | production | Nico | server47 server51 server55 | "argo":https://argocd-server.argocd.svc.p5-cow.k8s.ooo | | 2022-08-27 |
| [[p6.k8s.ooo]] | production | | server67 server69 server71 | "argo":https://argocd-server.argocd.svc.p6.k8s.ooo | 147.78.194.13 | 2021-10-05 |
| [[p6-cow.k8s.ooo]] | production | | server134 server135 server136 | "argo":https://argocd-server.argocd.svc.p6in10.k8s.ooo | ? | 2023-05-17 |
| [[p10.k8s.ooo]] | production | | server131 server132 server133 | "argo":https://argocd-server.argocd.svc.p10.k8s.ooo | 147.78.194.12 | 2021-10-05 |
| [[k8s.ge.nau.so]] | development | | server107 server108 server109 | "argo":https://argocd-server.argocd.svc.k8s.ge.nau.so | | |
| [[dev.k8s.ooo]] | development | | server110 server111 server112 | "argo":https://argocd-server.argocd.svc.dev.k8s.ooo | - | 2022-07-08 |
| [[r1r2p15k8sooo|r1.p15.k8s.ooo]] | production | Nico | server120 | | | 2022-10-30 |
| [[r1r2p15k8sooo|r2.p15.k8s.ooo]] | production | Nico | server121 | | | 2022-09-06 |
| [[r1r2p10k8sooo|r1.p10.k8s.ooo]] | production | Nico | server122 | | | 2022-10-30 |
| [[r1r2p10k8sooo|r2.p10.k8s.ooo]] | production | Nico | server123 | | | 2022-10-15 |
| [[r1r2p5k8sooo|r1.p5.k8s.ooo]] | production | Nico | server137 | | | 2022-10-30 |
| [[r1r2p5k8sooo|r2.p5.k8s.ooo]] | production | Nico | server138 | | | 2022-10-30 |
| [[r1r2p6k8sooo|r1.p6.k8s.ooo]] | production | Nico | server139 | | | 2022-10-30 |
| [[r1r2p6k8sooo|r2.p6.k8s.ooo]] | production | Nico | server140 | | | 2022-10-30 |
h2. General architecture and components overview
* All k8s clusters are IPv6 only
* We use BGP peering to propagate podcidr and serviceCidr networks to our infrastructure
* The main public testing repository is "ungleich-k8s":https://code.ungleich.ch/ungleich-public/ungleich-k8s
** Private configurations are found in the **k8s-config** repository
h3. Cluster types
| **Type/Feature** | **Development** | **Production** |
| Min No. nodes | 3 (1 master, 3 worker) | 5 (3 master, 3 worker) |
| Recommended minimum | 4 (dedicated master, 3 worker) | 8 (3 master, 5 worker) |
| Separation of control plane | optional | recommended |
| Persistent storage | required | required |
| Number of storage monitors | 3 | 5 |
h2. General k8s operations
h3. Cheat sheet / external great references
* "kubectl cheatsheet":https://kubernetes.io/docs/reference/kubectl/cheatsheet/
Some examples:
h4. Use kubectl to print only the node names
<pre>
kubectl get nodes -o jsonpath='{.items[*].metadata.name}'
</pre>
Can easily be used in a shell loop like this:
<pre>
for host in $(kubectl get nodes -o jsonpath='{.items[*].metadata.name}'); do echo $host; ssh root@${host} uptime; done
</pre>
h3. Allowing to schedule work on the control plane / removing node taints
* Mostly for single node / test / development clusters
* Just remove the master taint as follows
<pre>
kubectl taint nodes --all node-role.kubernetes.io/master-
kubectl taint nodes --all node-role.kubernetes.io/control-plane-
</pre>
You can check the node taints using @kubectl describe node ...@
h3. Adding taints
* For instance to limit nodes to specific customers
<pre>
kubectl taint nodes serverXX customer=CUSTOMERNAME:NoSchedule
</pre>
h3. Get the cluster admin.conf
* On the masters of each cluster you can find the file @/etc/kubernetes/admin.conf@
* To be able to administrate the cluster you can copy the admin.conf to your local machine
* Multi cluster debugging can very easy if you name the config ~/cX-admin.conf (see example below)
<pre>
% scp root@server47.place7.ungleich.ch:/etc/kubernetes/admin.conf ~/c2-admin.conf
% export KUBECONFIG=~/c2-admin.conf
% kubectl get nodes
NAME STATUS ROLES AGE VERSION
server47 Ready control-plane,master 82d v1.22.0
server48 Ready control-plane,master 82d v1.22.0
server49 Ready <none> 82d v1.22.0
server50 Ready <none> 82d v1.22.0
server59 Ready control-plane,master 82d v1.22.0
server60 Ready,SchedulingDisabled <none> 82d v1.22.0
server61 Ready <none> 82d v1.22.0
server62 Ready <none> 82d v1.22.0
</pre>
h3. Installing a new k8s cluster
* Decide on the cluster name (usually *cX.k8s.ooo*), X counting upwards
** Using pXX.k8s.ooo for production clusters of placeXX
* Use cdist to configure the nodes with requirements like crio
* Decide between single or multi node control plane setups (see below)
** Single control plane suitable for development clusters
Typical init procedure:
h4. Single control plane:
<pre>
kubeadm init --config bootstrap/XXX/kubeadm.yaml
</pre>
h4. Multi control plane (HA):
<pre>
kubeadm init --config bootstrap/XXX/kubeadm.yaml --upload-certs
</pre>
h3. Deleting a pod that is hanging in terminating state
<pre>
kubectl delete pod <PODNAME> --grace-period=0 --force --namespace <NAMESPACE>
</pre>
(from https://stackoverflow.com/questions/35453792/pods-stuck-in-terminating-status)
h3. Listing nodes of a cluster
<pre>
[15:05] bridge:~% kubectl get nodes
NAME STATUS ROLES AGE VERSION
server22 Ready <none> 52d v1.22.0
server23 Ready <none> 52d v1.22.2
server24 Ready <none> 52d v1.22.0
server25 Ready <none> 52d v1.22.0
server26 Ready <none> 52d v1.22.0
server27 Ready <none> 52d v1.22.0
server63 Ready control-plane,master 52d v1.22.0
server64 Ready <none> 52d v1.22.0
server65 Ready control-plane,master 52d v1.22.0
server66 Ready <none> 52d v1.22.0
server83 Ready control-plane,master 52d v1.22.0
server84 Ready <none> 52d v1.22.0
server85 Ready <none> 52d v1.22.0
server86 Ready <none> 52d v1.22.0
</pre>
h3. Removing / draining a node
Usually @kubectl drain server@ should do the job, but sometimes we need to be more aggressive:
<pre>
kubectl drain --delete-emptydir-data --ignore-daemonsets serverXX
</pre>
h3. Readding a node after draining
<pre>
kubectl uncordon serverXX
</pre>
h3. (Re-)joining worker nodes after creating the cluster
* We need to have an up-to-date token
* We use different join commands for the workers and control plane nodes
Generating the join command on an existing control plane node:
<pre>
kubeadm token create --print-join-command
</pre>
h3. (Re-)joining control plane nodes after creating the cluster
* We generate the token again
* We upload the certificates
* We need to combine/create the join command for the control plane node
Example session:
<pre>
% kubeadm token create --print-join-command
kubeadm join p10-api.k8s.ooo:6443 --token xmff4i.ABC --discovery-token-ca-cert-hash sha256:longhash
% kubeadm init phase upload-certs --upload-certs
[upload-certs] Storing the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
[upload-certs] Using certificate key:
CERTKEY
# Then we use these two outputs on the joining node:
kubeadm join p10-api.k8s.ooo:6443 --token xmff4i.ABC --discovery-token-ca-cert-hash sha256:longhash --control-plane --certificate-key CERTKEY
</pre>
Commands to be used on a control plane node:
<pre>
kubeadm token create --print-join-command
kubeadm init phase upload-certs --upload-certs
</pre>
Commands to be used on the joining node:
<pre>
JOINCOMMAND --control-plane --certificate-key CERTKEY
</pre>
SEE ALSO
* https://stackoverflow.com/questions/63936268/how-to-generate-kubeadm-token-for-secondary-control-plane-nodes
* https://blog.scottlowe.org/2019/08/15/reconstructing-the-join-command-for-kubeadm/
h3. How to fix etcd does not start when rejoining a kubernetes cluster as a control plane
If during the above step etcd does not come up, @kubeadm join@ can hang as follows:
<pre>
[control-plane] Creating static Pod manifest for "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
[check-etcd] Checking that the etcd cluster is healthy
error execution phase check-etcd: etcd cluster is not healthy: failed to dial endpoint https://[2a0a:e5c0:10:1:225:b3ff:fe20:37
8a]:2379 with maintenance client: context deadline exceeded
To see the stack trace of this error execute with --v=5 or higher
</pre>
Then the problem is likely that the etcd server is still a member of the cluster. We first need to remove it from the etcd cluster and then the join works.
To fix this we do:
* Find a working etcd pod
* Find the etcd members / member list
* Remove the etcd member that we want to re-join the cluster
<pre>
# Find the etcd pods
kubectl -n kube-system get pods -l component=etcd,tier=control-plane
# Get the list of etcd servers with the member id
kubectl exec -n kube-system -ti ETCDPODNAME -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list
# Remove the member
kubectl exec -n kube-system -ti ETCDPODNAME -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove MEMBERID
</pre>
Sample session:
<pre>
[10:48] line:~% kubectl -n kube-system get pods -l component=etcd,tier=control-plane
NAME READY STATUS RESTARTS AGE
etcd-server63 1/1 Running 0 3m11s
etcd-server65 1/1 Running 3 7d2h
etcd-server83 1/1 Running 8 (6d ago) 7d2h
[10:48] line:~% kubectl exec -n kube-system -ti etcd-server65 -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list
356891cd676df6e4, started, server65, https://[2a0a:e5c0:10:1:225:b3ff:fe20:375c]:2380, https://[2a0a:e5c0:10:1:225:b3ff:fe20:375c]:2379, false
371b8a07185dee7e, started, server63, https://[2a0a:e5c0:10:1:225:b3ff:fe20:378a]:2380, https://[2a0a:e5c0:10:1:225:b3ff:fe20:378a]:2379, false
5942bc58307f8af9, started, server83, https://[2a0a:e5c0:10:1:3e4a:92ff:fe79:bb98]:2380, https://[2a0a:e5c0:10:1:3e4a:92ff:fe79:bb98]:2379, false
[10:48] line:~% kubectl exec -n kube-system -ti etcd-server65 -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove 371b8a07185dee7e
Member 371b8a07185dee7e removed from cluster e3c0805f592a8f77
</pre>
SEE ALSO
* We found the solution using https://stackoverflow.com/questions/67921552/re-installed-node-cannot-join-kubernetes-cluster
h4. Updating the members
1) get alive member
<pre>
% kubectl -n kube-system get pods -l component=etcd,tier=control-plane
NAME READY STATUS RESTARTS AGE
etcd-server67 1/1 Running 1 185d
etcd-server69 1/1 Running 1 185d
etcd-server71 1/1 Running 2 185d
[20:57] sun:~%
</pre>
2) get member list
* in this case via crictl, as the api does not work correctly anymore
<pre>
</pre>
3) update
<pre>
etcdctl member update MEMBERID --peer-urls=https://[...]:2380
</pre>
h3. Node labels (adding, showing, removing)
Listing the labels:
<pre>
kubectl get nodes --show-labels
</pre>
Adding labels:
<pre>
kubectl label nodes LIST-OF-NODES label1=value1
</pre>
For instance:
<pre>
kubectl label nodes router2 router3 hosttype=router
</pre>
Selecting nodes in pods:
<pre>
apiVersion: v1
kind: Pod
...
spec:
nodeSelector:
hosttype: router
</pre>
Removing labels by adding a minus at the end of the label name:
<pre>
kubectl label node <nodename> <labelname>-
</pre>
For instance:
<pre>
kubectl label nodes router2 router3 hosttype-
</pre>
SEE ALSO
* https://kubernetes.io/docs/tasks/configure-pod-container/assign-pods-nodes/
* https://stackoverflow.com/questions/34067979/how-to-delete-a-node-label-by-command-and-api
h3. Listing all pods on a node
<pre>
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=serverXX
</pre>
Found on https://stackoverflow.com/questions/62000559/how-to-list-all-the-pods-running-in-a-particular-worker-node-by-executing-a-comm
h3. Hardware Maintenance using ungleich-hardware
Use the following manifest and replace the HOST with the actual host:
<pre>
apiVersion: v1
kind: Pod
metadata:
name: ungleich-hardware-HOST
spec:
containers:
- name: ungleich-hardware
image: ungleich/ungleich-hardware:0.0.5
args:
- sleep
- "1000000"
volumeMounts:
- mountPath: /dev
name: dev
securityContext:
privileged: true
nodeSelector:
kubernetes.io/hostname: "HOST"
volumes:
- name: dev
hostPath:
path: /dev
</pre>
Also see: [[The_ungleich_hardware_maintenance_guide]]
h3. Triggering a cronjob / creating a job from a cronjob
To test a cronjob, we can create a job from a cronjob:
<pre>
kubectl create job --from=cronjob/volume2-daily-backup volume2-manual
</pre>
This creates a job volume2-manual based on the cronjob volume2-daily
h3. su-ing into a user that has nologin shell set
Many times users are having nologin as their shell inside the container. To be able to execute maintenance commands within the
container, we can use @su -s /bin/sh@ like this:
<pre>
su -s /bin/sh -c '/path/to/your/script' testuser
</pre>
Found on https://serverfault.com/questions/351046/how-to-run-command-as-user-who-has-usr-sbin-nologin-as-shell
h3. How to print a secret value
Assuming you want the "password" item from a secret, use:
<pre>
kubectl get secret SECRETNAME -o jsonpath="{.data.password}" | base64 -d; echo ""
</pre>
h3. Fixing the "ImageInspectError"
If you see this problem:
<pre>
# kubectl get pods
NAME READY STATUS RESTARTS AGE
bird-router-server137-bird-767f65bb47-g4xsh 0/1 Init:ImageInspectError 0 77d
bird-router-server137-openvpn-server120-5c987b7ffb-cn9xf 0/1 ImageInspectError 1 159d
bird-router-server137-unbound-5c6f5d4bb6-cxbpr 0/1 ImageInspectError 1 159d
</pre>
Fixes so far:
* correct registries.conf
h3. Automatic cleanup of images
* options to kubelet
<pre>
--image-gc-high-threshold=90: The percent of disk usage after which image garbage collection is always run. Default: 90%
--image-gc-low-threshold=80: The percent of disk usage before which image garbage collection is never run. Lowest disk usage to garbage collect to. Default: 80%
</pre>
h3. How to upgrade a kubernetes cluster
h4. General
* Should be done every X months to stay up-to-date
** X probably something like 3-6
* kubeadm based clusters
* Needs specific kubeadm versions for upgrade
* Follow instructions on https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/
* Finding releases: https://github.com/kubernetes/kubernetes/tree/master/CHANGELOG
h4. Getting a specific kubeadm or kubelet version
<pre>
RELEASE=v1.22.17
RELEASE=v1.23.17
RELEASE=v1.24.9
RELEASE=v1.25.9
RELEASE=v1.26.6
RELEASE=v1.27.2
ARCH=amd64
curl -L --remote-name-all https://dl.k8s.io/release/${RELEASE}/bin/linux/${ARCH}/{kubeadm,kubelet}
chmod u+x kubeadm kubelet
</pre>
h4. Steps
* kubeadm upgrade plan
** On one control plane node
* kubeadm upgrade apply vXX.YY.ZZ
** On one control plane node
* kubeadm upgrade node
** On all other control plane nodes
** On all worker nodes afterwards
Repeat for all control planes nodes. The upgrade kubelet on all other nodes via package manager.
h4. Upgrading to 1.22.17
* https://v1-22.docs.kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/
* Need to create a kubeadm config map
** f.i. using the following
** @/usr/local/bin/kubeadm-v1.22.17 upgrade --config kubeadm.yaml --ignore-preflight-errors=CoreDNSUnsupportedPlugins,CoreDNSMigration apply -y v1.22.17@
* Done for p6 on 2023-10-04
h4. Upgrading to 1.23.17
* https://v1-23.docs.kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/
* No special notes
* Done for p6 on 2023-10-04
h4. Upgrading to 1.24.17
* https://v1-24.docs.kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/
* No special notes
* Done for p6 on 2023-10-04
h4. Upgrading to 1.25.14
* https://v1-24.docs.kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/
* No special notes
* Done for p6 on 2023-10-04
h4. Upgrading to 1.26.9
* https://v1-26.docs.kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/
* No special notes
* Done for p6 on 2023-10-04
h4. Upgrading to 1.27
* https://v1-27.docs.kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/
* kubelet will not start anymore
* reason: @"command failed" err="failed to parse kubelet flag: unknown flag: --container-runtime"@
* /var/lib/kubelet/kubeadm-flags.env contains that parameter
* remove it, start kubelet
h4. Upgrading to 1.28
* https://v1-28.docs.kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/
h4. Upgrading to 1.29
* Done for many clusters around 2024-01-10
* Unsure if it was properly released
* https://v1-29.docs.kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/
h4. Upgrading to 1.31
* Cluster needs to updated FIRST before kubelet/the OS
Otherwise you run into errors in the pod like this:
<pre>
Warning Failed 11s (x3 over 12s) kubelet Error: services have not yet been read at least once, cannot construct envvars
</pre>
And the resulting pod state is:
<pre>
Init:CreateContainerConfigError
</pre>
Fix:
* find an old 1.30 kubelet package, downgrade kubelet, upgrade the control plane, upgrade kubelet again
<pre>
wget https://mirror.ungleich.ch/mirror/packages/alpine/v3.20/community/x86_64/kubelet-1.30.0-r3.apk
wget https://mirror.ungleich.ch/mirror/packages/alpine/v3.20/community/x86_64/kubelet-openrc-1.30.0-r3.apk
apk add ./kubelet-1.30.0-r3.apk ./kubelet-openrc-1.30.0-r3.apk
</pre>
h4. Upgrade to crio 1.27: missing crun
Error message
<pre>
level=fatal msg="validating runtime config: runtime validation: \"crun\" not found in $PATH: exec: \"crun\": executable file not found in $PATH"
</pre>
Fix:
<pre>
apk add crun
</pre>
h2. Reference CNI
* Mainly "stupid", but effective plugins
* Main documentation on https://www.cni.dev/plugins/current/
* Plugins
** bridge
*** Can create the bridge on the host
*** But seems not to be able to add host interfaces to it as well
*** Has support for vlan tags
** vlan
*** creates vlan tagged sub interface on the host
*** "It's a 1:1 mapping (i.e. no bridge in between)":https://github.com/k8snetworkplumbingwg/multus-cni/issues/569
** host-device
*** moves the interface from the host into the container
*** very easy for physical connections to containers
** ipvlan
*** "virtualisation" of a host device
*** routing based on IP
*** Same MAC for everyone
*** Cannot reach the master interface
** maclvan
*** With mac addresses
*** Supports various modes (to be checked)
** ptp ("point to point")
*** Creates a host device and connects it to the container
** win*
*** Windows implementations
h2. Calico CNI
h3. Calico Installation
* We install "calico using helm":https://docs.projectcalico.org/getting-started/kubernetes/helm
* This has the following advantages:
** Easy to upgrade
** Does not require os to configure IPv6/dual stack settings as the tigera operator figures out things on its own
Usually plain calico can be installed directly using:
<pre>
VERSION=v3.25.0
helm repo add projectcalico https://docs.projectcalico.org/charts
helm repo update
helm upgrade --install --namespace tigera calico projectcalico/tigera-operator --version $VERSION --create-namespace
</pre>
* Check the tags on https://github.com/projectcalico/calico/tags for the latest release
h3. Installing calicoctl
* General installation instructions, including binary download: https://projectcalico.docs.tigera.io/maintenance/clis/calicoctl/install
To be able to manage and configure calico, we need to
"install calicoctl (we choose the version as a pod)":https://docs.projectcalico.org/getting-started/clis/calicoctl/install#install-calicoctl-as-a-kubernetes-pod
<pre>
kubectl apply -f https://docs.projectcalico.org/manifests/calicoctl.yaml
</pre>
Or version specific:
<pre>
kubectl apply -f https://github.com/projectcalico/calico/blob/v3.20.4/manifests/calicoctl.yaml
# For 3.22
kubectl apply -f https://projectcalico.docs.tigera.io/archive/v3.22/manifests/calicoctl.yaml
</pre>
And making it easier accessible by alias:
<pre>
alias calicoctl="kubectl exec -i -n kube-system calicoctl -- /calicoctl"
</pre>
h3. Calico configuration
By default our k8s clusters "BGP peer":https://docs.projectcalico.org/networking/bgp
with an upstream router to propagate podcidr and servicecidr.
Default settings in our infrastructure:
* We use a full-mesh using the @nodeToNodeMeshEnabled: true@ option
* We keep the original next hop so that *only* the server with the pod is announcing it (instead of ecmp)
* We use private ASNs for k8s clusters
* We do *not* use any overlay
After installing calico and calicoctl the last step of the installation is usually:
<pre>
calicoctl create -f - < calico-bgp.yaml
</pre>
A sample BGP configuration:
<pre>
---
apiVersion: projectcalico.org/v3
kind: BGPConfiguration
metadata:
name: default
spec:
logSeverityScreen: Info
nodeToNodeMeshEnabled: true
asNumber: 65534
serviceClusterIPs:
- cidr: 2a0a:e5c0:10:3::/108
serviceExternalIPs:
- cidr: 2a0a:e5c0:10:3::/108
---
apiVersion: projectcalico.org/v3
kind: BGPPeer
metadata:
name: router1-place10
spec:
peerIP: 2a0a:e5c0:10:1::50
asNumber: 213081
keepOriginalNextHop: true
</pre>
h2. Cilium CNI (experimental)
h3. Status
*NO WORKING CILIUM CONFIGURATION FOR IPV6 only modes*
h3. Latest error
It seems cilium does not run on IPv6 only hosts:
<pre>
level=info msg="Validating configured node address ranges" subsys=daemon
level=fatal msg="postinit failed" error="external IPv4 node address could not be derived, please configure via --ipv4-node" subsys=daemon
level=info msg="Starting IP identity watcher" subsys=ipcache
</pre>
It crashes after that log entry
h3. BGP configuration
* The cilium-operator will not start without a correct configmap being present beforehand (see error message below)
* Creating the bgp config beforehand as a configmap is thus required.
The error one gets without the configmap present:
Pods are hanging with:
<pre>
cilium-bpqm6 0/1 Init:0/4 0 9s
cilium-operator-5947d94f7f-5bmh2 0/1 ContainerCreating 0 9s
</pre>
The error message in the cilium-*perator is:
<pre>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 80s default-scheduler Successfully assigned kube-system/cilium-operator-5947d94f7f-lqcsp to server56
Warning FailedMount 16s (x8 over 80s) kubelet MountVolume.SetUp failed for volume "bgp-config-path" : configmap "bgp-config" not found
</pre>
A correct bgp config looks like this:
<pre>
apiVersion: v1
kind: ConfigMap
metadata:
name: bgp-config
namespace: kube-system
data:
config.yaml: |
peers:
- peer-address: 2a0a:e5c0::46
peer-asn: 209898
my-asn: 65533
- peer-address: 2a0a:e5c0::47
peer-asn: 209898
my-asn: 65533
address-pools:
- name: default
protocol: bgp
addresses:
- 2a0a:e5c0:0:14::/64
</pre>
h3. Installation
Adding the repo
<pre>
helm repo add cilium https://helm.cilium.io/
helm repo update
</pre>
Installing + configuring cilium
<pre>
ipv6pool=2a0a:e5c0:0:14::/112
version=1.12.2
helm upgrade --install cilium cilium/cilium --version $version \
--namespace kube-system \
--set ipv4.enabled=false \
--set ipv6.enabled=true \
--set enableIPv6Masquerade=false \
--set bgpControlPlane.enabled=true
# --set ipam.operator.clusterPoolIPv6PodCIDRList=$ipv6pool
# Old style bgp?
# --set bgp.enabled=true --set bgp.announce.podCIDR=true \
# Show possible configuration options
helm show values cilium/cilium
</pre>
Using a /64 for ipam.operator.clusterPoolIPv6PodCIDRList fails with:
<pre>
level=fatal msg="Unable to init cluster-pool allocator" error="unable to initialize IPv6 allocator New CIDR set failed; the node CIDR size is too big" subsys=cilium-operator-generic
</pre>
See also https://github.com/cilium/cilium/issues/20756
Seems a /112 is actually working.
h3. Kernel modules
Cilium requires the following modules to be loaded on the host (not loaded by default):
<pre>
modprobe ip6table_raw
modprobe ip6table_filter
</pre>
h3. Interesting helm flags
* autoDirectNodeRoutes
* bgpControlPlane.enabled = true
h3. SEE ALSO
* https://docs.cilium.io/en/v1.12/helm-reference/
h2. Multus
* https://github.com/k8snetworkplumbingwg/multus-cni
* Installing a deployment w/ CRDs
<pre>
VERSION=v4.0.1
kubectl apply -f https://raw.githubusercontent.com/k8snetworkplumbingwg/multus-cni/${VERSION}/deployments/multus-daemonset-crio.yml
</pre>
h2. ArgoCD
h3. Argocd Installation
* See https://argo-cd.readthedocs.io/en/stable/
As there is no configuration management present yet, argocd is installed using
<pre>
kubectl create namespace argocd
# OR: latest stable
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
# OR Specific Version
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/v2.3.2/manifests/install.yaml
</pre>
h3. Get the argocd credentials
<pre>
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d; echo ""
</pre>
h3. Accessing argocd
In regular IPv6 clusters:
* Navigate to https://argocd-server.argocd.CLUSTERDOMAIN
In legacy IPv4 clusters
<pre>
kubectl --namespace argocd port-forward svc/argocd-server 8080:80
</pre>
* Navigate to https://localhost:8080
h3. Using the argocd webhook to trigger changes
* To trigger changes post json https://argocd.example.com/api/webhook
h3. Deploying an application
* Applications are deployed via git towards gitea (code.ungleich.ch) and then pulled by argo
* Always include the *redmine-url* pointing to the (customer) ticket
** Also add the support-url if it exists
Application sample
<pre>
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: gitea-CUSTOMER
namespace: argocd
spec:
destination:
namespace: default
server: 'https://kubernetes.default.svc'
source:
path: apps/prod/gitea
repoURL: 'https://code.ungleich.ch/ungleich-intern/k8s-config.git'
targetRevision: HEAD
helm:
parameters:
- name: storage.data.storageClass
value: rook-ceph-block-hdd
- name: storage.data.size
value: 200Gi
- name: storage.db.storageClass
value: rook-ceph-block-ssd
- name: storage.db.size
value: 10Gi
- name: storage.letsencrypt.storageClass
value: rook-ceph-block-hdd
- name: storage.letsencrypt.size
value: 50Mi
- name: letsencryptStaging
value: 'no'
- name: fqdn
value: 'code.verua.online'
project: default
syncPolicy:
automated:
prune: true
selfHeal: true
info:
- name: 'redmine-url'
value: 'https://redmine.ungleich.ch/issues/ISSUEID'
- name: 'support-url'
value: 'https://support.ungleich.ch/Ticket/Display.html?id=TICKETID'
</pre>
h2. Helm related operations and conventions
We use helm charts extensively.
* In production, they are managed via argocd
* In development, helm chart can de developed and deployed manually using the helm utility.
h3. Installing a helm chart
One can use the usual pattern of
<pre>
helm install <releasename> <chartdirectory>
</pre>
However often you want to reinstall/update when testing helm charts. The following pattern is "better", because it allows you to reinstall, if it is already installed:
<pre>
helm upgrade --install <releasename> <chartdirectory>
</pre>
h3. Naming services and deployments in helm charts [Application labels]
* We always have {{ .Release.Name }} to identify the current "instance"
* Deployments:
** use @app: <what it is>@, f.i. @app: nginx@, @app: postgres@, ...
* See more about standard labels on
** https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/
** https://helm.sh/docs/chart_best_practices/labels/
h3. Show all versions of a helm chart
<pre>
helm search repo -l repo/chart
</pre>
For example:
<pre>
% helm search repo -l projectcalico/tigera-operator
NAME CHART VERSION APP VERSION DESCRIPTION
projectcalico/tigera-operator v3.23.3 v3.23.3 Installs the Tigera operator for Calico
projectcalico/tigera-operator v3.23.2 v3.23.2 Installs the Tigera operator for Calico
....
</pre>
h3. Show possible values of a chart
<pre>
helm show values <repo/chart>
</pre>
Example:
<pre>
helm show values ingress-nginx/ingress-nginx
</pre>
h3. Show all possible charts in a repo
<pre>
helm search repo REPO
</pre>
h3. Download a chart
For instance for checking it out locally. Use:
<pre>
helm pull <repo/chart>
</pre>
h2. Rook + Ceph
h3. Installation
* Usually directly via argocd
h3. Executing ceph commands
Using the ceph-tools pod as follows:
<pre>
kubectl exec -n rook-ceph -ti $(kubectl -n rook-ceph get pods -l app=rook-ceph-tools -o jsonpath='{.items[*].metadata.name}') -- ceph -s
</pre>
h3. Inspecting the logs of a specific server
<pre>
# Get the related pods
kubectl -n rook-ceph get pods -l app=rook-ceph-osd-prepare
...
# Inspect the logs of a specific pod
kubectl -n rook-ceph logs -f rook-ceph-osd-prepare-server23--1-444qx
</pre>
h3. Inspecting the logs of the rook-ceph-operator
<pre>
kubectl -n rook-ceph logs -f -l app=rook-ceph-operator
</pre>
h3. (Temporarily) Disabling the rook-operation
* first disabling the sync in argocd
* then scale it down
<pre>
kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=0
</pre>
When done with the work/maintenance, re-enable sync in argocd.
The following command is thus strictly speaking not required, as argocd will fix it on its own:
<pre>
kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=1
</pre>
h3. Restarting the rook operator
<pre>
kubectl -n rook-ceph delete pods -l app=rook-ceph-operator
</pre>
h3. Triggering server prepare / adding new osds
The rook-ceph-operator triggers/watches/creates pods to maintain hosts. To trigger a full "re scan", simply delete that pod:
<pre>
kubectl -n rook-ceph delete pods -l app=rook-ceph-operator
</pre>
This will cause all the @rook-ceph-osd-prepare-..@ jobs to be recreated and thus OSDs to be created, if new disks have been added.
h3. Removing an OSD
* See "Ceph OSD Management":https://rook.io/docs/rook/v1.7/ceph-osd-mgmt.html
* More specifically: https://github.com/rook/rook/blob/release-1.7/cluster/examples/kubernetes/ceph/osd-purge.yaml
* Then delete the related deployment
Set osd id in the osd-purge.yaml and apply it. OSD should be down before.
<pre>
apiVersion: batch/v1
kind: Job
metadata:
name: rook-ceph-purge-osd
namespace: rook-ceph # namespace:cluster
labels:
app: rook-ceph-purge-osd
spec:
template:
metadata:
labels:
app: rook-ceph-purge-osd
spec:
serviceAccountName: rook-ceph-purge-osd
containers:
- name: osd-removal
image: rook/ceph:master
# TODO: Insert the OSD ID in the last parameter that is to be removed
# The OSD IDs are a comma-separated list. For example: "0" or "0,2".
# If you want to preserve the OSD PVCs, set `--preserve-pvc true`.
#
# A --force-osd-removal option is available if the OSD should be destroyed even though the
# removal could lead to data loss.
args:
- "ceph"
- "osd"
- "remove"
- "--preserve-pvc"
- "false"
- "--force-osd-removal"
- "false"
- "--osd-ids"
- "SETTHEOSDIDHERE"
env:
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: ROOK_MON_ENDPOINTS
valueFrom:
configMapKeyRef:
key: data
name: rook-ceph-mon-endpoints
- name: ROOK_CEPH_USERNAME
valueFrom:
secretKeyRef:
key: ceph-username
name: rook-ceph-mon
- name: ROOK_CEPH_SECRET
valueFrom:
secretKeyRef:
key: ceph-secret
name: rook-ceph-mon
- name: ROOK_CONFIG_DIR
value: /var/lib/rook
- name: ROOK_CEPH_CONFIG_OVERRIDE
value: /etc/rook/config/override.conf
- name: ROOK_FSID
valueFrom:
secretKeyRef:
key: fsid
name: rook-ceph-mon
- name: ROOK_LOG_LEVEL
value: DEBUG
volumeMounts:
- mountPath: /etc/ceph
name: ceph-conf-emptydir
- mountPath: /var/lib/rook
name: rook-config
volumes:
- emptyDir: {}
name: ceph-conf-emptydir
- emptyDir: {}
name: rook-config
restartPolicy: Never
</pre>
Deleting the deployment:
<pre>
[18:05] bridge:~% kubectl -n rook-ceph delete deployment rook-ceph-osd-6
deployment.apps "rook-ceph-osd-6" deleted
</pre>
h3. Placement of mons/osds/etc.
See https://rook.io/docs/rook/v1.11/CRDs/Cluster/ceph-cluster-crd/#placement-configuration-settings
h3. Setting up and managing S3 object storage
h4. Endpoints
| Location | Enpdoint |
| p5 | https://s3.k8s.place5.ungleich.ch |
h4. Setting up a storage class
* This will store the buckets of a specific customer
Similar to this:
<pre>
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ungleich-archive-bucket-sc
namespace: rook-ceph
provisioner: rook-ceph.ceph.rook.io/bucket
reclaimPolicy: Delete
parameters:
objectStoreName: place5
objectStoreNamespace: rook-ceph
</pre>
h4. Setting up the Bucket
Similar to this:
<pre>
apiVersion: objectbucket.io/v1alpha1
kind: ObjectBucketClaim
metadata:
name: ungleich-archive-bucket-claim
namespace: rook-ceph
spec:
generateBucketName: ungleich-archive-ceph-bkt
storageClassName: ungleich-archive-bucket-sc
additionalConfig:
# To set for quota for OBC
#maxObjects: "1000"
maxSize: "100G"
</pre>
* See also: https://rook.io/docs/rook/latest-release/Storage-Configuration/Object-Storage-RGW/ceph-object-bucket-claim/#obc-custom-resource
h4. Getting the credentials for the bucket
* Get "public" information from the configmap
* Get secret from the secret
<pre>
name=BUCKETNAME
s3host=s3.k8s.place5.ungleich.ch
endpoint=https://${s3host}
cm=$(kubectl -n rook-ceph get configmap -o yaml ${name}-bucket-claim)
sec=$(kubectl -n rook-ceph get secrets -o yaml ${name}-bucket-claim)
export AWS_ACCESS_KEY_ID=$(echo $sec | yq .data.AWS_ACCESS_KEY_ID | base64 -d ; echo "")
export AWS_SECRET_ACCESS_KEY=$(echo $sec | yq .data.AWS_SECRET_ACCESS_KEY | base64 -d ; echo "")
bucket_name=$(echo $cm | yq .data.BUCKET_NAME)
</pre>
h5. Access via s3cmd
it is *NOT*:
<pre>
s3cmd --host ${s3host}:443 --access_key=${AWS_ACCESS_KEY_ID} --secret_key=${AWS_SECRET_ACCESS_KEY} ls s3://${name}
</pre>
h5. Access via s4cmd
<pre>
s4cmd --endpoint-url ${endpoint} --access-key=$(AWS_ACCESS_KEY_ID) --secret-key=$(AWS_SECRET_ACCESS_KEY) ls
</pre>
h5. Access via s5cmd
* Uses environment variables
<pre>
s5cmd --endpoint-url ${endpoint} ls
</pre>
h2. Ingress + Cert Manager
* We deploy "nginx-ingress":https://docs.nginx.com/nginx-ingress-controller/ to get an ingress
* we deploy "cert-manager":https://cert-manager.io/ to handle certificates
* We independently deploy @ClusterIssuer@ to allow the cert-manager app to deploy and the issuer to be created once the CRDs from cert manager are in place
h3. IPv4 reachability
The ingress is by default IPv6 only. To make it reachable from the IPv4 world, get its IPv6 address and configure a NAT64 mapping in Jool.
Steps:
h4. Get the ingress IPv6 address
Use @kubectl -n ingress-nginx get svc ingress-nginx-controller -o jsonpath='{.spec.clusterIP}'; echo ''@
Example:
<pre>
kubectl -n ingress-nginx get svc ingress-nginx-controller -o jsonpath='{.spec.clusterIP}'; echo ''
2a0a:e5c0:10:1b::ce11
</pre>
h4. Add NAT64 mapping
* Update the __dcl_jool_siit cdist type
* Record the two IPs (IPv6 and IPv4)
* Configure all routers
h4. Add DNS record
To use the ingress capable as a CNAME destination, create an "ingress" DNS record, such as:
<pre>
; k8s ingress for dev
dev-ingress AAAA 2a0a:e5c0:10:1b::ce11
dev-ingress A 147.78.194.23
</pre>
h4. Add supporting wildcard DNS
If you plan to add various sites under a specific domain, we can add a wildcard DNS entry, such as *.k8s-dev.django-hosting.ch:
<pre>
*.k8s-dev CNAME dev-ingress.ungleich.ch.
</pre>
h2. Harbor
* We user "Harbor":https://goharbor.io/ as an image registry for our own images. Internal app reference: apps/prod/harbor.
* The admin password is in the password store, it is Harbor12345 by default
* At the moment harbor only authenticates against the internal ldap tree
h3. LDAP configuration
* The url needs to be ldaps://...
* uid = uid
* rest standard
h2. Monitoring / Prometheus
* Via "kube-prometheus":https://github.com/prometheus-operator/kube-prometheus/
Access via ...
* http://prometheus-k8s.monitoring.svc:9090
* http://grafana.monitoring.svc:3000
* http://alertmanager.monitoring.svc:9093
h3. Prometheus Options
* "helm/kube-prometheus-stack":https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack
** Includes dashboards and co.
* "manifest based kube-prometheus":https://github.com/prometheus-operator/kube-prometheus
** Includes dashboards and co.
* "Prometheus Operator (mainly CRD manifest":https://github.com/prometheus-operator/prometheus-operator
h3. Grafana default password
* If not changed: admin / @prom-operator@
** Can be changed via:
<pre>
helm:
values: |-
configurations: |-
grafana:
adminPassword: "..."
</pre>
h2. Nextcloud
h3. How to get the nextcloud credentials
* The initial username is set to "nextcloud"
* The password is autogenerated and saved in a kubernetes secret
<pre>
kubectl get secret RELEASENAME-nextcloud -o jsonpath="{.data.PASSWORD}" | base64 -d; echo ""
</pre>
h3. How to fix "Access through untrusted domain"
* Nextcloud stores the initial domain configuration
* If the FQDN is changed, it will show the error message "Access through untrusted domain"
* To fix, edit /var/www/html/config/config.php and correct the domain
* Then delete the pods
h3. Running occ commands inside the nextcloud container
* Find the pod in the right namespace
Exec:
<pre>
su www-data -s /bin/sh -c ./occ
</pre>
* -s /bin/sh is needed as the default shell is set to /bin/false
h4. Rescanning files
* If files have been added without nextcloud's knowledge
<pre>
su www-data -s /bin/sh -c "./occ files:scan --all"
</pre>
h2. Sealed Secrets
* install kubeseal
<pre>
KUBESEAL_VERSION='0.23.0'
wget "https://github.com/bitnami-labs/sealed-secrets/releases/download/v${KUBESEAL_VERSION:?}/kubeseal-${KUBESEAL_VERSION:?}-linux-amd64.tar.gz"
tar -xvzf kubeseal-${KUBESEAL_VERSION:?}-linux-amd64.tar.gz kubeseal
sudo install -m 755 kubeseal /usr/local/bin/kubeseal
</pre>
* create key for sealed-secret
<pre>
kubeseal --fetch-cert > /tmp/public-key-cert.pem
</pre>
* create the secret
<pre>
ex)
apiVersion: v1
kind: Secret
metadata:
name: Release.Name-postgres-config
annotations:
secret-generator.v1.mittwald.de/autogenerate: POSTGRES_PASSWORD
hosting: Release.Name
labels:
app.kubernetes.io/instance: Release.Name
app.kubernetes.io/component: postgres
stringData:
POSTGRES_USER: postgresUser
POSTGRES_DB: postgresDBName
POSTGRES_INITDB_ARGS: "--no-locale --encoding=UTF8"
</pre>
* convert secret.yaml to sealed-secret.yaml
<pre>
kubeseal -n <namespace> --cert=/tmp/public-key-cert.pem --format=yaml < ./secret.yaml > ./sealed-secret.yaml
</pre>
* use sealed-secret.yaml on helm-chart directory
* refer ticket : #11989 , #12120
h2. Infrastructure versions
h3. ungleich kubernetes infrastructure v5 (2021-10)
Clusters are configured / setup in this order:
* Bootstrap via kubeadm
* "Networking via calico + BGP (non ECMP) using helm":https://docs.projectcalico.org/getting-started/kubernetes/helm
* "ArgoCD for CD":https://argo-cd.readthedocs.io/en/stable/
** "rook for storage via argocd":https://rook.io/
** haproxy for in IPv6-cluster-IPv4-to-IPv6 proxy via argocd
** "kubernetes-secret-generator for in cluster secrets":https://github.com/mittwald/kubernetes-secret-generator
** "ungleich-certbot managing certs and nginx":https://hub.docker.com/repository/docker/ungleich/ungleich-certbot
h3. ungleich kubernetes infrastructure v4 (2021-09)
* rook is configured via manifests instead of using the rook-ceph-cluster helm chart
* The rook operator is still being installed via helm
h3. ungleich kubernetes infrastructure v3 (2021-07)
* rook is now installed via helm via argocd instead of directly via manifests
h3. ungleich kubernetes infrastructure v2 (2021-05)
* Replaced fluxv2 from ungleich k8s v1 with argocd
** argocd can apply helm templates directly without needing to go through Chart releases
* We are also using argoflow for build flows
* Planned to add "kaniko":https://github.com/GoogleContainerTools/kaniko for image building
h3. ungleich kubernetes infrastructure v1 (2021-01)
We are using the following components:
* "Calico as a CNI":https://www.projectcalico.org/ with BGP, IPv6 only, no encapsulation
** Needed for basic networking
* "kubernetes-secret-generator":https://github.com/mittwald/kubernetes-secret-generator for creating secrets
** Needed so that secrets are not stored in the git repository, but only in the cluster
* "ungleich-certbot":https://hub.docker.com/repository/docker/ungleich/ungleich-certbot
** Needed to get letsencrypt certificates for services
* "rook with ceph rbd + cephfs":https://rook.io/ for storage
** rbd for almost everything, *ReadWriteOnce*
** cephfs for smaller things, multi access *ReadWriteMany*
** Needed for providing persistent storage
* "flux v2":https://fluxcd.io/
** Needed to manage resources automatically