The ungleich kubernetes infrastructure » History » Revision 106

« Previous | Revision 106/112 (diff) | Next »
Nico Schottelius, 06/14/2022 05:08 PM

The ungleich kubernetes infrastructure and ungleich kubernetes manual


This document is pre-production.
This document is to become the ungleich kubernetes infrastructure overview as well as the ungleich kubernetes manual.

k8s clusters

Cluster Purpose/Setup Maintainer Master(s) argo v4 http proxy last verified Dev - UNUSED 2021-10-05 retired - 2022-03-15 Dev p7 HW Nico server47 server53 server54 argo 2021-10-05 retired - - 2021-10-05 Dev2 p7 HW Jin-Guk server52 server53 server54 - retired - 2022-03-15 Dev p6 VM Jin-Guk Jin-Guk production server34 server36 server38 argo - production server67 server69 server71 argo 2021-10-05 production server63 server65 server83 argo 2021-10-05
nau development Nico server75

General architecture and components overview

  • All k8s clusters are IPv6 only
  • We use BGP peering to propagate podcidr and serviceCidr networks to our infrastructure
  • The main public testing repository is ungleich-k8s
    • Private configurations are found in the k8s-config repository

Cluster types

Type/Feature Development Production
Min No. nodes 3 (1 master, 3 worker) 5 (3 master, 3 worker)
Recommended minimum 4 (dedicated master, 3 worker) 8 (3 master, 5 worker)
Separation of control plane optional recommended
Persistent storage required required
Number of storage monitors 3 5

General k8s operations

Cheat sheet / external great references

Allowing to schedule work on the control plane

  • Mostly for single node / test / development clusters
  • Just remove the master taint as follows
kubectl taint nodes --all

Get the cluster admin.conf

  • On the masters of each cluster you can find the file /etc/kubernetes/admin.conf
  • To be able to administrate the cluster you can copy the admin.conf to your local machine
  • Multi cluster debugging can very easy if you name the config ~/cX-admin.conf (see example below)
% scp ~/c2-admin.conf
% export KUBECONFIG=~/c2-admin.conf    
% kubectl get nodes
NAME       STATUS                     ROLES                  AGE   VERSION
server47   Ready                      control-plane,master   82d   v1.22.0
server48   Ready                      control-plane,master   82d   v1.22.0
server49   Ready                      <none>                 82d   v1.22.0
server50   Ready                      <none>                 82d   v1.22.0
server59   Ready                      control-plane,master   82d   v1.22.0
server60   Ready,SchedulingDisabled   <none>                 82d   v1.22.0
server61   Ready                      <none>                 82d   v1.22.0
server62   Ready                      <none>                 82d   v1.22.0               

Installing a new k8s cluster

  • Decide on the cluster name (usually, X counting upwards
    • Using for production clusters of placeXX
  • Use cdist to configure the nodes with requirements like crio
  • Decide between single or multi node control plane setups (see below)
    • Single control plane suitable for development clusters

Typical init procedure:

  • Single control plane: kubeadm init --config bootstrap/XXX/kubeadm.yaml
  • Multi control plane (HA): kubeadm init --config bootstrap/XXX/kubeadm.yaml --upload-certs

Deleting a pod that is hanging in terminating state

kubectl delete pod <PODNAME> --grace-period=0 --force --namespace <NAMESPACE>


Listing nodes of a cluster

[15:05] bridge:~% kubectl get nodes
NAME       STATUS   ROLES                  AGE   VERSION
server22   Ready    <none>                 52d   v1.22.0
server23   Ready    <none>                 52d   v1.22.2
server24   Ready    <none>                 52d   v1.22.0
server25   Ready    <none>                 52d   v1.22.0
server26   Ready    <none>                 52d   v1.22.0
server27   Ready    <none>                 52d   v1.22.0
server63   Ready    control-plane,master   52d   v1.22.0
server64   Ready    <none>                 52d   v1.22.0
server65   Ready    control-plane,master   52d   v1.22.0
server66   Ready    <none>                 52d   v1.22.0
server83   Ready    control-plane,master   52d   v1.22.0
server84   Ready    <none>                 52d   v1.22.0
server85   Ready    <none>                 52d   v1.22.0
server86   Ready    <none>                 52d   v1.22.0

Removing / draining a node

Usually kubectl drain server should do the job, but sometimes we need to be more aggressive:

kubectl drain --delete-emptydir-data --ignore-daemonsets serverXX

Readding a node after draining

kubectl uncordon serverXX

(Re-)joining worker nodes after creating the cluster

  • We need to have an up-to-date token
  • We use different join commands for the workers and control plane nodes

Generating the join command on an existing control plane node:

kubeadm token create --print-join-command

(Re-)joining control plane nodes after creating the cluster

  • We generate the token again
  • We upload the certificates
  • We need to combine/create the join command for the control plane node

Example session:

% kubeadm token create --print-join-command
kubeadm join --token xmff4i.ABC --discovery-token-ca-cert-hash sha256:longhash 

% kubeadm init phase upload-certs --upload-certs
[upload-certs] Storing the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
[upload-certs] Using certificate key:

# Then we use these two outputs on the joining node:

kubeadm join --token xmff4i.ABC --discovery-token-ca-cert-hash sha256:longhash --control-plane --certificate-key CERTKEY

Commands to be used on a control plane node:

kubeadm token create --print-join-command
kubeadm init phase upload-certs --upload-certs

Commands to be used on the joining node:

JOINCOMMAND --control-plane --certificate-key CERTKEY


How to fix etcd does not start when rejoining a kubernetes cluster as a control plane

If during the above step etcd does not come up, kubeadm join can hang as follows:

[control-plane] Creating static Pod manifest for "kube-apiserver"                                                              
[control-plane] Creating static Pod manifest for "kube-controller-manager"                                                     
[control-plane] Creating static Pod manifest for "kube-scheduler"                                                              
[check-etcd] Checking that the etcd cluster is healthy                                                                         
error execution phase check-etcd: etcd cluster is not healthy: failed to dial endpoint https://[2a0a:e5c0:10:1:225:b3ff:fe20:37
8a]:2379 with maintenance client: context deadline exceeded                                                                    
To see the stack trace of this error execute with --v=5 or higher         

Then the problem is likely that the etcd server is still a member of the cluster. We first need to remove it from the etcd cluster and then the join works.

To fix this we do:

  • Find a working etcd pod
  • Find the etcd members / member list
  • Remove the etcd member that we want to re-join the cluster
# Find the etcd pods
kubectl -n kube-system get pods -l component=etcd,tier=control-plane

# Get the list of etcd servers with the member id 
kubectl exec -n kube-system -ti ETCDPODNAME -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list

# Remove the member
kubectl exec -n kube-system -ti ETCDPODNAME -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove MEMBERID

Sample session:

[10:48] line:~% kubectl -n kube-system get pods -l component=etcd,tier=control-plane
etcd-server63   1/1     Running   0            3m11s
etcd-server65   1/1     Running   3            7d2h
etcd-server83   1/1     Running   8 (6d ago)   7d2h
[10:48] line:~% kubectl exec -n kube-system -ti etcd-server65 -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list
356891cd676df6e4, started, server65, https://[2a0a:e5c0:10:1:225:b3ff:fe20:375c]:2380, https://[2a0a:e5c0:10:1:225:b3ff:fe20:375c]:2379, false
371b8a07185dee7e, started, server63, https://[2a0a:e5c0:10:1:225:b3ff:fe20:378a]:2380, https://[2a0a:e5c0:10:1:225:b3ff:fe20:378a]:2379, false
5942bc58307f8af9, started, server83, https://[2a0a:e5c0:10:1:3e4a:92ff:fe79:bb98]:2380, https://[2a0a:e5c0:10:1:3e4a:92ff:fe79:bb98]:2379, false

[10:48] line:~% kubectl exec -n kube-system -ti etcd-server65 -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove 371b8a07185dee7e
Member 371b8a07185dee7e removed from cluster e3c0805f592a8f77


Hardware Maintenance using ungleich-hardware

Use the following manifest and replace the HOST with the actual host:

apiVersion: v1
kind: Pod
  name: ungleich-hardware-HOST
  - name: ungleich-hardware
    image: ungleich/ungleich-hardware:0.0.5
    - sleep
    - "1000000" 
      - mountPath: /dev
        name: dev
      privileged: true
  nodeSelector: "HOST" 

    - name: dev
        path: /dev

Also see: The_ungleich_hardware_maintenance_guide

Triggering a cronjob / creating a job from a cronjob

To test a cronjob, we can create a job from a cronjob:

kubectl create job --from=cronjob/volume2-daily-backup volume2-manual

This creates a job volume2-manual based on the cronjob volume2-daily

Calico CNI

Calico Installation

  • We install calico using helm
  • This has the following advantages:
    • Easy to upgrade
    • Does not require os to configure IPv6/dual stack settings as the tigera operator figures out things on its own

Usually plain calico can be installed directly using:

helm repo add projectcalico
helm install calico projectcalico/tigera-operator --version v3.20.4

Installing calicoctl

To be able to manage and configure calico, we need to
install calicoctl

kubectl apply -f

Or version specific:

kubectl apply -f

# For 3.22
kubectl apply -f

And making it easier accessible by alias:

alias calicoctl="kubectl exec -i -n kube-system calicoctl -- /calicoctl" 

Calico configuration

By default our k8s clusters BGP peer
with an upstream router to propagate podcidr and servicecidr.

Default settings in our infrastructure:

  • We use a full-mesh using the nodeToNodeMeshEnabled: true option
  • We keep the original next hop so that only the server with the pod is announcing it (instead of ecmp)
  • We use private ASNs for k8s clusters
  • We do not use any overlay

After installing calico and calicoctl the last step of the installation is usually:

calicoctl create -f - < calico-bgp.yaml

A sample BGP configuration:

kind: BGPConfiguration
  name: default
  logSeverityScreen: Info
  nodeToNodeMeshEnabled: true
  asNumber: 65534
  - cidr: 2a0a:e5c0:10:3::/108
  - cidr: 2a0a:e5c0:10:3::/108
kind: BGPPeer
  name: router1-place10
  peerIP: 2a0a:e5c0:10:1::50
  asNumber: 213081
  keepOriginalNextHop: true

ArgoCD / ArgoWorkFlow

Argocd Installation

As there is no configuration management present yet, argocd is installed using

kubectl create namespace argocd

# Specific Version
kubectl apply -n argocd -f

# OR: latest stable
kubectl apply -n argocd -f

Get the argocd credentials

kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d; echo "" 

Accessing argocd

In regular IPv6 clusters:

In legacy IPv4 clusters

kubectl --namespace argocd port-forward svc/argocd-server 8080:80

Using the argocd webhook to trigger changes

Deploying an application

  • Applications are deployed via git towards gitea ( and then pulled by argo
  • Always include the redmine-url pointing to the (customer) ticket
    • Also add the support-url if it exists

Application sample

kind: Application
  name: gitea-CUSTOMER
  namespace: argocd
    namespace: default
    server: 'https://kubernetes.default.svc'
    path: apps/prod/gitea
    repoURL: ''
    targetRevision: HEAD
        - name:
          value: rook-ceph-block-hdd
        - name:
          value: 200Gi
        - name: storage.db.storageClass
          value: rook-ceph-block-ssd
        - name: storage.db.size
          value: 10Gi
        - name: storage.letsencrypt.storageClass
          value: rook-ceph-block-hdd
        - name: storage.letsencrypt.size
          value: 50Mi
        - name: letsencryptStaging
          value: 'no'
        - name: fqdn
          value: ''
  project: default
      prune: true
      selfHeal: true
    - name: 'redmine-url'
      value: ''
    - name: 'support-url'
      value: ''

Helm related operations and conventions

We use helm charts extensively.

  • In production, they are managed via argocd
  • In development, helm chart can de developed and deployed manually using the helm utility.

Installing a helm chart

One can use the usual pattern of

helm install <releasename> <chartdirectory>

However often you want to reinstall/update when testing helm charts. The following pattern is "better", because it allows you to reinstall, if it is already installed:

helm upgrade --install <releasename> <chartdirectory>

Naming services and deployments in helm charts [Application labels]

Rook / Ceph Related Operations

Executing ceph commands

Using the ceph-tools pod as follows:

kubectl exec -n rook-ceph -ti $(kubectl -n rook-ceph get pods -l app=rook-ceph-tools -o jsonpath='{.items[*]}') -- ceph -s

Inspecting the logs of a specific server

# Get the related pods
kubectl -n rook-ceph get pods -l app=rook-ceph-osd-prepare 

# Inspect the logs of a specific pod
kubectl -n rook-ceph logs -f rook-ceph-osd-prepare-server23--1-444qx

Inspecting the logs of the rook-ceph-operator

kubectl -n rook-ceph logs -f -l app=rook-ceph-operator

Triggering server prepare / adding new osds

The rook-ceph-operator triggers/watches/creates pods to maintain hosts. To trigger a full "re scan", simply delete that pod:

kubectl -n rook-ceph delete pods -l app=rook-ceph-operator

This will cause all the rook-ceph-osd-prepare-.. jobs to be recreated and thus OSDs to be created, if new disks have been added.

Removing an OSD

Set osd id in the osd-purge.yaml and apply it. OSD should be down before.

apiVersion: batch/v1
kind: Job
  name: rook-ceph-purge-osd
  namespace: rook-ceph # namespace:cluster
    app: rook-ceph-purge-osd
        app: rook-ceph-purge-osd
      serviceAccountName: rook-ceph-purge-osd
        - name: osd-removal
          image: rook/ceph:master
          # TODO: Insert the OSD ID in the last parameter that is to be removed
          # The OSD IDs are a comma-separated list. For example: "0" or "0,2".
          # If you want to preserve the OSD PVCs, set `--preserve-pvc true`.
          # A --force-osd-removal option is available if the OSD should be destroyed even though the
          # removal could lead to data loss.
            - "ceph" 
            - "osd" 
            - "remove" 
            - "--preserve-pvc" 
            - "false" 
            - "--force-osd-removal" 
            - "false" 
            - "--osd-ids" 
            - "SETTHEOSDIDHERE" 
            - name: POD_NAMESPACE
                  fieldPath: metadata.namespace
            - name: ROOK_MON_ENDPOINTS
                  key: data
                  name: rook-ceph-mon-endpoints
            - name: ROOK_CEPH_USERNAME
                  key: ceph-username
                  name: rook-ceph-mon
            - name: ROOK_CEPH_SECRET
                  key: ceph-secret
                  name: rook-ceph-mon
            - name: ROOK_CONFIG_DIR
              value: /var/lib/rook
            - name: ROOK_CEPH_CONFIG_OVERRIDE
              value: /etc/rook/config/override.conf
            - name: ROOK_FSID
                  key: fsid
                  name: rook-ceph-mon
            - name: ROOK_LOG_LEVEL
              value: DEBUG
            - mountPath: /etc/ceph
              name: ceph-conf-emptydir
            - mountPath: /var/lib/rook
              name: rook-config
        - emptyDir: {}
          name: ceph-conf-emptydir
        - emptyDir: {}
          name: rook-config
      restartPolicy: Never

Deleting the deployment:

[18:05] bridge:~% kubectl -n rook-ceph delete deployment rook-ceph-osd-6
deployment.apps "rook-ceph-osd-6" deleted


  • We user Harbor for caching and as an image registry. Internal app reference: apps/prod/harbor.
  • The admin password is in the password store, auto generated per cluster
  • At the moment harbor only authenticates against the internal ldap tree

LDAP configuration

  • The url needs to be ldaps://...
  • uid = uid
  • rest standard

Monitoring / Prometheus

Access via ...

Prometheus Options


How to get the nextcloud credentials

  • The initial username is set to "nextcloud"
  • The password is autogenerated and saved in a kubernetes secret
kubectl get secret RELEASENAME-nextcloud -o jsonpath="{.data.PASSWORD}" | base64 -d; echo "" 

How to fix "Access through untrusted domain"

  • Nextcloud stores the initial domain configuration
  • If the FQDN is changed, it will show the error message "Access through untrusted domain"
  • To fix, edit /var/www/html/config/config.php and correct the domain
  • Then delete the pods

Infrastructure versions

ungleich kubernetes infrastructure v5 (2021-10)

Clusters are configured / setup in this order:

ungleich kubernetes infrastructure v4 (2021-09)

  • rook is configured via manifests instead of using the rook-ceph-cluster helm chart
  • The rook operator is still being installed via helm

ungleich kubernetes infrastructure v3 (2021-07)

  • rook is now installed via helm via argocd instead of directly via manifests

ungleich kubernetes infrastructure v2 (2021-05)

  • Replaced fluxv2 from ungleich k8s v1 with argocd
    • argocd can apply helm templates directly without needing to go through Chart releases
  • We are also using argoflow for build flows
  • Planned to add kaniko for image building

ungleich kubernetes infrastructure v1 (2021-01)

We are using the following components:

  • Calico as a CNI with BGP, IPv6 only, no encapsulation
    • Needed for basic networking
  • kubernetes-secret-generator for creating secrets
    • Needed so that secrets are not stored in the git repository, but only in the cluster
  • ungleich-certbot
    • Needed to get letsencrypt certificates for services
  • rook with ceph rbd + cephfs for storage
    • rbd for almost everything, ReadWriteOnce
    • cephfs for smaller things, multi access ReadWriteMany
    • Needed for providing persistent storage
  • flux v2
    • Needed to manage resources automatically

Updated by Nico Schottelius 20 days ago · 106 revisions