Project

General

Profile

Actions

The ungleich kubernetes infrastructure » History » Revision 217

« Previous | Revision 217/218 (diff) | Next »
Nico Schottelius, 09/09/2024 08:33 AM


The ungleich kubernetes infrastructure and ungleich kubernetes manual

Status

This document is production.
This document is the ungleich kubernetes infrastructure overview as well as the ungleich kubernetes manual.

k8s clusters

Cluster Purpose/Setup Maintainer Master(s) argo v4 http proxy last verified
c0.k8s.ooo Dev - UNUSED 2021-10-05
c1.k8s.ooo retired - 2022-03-15
c2.k8s.ooo Dev p7 HW Nico server47 server53 server54 argo 2021-10-05
c3.k8s.ooo retired - - 2021-10-05
c4.k8s.ooo Dev2 p7 HW Jin-Guk server52 server53 server54 -
c5.k8s.ooo retired - 2022-03-15
c6.k8s.ooo Dev p6 VM Jin-Guk Jin-Guk
p5.k8s.ooo production server34 server36 server38 argo -
p5-cow.k8s.ooo production Nico server47 server51 server55 argo 2022-08-27
p6.k8s.ooo production server67 server69 server71 argo 147.78.194.13 2021-10-05
p6-cow.k8s.ooo production server134 server135 server136 argo ? 2023-05-17
p10.k8s.ooo production server131 server132 server133 argo 147.78.194.12 2021-10-05
k8s.ge.nau.so development server107 server108 server109 argo
dev.k8s.ooo development server110 server111 server112 argo - 2022-07-08
r1.p15.k8s.ooo production Nico server120 2022-10-30
r2.p15.k8s.ooo production Nico server121 2022-09-06
r1.p10.k8s.ooo production Nico server122 2022-10-30
r2.p10.k8s.ooo production Nico server123 2022-10-15
r1.p5.k8s.ooo production Nico server137 2022-10-30
r2.p5.k8s.ooo production Nico server138 2022-10-30
r1.p6.k8s.ooo production Nico server139 2022-10-30
r2.p6.k8s.ooo production Nico server140 2022-10-30

General architecture and components overview

  • All k8s clusters are IPv6 only
  • We use BGP peering to propagate podcidr and serviceCidr networks to our infrastructure
  • The main public testing repository is ungleich-k8s
    • Private configurations are found in the k8s-config repository

Cluster types

Type/Feature Development Production
Min No. nodes 3 (1 master, 3 worker) 5 (3 master, 3 worker)
Recommended minimum 4 (dedicated master, 3 worker) 8 (3 master, 5 worker)
Separation of control plane optional recommended
Persistent storage required required
Number of storage monitors 3 5

General k8s operations

Cheat sheet / external great references

Some examples:

Use kubectl to print only the node names

kubectl get nodes -o jsonpath='{.items[*].metadata.name}'

Can easily be used in a shell loop like this:

for host in $(kubectl get nodes -o jsonpath='{.items[*].metadata.name}'); do echo $host; ssh root@${host} uptime; done

Allowing to schedule work on the control plane / removing node taints

  • Mostly for single node / test / development clusters
  • Just remove the master taint as follows
kubectl taint nodes --all node-role.kubernetes.io/master-
kubectl taint nodes --all node-role.kubernetes.io/control-plane-

You can check the node taints using kubectl describe node ...

Adding taints

  • For instance to limit nodes to specific customers
kubectl taint nodes serverXX customer=CUSTOMERNAME:NoSchedule

Get the cluster admin.conf

  • On the masters of each cluster you can find the file /etc/kubernetes/admin.conf
  • To be able to administrate the cluster you can copy the admin.conf to your local machine
  • Multi cluster debugging can very easy if you name the config ~/cX-admin.conf (see example below)
% scp root@server47.place7.ungleich.ch:/etc/kubernetes/admin.conf ~/c2-admin.conf
% export KUBECONFIG=~/c2-admin.conf    
% kubectl get nodes
NAME       STATUS                     ROLES                  AGE   VERSION
server47   Ready                      control-plane,master   82d   v1.22.0
server48   Ready                      control-plane,master   82d   v1.22.0
server49   Ready                      <none>                 82d   v1.22.0
server50   Ready                      <none>                 82d   v1.22.0
server59   Ready                      control-plane,master   82d   v1.22.0
server60   Ready,SchedulingDisabled   <none>                 82d   v1.22.0
server61   Ready                      <none>                 82d   v1.22.0
server62   Ready                      <none>                 82d   v1.22.0               

Installing a new k8s cluster

  • Decide on the cluster name (usually cX.k8s.ooo), X counting upwards
    • Using pXX.k8s.ooo for production clusters of placeXX
  • Use cdist to configure the nodes with requirements like crio
  • Decide between single or multi node control plane setups (see below)
    • Single control plane suitable for development clusters

Typical init procedure:

Single control plane:

kubeadm init --config bootstrap/XXX/kubeadm.yaml

Multi control plane (HA):

kubeadm init --config bootstrap/XXX/kubeadm.yaml --upload-certs

Deleting a pod that is hanging in terminating state

kubectl delete pod <PODNAME> --grace-period=0 --force --namespace <NAMESPACE>

(from https://stackoverflow.com/questions/35453792/pods-stuck-in-terminating-status)

Listing nodes of a cluster

[15:05] bridge:~% kubectl get nodes
NAME       STATUS   ROLES                  AGE   VERSION
server22   Ready    <none>                 52d   v1.22.0
server23   Ready    <none>                 52d   v1.22.2
server24   Ready    <none>                 52d   v1.22.0
server25   Ready    <none>                 52d   v1.22.0
server26   Ready    <none>                 52d   v1.22.0
server27   Ready    <none>                 52d   v1.22.0
server63   Ready    control-plane,master   52d   v1.22.0
server64   Ready    <none>                 52d   v1.22.0
server65   Ready    control-plane,master   52d   v1.22.0
server66   Ready    <none>                 52d   v1.22.0
server83   Ready    control-plane,master   52d   v1.22.0
server84   Ready    <none>                 52d   v1.22.0
server85   Ready    <none>                 52d   v1.22.0
server86   Ready    <none>                 52d   v1.22.0

Removing / draining a node

Usually kubectl drain server should do the job, but sometimes we need to be more aggressive:

kubectl drain --delete-emptydir-data --ignore-daemonsets serverXX

Readding a node after draining

kubectl uncordon serverXX

(Re-)joining worker nodes after creating the cluster

  • We need to have an up-to-date token
  • We use different join commands for the workers and control plane nodes

Generating the join command on an existing control plane node:

kubeadm token create --print-join-command

(Re-)joining control plane nodes after creating the cluster

  • We generate the token again
  • We upload the certificates
  • We need to combine/create the join command for the control plane node

Example session:

% kubeadm token create --print-join-command
kubeadm join p10-api.k8s.ooo:6443 --token xmff4i.ABC --discovery-token-ca-cert-hash sha256:longhash 

% kubeadm init phase upload-certs --upload-certs
[upload-certs] Storing the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
[upload-certs] Using certificate key:
CERTKEY

# Then we use these two outputs on the joining node:

kubeadm join p10-api.k8s.ooo:6443 --token xmff4i.ABC --discovery-token-ca-cert-hash sha256:longhash --control-plane --certificate-key CERTKEY

Commands to be used on a control plane node:

kubeadm token create --print-join-command
kubeadm init phase upload-certs --upload-certs

Commands to be used on the joining node:

JOINCOMMAND --control-plane --certificate-key CERTKEY

SEE ALSO

How to fix etcd does not start when rejoining a kubernetes cluster as a control plane

If during the above step etcd does not come up, kubeadm join can hang as follows:

[control-plane] Creating static Pod manifest for "kube-apiserver"                                                              
[control-plane] Creating static Pod manifest for "kube-controller-manager"                                                     
[control-plane] Creating static Pod manifest for "kube-scheduler"                                                              
[check-etcd] Checking that the etcd cluster is healthy                                                                         
error execution phase check-etcd: etcd cluster is not healthy: failed to dial endpoint https://[2a0a:e5c0:10:1:225:b3ff:fe20:37
8a]:2379 with maintenance client: context deadline exceeded                                                                    
To see the stack trace of this error execute with --v=5 or higher         

Then the problem is likely that the etcd server is still a member of the cluster. We first need to remove it from the etcd cluster and then the join works.

To fix this we do:

  • Find a working etcd pod
  • Find the etcd members / member list
  • Remove the etcd member that we want to re-join the cluster
# Find the etcd pods
kubectl -n kube-system get pods -l component=etcd,tier=control-plane

# Get the list of etcd servers with the member id 
kubectl exec -n kube-system -ti ETCDPODNAME -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list

# Remove the member
kubectl exec -n kube-system -ti ETCDPODNAME -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove MEMBERID

Sample session:

[10:48] line:~% kubectl -n kube-system get pods -l component=etcd,tier=control-plane
NAME            READY   STATUS    RESTARTS     AGE
etcd-server63   1/1     Running   0            3m11s
etcd-server65   1/1     Running   3            7d2h
etcd-server83   1/1     Running   8 (6d ago)   7d2h
[10:48] line:~% kubectl exec -n kube-system -ti etcd-server65 -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list
356891cd676df6e4, started, server65, https://[2a0a:e5c0:10:1:225:b3ff:fe20:375c]:2380, https://[2a0a:e5c0:10:1:225:b3ff:fe20:375c]:2379, false
371b8a07185dee7e, started, server63, https://[2a0a:e5c0:10:1:225:b3ff:fe20:378a]:2380, https://[2a0a:e5c0:10:1:225:b3ff:fe20:378a]:2379, false
5942bc58307f8af9, started, server83, https://[2a0a:e5c0:10:1:3e4a:92ff:fe79:bb98]:2380, https://[2a0a:e5c0:10:1:3e4a:92ff:fe79:bb98]:2379, false

[10:48] line:~% kubectl exec -n kube-system -ti etcd-server65 -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove 371b8a07185dee7e
Member 371b8a07185dee7e removed from cluster e3c0805f592a8f77

SEE ALSO

Updating the members

1) get alive member

% kubectl -n kube-system get pods -l component=etcd,tier=control-plane
NAME            READY   STATUS    RESTARTS   AGE
etcd-server67   1/1     Running   1          185d
etcd-server69   1/1     Running   1          185d
etcd-server71   1/1     Running   2          185d
[20:57] sun:~% 

2) get member list

  • in this case via crictl, as the api does not work correctly anymore

3) update

etcdctl member update MEMBERID  --peer-urls=https://[...]:2380

Node labels (adding, showing, removing)

Listing the labels:

kubectl get nodes --show-labels

Adding labels:

kubectl label nodes LIST-OF-NODES label1=value1 

For instance:

kubectl label nodes router2 router3 hosttype=router 

Selecting nodes in pods:

apiVersion: v1
kind: Pod
...
spec:
  nodeSelector:
    hosttype: router

Removing labels by adding a minus at the end of the label name:

kubectl label node <nodename> <labelname>-

For instance:

kubectl label nodes router2 router3 hosttype- 

SEE ALSO

Listing all pods on a node

kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=serverXX

Found on https://stackoverflow.com/questions/62000559/how-to-list-all-the-pods-running-in-a-particular-worker-node-by-executing-a-comm

Hardware Maintenance using ungleich-hardware

Use the following manifest and replace the HOST with the actual host:

apiVersion: v1
kind: Pod
metadata:
  name: ungleich-hardware-HOST
spec:
  containers:
  - name: ungleich-hardware
    image: ungleich/ungleich-hardware:0.0.5
    args:
    - sleep
    - "1000000" 
    volumeMounts:
      - mountPath: /dev
        name: dev
    securityContext:
      privileged: true
  nodeSelector:
    kubernetes.io/hostname: "HOST" 

  volumes:
    - name: dev
      hostPath:
        path: /dev

Also see: The_ungleich_hardware_maintenance_guide

Triggering a cronjob / creating a job from a cronjob

To test a cronjob, we can create a job from a cronjob:

kubectl create job --from=cronjob/volume2-daily-backup volume2-manual

This creates a job volume2-manual based on the cronjob volume2-daily

su-ing into a user that has nologin shell set

Many times users are having nologin as their shell inside the container. To be able to execute maintenance commands within the
container, we can use su -s /bin/sh like this:

su -s /bin/sh -c '/path/to/your/script' testuser

Found on https://serverfault.com/questions/351046/how-to-run-command-as-user-who-has-usr-sbin-nologin-as-shell

How to print a secret value

Assuming you want the "password" item from a secret, use:

kubectl get secret SECRETNAME -o jsonpath="{.data.password}" | base64 -d; echo "" 

Fixing the "ImageInspectError"

If you see this problem:

# kubectl get pods
NAME                                                       READY   STATUS                   RESTARTS   AGE
bird-router-server137-bird-767f65bb47-g4xsh                0/1     Init:ImageInspectError   0          77d
bird-router-server137-openvpn-server120-5c987b7ffb-cn9xf   0/1     ImageInspectError        1          159d
bird-router-server137-unbound-5c6f5d4bb6-cxbpr             0/1     ImageInspectError        1          159d

Fixes so far:

  • correct registries.conf

Automatic cleanup of images

  • options to kubelet
  --image-gc-high-threshold=90: The percent of disk usage after which image garbage collection is always run. Default: 90%
  --image-gc-low-threshold=80: The percent of disk usage before which image garbage collection is never run. Lowest disk usage to garbage collect to. Default: 80%

How to upgrade a kubernetes cluster

General

Getting a specific kubeadm or kubelet version

RELEASE=v1.22.17
RELEASE=v1.23.17
RELEASE=v1.24.9
RELEASE=v1.25.9
RELEASE=v1.26.6
RELEASE=v1.27.2

ARCH=amd64

curl -L --remote-name-all https://dl.k8s.io/release/${RELEASE}/bin/linux/${ARCH}/{kubeadm,kubelet}
chmod u+x kubeadm kubelet

Steps

  • kubeadm upgrade plan
    • On one control plane node
  • kubeadm upgrade apply vXX.YY.ZZ
    • On one control plane node
  • kubeadm upgrade node
    • On all other control plane nodes
    • On all worker nodes afterwards

Repeat for all control planes nodes. The upgrade kubelet on all other nodes via package manager.

Upgrading to 1.22.17

Upgrading to 1.23.17

Upgrading to 1.24.17

Upgrading to 1.25.14

Upgrading to 1.26.9

Upgrading to 1.27

Upgrading to 1.28

Upgrading to 1.29

Upgrade to crio 1.27: missing crun

Error message

level=fatal msg="validating runtime config: runtime validation: \"crun\" not found in $PATH: exec: \"crun\": executable file not found in $PATH" 

Fix:

apk add crun

Reference CNI

  • Mainly "stupid", but effective plugins
  • Main documentation on https://www.cni.dev/plugins/current/
  • Plugins
    • bridge
      • Can create the bridge on the host
      • But seems not to be able to add host interfaces to it as well
      • Has support for vlan tags
    • vlan
    • host-device
      • moves the interface from the host into the container
      • very easy for physical connections to containers
    • ipvlan
      • "virtualisation" of a host device
      • routing based on IP
      • Same MAC for everyone
      • Cannot reach the master interface
    • maclvan
      • With mac addresses
      • Supports various modes (to be checked)
    • ptp ("point to point")
      • Creates a host device and connects it to the container
    • win*
      • Windows implementations

Calico CNI

Calico Installation

  • We install calico using helm
  • This has the following advantages:
    • Easy to upgrade
    • Does not require os to configure IPv6/dual stack settings as the tigera operator figures out things on its own

Usually plain calico can be installed directly using:

VERSION=v3.25.0

helm repo add projectcalico https://docs.projectcalico.org/charts
helm repo update
helm upgrade --install --namespace tigera calico projectcalico/tigera-operator --version $VERSION --create-namespace

Installing calicoctl

To be able to manage and configure calico, we need to
install calicoctl

kubectl apply -f https://docs.projectcalico.org/manifests/calicoctl.yaml

Or version specific:

kubectl apply -f https://github.com/projectcalico/calico/blob/v3.20.4/manifests/calicoctl.yaml

# For 3.22
kubectl apply -f https://projectcalico.docs.tigera.io/archive/v3.22/manifests/calicoctl.yaml

And making it easier accessible by alias:

alias calicoctl="kubectl exec -i -n kube-system calicoctl -- /calicoctl" 

Calico configuration

By default our k8s clusters BGP peer
with an upstream router to propagate podcidr and servicecidr.

Default settings in our infrastructure:

  • We use a full-mesh using the nodeToNodeMeshEnabled: true option
  • We keep the original next hop so that only the server with the pod is announcing it (instead of ecmp)
  • We use private ASNs for k8s clusters
  • We do not use any overlay

After installing calico and calicoctl the last step of the installation is usually:

calicoctl create -f - < calico-bgp.yaml

A sample BGP configuration:

---
apiVersion: projectcalico.org/v3
kind: BGPConfiguration
metadata:
  name: default
spec:
  logSeverityScreen: Info
  nodeToNodeMeshEnabled: true
  asNumber: 65534
  serviceClusterIPs:
  - cidr: 2a0a:e5c0:10:3::/108
  serviceExternalIPs:
  - cidr: 2a0a:e5c0:10:3::/108
---
apiVersion: projectcalico.org/v3
kind: BGPPeer
metadata:
  name: router1-place10
spec:
  peerIP: 2a0a:e5c0:10:1::50
  asNumber: 213081
  keepOriginalNextHop: true

Cilium CNI (experimental)

Status

NO WORKING CILIUM CONFIGURATION FOR IPV6 only modes

Latest error

It seems cilium does not run on IPv6 only hosts:

level=info msg="Validating configured node address ranges" subsys=daemon
level=fatal msg="postinit failed" error="external IPv4 node address could not be derived, please configure via --ipv4-node" subsys=daemon
level=info msg="Starting IP identity watcher" subsys=ipcache

It crashes after that log entry

BGP configuration

  • The cilium-operator will not start without a correct configmap being present beforehand (see error message below)
  • Creating the bgp config beforehand as a configmap is thus required.

The error one gets without the configmap present:

Pods are hanging with:

cilium-bpqm6                       0/1     Init:0/4            0             9s
cilium-operator-5947d94f7f-5bmh2   0/1     ContainerCreating   0             9s

The error message in the cilium-*perator is:

Events:
  Type     Reason       Age                From               Message
  ----     ------       ----               ----               -------
  Normal   Scheduled    80s                default-scheduler  Successfully assigned kube-system/cilium-operator-5947d94f7f-lqcsp to server56
  Warning  FailedMount  16s (x8 over 80s)  kubelet            MountVolume.SetUp failed for volume "bgp-config-path" : configmap "bgp-config" not found

A correct bgp config looks like this:

apiVersion: v1
kind: ConfigMap
metadata:
  name: bgp-config
  namespace: kube-system
data:
  config.yaml: |
    peers:
      - peer-address: 2a0a:e5c0::46
        peer-asn: 209898
        my-asn: 65533
      - peer-address: 2a0a:e5c0::47
        peer-asn: 209898
        my-asn: 65533
    address-pools:
      - name: default
        protocol: bgp
        addresses:
          - 2a0a:e5c0:0:14::/64

Installation

Adding the repo


helm repo add cilium https://helm.cilium.io/
helm repo update

Installing + configuring cilium

ipv6pool=2a0a:e5c0:0:14::/112

version=1.12.2

helm upgrade --install cilium cilium/cilium --version $version \
  --namespace kube-system \
  --set ipv4.enabled=false \
  --set ipv6.enabled=true \
  --set enableIPv6Masquerade=false \
  --set bgpControlPlane.enabled=true 

#  --set ipam.operator.clusterPoolIPv6PodCIDRList=$ipv6pool

# Old style bgp?
#   --set bgp.enabled=true --set bgp.announce.podCIDR=true \

# Show possible configuration options
helm show values cilium/cilium

Using a /64 for ipam.operator.clusterPoolIPv6PodCIDRList fails with:

level=fatal msg="Unable to init cluster-pool allocator" error="unable to initialize IPv6 allocator New CIDR set failed; the node CIDR size is too big" subsys=cilium-operator-generic

See also https://github.com/cilium/cilium/issues/20756

Seems a /112 is actually working.

Kernel modules

Cilium requires the following modules to be loaded on the host (not loaded by default):

modprobe  ip6table_raw
modprobe  ip6table_filter

Interesting helm flags

  • autoDirectNodeRoutes
  • bgpControlPlane.enabled = true

SEE ALSO

Multus

VERSION=v4.0.1

kubectl apply -f https://raw.githubusercontent.com/k8snetworkplumbingwg/multus-cni/${VERSION}/deployments/multus-daemonset-crio.yml

ArgoCD

Argocd Installation

As there is no configuration management present yet, argocd is installed using

kubectl create namespace argocd

# OR: latest stable
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

# OR Specific Version
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/v2.3.2/manifests/install.yaml

Get the argocd credentials

kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d; echo "" 

Accessing argocd

In regular IPv6 clusters:

In legacy IPv4 clusters

kubectl --namespace argocd port-forward svc/argocd-server 8080:80

Using the argocd webhook to trigger changes

Deploying an application

  • Applications are deployed via git towards gitea (code.ungleich.ch) and then pulled by argo
  • Always include the redmine-url pointing to the (customer) ticket
    • Also add the support-url if it exists

Application sample

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: gitea-CUSTOMER
  namespace: argocd
spec:
  destination:
    namespace: default
    server: 'https://kubernetes.default.svc'
  source:
    path: apps/prod/gitea
    repoURL: 'https://code.ungleich.ch/ungleich-intern/k8s-config.git'
    targetRevision: HEAD
    helm:
      parameters:
        - name: storage.data.storageClass
          value: rook-ceph-block-hdd
        - name: storage.data.size
          value: 200Gi
        - name: storage.db.storageClass
          value: rook-ceph-block-ssd
        - name: storage.db.size
          value: 10Gi
        - name: storage.letsencrypt.storageClass
          value: rook-ceph-block-hdd
        - name: storage.letsencrypt.size
          value: 50Mi
        - name: letsencryptStaging
          value: 'no'
        - name: fqdn
          value: 'code.verua.online'
  project: default
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
  info:
    - name: 'redmine-url'
      value: 'https://redmine.ungleich.ch/issues/ISSUEID'
    - name: 'support-url'
      value: 'https://support.ungleich.ch/Ticket/Display.html?id=TICKETID'

Helm related operations and conventions

We use helm charts extensively.

  • In production, they are managed via argocd
  • In development, helm chart can de developed and deployed manually using the helm utility.

Installing a helm chart

One can use the usual pattern of

helm install <releasename> <chartdirectory>

However often you want to reinstall/update when testing helm charts. The following pattern is "better", because it allows you to reinstall, if it is already installed:

helm upgrade --install <releasename> <chartdirectory>

Naming services and deployments in helm charts [Application labels]

Show all versions of a helm chart

helm search repo -l repo/chart

For example:

% helm search repo -l projectcalico/tigera-operator 
NAME                             CHART VERSION    APP VERSION    DESCRIPTION                            
projectcalico/tigera-operator    v3.23.3          v3.23.3        Installs the Tigera operator for Calico
projectcalico/tigera-operator    v3.23.2          v3.23.2        Installs the Tigera operator for Calico
....

Show possible values of a chart

helm show values <repo/chart>

Example:

helm show values ingress-nginx/ingress-nginx

Show all possible charts in a repo

helm search repo REPO

Download a chart

For instance for checking it out locally. Use:

helm pull <repo/chart>

Rook + Ceph

Installation

  • Usually directly via argocd

Executing ceph commands

Using the ceph-tools pod as follows:

kubectl exec -n rook-ceph -ti $(kubectl -n rook-ceph get pods -l app=rook-ceph-tools -o jsonpath='{.items[*].metadata.name}') -- ceph -s

Inspecting the logs of a specific server

# Get the related pods
kubectl -n rook-ceph get pods -l app=rook-ceph-osd-prepare 
...

# Inspect the logs of a specific pod
kubectl -n rook-ceph logs -f rook-ceph-osd-prepare-server23--1-444qx

Inspecting the logs of the rook-ceph-operator

kubectl -n rook-ceph logs -f -l app=rook-ceph-operator

(Temporarily) Disabling the rook-operation

  • first disabling the sync in argocd
  • then scale it down
kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=0

When done with the work/maintenance, re-enable sync in argocd.
The following command is thus strictly speaking not required, as argocd will fix it on its own:

kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=1

Restarting the rook operator

kubectl -n rook-ceph delete pods  -l app=rook-ceph-operator

Triggering server prepare / adding new osds

The rook-ceph-operator triggers/watches/creates pods to maintain hosts. To trigger a full "re scan", simply delete that pod:

kubectl -n rook-ceph delete pods -l app=rook-ceph-operator

This will cause all the rook-ceph-osd-prepare-.. jobs to be recreated and thus OSDs to be created, if new disks have been added.

Removing an OSD

Set osd id in the osd-purge.yaml and apply it. OSD should be down before.

apiVersion: batch/v1
kind: Job
metadata:
  name: rook-ceph-purge-osd
  namespace: rook-ceph # namespace:cluster
  labels:
    app: rook-ceph-purge-osd
spec:
  template:
    metadata:
      labels:
        app: rook-ceph-purge-osd
    spec:
      serviceAccountName: rook-ceph-purge-osd
      containers:
        - name: osd-removal
          image: rook/ceph:master
          # TODO: Insert the OSD ID in the last parameter that is to be removed
          # The OSD IDs are a comma-separated list. For example: "0" or "0,2".
          # If you want to preserve the OSD PVCs, set `--preserve-pvc true`.
          #
          # A --force-osd-removal option is available if the OSD should be destroyed even though the
          # removal could lead to data loss.
          args:
            - "ceph" 
            - "osd" 
            - "remove" 
            - "--preserve-pvc" 
            - "false" 
            - "--force-osd-removal" 
            - "false" 
            - "--osd-ids" 
            - "SETTHEOSDIDHERE" 
          env:
            - name: POD_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: ROOK_MON_ENDPOINTS
              valueFrom:
                configMapKeyRef:
                  key: data
                  name: rook-ceph-mon-endpoints
            - name: ROOK_CEPH_USERNAME
              valueFrom:
                secretKeyRef:
                  key: ceph-username
                  name: rook-ceph-mon
            - name: ROOK_CEPH_SECRET
              valueFrom:
                secretKeyRef:
                  key: ceph-secret
                  name: rook-ceph-mon
            - name: ROOK_CONFIG_DIR
              value: /var/lib/rook
            - name: ROOK_CEPH_CONFIG_OVERRIDE
              value: /etc/rook/config/override.conf
            - name: ROOK_FSID
              valueFrom:
                secretKeyRef:
                  key: fsid
                  name: rook-ceph-mon
            - name: ROOK_LOG_LEVEL
              value: DEBUG
          volumeMounts:
            - mountPath: /etc/ceph
              name: ceph-conf-emptydir
            - mountPath: /var/lib/rook
              name: rook-config
      volumes:
        - emptyDir: {}
          name: ceph-conf-emptydir
        - emptyDir: {}
          name: rook-config
      restartPolicy: Never

Deleting the deployment:

[18:05] bridge:~% kubectl -n rook-ceph delete deployment rook-ceph-osd-6
deployment.apps "rook-ceph-osd-6" deleted

Placement of mons/osds/etc.

See https://rook.io/docs/rook/v1.11/CRDs/Cluster/ceph-cluster-crd/#placement-configuration-settings

Setting up and managing S3 object storage

Endpoints

Location Enpdoint
p5 https://s3.k8s.place5.ungleich.ch

Setting up a storage class

  • This will store the buckets of a specific customer

Similar to this:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ungleich-archive-bucket-sc
  namespace: rook-ceph
provisioner: rook-ceph.ceph.rook.io/bucket
reclaimPolicy: Delete
parameters:
  objectStoreName: place5
  objectStoreNamespace: rook-ceph

Setting up the Bucket

Similar to this:

apiVersion: objectbucket.io/v1alpha1
kind: ObjectBucketClaim
metadata:
  name: ungleich-archive-bucket-claim
  namespace: rook-ceph
spec:
  generateBucketName: ungleich-archive-ceph-bkt
  storageClassName: ungleich-archive-bucket-sc
  additionalConfig:
    # To set for quota for OBC
    #maxObjects: "1000" 
    maxSize: "100G" 

Getting the credentials for the bucket

  • Get "public" information from the configmap
  • Get secret from the secret
name=BUCKETNAME
endpoint=https://s3.k8s.place5.ungleich.ch

cm=$(kubectl -n rook-ceph get configmap -o yaml ${name}-bucket-claim)

sec=$(kubectl -n rook-ceph get secrets -o yaml ${name}-bucket-claim)
AWS_ACCESS_KEY_ID=$(echo $sec | yq .data.AWS_ACCESS_KEY_ID | base64 -d ; echo "")
AWS_SECRET_ACCESS_KEY=$(echo $sec | yq .data.AWS_SECRET_ACCESS_KEY | base64 -d ; echo "")

bucket_name=$(echo $cm | yq .data.BUCKET_NAME)
Access via s4cmd
s4cmd --endpoint-url ${endpoint} --access-key=$(AWS_ACCESS_KEY_ID) --secret-key=$(AWS_SECRET_ACCESS_KEY) ls

Ingress + Cert Manager

  • We deploy nginx-ingress to get an ingress
  • we deploy cert-manager to handle certificates
  • We independently deploy ClusterIssuer to allow the cert-manager app to deploy and the issuer to be created once the CRDs from cert manager are in place

IPv4 reachability

The ingress is by default IPv6 only. To make it reachable from the IPv4 world, get its IPv6 address and configure a NAT64 mapping in Jool.

Steps:

Get the ingress IPv6 address

Use kubectl -n ingress-nginx get svc ingress-nginx-controller -o jsonpath='{.spec.clusterIP}'; echo ''

Example:

kubectl -n ingress-nginx get svc ingress-nginx-controller -o jsonpath='{.spec.clusterIP}'; echo ''
2a0a:e5c0:10:1b::ce11

Add NAT64 mapping

  • Update the __dcl_jool_siit cdist type
  • Record the two IPs (IPv6 and IPv4)
  • Configure all routers

Add DNS record

To use the ingress capable as a CNAME destination, create an "ingress" DNS record, such as:

; k8s ingress for dev
dev-ingress                 AAAA 2a0a:e5c0:10:1b::ce11
dev-ingress                 A 147.78.194.23

Add supporting wildcard DNS

If you plan to add various sites under a specific domain, we can add a wildcard DNS entry, such as *.k8s-dev.django-hosting.ch:

*.k8s-dev         CNAME dev-ingress.ungleich.ch.

Harbor

  • We user Harbor as an image registry for our own images. Internal app reference: apps/prod/harbor.
  • The admin password is in the password store, it is Harbor12345 by default
  • At the moment harbor only authenticates against the internal ldap tree

LDAP configuration

  • The url needs to be ldaps://...
  • uid = uid
  • rest standard

Monitoring / Prometheus

Access via ...

Prometheus Options

Grafana default password

  • If not changed: prom-operator

Nextcloud

How to get the nextcloud credentials

  • The initial username is set to "nextcloud"
  • The password is autogenerated and saved in a kubernetes secret
kubectl get secret RELEASENAME-nextcloud -o jsonpath="{.data.PASSWORD}" | base64 -d; echo "" 

How to fix "Access through untrusted domain"

  • Nextcloud stores the initial domain configuration
  • If the FQDN is changed, it will show the error message "Access through untrusted domain"
  • To fix, edit /var/www/html/config/config.php and correct the domain
  • Then delete the pods

Running occ commands inside the nextcloud container

  • Find the pod in the right namespace

Exec:

su www-data -s /bin/sh -c ./occ
  • -s /bin/sh is needed as the default shell is set to /bin/false

Rescanning files

  • If files have been added without nextcloud's knowledge
su www-data -s /bin/sh -c "./occ files:scan --all" 

Sealed Secrets

  • install kubeseal
KUBESEAL_VERSION='0.23.0'
wget "https://github.com/bitnami-labs/sealed-secrets/releases/download/v${KUBESEAL_VERSION:?}/kubeseal-${KUBESEAL_VERSION:?}-linux-amd64.tar.gz" 
tar -xvzf kubeseal-${KUBESEAL_VERSION:?}-linux-amd64.tar.gz kubeseal
sudo install -m 755 kubeseal /usr/local/bin/kubeseal
  • create key for sealed-secret
kubeseal --fetch-cert > /tmp/public-key-cert.pem
  • create the secret
ex)
apiVersion: v1
kind: Secret
metadata:
  name: Release.Name-postgres-config
  annotations:
    secret-generator.v1.mittwald.de/autogenerate: POSTGRES_PASSWORD
    hosting: Release.Name
  labels:
    app.kubernetes.io/instance: Release.Name
    app.kubernetes.io/component: postgres
stringData:
  POSTGRES_USER: postgresUser
  POSTGRES_DB: postgresDBName
  POSTGRES_INITDB_ARGS: "--no-locale --encoding=UTF8" 
  • convert secret.yaml to sealed-secret.yaml
kubeseal -n <namespace> --cert=/tmp/public-key-cert.pem --format=yaml < ./secret.yaml  > ./sealed-secret.yaml
  • use sealed-secret.yaml on helm-chart directory
  • refer ticket : #11989 , #12120

Infrastructure versions

ungleich kubernetes infrastructure v5 (2021-10)

Clusters are configured / setup in this order:

ungleich kubernetes infrastructure v4 (2021-09)

  • rook is configured via manifests instead of using the rook-ceph-cluster helm chart
  • The rook operator is still being installed via helm

ungleich kubernetes infrastructure v3 (2021-07)

  • rook is now installed via helm via argocd instead of directly via manifests

ungleich kubernetes infrastructure v2 (2021-05)

  • Replaced fluxv2 from ungleich k8s v1 with argocd
    • argocd can apply helm templates directly without needing to go through Chart releases
  • We are also using argoflow for build flows
  • Planned to add kaniko for image building

ungleich kubernetes infrastructure v1 (2021-01)

We are using the following components:

  • Calico as a CNI with BGP, IPv6 only, no encapsulation
    • Needed for basic networking
  • kubernetes-secret-generator for creating secrets
    • Needed so that secrets are not stored in the git repository, but only in the cluster
  • ungleich-certbot
    • Needed to get letsencrypt certificates for services
  • rook with ceph rbd + cephfs for storage
    • rbd for almost everything, ReadWriteOnce
    • cephfs for smaller things, multi access ReadWriteMany
    • Needed for providing persistent storage
  • flux v2
    • Needed to manage resources automatically

Updated by Nico Schottelius about 1 month ago · 217 revisions