The ungleich kubernetes infrastructure » History » Revision 47

« Previous | Revision 47/51 (diff) | Next »
Nico Schottelius, 10/12/2021 08:07 AM

The ungleich kubernetes infrastructure and ungleich kubernetes manual


This document is pre-production.
This document is to become the ungleich kubernetes infrastructure overview as well as the ungleich kubernetes manual.

k8s clusters

Cluster Purpose/Setup Maintainer Master(s) last verified Dev - UNUSED 2021-10-05 Dev p6 VM Nico 2021-10-05 Dev p7 HW Nico server47 server53 server54 2021-10-05 Test p7 PI - UNUSED 2021-10-05 Dev2 p7 HW Fran/Jin-Guk server52 server53 server54 - Dev p6 VM Amal Nico/Amal Dev p6 VM Jin-Guk Jin-Guk production server67 server69 server71 2021-10-05 production server63 server65 server83 2021-10-05

General architecture and components overview

  • All k8s clusters are IPv6 only
  • We use BGP peering to propagate podcidr and serviceCidr networks to our infrastructure
  • The main public testing repository is ungleich-k8s
    • Private configurations are found in the k8s-config repository

Cluster types

Type/Feature Development Production
Min No. nodes 3 (1 master, 3 worker) 5 (3 master, 3 worker)
Recommended minimum 4 (dedicated master, 3 worker) 8 (3 master, 5 worker)
Separation of control plane optional recommended
Persistent storage required required
Number of storage monitors 3 5

General k8s operations

Cheat sheet / external great references

Get the argocd credentials

kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d; echo "" 

Get the cluster admin.conf

  • On the masters of each cluster you can find the file /etc/kubernetes/admin.conf
  • To be able to administrate the cluster you can copy the admin.conf to your local machine
  • Multi cluster debugging can very easy if you name the config ~/cX-admin.conf (see example below)
% scp ~/c2-admin.conf
% export KUBECONFIG=~/c2-admin.conf    
% kubectl get nodes
NAME       STATUS                     ROLES                  AGE   VERSION
server47   Ready                      control-plane,master   82d   v1.22.0
server48   Ready                      control-plane,master   82d   v1.22.0
server49   Ready                      <none>                 82d   v1.22.0
server50   Ready                      <none>                 82d   v1.22.0
server59   Ready                      control-plane,master   82d   v1.22.0
server60   Ready,SchedulingDisabled   <none>                 82d   v1.22.0
server61   Ready                      <none>                 82d   v1.22.0
server62   Ready                      <none>                 82d   v1.22.0               

Installing a new k8s cluster

  • Decide on the cluster name (usually, X counting upwards
    • Using for production clusters of placeXX
  • Use cdist to configure the nodes with requirements like crio
  • Decide between single or multi node control plane setups (see below)
    • Single control plane suitable for development clusters

Typical init procedure:

  • Single control plane: kubeadm init --config bootstrap/XXX/kubeadm.yaml
  • Multi control plane (HA): kubeadm init --config bootstrap/XXX/kubeadm.yaml --upload-certs

Deleting a pod that is hanging in terminating state

kubectl delete pod <PODNAME> --grace-period=0 --force --namespace <NAMESPACE>


Listing nodes of a cluster

[15:05] bridge:~% kubectl get nodes
NAME       STATUS   ROLES                  AGE   VERSION
server22   Ready    <none>                 52d   v1.22.0
server23   Ready    <none>                 52d   v1.22.2
server24   Ready    <none>                 52d   v1.22.0
server25   Ready    <none>                 52d   v1.22.0
server26   Ready    <none>                 52d   v1.22.0
server27   Ready    <none>                 52d   v1.22.0
server63   Ready    control-plane,master   52d   v1.22.0
server64   Ready    <none>                 52d   v1.22.0
server65   Ready    control-plane,master   52d   v1.22.0
server66   Ready    <none>                 52d   v1.22.0
server83   Ready    control-plane,master   52d   v1.22.0
server84   Ready    <none>                 52d   v1.22.0
server85   Ready    <none>                 52d   v1.22.0
server86   Ready    <none>                 52d   v1.22.0

Removing / draining a node

Usually kubectl drain server should do the job, but sometimes we need to be more aggressive:

kubectl drain --delete-emptydir-data --ignore-daemonsets server23

Readding a node after draining

kubectl uncordon serverXX

Rook / Ceph Related Operations

Inspecting the logs of a specific server

# Get the related pods
kubectl -n rook-ceph get pods -l app=rook-ceph-osd-prepare 

# Inspect the logs of a specific pod
kubectl -n rook-ceph logs -f rook-ceph-osd-prepare-server23--1-444qx

Triggering server prepare / adding new osds

The rook-ceph-operator triggers/watches/creates pods to maintain hosts. To trigger a full "re scan", simply delete that pod:

kubectl -n rook-ceph delete pods -l app=rook-ceph-operator

This will cause all the rook-ceph-osd-prepare-.. jobs to be recreated and thus OSDs to be created, if new disks have been added.

Removing an OSD

Infrastructure versions

ungleich kubernetes infrastructure v3

  • rook is now installed via helm via argocd instead of directly via manifests

ungleich kubernetes infrastructure v2

  • Replaced fluxv2 from ungleich k8s v1 with argocd
    • argocd can apply helm templates directly without needing to go through Chart releases
  • We are also using argoflow for build flows
  • Planned to add kaniko for image building

ungleich kubernetes infrastructure v1

We are using the following components:

  • Calico as a CNI with BGP, IPv6 only, no encapsulation
    • Needed for basic networking
  • kubernetes-secret-generator for creating secrets
    • Needed so that secrets are not stored in the git repository, but only in the cluster
  • ungleich-certbot
    • Needed to get letsencrypt certificates for services
  • rook with ceph rbd + cephfs for storage
    • rbd for almost everything, ReadWriteOnce
    • cephfs for smaller things, multi access ReadWriteMany
    • Needed for providing persistent storage
  • flux v2
    • Needed to manage resources automatically

Updated by Nico Schottelius 5 days ago · 47 revisions