Project

General

Profile

Actions

The ungleich kubernetes infrastructure » History » Revision 50

« Previous | Revision 50/222 (diff) | Next »
Nico Schottelius, 10/14/2021 05:20 AM


The ungleich kubernetes infrastructure and ungleich kubernetes manual

Status

This document is pre-production.
This document is to become the ungleich kubernetes infrastructure overview as well as the ungleich kubernetes manual.

k8s clusters

Cluster Purpose/Setup Maintainer Master(s) argo rook last verified
c0.k8s.ooo Dev - UNUSED 2021-10-05
c1.k8s.ooo Dev p6 VM Nico 2a0a-e5c0-2-11-0-62ff-fe0b-1a3d.k8s-1.place6.ungleich.ch 2021-10-05
c2.k8s.ooo Dev p7 HW Nico server47 server53 server54 x x 2021-10-05
c3.k8s.ooo Test p7 PI - UNUSED 2021-10-05
c4.k8s.ooo Dev2 p7 HW Fran/Jin-Guk server52 server53 server54 -
c5.k8s.ooo Dev p6 VM Amal Nico/Amal 2a0a-e5c0-2-11-0-62ff-fe0b-1a46.k8s-1.place6.ungleich.ch
c6.k8s.ooo Dev p6 VM Jin-Guk Jin-Guk
p6.k8s.ooo production server67 server69 server71 x x 2021-10-05
p10.k8s.ooo production server63 server65 server83 x x 2021-10-05

General architecture and components overview

  • All k8s clusters are IPv6 only
  • We use BGP peering to propagate podcidr and serviceCidr networks to our infrastructure
  • The main public testing repository is ungleich-k8s
    • Private configurations are found in the k8s-config repository

Cluster types

Type/Feature Development Production
Min No. nodes 3 (1 master, 3 worker) 5 (3 master, 3 worker)
Recommended minimum 4 (dedicated master, 3 worker) 8 (3 master, 5 worker)
Separation of control plane optional recommended
Persistent storage required required
Number of storage monitors 3 5

General k8s operations

Cheat sheet / external great references

Get the argocd credentials

kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d; echo "" 

Get the cluster admin.conf

  • On the masters of each cluster you can find the file /etc/kubernetes/admin.conf
  • To be able to administrate the cluster you can copy the admin.conf to your local machine
  • Multi cluster debugging can very easy if you name the config ~/cX-admin.conf (see example below)
% scp root@server47.place7.ungleich.ch:/etc/kubernetes/admin.conf ~/c2-admin.conf
% export KUBECONFIG=~/c2-admin.conf    
% kubectl get nodes
NAME       STATUS                     ROLES                  AGE   VERSION
server47   Ready                      control-plane,master   82d   v1.22.0
server48   Ready                      control-plane,master   82d   v1.22.0
server49   Ready                      <none>                 82d   v1.22.0
server50   Ready                      <none>                 82d   v1.22.0
server59   Ready                      control-plane,master   82d   v1.22.0
server60   Ready,SchedulingDisabled   <none>                 82d   v1.22.0
server61   Ready                      <none>                 82d   v1.22.0
server62   Ready                      <none>                 82d   v1.22.0               

Installing a new k8s cluster

  • Decide on the cluster name (usually cX.k8s.ooo), X counting upwards
    • Using pXX.k8s.ooo for production clusters of placeXX
  • Use cdist to configure the nodes with requirements like crio
  • Decide between single or multi node control plane setups (see below)
    • Single control plane suitable for development clusters

Typical init procedure:

  • Single control plane: kubeadm init --config bootstrap/XXX/kubeadm.yaml
  • Multi control plane (HA): kubeadm init --config bootstrap/XXX/kubeadm.yaml --upload-certs

Deleting a pod that is hanging in terminating state

kubectl delete pod <PODNAME> --grace-period=0 --force --namespace <NAMESPACE>

(from https://stackoverflow.com/questions/35453792/pods-stuck-in-terminating-status)

Listing nodes of a cluster

[15:05] bridge:~% kubectl get nodes
NAME       STATUS   ROLES                  AGE   VERSION
server22   Ready    <none>                 52d   v1.22.0
server23   Ready    <none>                 52d   v1.22.2
server24   Ready    <none>                 52d   v1.22.0
server25   Ready    <none>                 52d   v1.22.0
server26   Ready    <none>                 52d   v1.22.0
server27   Ready    <none>                 52d   v1.22.0
server63   Ready    control-plane,master   52d   v1.22.0
server64   Ready    <none>                 52d   v1.22.0
server65   Ready    control-plane,master   52d   v1.22.0
server66   Ready    <none>                 52d   v1.22.0
server83   Ready    control-plane,master   52d   v1.22.0
server84   Ready    <none>                 52d   v1.22.0
server85   Ready    <none>                 52d   v1.22.0
server86   Ready    <none>                 52d   v1.22.0

Removing / draining a node

Usually kubectl drain server should do the job, but sometimes we need to be more aggressive:

kubectl drain --delete-emptydir-data --ignore-daemonsets server23

Readding a node after draining

kubectl uncordon serverXX

(Re-)joining worker nodes after creating the cluster

  • We need to have an up-to-date token
  • We use different join commands for the workers and control plane nodes

Generating the join command on an existing control plane node:

kubeadm token create --print-join-command

(Re-)joining control plane nodes after creating the cluster

  • We generate the token again
  • We upload the certificates
  • We need to combine/create the join command for the control plane node

Example session:

% kubeadm token create --print-join-command
kubeadm join p10-api.k8s.ooo:6443 --token xmff4i.ABC --discovery-token-ca-cert-hash sha256:longhash 

% kubeadm init phase upload-certs --upload-certs
[upload-certs] Storing the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
[upload-certs] Using certificate key:
CERTKEY

# Then we use these two outputs on the joining node:

kubeadm join p10-api.k8s.ooo:6443 --token xmff4i.ABC --discovery-token-ca-cert-hash sha256:longhash --control-plane --certificate-key CERTKEY

Commands to be used on a control plane node:

kubeadm token create --print-join-command
kubeadm init phase upload-certs --upload-certs

Commands to be used on the joining node:

JOINCOMMAND --control-plane --certificate-key CERTKEY

Rook / Ceph Related Operations

Inspecting the logs of a specific server

# Get the related pods
kubectl -n rook-ceph get pods -l app=rook-ceph-osd-prepare 
...

# Inspect the logs of a specific pod
kubectl -n rook-ceph logs -f rook-ceph-osd-prepare-server23--1-444qx

Triggering server prepare / adding new osds

The rook-ceph-operator triggers/watches/creates pods to maintain hosts. To trigger a full "re scan", simply delete that pod:

kubectl -n rook-ceph delete pods -l app=rook-ceph-operator

This will cause all the rook-ceph-osd-prepare-.. jobs to be recreated and thus OSDs to be created, if new disks have been added.

Removing an OSD

Infrastructure versions

ungleich kubernetes infrastructure v3

  • rook is now installed via helm via argocd instead of directly via manifests

ungleich kubernetes infrastructure v2

  • Replaced fluxv2 from ungleich k8s v1 with argocd
    • argocd can apply helm templates directly without needing to go through Chart releases
  • We are also using argoflow for build flows
  • Planned to add kaniko for image building

ungleich kubernetes infrastructure v1

We are using the following components:

  • Calico as a CNI with BGP, IPv6 only, no encapsulation
    • Needed for basic networking
  • kubernetes-secret-generator for creating secrets
    • Needed so that secrets are not stored in the git repository, but only in the cluster
  • ungleich-certbot
    • Needed to get letsencrypt certificates for services
  • rook with ceph rbd + cephfs for storage
    • rbd for almost everything, ReadWriteOnce
    • cephfs for smaller things, multi access ReadWriteMany
    • Needed for providing persistent storage
  • flux v2
    • Needed to manage resources automatically

Updated by Nico Schottelius about 3 years ago · 50 revisions