Actions

History

The ungleich kubernetes infrastructure » History » Revision 45

« Previous | Revision 45/233 (diff) | Next »
Nico Schottelius, 10/08/2021 03:16 AM

The ungleich kubernetes infrastructure and ungleich kubernetes manual¶

Status¶

This document is pre-production.
This document is to become the ungleich kubernetes infrastructure overview as well as the ungleich kubernetes manual.

k8s clusters¶

Cluster	Purpose/Setup	Maintainer	Master(s)	last verified
c0.k8s.ooo	Dev	-	UNUSED	2021-10-05
c1.k8s.ooo	Dev p6 VM	Nico	2a0a-e5c0-2-11-0-62ff-fe0b-1a3d.k8s-1.place6.ungleich.ch	2021-10-05
c2.k8s.ooo	Dev p7 HW	Nico	server47 server53 server54	2021-10-05
c3.k8s.ooo	Test p7 PI	-	UNUSED	2021-10-05
c4.k8s.ooo	Dev2 p7 HW	Fran/Jin-Guk	server52 server53 server54	-
c5.k8s.ooo	Dev p6 VM Amal	Nico/Amal	2a0a-e5c0-2-11-0-62ff-fe0b-1a46.k8s-1.place6.ungleich.ch
c6.k8s.ooo	Dev p6 VM Jin-Guk	Jin-Guk
p6.k8s.ooo	production		server67 server69 server71	2021-10-05
p10.k8s.ooo	production		server63 server65 server83	2021-10-05

General architecture and components overview¶

All k8s clusters are IPv6 only
We use BGP peering to propagate podcidr and serviceCidr networks to our infrastructure
The main public testing repository is ungleich-k8s
- Private configurations are found in the k8s-config repository

Cluster types¶

Type/Feature	Development	Production
Min No. nodes	3 (1 master, 3 worker)	5 (3 master, 3 worker)
Recommended minimum	4 (dedicated master, 3 worker)	8 (3 master, 5 worker)
Separation of control plane	optional	recommended
Persistent storage	required	required
Number of storage monitors	3	5

General k8s operations¶

Get the argocd credentials¶

kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d; echo ""

Get the cluster admin.conf¶

On the masters of each cluster you can find the file /etc/kubernetes/admin.conf
To be able to administrate the cluster you can copy the admin.conf to your local machine
Multi cluster debugging can very easy if you name the config ~/cX-admin.conf (see example below)

% scp root@server47.place7.ungleich.ch:/etc/kubernetes/admin.conf ~/c2-admin.conf
% export KUBECONFIG=~/c2-admin.conf    
% kubectl get nodes
NAME       STATUS                     ROLES                  AGE   VERSION
server47   Ready                      control-plane,master   82d   v1.22.0
server48   Ready                      control-plane,master   82d   v1.22.0
server49   Ready                      <none>                 82d   v1.22.0
server50   Ready                      <none>                 82d   v1.22.0
server59   Ready                      control-plane,master   82d   v1.22.0
server60   Ready,SchedulingDisabled   <none>                 82d   v1.22.0
server61   Ready                      <none>                 82d   v1.22.0
server62   Ready                      <none>                 82d   v1.22.0

Installing a new k8s cluster¶

Decide on the cluster name (usually cX.k8s.ooo), X counting upwards
- Using pXX.k8s.ooo for production clusters of placeXX
Use cdist to configure the nodes with requirements like crio
Decide between single or multi node control plane setups (see below)
- Single control plane suitable for development clusters

Typical init procedure:

Single control plane: kubeadm init --config bootstrap/XXX/kubeadm.yaml
Multi control plane (HA): kubeadm init --config bootstrap/XXX/kubeadm.yaml --upload-certs

Deleting a pod that is hanging in terminating state¶

kubectl delete pod <PODNAME> --grace-period=0 --force --namespace <NAMESPACE>

(from https://stackoverflow.com/questions/35453792/pods-stuck-in-terminating-status)

Listing nodes of a cluster¶

[15:05] bridge:~% kubectl get nodes
NAME       STATUS   ROLES                  AGE   VERSION
server22   Ready    <none>                 52d   v1.22.0
server23   Ready    <none>                 52d   v1.22.2
server24   Ready    <none>                 52d   v1.22.0
server25   Ready    <none>                 52d   v1.22.0
server26   Ready    <none>                 52d   v1.22.0
server27   Ready    <none>                 52d   v1.22.0
server63   Ready    control-plane,master   52d   v1.22.0
server64   Ready    <none>                 52d   v1.22.0
server65   Ready    control-plane,master   52d   v1.22.0
server66   Ready    <none>                 52d   v1.22.0
server83   Ready    control-plane,master   52d   v1.22.0
server84   Ready    <none>                 52d   v1.22.0
server85   Ready    <none>                 52d   v1.22.0
server86   Ready    <none>                 52d   v1.22.0

Removing / draining a node¶

Usually kubectl drain server should do the job, but sometimes we need to be more aggressive:

kubectl drain --delete-emptydir-data --ignore-daemonsets server23

Readding a node after draining¶

kubectl uncordon serverXX

Rook / Ceph Related Operations¶

Inspecting the logs of a specific server¶

# Get the related pods
kubectl -n rook-ceph get pods -l app=rook-ceph-osd-prepare 
...

# Inspect the logs of a specific pod
kubectl -n rook-ceph logs -f rook-ceph-osd-prepare-server23--1-444qx

Triggering server prepare / adding new osds¶

The rook-ceph-operator triggers/watches/creates pods to maintain hosts. To trigger a full "re scan", simply delete that pod:

kubectl -n rook-ceph delete pods -l app=rook-ceph-operator

This will cause all the rook-ceph-osd-prepare-.. jobs to be recreated and thus OSDs to be created, if new disks have been added.

Removing an OSD¶

See Ceph OSD Management

Infrastructure versions¶

ungleich kubernetes infrastructure v3¶

rook is now installed via helm via argocd instead of directly via manifests

ungleich kubernetes infrastructure v2¶

Replaced fluxv2 from ungleich k8s v1 with argocd
- argocd can apply helm templates directly without needing to go through Chart releases
We are also using argoflow for build flows
Planned to add kaniko for image building

ungleich kubernetes infrastructure v1¶

We are using the following components:

Calico as a CNI with BGP, IPv6 only, no encapsulation
- Needed for basic networking
kubernetes-secret-generator for creating secrets
- Needed so that secrets are not stored in the git repository, but only in the cluster
ungleich-certbot
- Needed to get letsencrypt certificates for services
rook with ceph rbd + cephfs for storage
- rbd for almost everything, ReadWriteOnce
- cephfs for smaller things, multi access ReadWriteMany
- Needed for providing persistent storage
flux v2
- Needed to manage resources automatically

Files (0)

Updated by Nico Schottelius almost 4 years ago · 45 revisions

Project

General

Profile

Open Infrastructure

Wiki