Actions

History

The ungleich kubernetes infrastructure » History » Revision 114

« Previous | Revision 114/233 (diff) | Next »
Jin-Guk Kwon, 07/08/2022 09:47 AM

The ungleich kubernetes infrastructure and ungleich kubernetes manual¶

Table of contents
The ungleich kubernetes infrastructure and ungleich kubernetes manual

Status¶

This document is pre-production.
This document is to become the ungleich kubernetes infrastructure overview as well as the ungleich kubernetes manual.

k8s clusters¶

Cluster	Purpose/Setup	Maintainer	Master(s)	argo	v4 http proxy	last verified
c0.k8s.ooo	Dev	-	UNUSED			2021-10-05
c1.k8s.ooo	retired		-			2022-03-15
c2.k8s.ooo	Dev p7 HW	Nico	server47 server53 server54	argo		2021-10-05
c3.k8s.ooo	retired	-	-			2021-10-05
c4.k8s.ooo	Dev2 p7 HW	Jin-Guk	server52 server53 server54			-
c5.k8s.ooo	retired		-			2022-03-15
c6.k8s.ooo	Dev p6 VM Jin-Guk	Jin-Guk
p5.k8s.ooo	production		server34 server36 server38	argo	-
p6.k8s.ooo	production		server67 server69 server71	argo	147.78.194.13	2021-10-05
p10.k8s.ooo	production		server63 server65 server83	argo	147.78.194.12	2021-10-05
k8s.ge.nau.so	development		server107 server108 server109	argo

General architecture and components overview¶

All k8s clusters are IPv6 only
We use BGP peering to propagate podcidr and serviceCidr networks to our infrastructure
The main public testing repository is ungleich-k8s
- Private configurations are found in the k8s-config repository

Cluster types¶

Type/Feature	Development	Production
Min No. nodes	3 (1 master, 3 worker)	5 (3 master, 3 worker)
Recommended minimum	4 (dedicated master, 3 worker)	8 (3 master, 5 worker)
Separation of control plane	optional	recommended
Persistent storage	required	required
Number of storage monitors	3	5

General k8s operations¶

Cheat sheet / external great references¶

kubectl cheatsheet

Allowing to schedule work on the control plane¶

Mostly for single node / test / development clusters
Just remove the master taint as follows

kubectl taint nodes --all node-role.kubernetes.io/master-

Get the cluster admin.conf¶

On the masters of each cluster you can find the file /etc/kubernetes/admin.conf
To be able to administrate the cluster you can copy the admin.conf to your local machine
Multi cluster debugging can very easy if you name the config ~/cX-admin.conf (see example below)

% scp root@server47.place7.ungleich.ch:/etc/kubernetes/admin.conf ~/c2-admin.conf
% export KUBECONFIG=~/c2-admin.conf    
% kubectl get nodes
NAME       STATUS                     ROLES                  AGE   VERSION
server47   Ready                      control-plane,master   82d   v1.22.0
server48   Ready                      control-plane,master   82d   v1.22.0
server49   Ready                      <none>                 82d   v1.22.0
server50   Ready                      <none>                 82d   v1.22.0
server59   Ready                      control-plane,master   82d   v1.22.0
server60   Ready,SchedulingDisabled   <none>                 82d   v1.22.0
server61   Ready                      <none>                 82d   v1.22.0
server62   Ready                      <none>                 82d   v1.22.0

Installing a new k8s cluster¶

Decide on the cluster name (usually cX.k8s.ooo), X counting upwards
- Using pXX.k8s.ooo for production clusters of placeXX
Use cdist to configure the nodes with requirements like crio
Decide between single or multi node control plane setups (see below)
- Single control plane suitable for development clusters

Typical init procedure:

Single control plane: kubeadm init --config bootstrap/XXX/kubeadm.yaml
Multi control plane (HA): kubeadm init --config bootstrap/XXX/kubeadm.yaml --upload-certs

Deleting a pod that is hanging in terminating state¶

kubectl delete pod <PODNAME> --grace-period=0 --force --namespace <NAMESPACE>

(from https://stackoverflow.com/questions/35453792/pods-stuck-in-terminating-status)

Listing nodes of a cluster¶

[15:05] bridge:~% kubectl get nodes
NAME       STATUS   ROLES                  AGE   VERSION
server22   Ready    <none>                 52d   v1.22.0
server23   Ready    <none>                 52d   v1.22.2
server24   Ready    <none>                 52d   v1.22.0
server25   Ready    <none>                 52d   v1.22.0
server26   Ready    <none>                 52d   v1.22.0
server27   Ready    <none>                 52d   v1.22.0
server63   Ready    control-plane,master   52d   v1.22.0
server64   Ready    <none>                 52d   v1.22.0
server65   Ready    control-plane,master   52d   v1.22.0
server66   Ready    <none>                 52d   v1.22.0
server83   Ready    control-plane,master   52d   v1.22.0
server84   Ready    <none>                 52d   v1.22.0
server85   Ready    <none>                 52d   v1.22.0
server86   Ready    <none>                 52d   v1.22.0

Removing / draining a node¶

Usually kubectl drain server should do the job, but sometimes we need to be more aggressive:

kubectl drain --delete-emptydir-data --ignore-daemonsets serverXX

Readding a node after draining¶

kubectl uncordon serverXX

(Re-)joining worker nodes after creating the cluster¶

We need to have an up-to-date token
We use different join commands for the workers and control plane nodes

Generating the join command on an existing control plane node:

kubeadm token create --print-join-command

(Re-)joining control plane nodes after creating the cluster¶

We generate the token again
We upload the certificates
We need to combine/create the join command for the control plane node

Example session:

% kubeadm token create --print-join-command
kubeadm join p10-api.k8s.ooo:6443 --token xmff4i.ABC --discovery-token-ca-cert-hash sha256:longhash 

% kubeadm init phase upload-certs --upload-certs
[upload-certs] Storing the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
[upload-certs] Using certificate key:
CERTKEY

# Then we use these two outputs on the joining node:

kubeadm join p10-api.k8s.ooo:6443 --token xmff4i.ABC --discovery-token-ca-cert-hash sha256:longhash --control-plane --certificate-key CERTKEY

Commands to be used on a control plane node:

kubeadm token create --print-join-command
kubeadm init phase upload-certs --upload-certs

Commands to be used on the joining node:

JOINCOMMAND --control-plane --certificate-key CERTKEY

How to fix etcd does not start when rejoining a kubernetes cluster as a control plane¶

If during the above step etcd does not come up, kubeadm join can hang as follows:

[control-plane] Creating static Pod manifest for "kube-apiserver"                                                              
[control-plane] Creating static Pod manifest for "kube-controller-manager"                                                     
[control-plane] Creating static Pod manifest for "kube-scheduler"                                                              
[check-etcd] Checking that the etcd cluster is healthy                                                                         
error execution phase check-etcd: etcd cluster is not healthy: failed to dial endpoint https://[2a0a:e5c0:10:1:225:b3ff:fe20:37
8a]:2379 with maintenance client: context deadline exceeded                                                                    
To see the stack trace of this error execute with --v=5 or higher

Then the problem is likely that the etcd server is still a member of the cluster. We first need to remove it from the etcd cluster and then the join works.

To fix this we do:

Find a working etcd pod
Find the etcd members / member list
Remove the etcd member that we want to re-join the cluster

# Find the etcd pods
kubectl -n kube-system get pods -l component=etcd,tier=control-plane

# Get the list of etcd servers with the member id 
kubectl exec -n kube-system -ti ETCDPODNAME -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list

# Remove the member
kubectl exec -n kube-system -ti ETCDPODNAME -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove MEMBERID

Sample session:

[10:48] line:~% kubectl -n kube-system get pods -l component=etcd,tier=control-plane
NAME            READY   STATUS    RESTARTS     AGE
etcd-server63   1/1     Running   0            3m11s
etcd-server65   1/1     Running   3            7d2h
etcd-server83   1/1     Running   8 (6d ago)   7d2h
[10:48] line:~% kubectl exec -n kube-system -ti etcd-server65 -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list
356891cd676df6e4, started, server65, https://[2a0a:e5c0:10:1:225:b3ff:fe20:375c]:2380, https://[2a0a:e5c0:10:1:225:b3ff:fe20:375c]:2379, false
371b8a07185dee7e, started, server63, https://[2a0a:e5c0:10:1:225:b3ff:fe20:378a]:2380, https://[2a0a:e5c0:10:1:225:b3ff:fe20:378a]:2379, false
5942bc58307f8af9, started, server83, https://[2a0a:e5c0:10:1:3e4a:92ff:fe79:bb98]:2380, https://[2a0a:e5c0:10:1:3e4a:92ff:fe79:bb98]:2379, false

[10:48] line:~% kubectl exec -n kube-system -ti etcd-server65 -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove 371b8a07185dee7e
Member 371b8a07185dee7e removed from cluster e3c0805f592a8f77

Hardware Maintenance using ungleich-hardware¶

Use the following manifest and replace the HOST with the actual host:

apiVersion: v1
kind: Pod
metadata:
  name: ungleich-hardware-HOST
spec:
  containers:
  - name: ungleich-hardware
    image: ungleich/ungleich-hardware:0.0.5
    args:
    - sleep
    - "1000000" 
    volumeMounts:
      - mountPath: /dev
        name: dev
    securityContext:
      privileged: true
  nodeSelector:
    kubernetes.io/hostname: "HOST" 

  volumes:
    - name: dev
      hostPath:
        path: /dev

Also see: The_ungleich_hardware_maintenance_guide

Triggering a cronjob / creating a job from a cronjob¶

To test a cronjob, we can create a job from a cronjob:

kubectl create job --from=cronjob/volume2-daily-backup volume2-manual

This creates a job volume2-manual based on the cronjob volume2-daily

su-ing into a user that has nologin shell set¶

Many times users are having nologin as their shell inside the container. To be able to execute maintenance commands within the
container, we can use su -s /bin/sh like this:

su -s /bin/sh -c '/path/to/your/script' testuser

Found on https://serverfault.com/questions/351046/how-to-run-command-as-user-who-has-usr-sbin-nologin-as-shell

How to print a secret value¶

Assuming you want the "password" item from a secret, use:

kubectl get secret SECRETNAME -o jsonpath="{.data.password}" | base64 -d; echo ""

Calico CNI¶

Calico Installation¶

We install calico using helm
This has the following advantages:
- Easy to upgrade
- Does not require os to configure IPv6/dual stack settings as the tigera operator figures out things on its own

Usually plain calico can be installed directly using:

helm repo add projectcalico https://docs.projectcalico.org/charts
helm install --namespace tigera calico projectcalico/tigera-operator --version v3.23.2 --create-namespace

Check the tags on https://github.com/projectcalico/calico/tags for the latest release

Installing calicoctl¶

To be able to manage and configure calico, we need to
install calicoctl

kubectl apply -f https://docs.projectcalico.org/manifests/calicoctl.yaml

Or version specific:

kubectl apply -f https://github.com/projectcalico/calico/blob/v3.20.4/manifests/calicoctl.yaml

# For 3.22
kubectl apply -f https://projectcalico.docs.tigera.io/archive/v3.22/manifests/calicoctl.yaml

And making it easier accessible by alias:

alias calicoctl="kubectl exec -i -n kube-system calicoctl -- /calicoctl"

Calico configuration¶

By default our k8s clusters BGP peer
with an upstream router to propagate podcidr and servicecidr.

Default settings in our infrastructure:

We use a full-mesh using the nodeToNodeMeshEnabled: true option
We keep the original next hop so that only the server with the pod is announcing it (instead of ecmp)
We use private ASNs for k8s clusters
We do not use any overlay

After installing calico and calicoctl the last step of the installation is usually:

calicoctl create -f - < calico-bgp.yaml

A sample BGP configuration:

---
apiVersion: projectcalico.org/v3
kind: BGPConfiguration
metadata:
  name: default
spec:
  logSeverityScreen: Info
  nodeToNodeMeshEnabled: true
  asNumber: 65534
  serviceClusterIPs:
  - cidr: 2a0a:e5c0:10:3::/108
  serviceExternalIPs:
  - cidr: 2a0a:e5c0:10:3::/108
---
apiVersion: projectcalico.org/v3
kind: BGPPeer
metadata:
  name: router1-place10
spec:
  peerIP: 2a0a:e5c0:10:1::50
  asNumber: 213081
  keepOriginalNextHop: true

ArgoCD / ArgoWorkFlow¶

Argocd Installation¶

As there is no configuration management present yet, argocd is installed using

kubectl create namespace argocd

# Specific Version
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/v2.3.2/manifests/install.yaml

# OR: latest stable
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

See https://argo-cd.readthedocs.io/en/stable/

Get the argocd credentials¶

kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d; echo ""

Accessing argocd¶

In regular IPv6 clusters:

Navigate to https://argocd-server.argocd.CLUSTERDOMAIN

In legacy IPv4 clusters

kubectl --namespace argocd port-forward svc/argocd-server 8080:80

Navigate to https://localhost:8080

Using the argocd webhook to trigger changes¶

To trigger changes post json https://argocd.example.com/api/webhook

Deploying an application¶

Applications are deployed via git towards gitea (code.ungleich.ch) and then pulled by argo
Always include the redmine-url pointing to the (customer) ticket
- Also add the support-url if it exists

Application sample

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: gitea-CUSTOMER
  namespace: argocd
spec:
  destination:
    namespace: default
    server: 'https://kubernetes.default.svc'
  source:
    path: apps/prod/gitea
    repoURL: 'https://code.ungleich.ch/ungleich-intern/k8s-config.git'
    targetRevision: HEAD
    helm:
      parameters:
        - name: storage.data.storageClass
          value: rook-ceph-block-hdd
        - name: storage.data.size
          value: 200Gi
        - name: storage.db.storageClass
          value: rook-ceph-block-ssd
        - name: storage.db.size
          value: 10Gi
        - name: storage.letsencrypt.storageClass
          value: rook-ceph-block-hdd
        - name: storage.letsencrypt.size
          value: 50Mi
        - name: letsencryptStaging
          value: 'no'
        - name: fqdn
          value: 'code.verua.online'
  project: default
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
  info:
    - name: 'redmine-url'
      value: 'https://redmine.ungleich.ch/issues/ISSUEID'
    - name: 'support-url'
      value: 'https://support.ungleich.ch/Ticket/Display.html?id=TICKETID'

Helm related operations and conventions¶

We use helm charts extensively.

In production, they are managed via argocd
In development, helm chart can de developed and deployed manually using the helm utility.

Installing a helm chart¶

One can use the usual pattern of

helm install <releasename> <chartdirectory>

However often you want to reinstall/update when testing helm charts. The following pattern is "better", because it allows you to reinstall, if it is already installed:

helm upgrade --install <releasename> <chartdirectory>

Naming services and deployments in helm charts [Application labels]¶

We always have {{ .Release.Name }} to identify the current "instance"
Deployments:
- use app: <what it is>, f.i. app: nginx, app: postgres, ...
See more about standard labels on
- https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/
- https://helm.sh/docs/chart_best_practices/labels/

Rook / Ceph Related Operations¶

Executing ceph commands¶

Using the ceph-tools pod as follows:

kubectl exec -n rook-ceph -ti $(kubectl -n rook-ceph get pods -l app=rook-ceph-tools -o jsonpath='{.items[*].metadata.name}') -- ceph -s

Inspecting the logs of a specific server¶

# Get the related pods
kubectl -n rook-ceph get pods -l app=rook-ceph-osd-prepare 
...

# Inspect the logs of a specific pod
kubectl -n rook-ceph logs -f rook-ceph-osd-prepare-server23--1-444qx

Inspecting the logs of the rook-ceph-operator¶

kubectl -n rook-ceph logs -f -l app=rook-ceph-operator

Triggering server prepare / adding new osds¶

The rook-ceph-operator triggers/watches/creates pods to maintain hosts. To trigger a full "re scan", simply delete that pod:

kubectl -n rook-ceph delete pods -l app=rook-ceph-operator

This will cause all the rook-ceph-osd-prepare-.. jobs to be recreated and thus OSDs to be created, if new disks have been added.

Removing an OSD¶

See Ceph OSD Management
More specifically: https://github.com/rook/rook/blob/release-1.7/cluster/examples/kubernetes/ceph/osd-purge.yaml
Then delete the related deployment

Set osd id in the osd-purge.yaml and apply it. OSD should be down before.

apiVersion: batch/v1
kind: Job
metadata:
  name: rook-ceph-purge-osd
  namespace: rook-ceph # namespace:cluster
  labels:
    app: rook-ceph-purge-osd
spec:
  template:
    metadata:
      labels:
        app: rook-ceph-purge-osd
    spec:
      serviceAccountName: rook-ceph-purge-osd
      containers:
        - name: osd-removal
          image: rook/ceph:master
          # TODO: Insert the OSD ID in the last parameter that is to be removed
          # The OSD IDs are a comma-separated list. For example: "0" or "0,2".
          # If you want to preserve the OSD PVCs, set `--preserve-pvc true`.
          #
          # A --force-osd-removal option is available if the OSD should be destroyed even though the
          # removal could lead to data loss.
          args:
            - "ceph" 
            - "osd" 
            - "remove" 
            - "--preserve-pvc" 
            - "false" 
            - "--force-osd-removal" 
            - "false" 
            - "--osd-ids" 
            - "SETTHEOSDIDHERE" 
          env:
            - name: POD_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: ROOK_MON_ENDPOINTS
              valueFrom:
                configMapKeyRef:
                  key: data
                  name: rook-ceph-mon-endpoints
            - name: ROOK_CEPH_USERNAME
              valueFrom:
                secretKeyRef:
                  key: ceph-username
                  name: rook-ceph-mon
            - name: ROOK_CEPH_SECRET
              valueFrom:
                secretKeyRef:
                  key: ceph-secret
                  name: rook-ceph-mon
            - name: ROOK_CONFIG_DIR
              value: /var/lib/rook
            - name: ROOK_CEPH_CONFIG_OVERRIDE
              value: /etc/rook/config/override.conf
            - name: ROOK_FSID
              valueFrom:
                secretKeyRef:
                  key: fsid
                  name: rook-ceph-mon
            - name: ROOK_LOG_LEVEL
              value: DEBUG
          volumeMounts:
            - mountPath: /etc/ceph
              name: ceph-conf-emptydir
            - mountPath: /var/lib/rook
              name: rook-config
      volumes:
        - emptyDir: {}
          name: ceph-conf-emptydir
        - emptyDir: {}
          name: rook-config
      restartPolicy: Never

Deleting the deployment:

[18:05] bridge:~% kubectl -n rook-ceph delete deployment rook-ceph-osd-6
deployment.apps "rook-ceph-osd-6" deleted

Harbor¶

We user Harbor for caching and as an image registry. Internal app reference: apps/prod/harbor.
The admin password is in the password store, auto generated per cluster
At the moment harbor only authenticates against the internal ldap tree

LDAP configuration¶

The url needs to be ldaps://...
uid = uid
rest standard

Monitoring / Prometheus¶

Via kube-prometheus

Access via ...

Prometheus Options¶

helm/kube-prometheus-stack
- Includes dashboards and co.
manifest based kube-prometheus
- Includes dashboards and co.
Prometheus Operator (mainly CRD manifest

Nextcloud¶

How to get the nextcloud credentials¶

The initial username is set to "nextcloud"
The password is autogenerated and saved in a kubernetes secret

kubectl get secret RELEASENAME-nextcloud -o jsonpath="{.data.PASSWORD}" | base64 -d; echo ""

How to fix "Access through untrusted domain"¶

Nextcloud stores the initial domain configuration
If the FQDN is changed, it will show the error message "Access through untrusted domain"
To fix, edit /var/www/html/config/config.php and correct the domain
Then delete the pods

Infrastructure versions¶

ungleich kubernetes infrastructure v5 (2021-10)¶

Clusters are configured / setup in this order:

Bootstrap via kubeadm
Networking via calico + BGP (non ECMP) using helm
ArgoCD for CD
- rook for storage via argocd
- haproxy for in IPv6-cluster-IPv4-to-IPv6 proxy via argocd
- kubernetes-secret-generator for in cluster secrets
- ungleich-certbot managing certs and nginx

ungleich kubernetes infrastructure v4 (2021-09)¶

rook is configured via manifests instead of using the rook-ceph-cluster helm chart
The rook operator is still being installed via helm

ungleich kubernetes infrastructure v3 (2021-07)¶

rook is now installed via helm via argocd instead of directly via manifests

ungleich kubernetes infrastructure v2 (2021-05)¶

Replaced fluxv2 from ungleich k8s v1 with argocd
- argocd can apply helm templates directly without needing to go through Chart releases
We are also using argoflow for build flows
Planned to add kaniko for image building

ungleich kubernetes infrastructure v1 (2021-01)¶

We are using the following components:

Calico as a CNI with BGP, IPv6 only, no encapsulation
- Needed for basic networking
kubernetes-secret-generator for creating secrets
- Needed so that secrets are not stored in the git repository, but only in the cluster
ungleich-certbot
- Needed to get letsencrypt certificates for services
rook with ceph rbd + cephfs for storage
- rbd for almost everything, ReadWriteOnce
- cephfs for smaller things, multi access ReadWriteMany
- Needed for providing persistent storage
flux v2
- Needed to manage resources automatically

Files (0)

Updated by Jin-Guk Kwon about 3 years ago · 114 revisions

Project

General

Profile

Open Infrastructure

Wiki