Project

General

Profile

The ungleich kubernetes infrastructure » History » Version 120

Nico Schottelius, 07/23/2022 06:02 PM

1 22 Nico Schottelius
h1. The ungleich kubernetes infrastructure and ungleich kubernetes manual
2 1 Nico Schottelius
3 3 Nico Schottelius
{{toc}}
4
5 1 Nico Schottelius
h2. Status
6
7 28 Nico Schottelius
This document is **pre-production**.
8
This document is to become the ungleich kubernetes infrastructure overview as well as the ungleich kubernetes manual.
9 1 Nico Schottelius
10 10 Nico Schottelius
h2. k8s clusters
11
12 111 Nico Schottelius
| Cluster           | Purpose/Setup     | Maintainer | Master(s)                     | argo                                                  | v4 http proxy | last verified |
13
| c0.k8s.ooo        | Dev               | -          | UNUSED                        |                                                       |               |    2021-10-05 |
14
| c1.k8s.ooo        | retired           |            | -                             |                                                       |               |    2022-03-15 |
15
| c2.k8s.ooo        | Dev p7 HW         | Nico       | server47 server53 server54    | "argo":https://argocd-server.argocd.svc.c2.k8s.ooo    |               |    2021-10-05 |
16
| c3.k8s.ooo        | retired           | -          | -                             |                                                       |               |    2021-10-05 |
17
| c4.k8s.ooo        | Dev2 p7 HW        | Jin-Guk    | server52 server53 server54    |                                                       |               |             - |
18
| c5.k8s.ooo        | retired           |            | -                             |                                                       |               |    2022-03-15 |
19
| c6.k8s.ooo        | Dev p6 VM Jin-Guk | Jin-Guk    |                               |                                                       |               |               |
20
| [[p5.k8s.ooo]]    | production        |            | server34 server36 server38    | "argo":https://argocd-server.argocd.svc.p5.k8s.ooo    |             - |               |
21
| [[p6.k8s.ooo]]    | production        |            | server67 server69 server71    | "argo":https://argocd-server.argocd.svc.p6.k8s.ooo    | 147.78.194.13 |    2021-10-05 |
22
| [[p10.k8s.ooo]]   | production        |            | server63 server65 server83    | "argo":https://argocd-server.argocd.svc.p10.k8s.ooo   | 147.78.194.12 |    2021-10-05 |
23
| [[k8s.ge.nau.so]] | development       |            | server107 server108 server109 | "argo":https://argocd-server.argocd.svc.k8s.ge.nau.so |               |               |
24 119 Jin-Guk Kwon
| [[dev.k8s.ooo]]   | development        |            | server110 server111 server112    | "argo":https://argocd-server.argocd.svc.dev.k8s.ooo   | - |    2022-07-08 |
25 21 Nico Schottelius
26 1 Nico Schottelius
h2. General architecture and components overview
27
28
* All k8s clusters are IPv6 only
29
* We use BGP peering to propagate podcidr and serviceCidr networks to our infrastructure
30
* The main public testing repository is "ungleich-k8s":https://code.ungleich.ch/ungleich-public/ungleich-k8s
31 18 Nico Schottelius
** Private configurations are found in the **k8s-config** repository
32 1 Nico Schottelius
33
h3. Cluster types
34
35 28 Nico Schottelius
| **Type/Feature**            | **Development**                | **Production**         |
36
| Min No. nodes               | 3 (1 master, 3 worker)         | 5 (3 master, 3 worker) |
37
| Recommended minimum         | 4 (dedicated master, 3 worker) | 8 (3 master, 5 worker) |
38
| Separation of control plane | optional                       | recommended            |
39
| Persistent storage          | required                       | required               |
40
| Number of storage monitors  | 3                              | 5                      |
41 1 Nico Schottelius
42 43 Nico Schottelius
h2. General k8s operations
43 1 Nico Schottelius
44 46 Nico Schottelius
h3. Cheat sheet / external great references
45
46
* "kubectl cheatsheet":https://kubernetes.io/docs/reference/kubectl/cheatsheet/
47
48 117 Nico Schottelius
h3. Allowing to schedule work on the control plane / removing node taints
49 69 Nico Schottelius
50
* Mostly for single node / test / development clusters
51
* Just remove the master taint as follows
52
53
<pre>
54
kubectl taint nodes --all node-role.kubernetes.io/master-
55 118 Nico Schottelius
kubectl taint nodes --all node-role.kubernetes.io/control-plane-
56 69 Nico Schottelius
</pre>
57 1 Nico Schottelius
58 117 Nico Schottelius
You can check the node taints using @kubectl describe node ...@
59 69 Nico Schottelius
60 44 Nico Schottelius
h3. Get the cluster admin.conf
61
62
* On the masters of each cluster you can find the file @/etc/kubernetes/admin.conf@
63
* To be able to administrate the cluster you can copy the admin.conf to your local machine
64
* Multi cluster debugging can very easy if you name the config ~/cX-admin.conf (see example below)
65
66
<pre>
67
% scp root@server47.place7.ungleich.ch:/etc/kubernetes/admin.conf ~/c2-admin.conf
68
% export KUBECONFIG=~/c2-admin.conf    
69
% kubectl get nodes
70
NAME       STATUS                     ROLES                  AGE   VERSION
71
server47   Ready                      control-plane,master   82d   v1.22.0
72
server48   Ready                      control-plane,master   82d   v1.22.0
73
server49   Ready                      <none>                 82d   v1.22.0
74
server50   Ready                      <none>                 82d   v1.22.0
75
server59   Ready                      control-plane,master   82d   v1.22.0
76
server60   Ready,SchedulingDisabled   <none>                 82d   v1.22.0
77
server61   Ready                      <none>                 82d   v1.22.0
78
server62   Ready                      <none>                 82d   v1.22.0               
79
</pre>
80
81 18 Nico Schottelius
h3. Installing a new k8s cluster
82 8 Nico Schottelius
83 9 Nico Schottelius
* Decide on the cluster name (usually *cX.k8s.ooo*), X counting upwards
84 28 Nico Schottelius
** Using pXX.k8s.ooo for production clusters of placeXX
85 9 Nico Schottelius
* Use cdist to configure the nodes with requirements like crio
86
* Decide between single or multi node control plane setups (see below)
87 28 Nico Schottelius
** Single control plane suitable for development clusters
88 9 Nico Schottelius
89 28 Nico Schottelius
Typical init procedure:
90 9 Nico Schottelius
91 28 Nico Schottelius
* Single control plane: @kubeadm init --config bootstrap/XXX/kubeadm.yaml@
92
* Multi control plane (HA): @kubeadm init --config bootstrap/XXX/kubeadm.yaml --upload-certs@
93 10 Nico Schottelius
94 29 Nico Schottelius
h3. Deleting a pod that is hanging in terminating state
95
96
<pre>
97
kubectl delete pod <PODNAME> --grace-period=0 --force --namespace <NAMESPACE>
98
</pre>
99
100
(from https://stackoverflow.com/questions/35453792/pods-stuck-in-terminating-status)
101
102 42 Nico Schottelius
h3. Listing nodes of a cluster
103
104
<pre>
105
[15:05] bridge:~% kubectl get nodes
106
NAME       STATUS   ROLES                  AGE   VERSION
107
server22   Ready    <none>                 52d   v1.22.0
108
server23   Ready    <none>                 52d   v1.22.2
109
server24   Ready    <none>                 52d   v1.22.0
110
server25   Ready    <none>                 52d   v1.22.0
111
server26   Ready    <none>                 52d   v1.22.0
112
server27   Ready    <none>                 52d   v1.22.0
113
server63   Ready    control-plane,master   52d   v1.22.0
114
server64   Ready    <none>                 52d   v1.22.0
115
server65   Ready    control-plane,master   52d   v1.22.0
116
server66   Ready    <none>                 52d   v1.22.0
117
server83   Ready    control-plane,master   52d   v1.22.0
118
server84   Ready    <none>                 52d   v1.22.0
119
server85   Ready    <none>                 52d   v1.22.0
120
server86   Ready    <none>                 52d   v1.22.0
121
</pre>
122
123 41 Nico Schottelius
h3. Removing / draining a node
124
125
Usually @kubectl drain server@ should do the job, but sometimes we need to be more aggressive:
126
127 1 Nico Schottelius
<pre>
128 103 Nico Schottelius
kubectl drain --delete-emptydir-data --ignore-daemonsets serverXX
129 42 Nico Schottelius
</pre>
130
131
h3. Readding a node after draining
132
133
<pre>
134
kubectl uncordon serverXX
135 1 Nico Schottelius
</pre>
136 43 Nico Schottelius
137 50 Nico Schottelius
h3. (Re-)joining worker nodes after creating the cluster
138 49 Nico Schottelius
139
* We need to have an up-to-date token
140
* We use different join commands for the workers and control plane nodes
141
142
Generating the join command on an existing control plane node:
143
144
<pre>
145
kubeadm token create --print-join-command
146
</pre>
147
148 50 Nico Schottelius
h3. (Re-)joining control plane nodes after creating the cluster
149 1 Nico Schottelius
150 50 Nico Schottelius
* We generate the token again
151
* We upload the certificates
152
* We need to combine/create the join command for the control plane node
153
154
Example session:
155
156
<pre>
157
% kubeadm token create --print-join-command
158
kubeadm join p10-api.k8s.ooo:6443 --token xmff4i.ABC --discovery-token-ca-cert-hash sha256:longhash 
159
160
% kubeadm init phase upload-certs --upload-certs
161
[upload-certs] Storing the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
162
[upload-certs] Using certificate key:
163
CERTKEY
164
165
# Then we use these two outputs on the joining node:
166
167
kubeadm join p10-api.k8s.ooo:6443 --token xmff4i.ABC --discovery-token-ca-cert-hash sha256:longhash --control-plane --certificate-key CERTKEY
168
</pre>
169
170
Commands to be used on a control plane node:
171
172
<pre>
173
kubeadm token create --print-join-command
174
kubeadm init phase upload-certs --upload-certs
175
</pre>
176
177
Commands to be used on the joining node:
178
179
<pre>
180
JOINCOMMAND --control-plane --certificate-key CERTKEY
181
</pre>
182 49 Nico Schottelius
183 51 Nico Schottelius
SEE ALSO
184
185
* https://stackoverflow.com/questions/63936268/how-to-generate-kubeadm-token-for-secondary-control-plane-nodes
186
* https://blog.scottlowe.org/2019/08/15/reconstructing-the-join-command-for-kubeadm/
187
188 53 Nico Schottelius
h3. How to fix etcd does not start when rejoining a kubernetes cluster as a control plane
189 52 Nico Schottelius
190
If during the above step etcd does not come up, @kubeadm join@ can hang as follows:
191
192
<pre>
193
[control-plane] Creating static Pod manifest for "kube-apiserver"                                                              
194
[control-plane] Creating static Pod manifest for "kube-controller-manager"                                                     
195
[control-plane] Creating static Pod manifest for "kube-scheduler"                                                              
196
[check-etcd] Checking that the etcd cluster is healthy                                                                         
197
error execution phase check-etcd: etcd cluster is not healthy: failed to dial endpoint https://[2a0a:e5c0:10:1:225:b3ff:fe20:37
198
8a]:2379 with maintenance client: context deadline exceeded                                                                    
199
To see the stack trace of this error execute with --v=5 or higher         
200
</pre>
201
202
Then the problem is likely that the etcd server is still a member of the cluster. We first need to remove it from the etcd cluster and then the join works.
203
204
To fix this we do:
205
206
* Find a working etcd pod
207
* Find the etcd members / member list
208
* Remove the etcd member that we want to re-join the cluster
209
210
211
<pre>
212
# Find the etcd pods
213
kubectl -n kube-system get pods -l component=etcd,tier=control-plane
214
215
# Get the list of etcd servers with the member id 
216
kubectl exec -n kube-system -ti ETCDPODNAME -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list
217
218
# Remove the member
219
kubectl exec -n kube-system -ti ETCDPODNAME -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove MEMBERID
220
</pre>
221
222
Sample session:
223
224
<pre>
225
[10:48] line:~% kubectl -n kube-system get pods -l component=etcd,tier=control-plane
226
NAME            READY   STATUS    RESTARTS     AGE
227
etcd-server63   1/1     Running   0            3m11s
228
etcd-server65   1/1     Running   3            7d2h
229
etcd-server83   1/1     Running   8 (6d ago)   7d2h
230
[10:48] line:~% kubectl exec -n kube-system -ti etcd-server65 -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list
231
356891cd676df6e4, started, server65, https://[2a0a:e5c0:10:1:225:b3ff:fe20:375c]:2380, https://[2a0a:e5c0:10:1:225:b3ff:fe20:375c]:2379, false
232
371b8a07185dee7e, started, server63, https://[2a0a:e5c0:10:1:225:b3ff:fe20:378a]:2380, https://[2a0a:e5c0:10:1:225:b3ff:fe20:378a]:2379, false
233
5942bc58307f8af9, started, server83, https://[2a0a:e5c0:10:1:3e4a:92ff:fe79:bb98]:2380, https://[2a0a:e5c0:10:1:3e4a:92ff:fe79:bb98]:2379, false
234
235
[10:48] line:~% kubectl exec -n kube-system -ti etcd-server65 -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove 371b8a07185dee7e
236
Member 371b8a07185dee7e removed from cluster e3c0805f592a8f77
237 1 Nico Schottelius
238
</pre>
239
240
SEE ALSO
241
242
* We found the solution using https://stackoverflow.com/questions/67921552/re-installed-node-cannot-join-kubernetes-cluster
243 56 Nico Schottelius
244 101 Nico Schottelius
h3. Hardware Maintenance using ungleich-hardware
245
246
Use the following manifest and replace the HOST with the actual host:
247
248
<pre>
249
apiVersion: v1
250
kind: Pod
251
metadata:
252
  name: ungleich-hardware-HOST
253
spec:
254
  containers:
255
  - name: ungleich-hardware
256
    image: ungleich/ungleich-hardware:0.0.5
257
    args:
258
    - sleep
259
    - "1000000"
260
    volumeMounts:
261
      - mountPath: /dev
262
        name: dev
263
    securityContext:
264
      privileged: true
265
  nodeSelector:
266
    kubernetes.io/hostname: "HOST"
267
268
  volumes:
269
    - name: dev
270
      hostPath:
271
        path: /dev
272
</pre>
273
274 102 Nico Schottelius
Also see: [[The_ungleich_hardware_maintenance_guide]]
275
276 105 Nico Schottelius
h3. Triggering a cronjob / creating a job from a cronjob
277 104 Nico Schottelius
278
To test a cronjob, we can create a job from a cronjob:
279
280
<pre>
281
kubectl create job --from=cronjob/volume2-daily-backup volume2-manual
282
</pre>
283
284
This creates a job volume2-manual based on the cronjob  volume2-daily
285
286 112 Nico Schottelius
h3. su-ing into a user that has nologin shell set
287
288
Many times users are having nologin as their shell inside the container. To be able to execute maintenance commands within the
289
container, we can use @su -s /bin/sh@ like this:
290
291
<pre>
292
su -s /bin/sh -c '/path/to/your/script' testuser
293
</pre>
294
295
Found on https://serverfault.com/questions/351046/how-to-run-command-as-user-who-has-usr-sbin-nologin-as-shell
296
297 113 Nico Schottelius
h3. How to print a secret value
298
299
Assuming you want the "password" item from a secret, use:
300
301
<pre>
302
kubectl get secret SECRETNAME -o jsonpath="{.data.password}" | base64 -d; echo "" 
303
</pre>
304
305 62 Nico Schottelius
h2. Calico CNI
306
307
h3. Calico Installation
308
309
* We install "calico using helm":https://docs.projectcalico.org/getting-started/kubernetes/helm
310
* This has the following advantages:
311
** Easy to upgrade
312
** Does not require os to configure IPv6/dual stack settings as the tigera operator figures out things on its own
313
314
Usually plain calico can be installed directly using:
315
316
<pre>
317
helm repo add projectcalico https://docs.projectcalico.org/charts
318 120 Nico Schottelius
helm upgrade --install --namespace tigera calico projectcalico/tigera-operator --version v3.23.2 --create-namespace
319
# helm install --namespace tigera calico projectcalico/tigera-operator --version v3.23.2 --create-namespace
320 1 Nico Schottelius
</pre>
321 92 Nico Schottelius
322
* Check the tags on https://github.com/projectcalico/calico/tags for the latest release
323 62 Nico Schottelius
324
h3. Installing calicoctl
325
326 115 Nico Schottelius
* General installation instructions, including binary download: https://projectcalico.docs.tigera.io/maintenance/clis/calicoctl/install
327
328 62 Nico Schottelius
To be able to manage and configure calico, we need to 
329
"install calicoctl (we choose the version as a pod)":https://docs.projectcalico.org/getting-started/clis/calicoctl/install#install-calicoctl-as-a-kubernetes-pod
330
331
<pre>
332
kubectl apply -f https://docs.projectcalico.org/manifests/calicoctl.yaml
333
</pre>
334
335 93 Nico Schottelius
Or version specific:
336
337
<pre>
338
kubectl apply -f https://github.com/projectcalico/calico/blob/v3.20.4/manifests/calicoctl.yaml
339 97 Nico Schottelius
340
# For 3.22
341
kubectl apply -f https://projectcalico.docs.tigera.io/archive/v3.22/manifests/calicoctl.yaml
342 93 Nico Schottelius
</pre>
343
344 70 Nico Schottelius
And making it easier accessible by alias:
345
346
<pre>
347
alias calicoctl="kubectl exec -i -n kube-system calicoctl -- /calicoctl"
348
</pre>
349
350 62 Nico Schottelius
h3. Calico configuration
351
352 63 Nico Schottelius
By default our k8s clusters "BGP peer":https://docs.projectcalico.org/networking/bgp
353
with an upstream router to propagate podcidr and servicecidr.
354 62 Nico Schottelius
355
Default settings in our infrastructure:
356
357
* We use a full-mesh using the @nodeToNodeMeshEnabled: true@ option
358
* We keep the original next hop so that *only* the server with the pod is announcing it (instead of ecmp)
359 1 Nico Schottelius
* We use private ASNs for k8s clusters
360 63 Nico Schottelius
* We do *not* use any overlay
361 62 Nico Schottelius
362
After installing calico and calicoctl the last step of the installation is usually:
363
364 1 Nico Schottelius
<pre>
365 79 Nico Schottelius
calicoctl create -f - < calico-bgp.yaml
366 62 Nico Schottelius
</pre>
367
368
369
A sample BGP configuration:
370
371
<pre>
372
---
373
apiVersion: projectcalico.org/v3
374
kind: BGPConfiguration
375
metadata:
376
  name: default
377
spec:
378
  logSeverityScreen: Info
379
  nodeToNodeMeshEnabled: true
380
  asNumber: 65534
381
  serviceClusterIPs:
382
  - cidr: 2a0a:e5c0:10:3::/108
383
  serviceExternalIPs:
384
  - cidr: 2a0a:e5c0:10:3::/108
385
---
386
apiVersion: projectcalico.org/v3
387
kind: BGPPeer
388
metadata:
389
  name: router1-place10
390
spec:
391
  peerIP: 2a0a:e5c0:10:1::50
392
  asNumber: 213081
393
  keepOriginalNextHop: true
394
</pre>
395
396 64 Nico Schottelius
h2. ArgoCD / ArgoWorkFlow
397 56 Nico Schottelius
398 60 Nico Schottelius
h3. Argocd Installation
399 1 Nico Schottelius
400 116 Nico Schottelius
* See https://argo-cd.readthedocs.io/en/stable/
401
402 60 Nico Schottelius
As there is no configuration management present yet, argocd is installed using
403
404 1 Nico Schottelius
<pre>
405 60 Nico Schottelius
kubectl create namespace argocd
406 86 Nico Schottelius
407 96 Nico Schottelius
# Specific Version
408
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/v2.3.2/manifests/install.yaml
409 86 Nico Schottelius
410
# OR: latest stable
411 60 Nico Schottelius
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
412 56 Nico Schottelius
</pre>
413 1 Nico Schottelius
414 116 Nico Schottelius
415 1 Nico Schottelius
416 60 Nico Schottelius
h3. Get the argocd credentials
417
418
<pre>
419
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d; echo ""
420
</pre>
421 52 Nico Schottelius
422 87 Nico Schottelius
h3. Accessing argocd
423
424
In regular IPv6 clusters:
425
426
* Navigate to https://argocd-server.argocd.CLUSTERDOMAIN
427
428
In legacy IPv4 clusters
429
430
<pre>
431
kubectl --namespace argocd port-forward svc/argocd-server 8080:80
432
</pre>
433
434 88 Nico Schottelius
* Navigate to https://localhost:8080
435
436 68 Nico Schottelius
h3. Using the argocd webhook to trigger changes
437 67 Nico Schottelius
438
* To trigger changes post json https://argocd.example.com/api/webhook
439
440 72 Nico Schottelius
h3. Deploying an application
441
442
* Applications are deployed via git towards gitea (code.ungleich.ch) and then pulled by argo
443 73 Nico Schottelius
* Always include the *redmine-url* pointing to the (customer) ticket
444
** Also add the support-url if it exists
445 72 Nico Schottelius
446
Application sample
447
448
<pre>
449
apiVersion: argoproj.io/v1alpha1
450
kind: Application
451
metadata:
452
  name: gitea-CUSTOMER
453
  namespace: argocd
454
spec:
455
  destination:
456
    namespace: default
457
    server: 'https://kubernetes.default.svc'
458
  source:
459
    path: apps/prod/gitea
460
    repoURL: 'https://code.ungleich.ch/ungleich-intern/k8s-config.git'
461
    targetRevision: HEAD
462
    helm:
463
      parameters:
464
        - name: storage.data.storageClass
465
          value: rook-ceph-block-hdd
466
        - name: storage.data.size
467
          value: 200Gi
468
        - name: storage.db.storageClass
469
          value: rook-ceph-block-ssd
470
        - name: storage.db.size
471
          value: 10Gi
472
        - name: storage.letsencrypt.storageClass
473
          value: rook-ceph-block-hdd
474
        - name: storage.letsencrypt.size
475
          value: 50Mi
476
        - name: letsencryptStaging
477
          value: 'no'
478
        - name: fqdn
479
          value: 'code.verua.online'
480
  project: default
481
  syncPolicy:
482
    automated:
483
      prune: true
484
      selfHeal: true
485
  info:
486
    - name: 'redmine-url'
487
      value: 'https://redmine.ungleich.ch/issues/ISSUEID'
488
    - name: 'support-url'
489
      value: 'https://support.ungleich.ch/Ticket/Display.html?id=TICKETID'
490
</pre>
491
492 80 Nico Schottelius
h2. Helm related operations and conventions
493 55 Nico Schottelius
494 61 Nico Schottelius
We use helm charts extensively.
495
496
* In production, they are managed via argocd
497
* In development, helm chart can de developed and deployed manually using the helm utility.
498
499 55 Nico Schottelius
h3. Installing a helm chart
500
501
One can use the usual pattern of
502
503
<pre>
504
helm install <releasename> <chartdirectory>
505
</pre>
506
507
However often you want to reinstall/update when testing helm charts. The following pattern is "better", because it allows you to reinstall, if it is already installed:
508
509
<pre>
510
helm upgrade --install <releasename> <chartdirectory>
511 1 Nico Schottelius
</pre>
512 80 Nico Schottelius
513
h3. Naming services and deployments in helm charts [Application labels]
514
515
* We always have {{ .Release.Name }} to identify the current "instance"
516
* Deployments:
517
** use @app: <what it is>@, f.i. @app: nginx@, @app: postgres@, ...
518 81 Nico Schottelius
* See more about standard labels on
519
** https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/
520
** https://helm.sh/docs/chart_best_practices/labels/
521 55 Nico Schottelius
522 43 Nico Schottelius
h2. Rook / Ceph Related Operations
523
524 71 Nico Schottelius
h3. Executing ceph commands
525
526
Using the ceph-tools pod as follows:
527
528
<pre>
529
kubectl exec -n rook-ceph -ti $(kubectl -n rook-ceph get pods -l app=rook-ceph-tools -o jsonpath='{.items[*].metadata.name}') -- ceph -s
530
</pre>
531
532 43 Nico Schottelius
h3. Inspecting the logs of a specific server
533
534
<pre>
535
# Get the related pods
536
kubectl -n rook-ceph get pods -l app=rook-ceph-osd-prepare 
537
...
538
539
# Inspect the logs of a specific pod
540
kubectl -n rook-ceph logs -f rook-ceph-osd-prepare-server23--1-444qx
541
542 71 Nico Schottelius
</pre>
543
544
h3. Inspecting the logs of the rook-ceph-operator
545
546
<pre>
547
kubectl -n rook-ceph logs -f -l app=rook-ceph-operator
548 43 Nico Schottelius
</pre>
549
550
h3. Triggering server prepare / adding new osds
551
552
The rook-ceph-operator triggers/watches/creates pods to maintain hosts. To trigger a full "re scan", simply delete that pod:
553
554
<pre>
555
kubectl -n rook-ceph delete pods -l app=rook-ceph-operator
556
</pre>
557
558
This will cause all the @rook-ceph-osd-prepare-..@ jobs to be recreated and thus OSDs to be created, if new disks have been added.
559
560
h3. Removing an OSD
561
562
* See "Ceph OSD Management":https://rook.io/docs/rook/v1.7/ceph-osd-mgmt.html
563 77 Nico Schottelius
* More specifically: https://github.com/rook/rook/blob/release-1.7/cluster/examples/kubernetes/ceph/osd-purge.yaml
564 99 Nico Schottelius
* Then delete the related deployment
565 41 Nico Schottelius
566 98 Nico Schottelius
Set osd id in the osd-purge.yaml and apply it. OSD should be down before.
567
568
<pre>
569
apiVersion: batch/v1
570
kind: Job
571
metadata:
572
  name: rook-ceph-purge-osd
573
  namespace: rook-ceph # namespace:cluster
574
  labels:
575
    app: rook-ceph-purge-osd
576
spec:
577
  template:
578
    metadata:
579
      labels:
580
        app: rook-ceph-purge-osd
581
    spec:
582
      serviceAccountName: rook-ceph-purge-osd
583
      containers:
584
        - name: osd-removal
585
          image: rook/ceph:master
586
          # TODO: Insert the OSD ID in the last parameter that is to be removed
587
          # The OSD IDs are a comma-separated list. For example: "0" or "0,2".
588
          # If you want to preserve the OSD PVCs, set `--preserve-pvc true`.
589
          #
590
          # A --force-osd-removal option is available if the OSD should be destroyed even though the
591
          # removal could lead to data loss.
592
          args:
593
            - "ceph"
594
            - "osd"
595
            - "remove"
596
            - "--preserve-pvc"
597
            - "false"
598
            - "--force-osd-removal"
599
            - "false"
600
            - "--osd-ids"
601
            - "SETTHEOSDIDHERE"
602
          env:
603
            - name: POD_NAMESPACE
604
              valueFrom:
605
                fieldRef:
606
                  fieldPath: metadata.namespace
607
            - name: ROOK_MON_ENDPOINTS
608
              valueFrom:
609
                configMapKeyRef:
610
                  key: data
611
                  name: rook-ceph-mon-endpoints
612
            - name: ROOK_CEPH_USERNAME
613
              valueFrom:
614
                secretKeyRef:
615
                  key: ceph-username
616
                  name: rook-ceph-mon
617
            - name: ROOK_CEPH_SECRET
618
              valueFrom:
619
                secretKeyRef:
620
                  key: ceph-secret
621
                  name: rook-ceph-mon
622
            - name: ROOK_CONFIG_DIR
623
              value: /var/lib/rook
624
            - name: ROOK_CEPH_CONFIG_OVERRIDE
625
              value: /etc/rook/config/override.conf
626
            - name: ROOK_FSID
627
              valueFrom:
628
                secretKeyRef:
629
                  key: fsid
630
                  name: rook-ceph-mon
631
            - name: ROOK_LOG_LEVEL
632
              value: DEBUG
633
          volumeMounts:
634
            - mountPath: /etc/ceph
635
              name: ceph-conf-emptydir
636
            - mountPath: /var/lib/rook
637
              name: rook-config
638
      volumes:
639
        - emptyDir: {}
640
          name: ceph-conf-emptydir
641
        - emptyDir: {}
642
          name: rook-config
643
      restartPolicy: Never
644
645
646 99 Nico Schottelius
</pre>
647
648
Deleting the deployment:
649
650
<pre>
651
[18:05] bridge:~% kubectl -n rook-ceph delete deployment rook-ceph-osd-6
652
deployment.apps "rook-ceph-osd-6" deleted
653 98 Nico Schottelius
</pre>
654
655 76 Nico Schottelius
h2. Harbor
656
657
* We user "Harbor":https://goharbor.io/ for caching and as an image registry. Internal app reference: apps/prod/harbor.
658
* The admin password is in the password store, auto generated per cluster
659
* At the moment harbor only authenticates against the internal ldap tree
660
661
h3. LDAP configuration
662
663
* The url needs to be ldaps://...
664
* uid = uid
665
* rest standard
666 75 Nico Schottelius
667 89 Nico Schottelius
h2. Monitoring / Prometheus
668
669 90 Nico Schottelius
* Via "kube-prometheus":https://github.com/prometheus-operator/kube-prometheus/
670 89 Nico Schottelius
671 91 Nico Schottelius
Access via ...
672
673
* http://prometheus-k8s.monitoring.svc:9090
674
* http://grafana.monitoring.svc:3000
675
* http://alertmanager.monitoring.svc:9093
676
677
678 100 Nico Schottelius
h3. Prometheus Options
679
680
* "helm/kube-prometheus-stack":https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack
681
** Includes dashboards and co.
682
* "manifest based kube-prometheus":https://github.com/prometheus-operator/kube-prometheus
683
** Includes dashboards and co.
684
* "Prometheus Operator (mainly CRD manifest":https://github.com/prometheus-operator/prometheus-operator
685
686 91 Nico Schottelius
687 82 Nico Schottelius
h2. Nextcloud
688
689 85 Nico Schottelius
h3. How to get the nextcloud credentials 
690 84 Nico Schottelius
691
* The initial username is set to "nextcloud"
692
* The password is autogenerated and saved in a kubernetes secret
693
694
<pre>
695 85 Nico Schottelius
kubectl get secret RELEASENAME-nextcloud -o jsonpath="{.data.PASSWORD}" | base64 -d; echo "" 
696 84 Nico Schottelius
</pre>
697
698 83 Nico Schottelius
h3. How to fix "Access through untrusted domain"
699
700 82 Nico Schottelius
* Nextcloud stores the initial domain configuration
701 1 Nico Schottelius
* If the FQDN is changed, it will show the error message "Access through untrusted domain"
702 82 Nico Schottelius
* To fix, edit /var/www/html/config/config.php and correct the domain
703 83 Nico Schottelius
* Then delete the pods
704 82 Nico Schottelius
705 1 Nico Schottelius
h2. Infrastructure versions
706 35 Nico Schottelius
707 57 Nico Schottelius
h3. ungleich kubernetes infrastructure v5 (2021-10)
708 1 Nico Schottelius
709 57 Nico Schottelius
Clusters are configured / setup in this order:
710
711
* Bootstrap via kubeadm
712 59 Nico Schottelius
* "Networking via calico + BGP (non ECMP) using helm":https://docs.projectcalico.org/getting-started/kubernetes/helm
713
* "ArgoCD for CD":https://argo-cd.readthedocs.io/en/stable/
714
** "rook for storage via argocd":https://rook.io/
715 58 Nico Schottelius
** haproxy for in IPv6-cluster-IPv4-to-IPv6 proxy via argocd
716
** "kubernetes-secret-generator for in cluster secrets":https://github.com/mittwald/kubernetes-secret-generator
717
** "ungleich-certbot managing certs and nginx":https://hub.docker.com/repository/docker/ungleich/ungleich-certbot
718
719 57 Nico Schottelius
720
h3. ungleich kubernetes infrastructure v4 (2021-09)
721
722 54 Nico Schottelius
* rook is configured via manifests instead of using the rook-ceph-cluster helm chart
723 1 Nico Schottelius
* The rook operator is still being installed via helm
724 35 Nico Schottelius
725 57 Nico Schottelius
h3. ungleich kubernetes infrastructure v3 (2021-07)
726 1 Nico Schottelius
727 10 Nico Schottelius
* rook is now installed via helm via argocd instead of directly via manifests
728 28 Nico Schottelius
729 57 Nico Schottelius
h3. ungleich kubernetes infrastructure v2 (2021-05)
730 28 Nico Schottelius
731
* Replaced fluxv2 from ungleich k8s v1 with argocd
732 1 Nico Schottelius
** argocd can apply helm templates directly without needing to go through Chart releases
733 28 Nico Schottelius
* We are also using argoflow for build flows
734
* Planned to add "kaniko":https://github.com/GoogleContainerTools/kaniko for image building
735
736 57 Nico Schottelius
h3. ungleich kubernetes infrastructure v1 (2021-01)
737 28 Nico Schottelius
738
We are using the following components:
739
740
* "Calico as a CNI":https://www.projectcalico.org/ with BGP, IPv6 only, no encapsulation
741
** Needed for basic networking
742
* "kubernetes-secret-generator":https://github.com/mittwald/kubernetes-secret-generator for creating secrets
743
** Needed so that secrets are not stored in the git repository, but only in the cluster
744
* "ungleich-certbot":https://hub.docker.com/repository/docker/ungleich/ungleich-certbot
745
** Needed to get letsencrypt certificates for services
746
* "rook with ceph rbd + cephfs":https://rook.io/ for storage
747
** rbd for almost everything, *ReadWriteOnce*
748
** cephfs for smaller things, multi access *ReadWriteMany*
749
** Needed for providing persistent storage
750
* "flux v2":https://fluxcd.io/
751
** Needed to manage resources automatically