Project

General

Profile

The ungleich kubernetes infrastructure » History » Version 209

Nico Schottelius, 12/25/2023 12:44 PM

1 22 Nico Schottelius
h1. The ungleich kubernetes infrastructure and ungleich kubernetes manual
2 1 Nico Schottelius
3 3 Nico Schottelius
{{toc}}
4
5 1 Nico Schottelius
h2. Status
6
7 28 Nico Schottelius
This document is **pre-production**.
8
This document is to become the ungleich kubernetes infrastructure overview as well as the ungleich kubernetes manual.
9 1 Nico Schottelius
10 10 Nico Schottelius
h2. k8s clusters
11
12 123 Nico Schottelius
| Cluster            | Purpose/Setup     | Maintainer | Master(s)                     | argo                                                   | v4 http proxy | last verified |
13
| c0.k8s.ooo         | Dev               | -          | UNUSED                        |                                                        |               |    2021-10-05 |
14
| c1.k8s.ooo         | retired           |            | -                             |                                                        |               |    2022-03-15 |
15
| c2.k8s.ooo         | Dev p7 HW         | Nico       | server47 server53 server54    | "argo":https://argocd-server.argocd.svc.c2.k8s.ooo     |               |    2021-10-05 |
16
| c3.k8s.ooo         | retired           | -          | -                             |                                                        |               |    2021-10-05 |
17
| c4.k8s.ooo         | Dev2 p7 HW        | Jin-Guk    | server52 server53 server54    |                                                        |               |             - |
18
| c5.k8s.ooo         | retired           |            | -                             |                                                        |               |    2022-03-15 |
19
| c6.k8s.ooo         | Dev p6 VM Jin-Guk | Jin-Guk    |                               |                                                        |               |               |
20
| [[p5.k8s.ooo]]     | production        |            | server34 server36 server38    | "argo":https://argocd-server.argocd.svc.p5.k8s.ooo     | -             |               |
21
| [[p5-cow.k8s.ooo]] | production        | Nico       | server47 server51 server55    | "argo":https://argocd-server.argocd.svc.p5-cow.k8s.ooo |               |    2022-08-27 |
22
| [[p6.k8s.ooo]]     | production        |            | server67 server69 server71    | "argo":https://argocd-server.argocd.svc.p6.k8s.ooo     | 147.78.194.13 |    2021-10-05 |
23 184 Nico Schottelius
| [[p6-cow.k8s.ooo]] | production        |            | server134 server135 server136 | "argo":https://argocd-server.argocd.svc.p6in10.k8s.ooo | ?             |    2023-05-17 |
24 177 Nico Schottelius
| [[p10.k8s.ooo]]    | production        |            | server131 server132 server133 | "argo":https://argocd-server.argocd.svc.p10.k8s.ooo    | 147.78.194.12 |    2021-10-05 |
25 123 Nico Schottelius
| [[k8s.ge.nau.so]]  | development       |            | server107 server108 server109 | "argo":https://argocd-server.argocd.svc.k8s.ge.nau.so  |               |               |
26
| [[dev.k8s.ooo]]    | development       |            | server110 server111 server112 | "argo":https://argocd-server.argocd.svc.dev.k8s.ooo    | -             |    2022-07-08 |
27 164 Nico Schottelius
| [[r1r2p15k8sooo|r1.p15.k8s.ooo]] | production | Nico | server120 | | | 2022-10-30 |
28
| [[r1r2p15k8sooo|r2.p15.k8s.ooo]] | production | Nico | server121 | | | 2022-09-06 |
29 162 Nico Schottelius
| [[r1r2p10k8sooo|r1.p10.k8s.ooo]] | production | Nico | server122 | | | 2022-10-30 |
30
| [[r1r2p10k8sooo|r2.p10.k8s.ooo]] | production | Nico | server123 | | | 2022-10-15 |
31
| [[r1r2p5k8sooo|r1.p5.k8s.ooo]] | production | Nico | server137 | | | 2022-10-30 |
32
| [[r1r2p5k8sooo|r2.p5.k8s.ooo]] | production | Nico | server138 | | | 2022-10-30 |
33
| [[r1r2p6k8sooo|r1.p6.k8s.ooo]] | production | Nico | server139 | | | 2022-10-30 |
34
| [[r1r2p6k8sooo|r2.p6.k8s.ooo]] | production | Nico | server140 | | | 2022-10-30 |
35 21 Nico Schottelius
36 1 Nico Schottelius
h2. General architecture and components overview
37
38
* All k8s clusters are IPv6 only
39
* We use BGP peering to propagate podcidr and serviceCidr networks to our infrastructure
40
* The main public testing repository is "ungleich-k8s":https://code.ungleich.ch/ungleich-public/ungleich-k8s
41 18 Nico Schottelius
** Private configurations are found in the **k8s-config** repository
42 1 Nico Schottelius
43
h3. Cluster types
44
45 28 Nico Schottelius
| **Type/Feature**            | **Development**                | **Production**         |
46
| Min No. nodes               | 3 (1 master, 3 worker)         | 5 (3 master, 3 worker) |
47
| Recommended minimum         | 4 (dedicated master, 3 worker) | 8 (3 master, 5 worker) |
48
| Separation of control plane | optional                       | recommended            |
49
| Persistent storage          | required                       | required               |
50
| Number of storage monitors  | 3                              | 5                      |
51 1 Nico Schottelius
52 43 Nico Schottelius
h2. General k8s operations
53 1 Nico Schottelius
54 46 Nico Schottelius
h3. Cheat sheet / external great references
55
56
* "kubectl cheatsheet":https://kubernetes.io/docs/reference/kubectl/cheatsheet/
57
58 117 Nico Schottelius
h3. Allowing to schedule work on the control plane / removing node taints
59 69 Nico Schottelius
60
* Mostly for single node / test / development clusters
61
* Just remove the master taint as follows
62
63
<pre>
64
kubectl taint nodes --all node-role.kubernetes.io/master-
65 118 Nico Schottelius
kubectl taint nodes --all node-role.kubernetes.io/control-plane-
66 69 Nico Schottelius
</pre>
67 1 Nico Schottelius
68 117 Nico Schottelius
You can check the node taints using @kubectl describe node ...@
69 69 Nico Schottelius
70 208 Nico Schottelius
h3. Adding taints
71
72
* For instance to limit nodes to specific customers
73
74
<pre>
75
kubectl taint nodes serverXX customer=CUSTOMERNAME:NoSchedule
76
</pre>
77
78 44 Nico Schottelius
h3. Get the cluster admin.conf
79
80
* On the masters of each cluster you can find the file @/etc/kubernetes/admin.conf@
81
* To be able to administrate the cluster you can copy the admin.conf to your local machine
82
* Multi cluster debugging can very easy if you name the config ~/cX-admin.conf (see example below)
83
84
<pre>
85
% scp root@server47.place7.ungleich.ch:/etc/kubernetes/admin.conf ~/c2-admin.conf
86
% export KUBECONFIG=~/c2-admin.conf    
87
% kubectl get nodes
88
NAME       STATUS                     ROLES                  AGE   VERSION
89
server47   Ready                      control-plane,master   82d   v1.22.0
90
server48   Ready                      control-plane,master   82d   v1.22.0
91
server49   Ready                      <none>                 82d   v1.22.0
92
server50   Ready                      <none>                 82d   v1.22.0
93
server59   Ready                      control-plane,master   82d   v1.22.0
94
server60   Ready,SchedulingDisabled   <none>                 82d   v1.22.0
95
server61   Ready                      <none>                 82d   v1.22.0
96
server62   Ready                      <none>                 82d   v1.22.0               
97
</pre>
98
99 18 Nico Schottelius
h3. Installing a new k8s cluster
100 8 Nico Schottelius
101 9 Nico Schottelius
* Decide on the cluster name (usually *cX.k8s.ooo*), X counting upwards
102 28 Nico Schottelius
** Using pXX.k8s.ooo for production clusters of placeXX
103 9 Nico Schottelius
* Use cdist to configure the nodes with requirements like crio
104
* Decide between single or multi node control plane setups (see below)
105 28 Nico Schottelius
** Single control plane suitable for development clusters
106 9 Nico Schottelius
107 28 Nico Schottelius
Typical init procedure:
108 9 Nico Schottelius
109 206 Nico Schottelius
h4. Single control plane:
110
111
<pre>
112
kubeadm init --config bootstrap/XXX/kubeadm.yaml
113
</pre>
114
115
h4. Multi control plane (HA):
116
117
<pre>
118
kubeadm init --config bootstrap/XXX/kubeadm.yaml --upload-certs
119
</pre>
120
121 10 Nico Schottelius
122 29 Nico Schottelius
h3. Deleting a pod that is hanging in terminating state
123
124
<pre>
125
kubectl delete pod <PODNAME> --grace-period=0 --force --namespace <NAMESPACE>
126
</pre>
127
128
(from https://stackoverflow.com/questions/35453792/pods-stuck-in-terminating-status)
129
130 42 Nico Schottelius
h3. Listing nodes of a cluster
131
132
<pre>
133
[15:05] bridge:~% kubectl get nodes
134
NAME       STATUS   ROLES                  AGE   VERSION
135
server22   Ready    <none>                 52d   v1.22.0
136
server23   Ready    <none>                 52d   v1.22.2
137
server24   Ready    <none>                 52d   v1.22.0
138
server25   Ready    <none>                 52d   v1.22.0
139
server26   Ready    <none>                 52d   v1.22.0
140
server27   Ready    <none>                 52d   v1.22.0
141
server63   Ready    control-plane,master   52d   v1.22.0
142
server64   Ready    <none>                 52d   v1.22.0
143
server65   Ready    control-plane,master   52d   v1.22.0
144
server66   Ready    <none>                 52d   v1.22.0
145
server83   Ready    control-plane,master   52d   v1.22.0
146
server84   Ready    <none>                 52d   v1.22.0
147
server85   Ready    <none>                 52d   v1.22.0
148
server86   Ready    <none>                 52d   v1.22.0
149
</pre>
150
151 41 Nico Schottelius
h3. Removing / draining a node
152
153
Usually @kubectl drain server@ should do the job, but sometimes we need to be more aggressive:
154
155 1 Nico Schottelius
<pre>
156 103 Nico Schottelius
kubectl drain --delete-emptydir-data --ignore-daemonsets serverXX
157 42 Nico Schottelius
</pre>
158
159
h3. Readding a node after draining
160
161
<pre>
162
kubectl uncordon serverXX
163 1 Nico Schottelius
</pre>
164 43 Nico Schottelius
165 50 Nico Schottelius
h3. (Re-)joining worker nodes after creating the cluster
166 49 Nico Schottelius
167
* We need to have an up-to-date token
168
* We use different join commands for the workers and control plane nodes
169
170
Generating the join command on an existing control plane node:
171
172
<pre>
173
kubeadm token create --print-join-command
174
</pre>
175
176 50 Nico Schottelius
h3. (Re-)joining control plane nodes after creating the cluster
177 1 Nico Schottelius
178 50 Nico Schottelius
* We generate the token again
179
* We upload the certificates
180
* We need to combine/create the join command for the control plane node
181
182
Example session:
183
184
<pre>
185
% kubeadm token create --print-join-command
186
kubeadm join p10-api.k8s.ooo:6443 --token xmff4i.ABC --discovery-token-ca-cert-hash sha256:longhash 
187
188
% kubeadm init phase upload-certs --upload-certs
189
[upload-certs] Storing the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
190
[upload-certs] Using certificate key:
191
CERTKEY
192
193
# Then we use these two outputs on the joining node:
194
195
kubeadm join p10-api.k8s.ooo:6443 --token xmff4i.ABC --discovery-token-ca-cert-hash sha256:longhash --control-plane --certificate-key CERTKEY
196
</pre>
197
198
Commands to be used on a control plane node:
199
200
<pre>
201
kubeadm token create --print-join-command
202
kubeadm init phase upload-certs --upload-certs
203
</pre>
204
205
Commands to be used on the joining node:
206
207
<pre>
208
JOINCOMMAND --control-plane --certificate-key CERTKEY
209
</pre>
210 49 Nico Schottelius
211 51 Nico Schottelius
SEE ALSO
212
213
* https://stackoverflow.com/questions/63936268/how-to-generate-kubeadm-token-for-secondary-control-plane-nodes
214
* https://blog.scottlowe.org/2019/08/15/reconstructing-the-join-command-for-kubeadm/
215
216 53 Nico Schottelius
h3. How to fix etcd does not start when rejoining a kubernetes cluster as a control plane
217 52 Nico Schottelius
218
If during the above step etcd does not come up, @kubeadm join@ can hang as follows:
219
220
<pre>
221
[control-plane] Creating static Pod manifest for "kube-apiserver"                                                              
222
[control-plane] Creating static Pod manifest for "kube-controller-manager"                                                     
223
[control-plane] Creating static Pod manifest for "kube-scheduler"                                                              
224
[check-etcd] Checking that the etcd cluster is healthy                                                                         
225
error execution phase check-etcd: etcd cluster is not healthy: failed to dial endpoint https://[2a0a:e5c0:10:1:225:b3ff:fe20:37
226
8a]:2379 with maintenance client: context deadline exceeded                                                                    
227
To see the stack trace of this error execute with --v=5 or higher         
228
</pre>
229
230
Then the problem is likely that the etcd server is still a member of the cluster. We first need to remove it from the etcd cluster and then the join works.
231
232
To fix this we do:
233
234
* Find a working etcd pod
235
* Find the etcd members / member list
236
* Remove the etcd member that we want to re-join the cluster
237
238
239
<pre>
240
# Find the etcd pods
241
kubectl -n kube-system get pods -l component=etcd,tier=control-plane
242
243
# Get the list of etcd servers with the member id 
244
kubectl exec -n kube-system -ti ETCDPODNAME -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list
245
246
# Remove the member
247
kubectl exec -n kube-system -ti ETCDPODNAME -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove MEMBERID
248
</pre>
249
250
Sample session:
251
252
<pre>
253
[10:48] line:~% kubectl -n kube-system get pods -l component=etcd,tier=control-plane
254
NAME            READY   STATUS    RESTARTS     AGE
255
etcd-server63   1/1     Running   0            3m11s
256
etcd-server65   1/1     Running   3            7d2h
257
etcd-server83   1/1     Running   8 (6d ago)   7d2h
258
[10:48] line:~% kubectl exec -n kube-system -ti etcd-server65 -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list
259
356891cd676df6e4, started, server65, https://[2a0a:e5c0:10:1:225:b3ff:fe20:375c]:2380, https://[2a0a:e5c0:10:1:225:b3ff:fe20:375c]:2379, false
260
371b8a07185dee7e, started, server63, https://[2a0a:e5c0:10:1:225:b3ff:fe20:378a]:2380, https://[2a0a:e5c0:10:1:225:b3ff:fe20:378a]:2379, false
261
5942bc58307f8af9, started, server83, https://[2a0a:e5c0:10:1:3e4a:92ff:fe79:bb98]:2380, https://[2a0a:e5c0:10:1:3e4a:92ff:fe79:bb98]:2379, false
262
263
[10:48] line:~% kubectl exec -n kube-system -ti etcd-server65 -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove 371b8a07185dee7e
264
Member 371b8a07185dee7e removed from cluster e3c0805f592a8f77
265 1 Nico Schottelius
266
</pre>
267
268
SEE ALSO
269
270
* We found the solution using https://stackoverflow.com/questions/67921552/re-installed-node-cannot-join-kubernetes-cluster
271 56 Nico Schottelius
272 147 Nico Schottelius
h3. Node labels (adding, showing, removing)
273
274
Listing the labels:
275
276
<pre>
277
kubectl get nodes --show-labels
278
</pre>
279
280
Adding labels:
281
282
<pre>
283
kubectl label nodes LIST-OF-NODES label1=value1 
284
285
</pre>
286
287
For instance:
288
289
<pre>
290
kubectl label nodes router2 router3 hosttype=router 
291
</pre>
292
293
Selecting nodes in pods:
294
295
<pre>
296
apiVersion: v1
297
kind: Pod
298
...
299
spec:
300
  nodeSelector:
301
    hosttype: router
302
</pre>
303
304 148 Nico Schottelius
Removing labels by adding a minus at the end of the label name:
305
306
<pre>
307
kubectl label node <nodename> <labelname>-
308
</pre>
309
310
For instance:
311
312
<pre>
313
kubectl label nodes router2 router3 hosttype- 
314
</pre>
315
316 147 Nico Schottelius
SEE ALSO
317 1 Nico Schottelius
318 148 Nico Schottelius
* https://kubernetes.io/docs/tasks/configure-pod-container/assign-pods-nodes/
319
* https://stackoverflow.com/questions/34067979/how-to-delete-a-node-label-by-command-and-api
320 147 Nico Schottelius
321 199 Nico Schottelius
h3. Listing all pods on a node
322
323
<pre>
324
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=serverXX
325
</pre>
326
327
Found on https://stackoverflow.com/questions/62000559/how-to-list-all-the-pods-running-in-a-particular-worker-node-by-executing-a-comm
328
329 101 Nico Schottelius
h3. Hardware Maintenance using ungleich-hardware
330
331
Use the following manifest and replace the HOST with the actual host:
332
333
<pre>
334
apiVersion: v1
335
kind: Pod
336
metadata:
337
  name: ungleich-hardware-HOST
338
spec:
339
  containers:
340
  - name: ungleich-hardware
341
    image: ungleich/ungleich-hardware:0.0.5
342
    args:
343
    - sleep
344
    - "1000000"
345
    volumeMounts:
346
      - mountPath: /dev
347
        name: dev
348
    securityContext:
349
      privileged: true
350
  nodeSelector:
351
    kubernetes.io/hostname: "HOST"
352
353
  volumes:
354
    - name: dev
355
      hostPath:
356
        path: /dev
357
</pre>
358
359 102 Nico Schottelius
Also see: [[The_ungleich_hardware_maintenance_guide]]
360
361 105 Nico Schottelius
h3. Triggering a cronjob / creating a job from a cronjob
362 104 Nico Schottelius
363
To test a cronjob, we can create a job from a cronjob:
364
365
<pre>
366
kubectl create job --from=cronjob/volume2-daily-backup volume2-manual
367
</pre>
368
369
This creates a job volume2-manual based on the cronjob  volume2-daily
370
371 112 Nico Schottelius
h3. su-ing into a user that has nologin shell set
372
373
Many times users are having nologin as their shell inside the container. To be able to execute maintenance commands within the
374
container, we can use @su -s /bin/sh@ like this:
375
376
<pre>
377
su -s /bin/sh -c '/path/to/your/script' testuser
378
</pre>
379
380
Found on https://serverfault.com/questions/351046/how-to-run-command-as-user-who-has-usr-sbin-nologin-as-shell
381
382 113 Nico Schottelius
h3. How to print a secret value
383
384
Assuming you want the "password" item from a secret, use:
385
386
<pre>
387
kubectl get secret SECRETNAME -o jsonpath="{.data.password}" | base64 -d; echo "" 
388
</pre>
389
390 209 Nico Schottelius
h3. Fixing the "ImageInspectError"
391
392
If you see this problem:
393
394
<pre>
395
# kubectl get pods
396
NAME                                                       READY   STATUS                   RESTARTS   AGE
397
bird-router-server137-bird-767f65bb47-g4xsh                0/1     Init:ImageInspectError   0          77d
398
bird-router-server137-openvpn-server120-5c987b7ffb-cn9xf   0/1     ImageInspectError        1          159d
399
bird-router-server137-unbound-5c6f5d4bb6-cxbpr             0/1     ImageInspectError        1          159d
400
</pre>
401
402
Fixes so far:
403
404
* correct registries.conf
405
406
407 173 Nico Schottelius
h3. How to upgrade a kubernetes cluster
408 172 Nico Schottelius
409
h4. General
410
411
* Should be done every X months to stay up-to-date
412
** X probably something like 3-6
413
* kubeadm based clusters
414
* Needs specific kubeadm versions for upgrade
415
* Follow instructions on https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/
416 190 Nico Schottelius
* Finding releases: https://github.com/kubernetes/kubernetes/tree/master/CHANGELOG
417 172 Nico Schottelius
418
h4. Getting a specific kubeadm or kubelet version
419
420
<pre>
421 190 Nico Schottelius
RELEASE=v1.22.17
422
RELEASE=v1.23.17
423 181 Nico Schottelius
RELEASE=v1.24.9
424 1 Nico Schottelius
RELEASE=v1.25.9
425
RELEASE=v1.26.6
426 190 Nico Schottelius
RELEASE=v1.27.2
427
428 187 Nico Schottelius
ARCH=amd64
429 172 Nico Schottelius
430
curl -L --remote-name-all https://dl.k8s.io/release/${RELEASE}/bin/linux/${ARCH}/{kubeadm,kubelet}
431 182 Nico Schottelius
chmod u+x kubeadm kubelet
432 172 Nico Schottelius
</pre>
433
434
h4. Steps
435
436
* kubeadm upgrade plan
437
** On one control plane node
438
* kubeadm upgrade apply vXX.YY.ZZ
439
** On one control plane node
440 189 Nico Schottelius
* kubeadm upgrade node
441
** On all other control plane nodes
442
** On all worker nodes afterwards
443
444 172 Nico Schottelius
445 173 Nico Schottelius
Repeat for all control planes nodes. The upgrade kubelet on all other nodes via package manager.
446 172 Nico Schottelius
447 193 Nico Schottelius
h4. Upgrading to 1.22.17
448 1 Nico Schottelius
449 193 Nico Schottelius
* https://v1-22.docs.kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/
450 194 Nico Schottelius
* Need to create a kubeadm config map
451 198 Nico Schottelius
** f.i. using the following
452
** @/usr/local/bin/kubeadm-v1.22.17   upgrade --config kubeadm.yaml --ignore-preflight-errors=CoreDNSUnsupportedPlugins,CoreDNSMigration apply -y v1.22.17@
453 193 Nico Schottelius
* Done for p6 on 2023-10-04
454
455
h4. Upgrading to 1.23.17
456
457
* https://v1-23.docs.kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/
458
* No special notes
459
* Done for p6 on 2023-10-04
460
461
h4. Upgrading to 1.24.17
462
463
* https://v1-24.docs.kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/
464
* No special notes
465
* Done for p6 on 2023-10-04
466
467
h4. Upgrading to 1.25.14
468
469
* https://v1-24.docs.kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/
470
* No special notes
471
* Done for p6 on 2023-10-04
472
473
h4. Upgrading to 1.26.9
474
475 1 Nico Schottelius
* https://v1-26.docs.kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/
476 193 Nico Schottelius
* No special notes
477
* Done for p6 on 2023-10-04
478 188 Nico Schottelius
479 196 Nico Schottelius
h4. Upgrading to 1.27
480 186 Nico Schottelius
481 192 Nico Schottelius
* https://v1-27.docs.kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/
482 186 Nico Schottelius
* kubelet will not start anymore
483
* reason: @"command failed" err="failed to parse kubelet flag: unknown flag: --container-runtime"@
484
* /var/lib/kubelet/kubeadm-flags.env contains that parameter
485
* remove it, start kubelet
486 192 Nico Schottelius
487 197 Nico Schottelius
h4. Upgrading to 1.28
488 192 Nico Schottelius
489
* https://v1-28.docs.kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/
490 186 Nico Schottelius
491
h4. Upgrade to crio 1.27: missing crun
492
493
Error message
494
495
<pre>
496
level=fatal msg="validating runtime config: runtime validation: \"crun\" not found in $PATH: exec: \"crun\": executable file not found in $PATH"
497
</pre>
498
499
Fix:
500
501
<pre>
502
apk add crun
503
</pre>
504
505 157 Nico Schottelius
h2. Reference CNI
506
507
* Mainly "stupid", but effective plugins
508
* Main documentation on https://www.cni.dev/plugins/current/
509 158 Nico Schottelius
* Plugins
510
** bridge
511
*** Can create the bridge on the host
512
*** But seems not to be able to add host interfaces to it as well
513
*** Has support for vlan tags
514
** vlan
515
*** creates vlan tagged sub interface on the host
516 160 Nico Schottelius
*** "It's a 1:1 mapping (i.e. no bridge in between)":https://github.com/k8snetworkplumbingwg/multus-cni/issues/569
517 158 Nico Schottelius
** host-device
518
*** moves the interface from the host into the container
519
*** very easy for physical connections to containers
520 159 Nico Schottelius
** ipvlan
521
*** "virtualisation" of a host device
522
*** routing based on IP
523
*** Same MAC for everyone
524
*** Cannot reach the master interface
525
** maclvan
526
*** With mac addresses
527
*** Supports various modes (to be checked)
528
** ptp ("point to point")
529
*** Creates a host device and connects it to the container
530
** win*
531 158 Nico Schottelius
*** Windows implementations
532 157 Nico Schottelius
533 62 Nico Schottelius
h2. Calico CNI
534
535
h3. Calico Installation
536
537
* We install "calico using helm":https://docs.projectcalico.org/getting-started/kubernetes/helm
538
* This has the following advantages:
539
** Easy to upgrade
540
** Does not require os to configure IPv6/dual stack settings as the tigera operator figures out things on its own
541
542
Usually plain calico can be installed directly using:
543
544
<pre>
545 174 Nico Schottelius
VERSION=v3.25.0
546 149 Nico Schottelius
547 1 Nico Schottelius
helm repo add projectcalico https://docs.projectcalico.org/charts
548 167 Nico Schottelius
helm repo update
549 124 Nico Schottelius
helm upgrade --install --namespace tigera calico projectcalico/tigera-operator --version $VERSION --create-namespace
550 1 Nico Schottelius
</pre>
551 92 Nico Schottelius
552
* Check the tags on https://github.com/projectcalico/calico/tags for the latest release
553 62 Nico Schottelius
554
h3. Installing calicoctl
555
556 115 Nico Schottelius
* General installation instructions, including binary download: https://projectcalico.docs.tigera.io/maintenance/clis/calicoctl/install
557
558 62 Nico Schottelius
To be able to manage and configure calico, we need to 
559
"install calicoctl (we choose the version as a pod)":https://docs.projectcalico.org/getting-started/clis/calicoctl/install#install-calicoctl-as-a-kubernetes-pod
560
561
<pre>
562
kubectl apply -f https://docs.projectcalico.org/manifests/calicoctl.yaml
563
</pre>
564
565 93 Nico Schottelius
Or version specific:
566
567
<pre>
568
kubectl apply -f https://github.com/projectcalico/calico/blob/v3.20.4/manifests/calicoctl.yaml
569 97 Nico Schottelius
570
# For 3.22
571
kubectl apply -f https://projectcalico.docs.tigera.io/archive/v3.22/manifests/calicoctl.yaml
572 93 Nico Schottelius
</pre>
573
574 70 Nico Schottelius
And making it easier accessible by alias:
575
576
<pre>
577
alias calicoctl="kubectl exec -i -n kube-system calicoctl -- /calicoctl"
578
</pre>
579
580 62 Nico Schottelius
h3. Calico configuration
581
582 63 Nico Schottelius
By default our k8s clusters "BGP peer":https://docs.projectcalico.org/networking/bgp
583
with an upstream router to propagate podcidr and servicecidr.
584 62 Nico Schottelius
585
Default settings in our infrastructure:
586
587
* We use a full-mesh using the @nodeToNodeMeshEnabled: true@ option
588
* We keep the original next hop so that *only* the server with the pod is announcing it (instead of ecmp)
589 1 Nico Schottelius
* We use private ASNs for k8s clusters
590 63 Nico Schottelius
* We do *not* use any overlay
591 62 Nico Schottelius
592
After installing calico and calicoctl the last step of the installation is usually:
593
594 1 Nico Schottelius
<pre>
595 79 Nico Schottelius
calicoctl create -f - < calico-bgp.yaml
596 62 Nico Schottelius
</pre>
597
598
599
A sample BGP configuration:
600
601
<pre>
602
---
603
apiVersion: projectcalico.org/v3
604
kind: BGPConfiguration
605
metadata:
606
  name: default
607
spec:
608
  logSeverityScreen: Info
609
  nodeToNodeMeshEnabled: true
610
  asNumber: 65534
611
  serviceClusterIPs:
612
  - cidr: 2a0a:e5c0:10:3::/108
613
  serviceExternalIPs:
614
  - cidr: 2a0a:e5c0:10:3::/108
615
---
616
apiVersion: projectcalico.org/v3
617
kind: BGPPeer
618
metadata:
619
  name: router1-place10
620
spec:
621
  peerIP: 2a0a:e5c0:10:1::50
622
  asNumber: 213081
623
  keepOriginalNextHop: true
624
</pre>
625
626 126 Nico Schottelius
h2. Cilium CNI (experimental)
627
628 137 Nico Schottelius
h3. Status
629
630 138 Nico Schottelius
*NO WORKING CILIUM CONFIGURATION FOR IPV6 only modes*
631 137 Nico Schottelius
632 146 Nico Schottelius
h3. Latest error
633
634
It seems cilium does not run on IPv6 only hosts:
635
636
<pre>
637
level=info msg="Validating configured node address ranges" subsys=daemon
638
level=fatal msg="postinit failed" error="external IPv4 node address could not be derived, please configure via --ipv4-node" subsys=daemon
639
level=info msg="Starting IP identity watcher" subsys=ipcache
640
</pre>
641
642
It crashes after that log entry
643
644 128 Nico Schottelius
h3. BGP configuration
645
646
* The cilium-operator will not start without a correct configmap being present beforehand (see error message below)
647
* Creating the bgp config beforehand as a configmap is thus required.
648
649
The error one gets without the configmap present:
650
651
Pods are hanging with:
652
653
<pre>
654
cilium-bpqm6                       0/1     Init:0/4            0             9s
655
cilium-operator-5947d94f7f-5bmh2   0/1     ContainerCreating   0             9s
656
</pre>
657
658
The error message in the cilium-*perator is:
659
660
<pre>
661
Events:
662
  Type     Reason       Age                From               Message
663
  ----     ------       ----               ----               -------
664
  Normal   Scheduled    80s                default-scheduler  Successfully assigned kube-system/cilium-operator-5947d94f7f-lqcsp to server56
665
  Warning  FailedMount  16s (x8 over 80s)  kubelet            MountVolume.SetUp failed for volume "bgp-config-path" : configmap "bgp-config" not found
666
</pre>
667
668
A correct bgp config looks like this:
669
670
<pre>
671
apiVersion: v1
672
kind: ConfigMap
673
metadata:
674
  name: bgp-config
675
  namespace: kube-system
676
data:
677
  config.yaml: |
678
    peers:
679
      - peer-address: 2a0a:e5c0::46
680
        peer-asn: 209898
681
        my-asn: 65533
682
      - peer-address: 2a0a:e5c0::47
683
        peer-asn: 209898
684
        my-asn: 65533
685
    address-pools:
686
      - name: default
687
        protocol: bgp
688
        addresses:
689
          - 2a0a:e5c0:0:14::/64
690
</pre>
691 127 Nico Schottelius
692
h3. Installation
693 130 Nico Schottelius
694 127 Nico Schottelius
Adding the repo
695 1 Nico Schottelius
<pre>
696 127 Nico Schottelius
697 129 Nico Schottelius
helm repo add cilium https://helm.cilium.io/
698 130 Nico Schottelius
helm repo update
699
</pre>
700 129 Nico Schottelius
701 135 Nico Schottelius
Installing + configuring cilium
702 129 Nico Schottelius
<pre>
703 130 Nico Schottelius
ipv6pool=2a0a:e5c0:0:14::/112
704 1 Nico Schottelius
705 146 Nico Schottelius
version=1.12.2
706 129 Nico Schottelius
707
helm upgrade --install cilium cilium/cilium --version $version \
708 1 Nico Schottelius
  --namespace kube-system \
709
  --set ipv4.enabled=false \
710
  --set ipv6.enabled=true \
711 146 Nico Schottelius
  --set enableIPv6Masquerade=false \
712
  --set bgpControlPlane.enabled=true 
713 1 Nico Schottelius
714 146 Nico Schottelius
#  --set ipam.operator.clusterPoolIPv6PodCIDRList=$ipv6pool
715
716
# Old style bgp?
717 136 Nico Schottelius
#   --set bgp.enabled=true --set bgp.announce.podCIDR=true \
718 127 Nico Schottelius
719
# Show possible configuration options
720
helm show values cilium/cilium
721
722 1 Nico Schottelius
</pre>
723 132 Nico Schottelius
724
Using a /64 for ipam.operator.clusterPoolIPv6PodCIDRList fails with:
725
726
<pre>
727
level=fatal msg="Unable to init cluster-pool allocator" error="unable to initialize IPv6 allocator New CIDR set failed; the node CIDR size is too big" subsys=cilium-operator-generic
728
</pre>
729
730 126 Nico Schottelius
731 1 Nico Schottelius
See also https://github.com/cilium/cilium/issues/20756
732 135 Nico Schottelius
733
Seems a /112 is actually working.
734
735
h3. Kernel modules
736
737
Cilium requires the following modules to be loaded on the host (not loaded by default):
738
739
<pre>
740 1 Nico Schottelius
modprobe  ip6table_raw
741
modprobe  ip6table_filter
742
</pre>
743 146 Nico Schottelius
744
h3. Interesting helm flags
745
746
* autoDirectNodeRoutes
747
* bgpControlPlane.enabled = true
748
749
h3. SEE ALSO
750
751
* https://docs.cilium.io/en/v1.12/helm-reference/
752 133 Nico Schottelius
753 179 Nico Schottelius
h2. Multus
754 168 Nico Schottelius
755
* https://github.com/k8snetworkplumbingwg/multus-cni
756
* Installing a deployment w/ CRDs
757 150 Nico Schottelius
758 169 Nico Schottelius
<pre>
759 176 Nico Schottelius
VERSION=v4.0.1
760 169 Nico Schottelius
761 170 Nico Schottelius
kubectl apply -f https://raw.githubusercontent.com/k8snetworkplumbingwg/multus-cni/${VERSION}/deployments/multus-daemonset-crio.yml
762
</pre>
763 169 Nico Schottelius
764 191 Nico Schottelius
h2. ArgoCD
765 56 Nico Schottelius
766 60 Nico Schottelius
h3. Argocd Installation
767 1 Nico Schottelius
768 116 Nico Schottelius
* See https://argo-cd.readthedocs.io/en/stable/
769
770 60 Nico Schottelius
As there is no configuration management present yet, argocd is installed using
771
772 1 Nico Schottelius
<pre>
773 60 Nico Schottelius
kubectl create namespace argocd
774 1 Nico Schottelius
775
# OR: latest stable
776
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
777
778 191 Nico Schottelius
# OR Specific Version
779
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/v2.3.2/manifests/install.yaml
780 56 Nico Schottelius
781 191 Nico Schottelius
782
</pre>
783 1 Nico Schottelius
784 60 Nico Schottelius
h3. Get the argocd credentials
785
786
<pre>
787
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d; echo ""
788
</pre>
789 52 Nico Schottelius
790 87 Nico Schottelius
h3. Accessing argocd
791
792
In regular IPv6 clusters:
793
794
* Navigate to https://argocd-server.argocd.CLUSTERDOMAIN
795
796
In legacy IPv4 clusters
797
798
<pre>
799
kubectl --namespace argocd port-forward svc/argocd-server 8080:80
800
</pre>
801
802 88 Nico Schottelius
* Navigate to https://localhost:8080
803
804 68 Nico Schottelius
h3. Using the argocd webhook to trigger changes
805 67 Nico Schottelius
806
* To trigger changes post json https://argocd.example.com/api/webhook
807
808 72 Nico Schottelius
h3. Deploying an application
809
810
* Applications are deployed via git towards gitea (code.ungleich.ch) and then pulled by argo
811 73 Nico Schottelius
* Always include the *redmine-url* pointing to the (customer) ticket
812
** Also add the support-url if it exists
813 72 Nico Schottelius
814
Application sample
815
816
<pre>
817
apiVersion: argoproj.io/v1alpha1
818
kind: Application
819
metadata:
820
  name: gitea-CUSTOMER
821
  namespace: argocd
822
spec:
823
  destination:
824
    namespace: default
825
    server: 'https://kubernetes.default.svc'
826
  source:
827
    path: apps/prod/gitea
828
    repoURL: 'https://code.ungleich.ch/ungleich-intern/k8s-config.git'
829
    targetRevision: HEAD
830
    helm:
831
      parameters:
832
        - name: storage.data.storageClass
833
          value: rook-ceph-block-hdd
834
        - name: storage.data.size
835
          value: 200Gi
836
        - name: storage.db.storageClass
837
          value: rook-ceph-block-ssd
838
        - name: storage.db.size
839
          value: 10Gi
840
        - name: storage.letsencrypt.storageClass
841
          value: rook-ceph-block-hdd
842
        - name: storage.letsencrypt.size
843
          value: 50Mi
844
        - name: letsencryptStaging
845
          value: 'no'
846
        - name: fqdn
847
          value: 'code.verua.online'
848
  project: default
849
  syncPolicy:
850
    automated:
851
      prune: true
852
      selfHeal: true
853
  info:
854
    - name: 'redmine-url'
855
      value: 'https://redmine.ungleich.ch/issues/ISSUEID'
856
    - name: 'support-url'
857
      value: 'https://support.ungleich.ch/Ticket/Display.html?id=TICKETID'
858
</pre>
859
860 80 Nico Schottelius
h2. Helm related operations and conventions
861 55 Nico Schottelius
862 61 Nico Schottelius
We use helm charts extensively.
863
864
* In production, they are managed via argocd
865
* In development, helm chart can de developed and deployed manually using the helm utility.
866
867 55 Nico Schottelius
h3. Installing a helm chart
868
869
One can use the usual pattern of
870
871
<pre>
872
helm install <releasename> <chartdirectory>
873
</pre>
874
875
However often you want to reinstall/update when testing helm charts. The following pattern is "better", because it allows you to reinstall, if it is already installed:
876
877
<pre>
878
helm upgrade --install <releasename> <chartdirectory>
879 1 Nico Schottelius
</pre>
880 80 Nico Schottelius
881
h3. Naming services and deployments in helm charts [Application labels]
882
883
* We always have {{ .Release.Name }} to identify the current "instance"
884
* Deployments:
885
** use @app: <what it is>@, f.i. @app: nginx@, @app: postgres@, ...
886 81 Nico Schottelius
* See more about standard labels on
887
** https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/
888
** https://helm.sh/docs/chart_best_practices/labels/
889 55 Nico Schottelius
890 151 Nico Schottelius
h3. Show all versions of a helm chart
891
892
<pre>
893
helm search repo -l repo/chart
894
</pre>
895
896
For example:
897
898
<pre>
899
% helm search repo -l projectcalico/tigera-operator 
900
NAME                         	CHART VERSION	APP VERSION	DESCRIPTION                            
901
projectcalico/tigera-operator	v3.23.3      	v3.23.3    	Installs the Tigera operator for Calico
902
projectcalico/tigera-operator	v3.23.2      	v3.23.2    	Installs the Tigera operator for Calico
903
....
904
</pre>
905
906 152 Nico Schottelius
h3. Show possible values of a chart
907
908
<pre>
909
helm show values <repo/chart>
910
</pre>
911
912
Example:
913
914
<pre>
915
helm show values ingress-nginx/ingress-nginx
916
</pre>
917
918 207 Nico Schottelius
h3. Show all possible charts in a repo
919
920
<pre>
921
helm search repo REPO
922
</pre>
923
924 178 Nico Schottelius
h3. Download a chart
925
926
For instance for checking it out locally. Use:
927
928
<pre>
929
helm pull <repo/chart>
930
</pre>
931 152 Nico Schottelius
932 139 Nico Schottelius
h2. Rook + Ceph
933
934
h3. Installation
935
936
* Usually directly via argocd
937
938 71 Nico Schottelius
h3. Executing ceph commands
939
940
Using the ceph-tools pod as follows:
941
942
<pre>
943
kubectl exec -n rook-ceph -ti $(kubectl -n rook-ceph get pods -l app=rook-ceph-tools -o jsonpath='{.items[*].metadata.name}') -- ceph -s
944
</pre>
945
946 43 Nico Schottelius
h3. Inspecting the logs of a specific server
947
948
<pre>
949
# Get the related pods
950
kubectl -n rook-ceph get pods -l app=rook-ceph-osd-prepare 
951
...
952
953
# Inspect the logs of a specific pod
954
kubectl -n rook-ceph logs -f rook-ceph-osd-prepare-server23--1-444qx
955
956 71 Nico Schottelius
</pre>
957
958
h3. Inspecting the logs of the rook-ceph-operator
959
960
<pre>
961
kubectl -n rook-ceph logs -f -l app=rook-ceph-operator
962 43 Nico Schottelius
</pre>
963
964 200 Nico Schottelius
h3. (Temporarily) Disabling the rook-operation
965
966
* first disabling the sync in argocd
967
* then scale it down
968
969
<pre>
970
kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=0
971
</pre>
972
973
When done with the work/maintenance, re-enable sync in argocd.
974
The following command is thus strictly speaking not required, as argocd will fix it on its own:
975
976
<pre>
977
kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=1
978
</pre>
979
980 121 Nico Schottelius
h3. Restarting the rook operator
981
982
<pre>
983
kubectl -n rook-ceph delete pods  -l app=rook-ceph-operator
984
</pre>
985
986 43 Nico Schottelius
h3. Triggering server prepare / adding new osds
987
988
The rook-ceph-operator triggers/watches/creates pods to maintain hosts. To trigger a full "re scan", simply delete that pod:
989
990
<pre>
991
kubectl -n rook-ceph delete pods -l app=rook-ceph-operator
992
</pre>
993
994
This will cause all the @rook-ceph-osd-prepare-..@ jobs to be recreated and thus OSDs to be created, if new disks have been added.
995
996
h3. Removing an OSD
997
998
* See "Ceph OSD Management":https://rook.io/docs/rook/v1.7/ceph-osd-mgmt.html
999 77 Nico Schottelius
* More specifically: https://github.com/rook/rook/blob/release-1.7/cluster/examples/kubernetes/ceph/osd-purge.yaml
1000 99 Nico Schottelius
* Then delete the related deployment
1001 41 Nico Schottelius
1002 98 Nico Schottelius
Set osd id in the osd-purge.yaml and apply it. OSD should be down before.
1003
1004
<pre>
1005
apiVersion: batch/v1
1006
kind: Job
1007
metadata:
1008
  name: rook-ceph-purge-osd
1009
  namespace: rook-ceph # namespace:cluster
1010
  labels:
1011
    app: rook-ceph-purge-osd
1012
spec:
1013
  template:
1014
    metadata:
1015
      labels:
1016
        app: rook-ceph-purge-osd
1017
    spec:
1018
      serviceAccountName: rook-ceph-purge-osd
1019
      containers:
1020
        - name: osd-removal
1021
          image: rook/ceph:master
1022
          # TODO: Insert the OSD ID in the last parameter that is to be removed
1023
          # The OSD IDs are a comma-separated list. For example: "0" or "0,2".
1024
          # If you want to preserve the OSD PVCs, set `--preserve-pvc true`.
1025
          #
1026
          # A --force-osd-removal option is available if the OSD should be destroyed even though the
1027
          # removal could lead to data loss.
1028
          args:
1029
            - "ceph"
1030
            - "osd"
1031
            - "remove"
1032
            - "--preserve-pvc"
1033
            - "false"
1034
            - "--force-osd-removal"
1035
            - "false"
1036
            - "--osd-ids"
1037
            - "SETTHEOSDIDHERE"
1038
          env:
1039
            - name: POD_NAMESPACE
1040
              valueFrom:
1041
                fieldRef:
1042
                  fieldPath: metadata.namespace
1043
            - name: ROOK_MON_ENDPOINTS
1044
              valueFrom:
1045
                configMapKeyRef:
1046
                  key: data
1047
                  name: rook-ceph-mon-endpoints
1048
            - name: ROOK_CEPH_USERNAME
1049
              valueFrom:
1050
                secretKeyRef:
1051
                  key: ceph-username
1052
                  name: rook-ceph-mon
1053
            - name: ROOK_CEPH_SECRET
1054
              valueFrom:
1055
                secretKeyRef:
1056
                  key: ceph-secret
1057
                  name: rook-ceph-mon
1058
            - name: ROOK_CONFIG_DIR
1059
              value: /var/lib/rook
1060
            - name: ROOK_CEPH_CONFIG_OVERRIDE
1061
              value: /etc/rook/config/override.conf
1062
            - name: ROOK_FSID
1063
              valueFrom:
1064
                secretKeyRef:
1065
                  key: fsid
1066
                  name: rook-ceph-mon
1067
            - name: ROOK_LOG_LEVEL
1068
              value: DEBUG
1069
          volumeMounts:
1070
            - mountPath: /etc/ceph
1071
              name: ceph-conf-emptydir
1072
            - mountPath: /var/lib/rook
1073
              name: rook-config
1074
      volumes:
1075
        - emptyDir: {}
1076
          name: ceph-conf-emptydir
1077
        - emptyDir: {}
1078
          name: rook-config
1079
      restartPolicy: Never
1080
1081
1082 99 Nico Schottelius
</pre>
1083
1084 1 Nico Schottelius
Deleting the deployment:
1085
1086
<pre>
1087
[18:05] bridge:~% kubectl -n rook-ceph delete deployment rook-ceph-osd-6
1088 99 Nico Schottelius
deployment.apps "rook-ceph-osd-6" deleted
1089
</pre>
1090 185 Nico Schottelius
1091
h3. Placement of mons/osds/etc.
1092
1093
See https://rook.io/docs/rook/v1.11/CRDs/Cluster/ceph-cluster-crd/#placement-configuration-settings
1094 98 Nico Schottelius
1095 145 Nico Schottelius
h2. Ingress + Cert Manager
1096
1097
* We deploy "nginx-ingress":https://docs.nginx.com/nginx-ingress-controller/ to get an ingress
1098
* we deploy "cert-manager":https://cert-manager.io/ to handle certificates
1099
* We independently deploy @ClusterIssuer@ to allow the cert-manager app to deploy and the issuer to be created once the CRDs from cert manager are in place
1100
1101
h3. IPv4 reachability 
1102
1103
The ingress is by default IPv6 only. To make it reachable from the IPv4 world, get its IPv6 address and configure a NAT64 mapping in Jool.
1104
1105
Steps:
1106
1107
h4. Get the ingress IPv6 address
1108
1109
Use @kubectl -n ingress-nginx get svc ingress-nginx-controller -o jsonpath='{.spec.clusterIP}'; echo ''@
1110
1111
Example:
1112
1113
<pre>
1114
kubectl -n ingress-nginx get svc ingress-nginx-controller -o jsonpath='{.spec.clusterIP}'; echo ''
1115
2a0a:e5c0:10:1b::ce11
1116
</pre>
1117
1118
h4. Add NAT64 mapping
1119
1120
* Update the __dcl_jool_siit cdist type
1121
* Record the two IPs (IPv6 and IPv4)
1122
* Configure all routers
1123
1124
1125
h4. Add DNS record
1126
1127
To use the ingress capable as a CNAME destination, create an "ingress" DNS record, such as:
1128
1129
<pre>
1130
; k8s ingress for dev
1131
dev-ingress                 AAAA 2a0a:e5c0:10:1b::ce11
1132
dev-ingress                 A 147.78.194.23
1133
1134
</pre> 
1135
1136
h4. Add supporting wildcard DNS
1137
1138
If you plan to add various sites under a specific domain, we can add a wildcard DNS entry, such as *.k8s-dev.django-hosting.ch:
1139
1140
<pre>
1141
*.k8s-dev         CNAME dev-ingress.ungleich.ch.
1142
</pre>
1143
1144 76 Nico Schottelius
h2. Harbor
1145
1146 175 Nico Schottelius
* We user "Harbor":https://goharbor.io/ as an image registry for our own images. Internal app reference: apps/prod/harbor.
1147
* The admin password is in the password store, it is Harbor12345 by default
1148 76 Nico Schottelius
* At the moment harbor only authenticates against the internal ldap tree
1149
1150
h3. LDAP configuration
1151
1152
* The url needs to be ldaps://...
1153
* uid = uid
1154
* rest standard
1155 75 Nico Schottelius
1156 89 Nico Schottelius
h2. Monitoring / Prometheus
1157
1158 90 Nico Schottelius
* Via "kube-prometheus":https://github.com/prometheus-operator/kube-prometheus/
1159 89 Nico Schottelius
1160 91 Nico Schottelius
Access via ...
1161
1162
* http://prometheus-k8s.monitoring.svc:9090
1163
* http://grafana.monitoring.svc:3000
1164
* http://alertmanager.monitoring.svc:9093
1165
1166
1167 100 Nico Schottelius
h3. Prometheus Options
1168
1169
* "helm/kube-prometheus-stack":https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack
1170
** Includes dashboards and co.
1171
* "manifest based kube-prometheus":https://github.com/prometheus-operator/kube-prometheus
1172
** Includes dashboards and co.
1173
* "Prometheus Operator (mainly CRD manifest":https://github.com/prometheus-operator/prometheus-operator
1174
1175 171 Nico Schottelius
h3. Grafana default password
1176
1177
* If not changed: @prom-operator@
1178
1179 82 Nico Schottelius
h2. Nextcloud
1180
1181 85 Nico Schottelius
h3. How to get the nextcloud credentials 
1182 84 Nico Schottelius
1183
* The initial username is set to "nextcloud"
1184
* The password is autogenerated and saved in a kubernetes secret
1185
1186
<pre>
1187 85 Nico Schottelius
kubectl get secret RELEASENAME-nextcloud -o jsonpath="{.data.PASSWORD}" | base64 -d; echo "" 
1188 84 Nico Schottelius
</pre>
1189
1190 83 Nico Schottelius
h3. How to fix "Access through untrusted domain"
1191
1192 82 Nico Schottelius
* Nextcloud stores the initial domain configuration
1193 1 Nico Schottelius
* If the FQDN is changed, it will show the error message "Access through untrusted domain"
1194 82 Nico Schottelius
* To fix, edit /var/www/html/config/config.php and correct the domain
1195 1 Nico Schottelius
* Then delete the pods
1196 165 Nico Schottelius
1197
h3. Running occ commands inside the nextcloud container
1198
1199
* Find the pod in the right namespace
1200
1201
Exec:
1202
1203
<pre>
1204
su www-data -s /bin/sh -c ./occ
1205
</pre>
1206
1207
* -s /bin/sh is needed as the default shell is set to /bin/false
1208
1209 166 Nico Schottelius
h4. Rescanning files
1210 165 Nico Schottelius
1211 166 Nico Schottelius
* If files have been added without nextcloud's knowledge
1212
1213
<pre>
1214
su www-data -s /bin/sh -c "./occ files:scan --all"
1215
</pre>
1216 82 Nico Schottelius
1217 201 Nico Schottelius
h2. Sealed Secrets
1218
1219 202 Jin-Guk Kwon
* install kubeseal
1220 1 Nico Schottelius
1221 202 Jin-Guk Kwon
<pre>
1222
KUBESEAL_VERSION='0.23.0'
1223
wget "https://github.com/bitnami-labs/sealed-secrets/releases/download/v${KUBESEAL_VERSION:?}/kubeseal-${KUBESEAL_VERSION:?}-linux-amd64.tar.gz" 
1224
tar -xvzf kubeseal-${KUBESEAL_VERSION:?}-linux-amd64.tar.gz kubeseal
1225
sudo install -m 755 kubeseal /usr/local/bin/kubeseal
1226
</pre>
1227
1228
* create key for sealed-secret
1229
1230
<pre>
1231
kubeseal --fetch-cert > /tmp/public-key-cert.pem
1232
</pre>
1233
1234
* create the secret
1235
1236
<pre>
1237 203 Jin-Guk Kwon
ex)
1238 202 Jin-Guk Kwon
apiVersion: v1
1239
kind: Secret
1240
metadata:
1241
  name: Release.Name-postgres-config
1242
  annotations:
1243
    secret-generator.v1.mittwald.de/autogenerate: POSTGRES_PASSWORD
1244
    hosting: Release.Name
1245
  labels:
1246
    app.kubernetes.io/instance: Release.Name
1247
    app.kubernetes.io/component: postgres
1248
stringData:
1249
  POSTGRES_USER: postgresUser
1250
  POSTGRES_DB: postgresDBName
1251
  POSTGRES_INITDB_ARGS: "--no-locale --encoding=UTF8"
1252
</pre>
1253
1254
* convert secret.yaml to sealed-secret.yaml
1255
1256
<pre>
1257
kubeseal -n <namespace> --cert=/tmp/public-key-cert.pem --format=yaml < ./secret.yaml  > ./sealed-secret.yaml
1258
</pre>
1259
1260
* use sealed-secret.yaml on helm-chart directory
1261 201 Nico Schottelius
1262 205 Jin-Guk Kwon
* refer ticket : #11989 , #12120
1263 204 Jin-Guk Kwon
1264 1 Nico Schottelius
h2. Infrastructure versions
1265 35 Nico Schottelius
1266 57 Nico Schottelius
h3. ungleich kubernetes infrastructure v5 (2021-10)
1267 1 Nico Schottelius
1268 57 Nico Schottelius
Clusters are configured / setup in this order:
1269
1270
* Bootstrap via kubeadm
1271 59 Nico Schottelius
* "Networking via calico + BGP (non ECMP) using helm":https://docs.projectcalico.org/getting-started/kubernetes/helm
1272
* "ArgoCD for CD":https://argo-cd.readthedocs.io/en/stable/
1273
** "rook for storage via argocd":https://rook.io/
1274 58 Nico Schottelius
** haproxy for in IPv6-cluster-IPv4-to-IPv6 proxy via argocd
1275
** "kubernetes-secret-generator for in cluster secrets":https://github.com/mittwald/kubernetes-secret-generator
1276
** "ungleich-certbot managing certs and nginx":https://hub.docker.com/repository/docker/ungleich/ungleich-certbot
1277
1278 57 Nico Schottelius
1279
h3. ungleich kubernetes infrastructure v4 (2021-09)
1280
1281 54 Nico Schottelius
* rook is configured via manifests instead of using the rook-ceph-cluster helm chart
1282 1 Nico Schottelius
* The rook operator is still being installed via helm
1283 35 Nico Schottelius
1284 57 Nico Schottelius
h3. ungleich kubernetes infrastructure v3 (2021-07)
1285 1 Nico Schottelius
1286 10 Nico Schottelius
* rook is now installed via helm via argocd instead of directly via manifests
1287 28 Nico Schottelius
1288 57 Nico Schottelius
h3. ungleich kubernetes infrastructure v2 (2021-05)
1289 28 Nico Schottelius
1290
* Replaced fluxv2 from ungleich k8s v1 with argocd
1291 1 Nico Schottelius
** argocd can apply helm templates directly without needing to go through Chart releases
1292 28 Nico Schottelius
* We are also using argoflow for build flows
1293
* Planned to add "kaniko":https://github.com/GoogleContainerTools/kaniko for image building
1294
1295 57 Nico Schottelius
h3. ungleich kubernetes infrastructure v1 (2021-01)
1296 28 Nico Schottelius
1297
We are using the following components:
1298
1299
* "Calico as a CNI":https://www.projectcalico.org/ with BGP, IPv6 only, no encapsulation
1300
** Needed for basic networking
1301
* "kubernetes-secret-generator":https://github.com/mittwald/kubernetes-secret-generator for creating secrets
1302
** Needed so that secrets are not stored in the git repository, but only in the cluster
1303
* "ungleich-certbot":https://hub.docker.com/repository/docker/ungleich/ungleich-certbot
1304
** Needed to get letsencrypt certificates for services
1305
* "rook with ceph rbd + cephfs":https://rook.io/ for storage
1306
** rbd for almost everything, *ReadWriteOnce*
1307
** cephfs for smaller things, multi access *ReadWriteMany*
1308
** Needed for providing persistent storage
1309
* "flux v2":https://fluxcd.io/
1310
** Needed to manage resources automatically