Project

General

Profile

The ungleich kubernetes infrastructure » History » Version 208

Nico Schottelius, 12/08/2023 05:39 PM

1 22 Nico Schottelius
h1. The ungleich kubernetes infrastructure and ungleich kubernetes manual
2 1 Nico Schottelius
3 3 Nico Schottelius
{{toc}}
4
5 1 Nico Schottelius
h2. Status
6
7 28 Nico Schottelius
This document is **pre-production**.
8
This document is to become the ungleich kubernetes infrastructure overview as well as the ungleich kubernetes manual.
9 1 Nico Schottelius
10 10 Nico Schottelius
h2. k8s clusters
11
12 123 Nico Schottelius
| Cluster            | Purpose/Setup     | Maintainer | Master(s)                     | argo                                                   | v4 http proxy | last verified |
13
| c0.k8s.ooo         | Dev               | -          | UNUSED                        |                                                        |               |    2021-10-05 |
14
| c1.k8s.ooo         | retired           |            | -                             |                                                        |               |    2022-03-15 |
15
| c2.k8s.ooo         | Dev p7 HW         | Nico       | server47 server53 server54    | "argo":https://argocd-server.argocd.svc.c2.k8s.ooo     |               |    2021-10-05 |
16
| c3.k8s.ooo         | retired           | -          | -                             |                                                        |               |    2021-10-05 |
17
| c4.k8s.ooo         | Dev2 p7 HW        | Jin-Guk    | server52 server53 server54    |                                                        |               |             - |
18
| c5.k8s.ooo         | retired           |            | -                             |                                                        |               |    2022-03-15 |
19
| c6.k8s.ooo         | Dev p6 VM Jin-Guk | Jin-Guk    |                               |                                                        |               |               |
20
| [[p5.k8s.ooo]]     | production        |            | server34 server36 server38    | "argo":https://argocd-server.argocd.svc.p5.k8s.ooo     | -             |               |
21
| [[p5-cow.k8s.ooo]] | production        | Nico       | server47 server51 server55    | "argo":https://argocd-server.argocd.svc.p5-cow.k8s.ooo |               |    2022-08-27 |
22
| [[p6.k8s.ooo]]     | production        |            | server67 server69 server71    | "argo":https://argocd-server.argocd.svc.p6.k8s.ooo     | 147.78.194.13 |    2021-10-05 |
23 184 Nico Schottelius
| [[p6-cow.k8s.ooo]] | production        |            | server134 server135 server136 | "argo":https://argocd-server.argocd.svc.p6in10.k8s.ooo | ?             |    2023-05-17 |
24 177 Nico Schottelius
| [[p10.k8s.ooo]]    | production        |            | server131 server132 server133 | "argo":https://argocd-server.argocd.svc.p10.k8s.ooo    | 147.78.194.12 |    2021-10-05 |
25 123 Nico Schottelius
| [[k8s.ge.nau.so]]  | development       |            | server107 server108 server109 | "argo":https://argocd-server.argocd.svc.k8s.ge.nau.so  |               |               |
26
| [[dev.k8s.ooo]]    | development       |            | server110 server111 server112 | "argo":https://argocd-server.argocd.svc.dev.k8s.ooo    | -             |    2022-07-08 |
27 164 Nico Schottelius
| [[r1r2p15k8sooo|r1.p15.k8s.ooo]] | production | Nico | server120 | | | 2022-10-30 |
28
| [[r1r2p15k8sooo|r2.p15.k8s.ooo]] | production | Nico | server121 | | | 2022-09-06 |
29 162 Nico Schottelius
| [[r1r2p10k8sooo|r1.p10.k8s.ooo]] | production | Nico | server122 | | | 2022-10-30 |
30
| [[r1r2p10k8sooo|r2.p10.k8s.ooo]] | production | Nico | server123 | | | 2022-10-15 |
31
| [[r1r2p5k8sooo|r1.p5.k8s.ooo]] | production | Nico | server137 | | | 2022-10-30 |
32
| [[r1r2p5k8sooo|r2.p5.k8s.ooo]] | production | Nico | server138 | | | 2022-10-30 |
33
| [[r1r2p6k8sooo|r1.p6.k8s.ooo]] | production | Nico | server139 | | | 2022-10-30 |
34
| [[r1r2p6k8sooo|r2.p6.k8s.ooo]] | production | Nico | server140 | | | 2022-10-30 |
35 21 Nico Schottelius
36 1 Nico Schottelius
h2. General architecture and components overview
37
38
* All k8s clusters are IPv6 only
39
* We use BGP peering to propagate podcidr and serviceCidr networks to our infrastructure
40
* The main public testing repository is "ungleich-k8s":https://code.ungleich.ch/ungleich-public/ungleich-k8s
41 18 Nico Schottelius
** Private configurations are found in the **k8s-config** repository
42 1 Nico Schottelius
43
h3. Cluster types
44
45 28 Nico Schottelius
| **Type/Feature**            | **Development**                | **Production**         |
46
| Min No. nodes               | 3 (1 master, 3 worker)         | 5 (3 master, 3 worker) |
47
| Recommended minimum         | 4 (dedicated master, 3 worker) | 8 (3 master, 5 worker) |
48
| Separation of control plane | optional                       | recommended            |
49
| Persistent storage          | required                       | required               |
50
| Number of storage monitors  | 3                              | 5                      |
51 1 Nico Schottelius
52 43 Nico Schottelius
h2. General k8s operations
53 1 Nico Schottelius
54 46 Nico Schottelius
h3. Cheat sheet / external great references
55
56
* "kubectl cheatsheet":https://kubernetes.io/docs/reference/kubectl/cheatsheet/
57
58 117 Nico Schottelius
h3. Allowing to schedule work on the control plane / removing node taints
59 69 Nico Schottelius
60
* Mostly for single node / test / development clusters
61
* Just remove the master taint as follows
62
63
<pre>
64
kubectl taint nodes --all node-role.kubernetes.io/master-
65 118 Nico Schottelius
kubectl taint nodes --all node-role.kubernetes.io/control-plane-
66 69 Nico Schottelius
</pre>
67 1 Nico Schottelius
68 117 Nico Schottelius
You can check the node taints using @kubectl describe node ...@
69 69 Nico Schottelius
70 208 Nico Schottelius
h3. Adding taints
71
72
* For instance to limit nodes to specific customers
73
74
<pre>
75
kubectl taint nodes serverXX customer=CUSTOMERNAME:NoSchedule
76
</pre>
77
78 44 Nico Schottelius
h3. Get the cluster admin.conf
79
80
* On the masters of each cluster you can find the file @/etc/kubernetes/admin.conf@
81
* To be able to administrate the cluster you can copy the admin.conf to your local machine
82
* Multi cluster debugging can very easy if you name the config ~/cX-admin.conf (see example below)
83
84
<pre>
85
% scp root@server47.place7.ungleich.ch:/etc/kubernetes/admin.conf ~/c2-admin.conf
86
% export KUBECONFIG=~/c2-admin.conf    
87
% kubectl get nodes
88
NAME       STATUS                     ROLES                  AGE   VERSION
89
server47   Ready                      control-plane,master   82d   v1.22.0
90
server48   Ready                      control-plane,master   82d   v1.22.0
91
server49   Ready                      <none>                 82d   v1.22.0
92
server50   Ready                      <none>                 82d   v1.22.0
93
server59   Ready                      control-plane,master   82d   v1.22.0
94
server60   Ready,SchedulingDisabled   <none>                 82d   v1.22.0
95
server61   Ready                      <none>                 82d   v1.22.0
96
server62   Ready                      <none>                 82d   v1.22.0               
97
</pre>
98
99 18 Nico Schottelius
h3. Installing a new k8s cluster
100 8 Nico Schottelius
101 9 Nico Schottelius
* Decide on the cluster name (usually *cX.k8s.ooo*), X counting upwards
102 28 Nico Schottelius
** Using pXX.k8s.ooo for production clusters of placeXX
103 9 Nico Schottelius
* Use cdist to configure the nodes with requirements like crio
104
* Decide between single or multi node control plane setups (see below)
105 28 Nico Schottelius
** Single control plane suitable for development clusters
106 9 Nico Schottelius
107 28 Nico Schottelius
Typical init procedure:
108 9 Nico Schottelius
109 206 Nico Schottelius
h4. Single control plane:
110
111
<pre>
112
kubeadm init --config bootstrap/XXX/kubeadm.yaml
113
</pre>
114
115
h4. Multi control plane (HA):
116
117
<pre>
118
kubeadm init --config bootstrap/XXX/kubeadm.yaml --upload-certs
119
</pre>
120
121 10 Nico Schottelius
122 29 Nico Schottelius
h3. Deleting a pod that is hanging in terminating state
123
124
<pre>
125
kubectl delete pod <PODNAME> --grace-period=0 --force --namespace <NAMESPACE>
126
</pre>
127
128
(from https://stackoverflow.com/questions/35453792/pods-stuck-in-terminating-status)
129
130 42 Nico Schottelius
h3. Listing nodes of a cluster
131
132
<pre>
133
[15:05] bridge:~% kubectl get nodes
134
NAME       STATUS   ROLES                  AGE   VERSION
135
server22   Ready    <none>                 52d   v1.22.0
136
server23   Ready    <none>                 52d   v1.22.2
137
server24   Ready    <none>                 52d   v1.22.0
138
server25   Ready    <none>                 52d   v1.22.0
139
server26   Ready    <none>                 52d   v1.22.0
140
server27   Ready    <none>                 52d   v1.22.0
141
server63   Ready    control-plane,master   52d   v1.22.0
142
server64   Ready    <none>                 52d   v1.22.0
143
server65   Ready    control-plane,master   52d   v1.22.0
144
server66   Ready    <none>                 52d   v1.22.0
145
server83   Ready    control-plane,master   52d   v1.22.0
146
server84   Ready    <none>                 52d   v1.22.0
147
server85   Ready    <none>                 52d   v1.22.0
148
server86   Ready    <none>                 52d   v1.22.0
149
</pre>
150
151 41 Nico Schottelius
h3. Removing / draining a node
152
153
Usually @kubectl drain server@ should do the job, but sometimes we need to be more aggressive:
154
155 1 Nico Schottelius
<pre>
156 103 Nico Schottelius
kubectl drain --delete-emptydir-data --ignore-daemonsets serverXX
157 42 Nico Schottelius
</pre>
158
159
h3. Readding a node after draining
160
161
<pre>
162
kubectl uncordon serverXX
163 1 Nico Schottelius
</pre>
164 43 Nico Schottelius
165 50 Nico Schottelius
h3. (Re-)joining worker nodes after creating the cluster
166 49 Nico Schottelius
167
* We need to have an up-to-date token
168
* We use different join commands for the workers and control plane nodes
169
170
Generating the join command on an existing control plane node:
171
172
<pre>
173
kubeadm token create --print-join-command
174
</pre>
175
176 50 Nico Schottelius
h3. (Re-)joining control plane nodes after creating the cluster
177 1 Nico Schottelius
178 50 Nico Schottelius
* We generate the token again
179
* We upload the certificates
180
* We need to combine/create the join command for the control plane node
181
182
Example session:
183
184
<pre>
185
% kubeadm token create --print-join-command
186
kubeadm join p10-api.k8s.ooo:6443 --token xmff4i.ABC --discovery-token-ca-cert-hash sha256:longhash 
187
188
% kubeadm init phase upload-certs --upload-certs
189
[upload-certs] Storing the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
190
[upload-certs] Using certificate key:
191
CERTKEY
192
193
# Then we use these two outputs on the joining node:
194
195
kubeadm join p10-api.k8s.ooo:6443 --token xmff4i.ABC --discovery-token-ca-cert-hash sha256:longhash --control-plane --certificate-key CERTKEY
196
</pre>
197
198
Commands to be used on a control plane node:
199
200
<pre>
201
kubeadm token create --print-join-command
202
kubeadm init phase upload-certs --upload-certs
203
</pre>
204
205
Commands to be used on the joining node:
206
207
<pre>
208
JOINCOMMAND --control-plane --certificate-key CERTKEY
209
</pre>
210 49 Nico Schottelius
211 51 Nico Schottelius
SEE ALSO
212
213
* https://stackoverflow.com/questions/63936268/how-to-generate-kubeadm-token-for-secondary-control-plane-nodes
214
* https://blog.scottlowe.org/2019/08/15/reconstructing-the-join-command-for-kubeadm/
215
216 53 Nico Schottelius
h3. How to fix etcd does not start when rejoining a kubernetes cluster as a control plane
217 52 Nico Schottelius
218
If during the above step etcd does not come up, @kubeadm join@ can hang as follows:
219
220
<pre>
221
[control-plane] Creating static Pod manifest for "kube-apiserver"                                                              
222
[control-plane] Creating static Pod manifest for "kube-controller-manager"                                                     
223
[control-plane] Creating static Pod manifest for "kube-scheduler"                                                              
224
[check-etcd] Checking that the etcd cluster is healthy                                                                         
225
error execution phase check-etcd: etcd cluster is not healthy: failed to dial endpoint https://[2a0a:e5c0:10:1:225:b3ff:fe20:37
226
8a]:2379 with maintenance client: context deadline exceeded                                                                    
227
To see the stack trace of this error execute with --v=5 or higher         
228
</pre>
229
230
Then the problem is likely that the etcd server is still a member of the cluster. We first need to remove it from the etcd cluster and then the join works.
231
232
To fix this we do:
233
234
* Find a working etcd pod
235
* Find the etcd members / member list
236
* Remove the etcd member that we want to re-join the cluster
237
238
239
<pre>
240
# Find the etcd pods
241
kubectl -n kube-system get pods -l component=etcd,tier=control-plane
242
243
# Get the list of etcd servers with the member id 
244
kubectl exec -n kube-system -ti ETCDPODNAME -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list
245
246
# Remove the member
247
kubectl exec -n kube-system -ti ETCDPODNAME -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove MEMBERID
248
</pre>
249
250
Sample session:
251
252
<pre>
253
[10:48] line:~% kubectl -n kube-system get pods -l component=etcd,tier=control-plane
254
NAME            READY   STATUS    RESTARTS     AGE
255
etcd-server63   1/1     Running   0            3m11s
256
etcd-server65   1/1     Running   3            7d2h
257
etcd-server83   1/1     Running   8 (6d ago)   7d2h
258
[10:48] line:~% kubectl exec -n kube-system -ti etcd-server65 -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list
259
356891cd676df6e4, started, server65, https://[2a0a:e5c0:10:1:225:b3ff:fe20:375c]:2380, https://[2a0a:e5c0:10:1:225:b3ff:fe20:375c]:2379, false
260
371b8a07185dee7e, started, server63, https://[2a0a:e5c0:10:1:225:b3ff:fe20:378a]:2380, https://[2a0a:e5c0:10:1:225:b3ff:fe20:378a]:2379, false
261
5942bc58307f8af9, started, server83, https://[2a0a:e5c0:10:1:3e4a:92ff:fe79:bb98]:2380, https://[2a0a:e5c0:10:1:3e4a:92ff:fe79:bb98]:2379, false
262
263
[10:48] line:~% kubectl exec -n kube-system -ti etcd-server65 -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove 371b8a07185dee7e
264
Member 371b8a07185dee7e removed from cluster e3c0805f592a8f77
265 1 Nico Schottelius
266
</pre>
267
268
SEE ALSO
269
270
* We found the solution using https://stackoverflow.com/questions/67921552/re-installed-node-cannot-join-kubernetes-cluster
271 56 Nico Schottelius
272 147 Nico Schottelius
h3. Node labels (adding, showing, removing)
273
274
Listing the labels:
275
276
<pre>
277
kubectl get nodes --show-labels
278
</pre>
279
280
Adding labels:
281
282
<pre>
283
kubectl label nodes LIST-OF-NODES label1=value1 
284
285
</pre>
286
287
For instance:
288
289
<pre>
290
kubectl label nodes router2 router3 hosttype=router 
291
</pre>
292
293
Selecting nodes in pods:
294
295
<pre>
296
apiVersion: v1
297
kind: Pod
298
...
299
spec:
300
  nodeSelector:
301
    hosttype: router
302
</pre>
303
304 148 Nico Schottelius
Removing labels by adding a minus at the end of the label name:
305
306
<pre>
307
kubectl label node <nodename> <labelname>-
308
</pre>
309
310
For instance:
311
312
<pre>
313
kubectl label nodes router2 router3 hosttype- 
314
</pre>
315
316 147 Nico Schottelius
SEE ALSO
317 1 Nico Schottelius
318 148 Nico Schottelius
* https://kubernetes.io/docs/tasks/configure-pod-container/assign-pods-nodes/
319
* https://stackoverflow.com/questions/34067979/how-to-delete-a-node-label-by-command-and-api
320 147 Nico Schottelius
321 199 Nico Schottelius
h3. Listing all pods on a node
322
323
<pre>
324
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=serverXX
325
</pre>
326
327
Found on https://stackoverflow.com/questions/62000559/how-to-list-all-the-pods-running-in-a-particular-worker-node-by-executing-a-comm
328
329 101 Nico Schottelius
h3. Hardware Maintenance using ungleich-hardware
330
331
Use the following manifest and replace the HOST with the actual host:
332
333
<pre>
334
apiVersion: v1
335
kind: Pod
336
metadata:
337
  name: ungleich-hardware-HOST
338
spec:
339
  containers:
340
  - name: ungleich-hardware
341
    image: ungleich/ungleich-hardware:0.0.5
342
    args:
343
    - sleep
344
    - "1000000"
345
    volumeMounts:
346
      - mountPath: /dev
347
        name: dev
348
    securityContext:
349
      privileged: true
350
  nodeSelector:
351
    kubernetes.io/hostname: "HOST"
352
353
  volumes:
354
    - name: dev
355
      hostPath:
356
        path: /dev
357
</pre>
358
359 102 Nico Schottelius
Also see: [[The_ungleich_hardware_maintenance_guide]]
360
361 105 Nico Schottelius
h3. Triggering a cronjob / creating a job from a cronjob
362 104 Nico Schottelius
363
To test a cronjob, we can create a job from a cronjob:
364
365
<pre>
366
kubectl create job --from=cronjob/volume2-daily-backup volume2-manual
367
</pre>
368
369
This creates a job volume2-manual based on the cronjob  volume2-daily
370
371 112 Nico Schottelius
h3. su-ing into a user that has nologin shell set
372
373
Many times users are having nologin as their shell inside the container. To be able to execute maintenance commands within the
374
container, we can use @su -s /bin/sh@ like this:
375
376
<pre>
377
su -s /bin/sh -c '/path/to/your/script' testuser
378
</pre>
379
380
Found on https://serverfault.com/questions/351046/how-to-run-command-as-user-who-has-usr-sbin-nologin-as-shell
381
382 113 Nico Schottelius
h3. How to print a secret value
383
384
Assuming you want the "password" item from a secret, use:
385
386
<pre>
387
kubectl get secret SECRETNAME -o jsonpath="{.data.password}" | base64 -d; echo "" 
388
</pre>
389
390 173 Nico Schottelius
h3. How to upgrade a kubernetes cluster
391 172 Nico Schottelius
392
h4. General
393
394
* Should be done every X months to stay up-to-date
395
** X probably something like 3-6
396
* kubeadm based clusters
397
* Needs specific kubeadm versions for upgrade
398
* Follow instructions on https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/
399 190 Nico Schottelius
* Finding releases: https://github.com/kubernetes/kubernetes/tree/master/CHANGELOG
400 172 Nico Schottelius
401
h4. Getting a specific kubeadm or kubelet version
402
403
<pre>
404 190 Nico Schottelius
RELEASE=v1.22.17
405
RELEASE=v1.23.17
406 181 Nico Schottelius
RELEASE=v1.24.9
407 1 Nico Schottelius
RELEASE=v1.25.9
408
RELEASE=v1.26.6
409 190 Nico Schottelius
RELEASE=v1.27.2
410
411 187 Nico Schottelius
ARCH=amd64
412 172 Nico Schottelius
413
curl -L --remote-name-all https://dl.k8s.io/release/${RELEASE}/bin/linux/${ARCH}/{kubeadm,kubelet}
414 182 Nico Schottelius
chmod u+x kubeadm kubelet
415 172 Nico Schottelius
</pre>
416
417
h4. Steps
418
419
* kubeadm upgrade plan
420
** On one control plane node
421
* kubeadm upgrade apply vXX.YY.ZZ
422
** On one control plane node
423 189 Nico Schottelius
* kubeadm upgrade node
424
** On all other control plane nodes
425
** On all worker nodes afterwards
426
427 172 Nico Schottelius
428 173 Nico Schottelius
Repeat for all control planes nodes. The upgrade kubelet on all other nodes via package manager.
429 172 Nico Schottelius
430 193 Nico Schottelius
h4. Upgrading to 1.22.17
431 1 Nico Schottelius
432 193 Nico Schottelius
* https://v1-22.docs.kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/
433 194 Nico Schottelius
* Need to create a kubeadm config map
434 198 Nico Schottelius
** f.i. using the following
435
** @/usr/local/bin/kubeadm-v1.22.17   upgrade --config kubeadm.yaml --ignore-preflight-errors=CoreDNSUnsupportedPlugins,CoreDNSMigration apply -y v1.22.17@
436 193 Nico Schottelius
* Done for p6 on 2023-10-04
437
438
h4. Upgrading to 1.23.17
439
440
* https://v1-23.docs.kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/
441
* No special notes
442
* Done for p6 on 2023-10-04
443
444
h4. Upgrading to 1.24.17
445
446
* https://v1-24.docs.kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/
447
* No special notes
448
* Done for p6 on 2023-10-04
449
450
h4. Upgrading to 1.25.14
451
452
* https://v1-24.docs.kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/
453
* No special notes
454
* Done for p6 on 2023-10-04
455
456
h4. Upgrading to 1.26.9
457
458 1 Nico Schottelius
* https://v1-26.docs.kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/
459 193 Nico Schottelius
* No special notes
460
* Done for p6 on 2023-10-04
461 188 Nico Schottelius
462 196 Nico Schottelius
h4. Upgrading to 1.27
463 186 Nico Schottelius
464 192 Nico Schottelius
* https://v1-27.docs.kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/
465 186 Nico Schottelius
* kubelet will not start anymore
466
* reason: @"command failed" err="failed to parse kubelet flag: unknown flag: --container-runtime"@
467
* /var/lib/kubelet/kubeadm-flags.env contains that parameter
468
* remove it, start kubelet
469 192 Nico Schottelius
470 197 Nico Schottelius
h4. Upgrading to 1.28
471 192 Nico Schottelius
472
* https://v1-28.docs.kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/
473 186 Nico Schottelius
474
h4. Upgrade to crio 1.27: missing crun
475
476
Error message
477
478
<pre>
479
level=fatal msg="validating runtime config: runtime validation: \"crun\" not found in $PATH: exec: \"crun\": executable file not found in $PATH"
480
</pre>
481
482
Fix:
483
484
<pre>
485
apk add crun
486
</pre>
487
488 157 Nico Schottelius
h2. Reference CNI
489
490
* Mainly "stupid", but effective plugins
491
* Main documentation on https://www.cni.dev/plugins/current/
492 158 Nico Schottelius
* Plugins
493
** bridge
494
*** Can create the bridge on the host
495
*** But seems not to be able to add host interfaces to it as well
496
*** Has support for vlan tags
497
** vlan
498
*** creates vlan tagged sub interface on the host
499 160 Nico Schottelius
*** "It's a 1:1 mapping (i.e. no bridge in between)":https://github.com/k8snetworkplumbingwg/multus-cni/issues/569
500 158 Nico Schottelius
** host-device
501
*** moves the interface from the host into the container
502
*** very easy for physical connections to containers
503 159 Nico Schottelius
** ipvlan
504
*** "virtualisation" of a host device
505
*** routing based on IP
506
*** Same MAC for everyone
507
*** Cannot reach the master interface
508
** maclvan
509
*** With mac addresses
510
*** Supports various modes (to be checked)
511
** ptp ("point to point")
512
*** Creates a host device and connects it to the container
513
** win*
514 158 Nico Schottelius
*** Windows implementations
515 157 Nico Schottelius
516 62 Nico Schottelius
h2. Calico CNI
517
518
h3. Calico Installation
519
520
* We install "calico using helm":https://docs.projectcalico.org/getting-started/kubernetes/helm
521
* This has the following advantages:
522
** Easy to upgrade
523
** Does not require os to configure IPv6/dual stack settings as the tigera operator figures out things on its own
524
525
Usually plain calico can be installed directly using:
526
527
<pre>
528 174 Nico Schottelius
VERSION=v3.25.0
529 149 Nico Schottelius
530 1 Nico Schottelius
helm repo add projectcalico https://docs.projectcalico.org/charts
531 167 Nico Schottelius
helm repo update
532 124 Nico Schottelius
helm upgrade --install --namespace tigera calico projectcalico/tigera-operator --version $VERSION --create-namespace
533 1 Nico Schottelius
</pre>
534 92 Nico Schottelius
535
* Check the tags on https://github.com/projectcalico/calico/tags for the latest release
536 62 Nico Schottelius
537
h3. Installing calicoctl
538
539 115 Nico Schottelius
* General installation instructions, including binary download: https://projectcalico.docs.tigera.io/maintenance/clis/calicoctl/install
540
541 62 Nico Schottelius
To be able to manage and configure calico, we need to 
542
"install calicoctl (we choose the version as a pod)":https://docs.projectcalico.org/getting-started/clis/calicoctl/install#install-calicoctl-as-a-kubernetes-pod
543
544
<pre>
545
kubectl apply -f https://docs.projectcalico.org/manifests/calicoctl.yaml
546
</pre>
547
548 93 Nico Schottelius
Or version specific:
549
550
<pre>
551
kubectl apply -f https://github.com/projectcalico/calico/blob/v3.20.4/manifests/calicoctl.yaml
552 97 Nico Schottelius
553
# For 3.22
554
kubectl apply -f https://projectcalico.docs.tigera.io/archive/v3.22/manifests/calicoctl.yaml
555 93 Nico Schottelius
</pre>
556
557 70 Nico Schottelius
And making it easier accessible by alias:
558
559
<pre>
560
alias calicoctl="kubectl exec -i -n kube-system calicoctl -- /calicoctl"
561
</pre>
562
563 62 Nico Schottelius
h3. Calico configuration
564
565 63 Nico Schottelius
By default our k8s clusters "BGP peer":https://docs.projectcalico.org/networking/bgp
566
with an upstream router to propagate podcidr and servicecidr.
567 62 Nico Schottelius
568
Default settings in our infrastructure:
569
570
* We use a full-mesh using the @nodeToNodeMeshEnabled: true@ option
571
* We keep the original next hop so that *only* the server with the pod is announcing it (instead of ecmp)
572 1 Nico Schottelius
* We use private ASNs for k8s clusters
573 63 Nico Schottelius
* We do *not* use any overlay
574 62 Nico Schottelius
575
After installing calico and calicoctl the last step of the installation is usually:
576
577 1 Nico Schottelius
<pre>
578 79 Nico Schottelius
calicoctl create -f - < calico-bgp.yaml
579 62 Nico Schottelius
</pre>
580
581
582
A sample BGP configuration:
583
584
<pre>
585
---
586
apiVersion: projectcalico.org/v3
587
kind: BGPConfiguration
588
metadata:
589
  name: default
590
spec:
591
  logSeverityScreen: Info
592
  nodeToNodeMeshEnabled: true
593
  asNumber: 65534
594
  serviceClusterIPs:
595
  - cidr: 2a0a:e5c0:10:3::/108
596
  serviceExternalIPs:
597
  - cidr: 2a0a:e5c0:10:3::/108
598
---
599
apiVersion: projectcalico.org/v3
600
kind: BGPPeer
601
metadata:
602
  name: router1-place10
603
spec:
604
  peerIP: 2a0a:e5c0:10:1::50
605
  asNumber: 213081
606
  keepOriginalNextHop: true
607
</pre>
608
609 126 Nico Schottelius
h2. Cilium CNI (experimental)
610
611 137 Nico Schottelius
h3. Status
612
613 138 Nico Schottelius
*NO WORKING CILIUM CONFIGURATION FOR IPV6 only modes*
614 137 Nico Schottelius
615 146 Nico Schottelius
h3. Latest error
616
617
It seems cilium does not run on IPv6 only hosts:
618
619
<pre>
620
level=info msg="Validating configured node address ranges" subsys=daemon
621
level=fatal msg="postinit failed" error="external IPv4 node address could not be derived, please configure via --ipv4-node" subsys=daemon
622
level=info msg="Starting IP identity watcher" subsys=ipcache
623
</pre>
624
625
It crashes after that log entry
626
627 128 Nico Schottelius
h3. BGP configuration
628
629
* The cilium-operator will not start without a correct configmap being present beforehand (see error message below)
630
* Creating the bgp config beforehand as a configmap is thus required.
631
632
The error one gets without the configmap present:
633
634
Pods are hanging with:
635
636
<pre>
637
cilium-bpqm6                       0/1     Init:0/4            0             9s
638
cilium-operator-5947d94f7f-5bmh2   0/1     ContainerCreating   0             9s
639
</pre>
640
641
The error message in the cilium-*perator is:
642
643
<pre>
644
Events:
645
  Type     Reason       Age                From               Message
646
  ----     ------       ----               ----               -------
647
  Normal   Scheduled    80s                default-scheduler  Successfully assigned kube-system/cilium-operator-5947d94f7f-lqcsp to server56
648
  Warning  FailedMount  16s (x8 over 80s)  kubelet            MountVolume.SetUp failed for volume "bgp-config-path" : configmap "bgp-config" not found
649
</pre>
650
651
A correct bgp config looks like this:
652
653
<pre>
654
apiVersion: v1
655
kind: ConfigMap
656
metadata:
657
  name: bgp-config
658
  namespace: kube-system
659
data:
660
  config.yaml: |
661
    peers:
662
      - peer-address: 2a0a:e5c0::46
663
        peer-asn: 209898
664
        my-asn: 65533
665
      - peer-address: 2a0a:e5c0::47
666
        peer-asn: 209898
667
        my-asn: 65533
668
    address-pools:
669
      - name: default
670
        protocol: bgp
671
        addresses:
672
          - 2a0a:e5c0:0:14::/64
673
</pre>
674 127 Nico Schottelius
675
h3. Installation
676 130 Nico Schottelius
677 127 Nico Schottelius
Adding the repo
678 1 Nico Schottelius
<pre>
679 127 Nico Schottelius
680 129 Nico Schottelius
helm repo add cilium https://helm.cilium.io/
681 130 Nico Schottelius
helm repo update
682
</pre>
683 129 Nico Schottelius
684 135 Nico Schottelius
Installing + configuring cilium
685 129 Nico Schottelius
<pre>
686 130 Nico Schottelius
ipv6pool=2a0a:e5c0:0:14::/112
687 1 Nico Schottelius
688 146 Nico Schottelius
version=1.12.2
689 129 Nico Schottelius
690
helm upgrade --install cilium cilium/cilium --version $version \
691 1 Nico Schottelius
  --namespace kube-system \
692
  --set ipv4.enabled=false \
693
  --set ipv6.enabled=true \
694 146 Nico Schottelius
  --set enableIPv6Masquerade=false \
695
  --set bgpControlPlane.enabled=true 
696 1 Nico Schottelius
697 146 Nico Schottelius
#  --set ipam.operator.clusterPoolIPv6PodCIDRList=$ipv6pool
698
699
# Old style bgp?
700 136 Nico Schottelius
#   --set bgp.enabled=true --set bgp.announce.podCIDR=true \
701 127 Nico Schottelius
702
# Show possible configuration options
703
helm show values cilium/cilium
704
705 1 Nico Schottelius
</pre>
706 132 Nico Schottelius
707
Using a /64 for ipam.operator.clusterPoolIPv6PodCIDRList fails with:
708
709
<pre>
710
level=fatal msg="Unable to init cluster-pool allocator" error="unable to initialize IPv6 allocator New CIDR set failed; the node CIDR size is too big" subsys=cilium-operator-generic
711
</pre>
712
713 126 Nico Schottelius
714 1 Nico Schottelius
See also https://github.com/cilium/cilium/issues/20756
715 135 Nico Schottelius
716
Seems a /112 is actually working.
717
718
h3. Kernel modules
719
720
Cilium requires the following modules to be loaded on the host (not loaded by default):
721
722
<pre>
723 1 Nico Schottelius
modprobe  ip6table_raw
724
modprobe  ip6table_filter
725
</pre>
726 146 Nico Schottelius
727
h3. Interesting helm flags
728
729
* autoDirectNodeRoutes
730
* bgpControlPlane.enabled = true
731
732
h3. SEE ALSO
733
734
* https://docs.cilium.io/en/v1.12/helm-reference/
735 133 Nico Schottelius
736 179 Nico Schottelius
h2. Multus
737 168 Nico Schottelius
738
* https://github.com/k8snetworkplumbingwg/multus-cni
739
* Installing a deployment w/ CRDs
740 150 Nico Schottelius
741 169 Nico Schottelius
<pre>
742 176 Nico Schottelius
VERSION=v4.0.1
743 169 Nico Schottelius
744 170 Nico Schottelius
kubectl apply -f https://raw.githubusercontent.com/k8snetworkplumbingwg/multus-cni/${VERSION}/deployments/multus-daemonset-crio.yml
745
</pre>
746 169 Nico Schottelius
747 191 Nico Schottelius
h2. ArgoCD
748 56 Nico Schottelius
749 60 Nico Schottelius
h3. Argocd Installation
750 1 Nico Schottelius
751 116 Nico Schottelius
* See https://argo-cd.readthedocs.io/en/stable/
752
753 60 Nico Schottelius
As there is no configuration management present yet, argocd is installed using
754
755 1 Nico Schottelius
<pre>
756 60 Nico Schottelius
kubectl create namespace argocd
757 1 Nico Schottelius
758
# OR: latest stable
759
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
760
761 191 Nico Schottelius
# OR Specific Version
762
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/v2.3.2/manifests/install.yaml
763 56 Nico Schottelius
764 191 Nico Schottelius
765
</pre>
766 1 Nico Schottelius
767 60 Nico Schottelius
h3. Get the argocd credentials
768
769
<pre>
770
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d; echo ""
771
</pre>
772 52 Nico Schottelius
773 87 Nico Schottelius
h3. Accessing argocd
774
775
In regular IPv6 clusters:
776
777
* Navigate to https://argocd-server.argocd.CLUSTERDOMAIN
778
779
In legacy IPv4 clusters
780
781
<pre>
782
kubectl --namespace argocd port-forward svc/argocd-server 8080:80
783
</pre>
784
785 88 Nico Schottelius
* Navigate to https://localhost:8080
786
787 68 Nico Schottelius
h3. Using the argocd webhook to trigger changes
788 67 Nico Schottelius
789
* To trigger changes post json https://argocd.example.com/api/webhook
790
791 72 Nico Schottelius
h3. Deploying an application
792
793
* Applications are deployed via git towards gitea (code.ungleich.ch) and then pulled by argo
794 73 Nico Schottelius
* Always include the *redmine-url* pointing to the (customer) ticket
795
** Also add the support-url if it exists
796 72 Nico Schottelius
797
Application sample
798
799
<pre>
800
apiVersion: argoproj.io/v1alpha1
801
kind: Application
802
metadata:
803
  name: gitea-CUSTOMER
804
  namespace: argocd
805
spec:
806
  destination:
807
    namespace: default
808
    server: 'https://kubernetes.default.svc'
809
  source:
810
    path: apps/prod/gitea
811
    repoURL: 'https://code.ungleich.ch/ungleich-intern/k8s-config.git'
812
    targetRevision: HEAD
813
    helm:
814
      parameters:
815
        - name: storage.data.storageClass
816
          value: rook-ceph-block-hdd
817
        - name: storage.data.size
818
          value: 200Gi
819
        - name: storage.db.storageClass
820
          value: rook-ceph-block-ssd
821
        - name: storage.db.size
822
          value: 10Gi
823
        - name: storage.letsencrypt.storageClass
824
          value: rook-ceph-block-hdd
825
        - name: storage.letsencrypt.size
826
          value: 50Mi
827
        - name: letsencryptStaging
828
          value: 'no'
829
        - name: fqdn
830
          value: 'code.verua.online'
831
  project: default
832
  syncPolicy:
833
    automated:
834
      prune: true
835
      selfHeal: true
836
  info:
837
    - name: 'redmine-url'
838
      value: 'https://redmine.ungleich.ch/issues/ISSUEID'
839
    - name: 'support-url'
840
      value: 'https://support.ungleich.ch/Ticket/Display.html?id=TICKETID'
841
</pre>
842
843 80 Nico Schottelius
h2. Helm related operations and conventions
844 55 Nico Schottelius
845 61 Nico Schottelius
We use helm charts extensively.
846
847
* In production, they are managed via argocd
848
* In development, helm chart can de developed and deployed manually using the helm utility.
849
850 55 Nico Schottelius
h3. Installing a helm chart
851
852
One can use the usual pattern of
853
854
<pre>
855
helm install <releasename> <chartdirectory>
856
</pre>
857
858
However often you want to reinstall/update when testing helm charts. The following pattern is "better", because it allows you to reinstall, if it is already installed:
859
860
<pre>
861
helm upgrade --install <releasename> <chartdirectory>
862 1 Nico Schottelius
</pre>
863 80 Nico Schottelius
864
h3. Naming services and deployments in helm charts [Application labels]
865
866
* We always have {{ .Release.Name }} to identify the current "instance"
867
* Deployments:
868
** use @app: <what it is>@, f.i. @app: nginx@, @app: postgres@, ...
869 81 Nico Schottelius
* See more about standard labels on
870
** https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/
871
** https://helm.sh/docs/chart_best_practices/labels/
872 55 Nico Schottelius
873 151 Nico Schottelius
h3. Show all versions of a helm chart
874
875
<pre>
876
helm search repo -l repo/chart
877
</pre>
878
879
For example:
880
881
<pre>
882
% helm search repo -l projectcalico/tigera-operator 
883
NAME                         	CHART VERSION	APP VERSION	DESCRIPTION                            
884
projectcalico/tigera-operator	v3.23.3      	v3.23.3    	Installs the Tigera operator for Calico
885
projectcalico/tigera-operator	v3.23.2      	v3.23.2    	Installs the Tigera operator for Calico
886
....
887
</pre>
888
889 152 Nico Schottelius
h3. Show possible values of a chart
890
891
<pre>
892
helm show values <repo/chart>
893
</pre>
894
895
Example:
896
897
<pre>
898
helm show values ingress-nginx/ingress-nginx
899
</pre>
900
901 207 Nico Schottelius
h3. Show all possible charts in a repo
902
903
<pre>
904
helm search repo REPO
905
</pre>
906
907 178 Nico Schottelius
h3. Download a chart
908
909
For instance for checking it out locally. Use:
910
911
<pre>
912
helm pull <repo/chart>
913
</pre>
914 152 Nico Schottelius
915 139 Nico Schottelius
h2. Rook + Ceph
916
917
h3. Installation
918
919
* Usually directly via argocd
920
921 71 Nico Schottelius
h3. Executing ceph commands
922
923
Using the ceph-tools pod as follows:
924
925
<pre>
926
kubectl exec -n rook-ceph -ti $(kubectl -n rook-ceph get pods -l app=rook-ceph-tools -o jsonpath='{.items[*].metadata.name}') -- ceph -s
927
</pre>
928
929 43 Nico Schottelius
h3. Inspecting the logs of a specific server
930
931
<pre>
932
# Get the related pods
933
kubectl -n rook-ceph get pods -l app=rook-ceph-osd-prepare 
934
...
935
936
# Inspect the logs of a specific pod
937
kubectl -n rook-ceph logs -f rook-ceph-osd-prepare-server23--1-444qx
938
939 71 Nico Schottelius
</pre>
940
941
h3. Inspecting the logs of the rook-ceph-operator
942
943
<pre>
944
kubectl -n rook-ceph logs -f -l app=rook-ceph-operator
945 43 Nico Schottelius
</pre>
946
947 200 Nico Schottelius
h3. (Temporarily) Disabling the rook-operation
948
949
* first disabling the sync in argocd
950
* then scale it down
951
952
<pre>
953
kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=0
954
</pre>
955
956
When done with the work/maintenance, re-enable sync in argocd.
957
The following command is thus strictly speaking not required, as argocd will fix it on its own:
958
959
<pre>
960
kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=1
961
</pre>
962
963 121 Nico Schottelius
h3. Restarting the rook operator
964
965
<pre>
966
kubectl -n rook-ceph delete pods  -l app=rook-ceph-operator
967
</pre>
968
969 43 Nico Schottelius
h3. Triggering server prepare / adding new osds
970
971
The rook-ceph-operator triggers/watches/creates pods to maintain hosts. To trigger a full "re scan", simply delete that pod:
972
973
<pre>
974
kubectl -n rook-ceph delete pods -l app=rook-ceph-operator
975
</pre>
976
977
This will cause all the @rook-ceph-osd-prepare-..@ jobs to be recreated and thus OSDs to be created, if new disks have been added.
978
979
h3. Removing an OSD
980
981
* See "Ceph OSD Management":https://rook.io/docs/rook/v1.7/ceph-osd-mgmt.html
982 77 Nico Schottelius
* More specifically: https://github.com/rook/rook/blob/release-1.7/cluster/examples/kubernetes/ceph/osd-purge.yaml
983 99 Nico Schottelius
* Then delete the related deployment
984 41 Nico Schottelius
985 98 Nico Schottelius
Set osd id in the osd-purge.yaml and apply it. OSD should be down before.
986
987
<pre>
988
apiVersion: batch/v1
989
kind: Job
990
metadata:
991
  name: rook-ceph-purge-osd
992
  namespace: rook-ceph # namespace:cluster
993
  labels:
994
    app: rook-ceph-purge-osd
995
spec:
996
  template:
997
    metadata:
998
      labels:
999
        app: rook-ceph-purge-osd
1000
    spec:
1001
      serviceAccountName: rook-ceph-purge-osd
1002
      containers:
1003
        - name: osd-removal
1004
          image: rook/ceph:master
1005
          # TODO: Insert the OSD ID in the last parameter that is to be removed
1006
          # The OSD IDs are a comma-separated list. For example: "0" or "0,2".
1007
          # If you want to preserve the OSD PVCs, set `--preserve-pvc true`.
1008
          #
1009
          # A --force-osd-removal option is available if the OSD should be destroyed even though the
1010
          # removal could lead to data loss.
1011
          args:
1012
            - "ceph"
1013
            - "osd"
1014
            - "remove"
1015
            - "--preserve-pvc"
1016
            - "false"
1017
            - "--force-osd-removal"
1018
            - "false"
1019
            - "--osd-ids"
1020
            - "SETTHEOSDIDHERE"
1021
          env:
1022
            - name: POD_NAMESPACE
1023
              valueFrom:
1024
                fieldRef:
1025
                  fieldPath: metadata.namespace
1026
            - name: ROOK_MON_ENDPOINTS
1027
              valueFrom:
1028
                configMapKeyRef:
1029
                  key: data
1030
                  name: rook-ceph-mon-endpoints
1031
            - name: ROOK_CEPH_USERNAME
1032
              valueFrom:
1033
                secretKeyRef:
1034
                  key: ceph-username
1035
                  name: rook-ceph-mon
1036
            - name: ROOK_CEPH_SECRET
1037
              valueFrom:
1038
                secretKeyRef:
1039
                  key: ceph-secret
1040
                  name: rook-ceph-mon
1041
            - name: ROOK_CONFIG_DIR
1042
              value: /var/lib/rook
1043
            - name: ROOK_CEPH_CONFIG_OVERRIDE
1044
              value: /etc/rook/config/override.conf
1045
            - name: ROOK_FSID
1046
              valueFrom:
1047
                secretKeyRef:
1048
                  key: fsid
1049
                  name: rook-ceph-mon
1050
            - name: ROOK_LOG_LEVEL
1051
              value: DEBUG
1052
          volumeMounts:
1053
            - mountPath: /etc/ceph
1054
              name: ceph-conf-emptydir
1055
            - mountPath: /var/lib/rook
1056
              name: rook-config
1057
      volumes:
1058
        - emptyDir: {}
1059
          name: ceph-conf-emptydir
1060
        - emptyDir: {}
1061
          name: rook-config
1062
      restartPolicy: Never
1063
1064
1065 99 Nico Schottelius
</pre>
1066
1067 1 Nico Schottelius
Deleting the deployment:
1068
1069
<pre>
1070
[18:05] bridge:~% kubectl -n rook-ceph delete deployment rook-ceph-osd-6
1071 99 Nico Schottelius
deployment.apps "rook-ceph-osd-6" deleted
1072
</pre>
1073 185 Nico Schottelius
1074
h3. Placement of mons/osds/etc.
1075
1076
See https://rook.io/docs/rook/v1.11/CRDs/Cluster/ceph-cluster-crd/#placement-configuration-settings
1077 98 Nico Schottelius
1078 145 Nico Schottelius
h2. Ingress + Cert Manager
1079
1080
* We deploy "nginx-ingress":https://docs.nginx.com/nginx-ingress-controller/ to get an ingress
1081
* we deploy "cert-manager":https://cert-manager.io/ to handle certificates
1082
* We independently deploy @ClusterIssuer@ to allow the cert-manager app to deploy and the issuer to be created once the CRDs from cert manager are in place
1083
1084
h3. IPv4 reachability 
1085
1086
The ingress is by default IPv6 only. To make it reachable from the IPv4 world, get its IPv6 address and configure a NAT64 mapping in Jool.
1087
1088
Steps:
1089
1090
h4. Get the ingress IPv6 address
1091
1092
Use @kubectl -n ingress-nginx get svc ingress-nginx-controller -o jsonpath='{.spec.clusterIP}'; echo ''@
1093
1094
Example:
1095
1096
<pre>
1097
kubectl -n ingress-nginx get svc ingress-nginx-controller -o jsonpath='{.spec.clusterIP}'; echo ''
1098
2a0a:e5c0:10:1b::ce11
1099
</pre>
1100
1101
h4. Add NAT64 mapping
1102
1103
* Update the __dcl_jool_siit cdist type
1104
* Record the two IPs (IPv6 and IPv4)
1105
* Configure all routers
1106
1107
1108
h4. Add DNS record
1109
1110
To use the ingress capable as a CNAME destination, create an "ingress" DNS record, such as:
1111
1112
<pre>
1113
; k8s ingress for dev
1114
dev-ingress                 AAAA 2a0a:e5c0:10:1b::ce11
1115
dev-ingress                 A 147.78.194.23
1116
1117
</pre> 
1118
1119
h4. Add supporting wildcard DNS
1120
1121
If you plan to add various sites under a specific domain, we can add a wildcard DNS entry, such as *.k8s-dev.django-hosting.ch:
1122
1123
<pre>
1124
*.k8s-dev         CNAME dev-ingress.ungleich.ch.
1125
</pre>
1126
1127 76 Nico Schottelius
h2. Harbor
1128
1129 175 Nico Schottelius
* We user "Harbor":https://goharbor.io/ as an image registry for our own images. Internal app reference: apps/prod/harbor.
1130
* The admin password is in the password store, it is Harbor12345 by default
1131 76 Nico Schottelius
* At the moment harbor only authenticates against the internal ldap tree
1132
1133
h3. LDAP configuration
1134
1135
* The url needs to be ldaps://...
1136
* uid = uid
1137
* rest standard
1138 75 Nico Schottelius
1139 89 Nico Schottelius
h2. Monitoring / Prometheus
1140
1141 90 Nico Schottelius
* Via "kube-prometheus":https://github.com/prometheus-operator/kube-prometheus/
1142 89 Nico Schottelius
1143 91 Nico Schottelius
Access via ...
1144
1145
* http://prometheus-k8s.monitoring.svc:9090
1146
* http://grafana.monitoring.svc:3000
1147
* http://alertmanager.monitoring.svc:9093
1148
1149
1150 100 Nico Schottelius
h3. Prometheus Options
1151
1152
* "helm/kube-prometheus-stack":https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack
1153
** Includes dashboards and co.
1154
* "manifest based kube-prometheus":https://github.com/prometheus-operator/kube-prometheus
1155
** Includes dashboards and co.
1156
* "Prometheus Operator (mainly CRD manifest":https://github.com/prometheus-operator/prometheus-operator
1157
1158 171 Nico Schottelius
h3. Grafana default password
1159
1160
* If not changed: @prom-operator@
1161
1162 82 Nico Schottelius
h2. Nextcloud
1163
1164 85 Nico Schottelius
h3. How to get the nextcloud credentials 
1165 84 Nico Schottelius
1166
* The initial username is set to "nextcloud"
1167
* The password is autogenerated and saved in a kubernetes secret
1168
1169
<pre>
1170 85 Nico Schottelius
kubectl get secret RELEASENAME-nextcloud -o jsonpath="{.data.PASSWORD}" | base64 -d; echo "" 
1171 84 Nico Schottelius
</pre>
1172
1173 83 Nico Schottelius
h3. How to fix "Access through untrusted domain"
1174
1175 82 Nico Schottelius
* Nextcloud stores the initial domain configuration
1176 1 Nico Schottelius
* If the FQDN is changed, it will show the error message "Access through untrusted domain"
1177 82 Nico Schottelius
* To fix, edit /var/www/html/config/config.php and correct the domain
1178 1 Nico Schottelius
* Then delete the pods
1179 165 Nico Schottelius
1180
h3. Running occ commands inside the nextcloud container
1181
1182
* Find the pod in the right namespace
1183
1184
Exec:
1185
1186
<pre>
1187
su www-data -s /bin/sh -c ./occ
1188
</pre>
1189
1190
* -s /bin/sh is needed as the default shell is set to /bin/false
1191
1192 166 Nico Schottelius
h4. Rescanning files
1193 165 Nico Schottelius
1194 166 Nico Schottelius
* If files have been added without nextcloud's knowledge
1195
1196
<pre>
1197
su www-data -s /bin/sh -c "./occ files:scan --all"
1198
</pre>
1199 82 Nico Schottelius
1200 201 Nico Schottelius
h2. Sealed Secrets
1201
1202 202 Jin-Guk Kwon
* install kubeseal
1203 1 Nico Schottelius
1204 202 Jin-Guk Kwon
<pre>
1205
KUBESEAL_VERSION='0.23.0'
1206
wget "https://github.com/bitnami-labs/sealed-secrets/releases/download/v${KUBESEAL_VERSION:?}/kubeseal-${KUBESEAL_VERSION:?}-linux-amd64.tar.gz" 
1207
tar -xvzf kubeseal-${KUBESEAL_VERSION:?}-linux-amd64.tar.gz kubeseal
1208
sudo install -m 755 kubeseal /usr/local/bin/kubeseal
1209
</pre>
1210
1211
* create key for sealed-secret
1212
1213
<pre>
1214
kubeseal --fetch-cert > /tmp/public-key-cert.pem
1215
</pre>
1216
1217
* create the secret
1218
1219
<pre>
1220 203 Jin-Guk Kwon
ex)
1221 202 Jin-Guk Kwon
apiVersion: v1
1222
kind: Secret
1223
metadata:
1224
  name: Release.Name-postgres-config
1225
  annotations:
1226
    secret-generator.v1.mittwald.de/autogenerate: POSTGRES_PASSWORD
1227
    hosting: Release.Name
1228
  labels:
1229
    app.kubernetes.io/instance: Release.Name
1230
    app.kubernetes.io/component: postgres
1231
stringData:
1232
  POSTGRES_USER: postgresUser
1233
  POSTGRES_DB: postgresDBName
1234
  POSTGRES_INITDB_ARGS: "--no-locale --encoding=UTF8"
1235
</pre>
1236
1237
* convert secret.yaml to sealed-secret.yaml
1238
1239
<pre>
1240
kubeseal -n <namespace> --cert=/tmp/public-key-cert.pem --format=yaml < ./secret.yaml  > ./sealed-secret.yaml
1241
</pre>
1242
1243
* use sealed-secret.yaml on helm-chart directory
1244 201 Nico Schottelius
1245 205 Jin-Guk Kwon
* refer ticket : #11989 , #12120
1246 204 Jin-Guk Kwon
1247 1 Nico Schottelius
h2. Infrastructure versions
1248 35 Nico Schottelius
1249 57 Nico Schottelius
h3. ungleich kubernetes infrastructure v5 (2021-10)
1250 1 Nico Schottelius
1251 57 Nico Schottelius
Clusters are configured / setup in this order:
1252
1253
* Bootstrap via kubeadm
1254 59 Nico Schottelius
* "Networking via calico + BGP (non ECMP) using helm":https://docs.projectcalico.org/getting-started/kubernetes/helm
1255
* "ArgoCD for CD":https://argo-cd.readthedocs.io/en/stable/
1256
** "rook for storage via argocd":https://rook.io/
1257 58 Nico Schottelius
** haproxy for in IPv6-cluster-IPv4-to-IPv6 proxy via argocd
1258
** "kubernetes-secret-generator for in cluster secrets":https://github.com/mittwald/kubernetes-secret-generator
1259
** "ungleich-certbot managing certs and nginx":https://hub.docker.com/repository/docker/ungleich/ungleich-certbot
1260
1261 57 Nico Schottelius
1262
h3. ungleich kubernetes infrastructure v4 (2021-09)
1263
1264 54 Nico Schottelius
* rook is configured via manifests instead of using the rook-ceph-cluster helm chart
1265 1 Nico Schottelius
* The rook operator is still being installed via helm
1266 35 Nico Schottelius
1267 57 Nico Schottelius
h3. ungleich kubernetes infrastructure v3 (2021-07)
1268 1 Nico Schottelius
1269 10 Nico Schottelius
* rook is now installed via helm via argocd instead of directly via manifests
1270 28 Nico Schottelius
1271 57 Nico Schottelius
h3. ungleich kubernetes infrastructure v2 (2021-05)
1272 28 Nico Schottelius
1273
* Replaced fluxv2 from ungleich k8s v1 with argocd
1274 1 Nico Schottelius
** argocd can apply helm templates directly without needing to go through Chart releases
1275 28 Nico Schottelius
* We are also using argoflow for build flows
1276
* Planned to add "kaniko":https://github.com/GoogleContainerTools/kaniko for image building
1277
1278 57 Nico Schottelius
h3. ungleich kubernetes infrastructure v1 (2021-01)
1279 28 Nico Schottelius
1280
We are using the following components:
1281
1282
* "Calico as a CNI":https://www.projectcalico.org/ with BGP, IPv6 only, no encapsulation
1283
** Needed for basic networking
1284
* "kubernetes-secret-generator":https://github.com/mittwald/kubernetes-secret-generator for creating secrets
1285
** Needed so that secrets are not stored in the git repository, but only in the cluster
1286
* "ungleich-certbot":https://hub.docker.com/repository/docker/ungleich/ungleich-certbot
1287
** Needed to get letsencrypt certificates for services
1288
* "rook with ceph rbd + cephfs":https://rook.io/ for storage
1289
** rbd for almost everything, *ReadWriteOnce*
1290
** cephfs for smaller things, multi access *ReadWriteMany*
1291
** Needed for providing persistent storage
1292
* "flux v2":https://fluxcd.io/
1293
** Needed to manage resources automatically