Project

General

Profile

The ungleich kubernetes infrastructure » History » Version 206

Nico Schottelius, 11/25/2023 03:59 PM

1 22 Nico Schottelius
h1. The ungleich kubernetes infrastructure and ungleich kubernetes manual
2 1 Nico Schottelius
3 3 Nico Schottelius
{{toc}}
4
5 1 Nico Schottelius
h2. Status
6
7 28 Nico Schottelius
This document is **pre-production**.
8
This document is to become the ungleich kubernetes infrastructure overview as well as the ungleich kubernetes manual.
9 1 Nico Schottelius
10 10 Nico Schottelius
h2. k8s clusters
11
12 123 Nico Schottelius
| Cluster            | Purpose/Setup     | Maintainer | Master(s)                     | argo                                                   | v4 http proxy | last verified |
13
| c0.k8s.ooo         | Dev               | -          | UNUSED                        |                                                        |               |    2021-10-05 |
14
| c1.k8s.ooo         | retired           |            | -                             |                                                        |               |    2022-03-15 |
15
| c2.k8s.ooo         | Dev p7 HW         | Nico       | server47 server53 server54    | "argo":https://argocd-server.argocd.svc.c2.k8s.ooo     |               |    2021-10-05 |
16
| c3.k8s.ooo         | retired           | -          | -                             |                                                        |               |    2021-10-05 |
17
| c4.k8s.ooo         | Dev2 p7 HW        | Jin-Guk    | server52 server53 server54    |                                                        |               |             - |
18
| c5.k8s.ooo         | retired           |            | -                             |                                                        |               |    2022-03-15 |
19
| c6.k8s.ooo         | Dev p6 VM Jin-Guk | Jin-Guk    |                               |                                                        |               |               |
20
| [[p5.k8s.ooo]]     | production        |            | server34 server36 server38    | "argo":https://argocd-server.argocd.svc.p5.k8s.ooo     | -             |               |
21
| [[p5-cow.k8s.ooo]] | production        | Nico       | server47 server51 server55    | "argo":https://argocd-server.argocd.svc.p5-cow.k8s.ooo |               |    2022-08-27 |
22
| [[p6.k8s.ooo]]     | production        |            | server67 server69 server71    | "argo":https://argocd-server.argocd.svc.p6.k8s.ooo     | 147.78.194.13 |    2021-10-05 |
23 184 Nico Schottelius
| [[p6-cow.k8s.ooo]] | production        |            | server134 server135 server136 | "argo":https://argocd-server.argocd.svc.p6in10.k8s.ooo | ?             |    2023-05-17 |
24 177 Nico Schottelius
| [[p10.k8s.ooo]]    | production        |            | server131 server132 server133 | "argo":https://argocd-server.argocd.svc.p10.k8s.ooo    | 147.78.194.12 |    2021-10-05 |
25 123 Nico Schottelius
| [[k8s.ge.nau.so]]  | development       |            | server107 server108 server109 | "argo":https://argocd-server.argocd.svc.k8s.ge.nau.so  |               |               |
26
| [[dev.k8s.ooo]]    | development       |            | server110 server111 server112 | "argo":https://argocd-server.argocd.svc.dev.k8s.ooo    | -             |    2022-07-08 |
27 164 Nico Schottelius
| [[r1r2p15k8sooo|r1.p15.k8s.ooo]] | production | Nico | server120 | | | 2022-10-30 |
28
| [[r1r2p15k8sooo|r2.p15.k8s.ooo]] | production | Nico | server121 | | | 2022-09-06 |
29 162 Nico Schottelius
| [[r1r2p10k8sooo|r1.p10.k8s.ooo]] | production | Nico | server122 | | | 2022-10-30 |
30
| [[r1r2p10k8sooo|r2.p10.k8s.ooo]] | production | Nico | server123 | | | 2022-10-15 |
31
| [[r1r2p5k8sooo|r1.p5.k8s.ooo]] | production | Nico | server137 | | | 2022-10-30 |
32
| [[r1r2p5k8sooo|r2.p5.k8s.ooo]] | production | Nico | server138 | | | 2022-10-30 |
33
| [[r1r2p6k8sooo|r1.p6.k8s.ooo]] | production | Nico | server139 | | | 2022-10-30 |
34
| [[r1r2p6k8sooo|r2.p6.k8s.ooo]] | production | Nico | server140 | | | 2022-10-30 |
35 21 Nico Schottelius
36 1 Nico Schottelius
h2. General architecture and components overview
37
38
* All k8s clusters are IPv6 only
39
* We use BGP peering to propagate podcidr and serviceCidr networks to our infrastructure
40
* The main public testing repository is "ungleich-k8s":https://code.ungleich.ch/ungleich-public/ungleich-k8s
41 18 Nico Schottelius
** Private configurations are found in the **k8s-config** repository
42 1 Nico Schottelius
43
h3. Cluster types
44
45 28 Nico Schottelius
| **Type/Feature**            | **Development**                | **Production**         |
46
| Min No. nodes               | 3 (1 master, 3 worker)         | 5 (3 master, 3 worker) |
47
| Recommended minimum         | 4 (dedicated master, 3 worker) | 8 (3 master, 5 worker) |
48
| Separation of control plane | optional                       | recommended            |
49
| Persistent storage          | required                       | required               |
50
| Number of storage monitors  | 3                              | 5                      |
51 1 Nico Schottelius
52 43 Nico Schottelius
h2. General k8s operations
53 1 Nico Schottelius
54 46 Nico Schottelius
h3. Cheat sheet / external great references
55
56
* "kubectl cheatsheet":https://kubernetes.io/docs/reference/kubectl/cheatsheet/
57
58 117 Nico Schottelius
h3. Allowing to schedule work on the control plane / removing node taints
59 69 Nico Schottelius
60
* Mostly for single node / test / development clusters
61
* Just remove the master taint as follows
62
63
<pre>
64
kubectl taint nodes --all node-role.kubernetes.io/master-
65 118 Nico Schottelius
kubectl taint nodes --all node-role.kubernetes.io/control-plane-
66 69 Nico Schottelius
</pre>
67 1 Nico Schottelius
68 117 Nico Schottelius
You can check the node taints using @kubectl describe node ...@
69 69 Nico Schottelius
70 44 Nico Schottelius
h3. Get the cluster admin.conf
71
72
* On the masters of each cluster you can find the file @/etc/kubernetes/admin.conf@
73
* To be able to administrate the cluster you can copy the admin.conf to your local machine
74
* Multi cluster debugging can very easy if you name the config ~/cX-admin.conf (see example below)
75
76
<pre>
77
% scp root@server47.place7.ungleich.ch:/etc/kubernetes/admin.conf ~/c2-admin.conf
78
% export KUBECONFIG=~/c2-admin.conf    
79
% kubectl get nodes
80
NAME       STATUS                     ROLES                  AGE   VERSION
81
server47   Ready                      control-plane,master   82d   v1.22.0
82
server48   Ready                      control-plane,master   82d   v1.22.0
83
server49   Ready                      <none>                 82d   v1.22.0
84
server50   Ready                      <none>                 82d   v1.22.0
85
server59   Ready                      control-plane,master   82d   v1.22.0
86
server60   Ready,SchedulingDisabled   <none>                 82d   v1.22.0
87
server61   Ready                      <none>                 82d   v1.22.0
88
server62   Ready                      <none>                 82d   v1.22.0               
89
</pre>
90
91 18 Nico Schottelius
h3. Installing a new k8s cluster
92 8 Nico Schottelius
93 9 Nico Schottelius
* Decide on the cluster name (usually *cX.k8s.ooo*), X counting upwards
94 28 Nico Schottelius
** Using pXX.k8s.ooo for production clusters of placeXX
95 9 Nico Schottelius
* Use cdist to configure the nodes with requirements like crio
96
* Decide between single or multi node control plane setups (see below)
97 28 Nico Schottelius
** Single control plane suitable for development clusters
98 9 Nico Schottelius
99 28 Nico Schottelius
Typical init procedure:
100 9 Nico Schottelius
101 206 Nico Schottelius
h4. Single control plane:
102
103
<pre>
104
kubeadm init --config bootstrap/XXX/kubeadm.yaml
105
</pre>
106
107
h4. Multi control plane (HA):
108
109
<pre>
110
kubeadm init --config bootstrap/XXX/kubeadm.yaml --upload-certs
111
</pre>
112
113 10 Nico Schottelius
114 29 Nico Schottelius
h3. Deleting a pod that is hanging in terminating state
115
116
<pre>
117
kubectl delete pod <PODNAME> --grace-period=0 --force --namespace <NAMESPACE>
118
</pre>
119
120
(from https://stackoverflow.com/questions/35453792/pods-stuck-in-terminating-status)
121
122 42 Nico Schottelius
h3. Listing nodes of a cluster
123
124
<pre>
125
[15:05] bridge:~% kubectl get nodes
126
NAME       STATUS   ROLES                  AGE   VERSION
127
server22   Ready    <none>                 52d   v1.22.0
128
server23   Ready    <none>                 52d   v1.22.2
129
server24   Ready    <none>                 52d   v1.22.0
130
server25   Ready    <none>                 52d   v1.22.0
131
server26   Ready    <none>                 52d   v1.22.0
132
server27   Ready    <none>                 52d   v1.22.0
133
server63   Ready    control-plane,master   52d   v1.22.0
134
server64   Ready    <none>                 52d   v1.22.0
135
server65   Ready    control-plane,master   52d   v1.22.0
136
server66   Ready    <none>                 52d   v1.22.0
137
server83   Ready    control-plane,master   52d   v1.22.0
138
server84   Ready    <none>                 52d   v1.22.0
139
server85   Ready    <none>                 52d   v1.22.0
140
server86   Ready    <none>                 52d   v1.22.0
141
</pre>
142
143 41 Nico Schottelius
h3. Removing / draining a node
144
145
Usually @kubectl drain server@ should do the job, but sometimes we need to be more aggressive:
146
147 1 Nico Schottelius
<pre>
148 103 Nico Schottelius
kubectl drain --delete-emptydir-data --ignore-daemonsets serverXX
149 42 Nico Schottelius
</pre>
150
151
h3. Readding a node after draining
152
153
<pre>
154
kubectl uncordon serverXX
155 1 Nico Schottelius
</pre>
156 43 Nico Schottelius
157 50 Nico Schottelius
h3. (Re-)joining worker nodes after creating the cluster
158 49 Nico Schottelius
159
* We need to have an up-to-date token
160
* We use different join commands for the workers and control plane nodes
161
162
Generating the join command on an existing control plane node:
163
164
<pre>
165
kubeadm token create --print-join-command
166
</pre>
167
168 50 Nico Schottelius
h3. (Re-)joining control plane nodes after creating the cluster
169 1 Nico Schottelius
170 50 Nico Schottelius
* We generate the token again
171
* We upload the certificates
172
* We need to combine/create the join command for the control plane node
173
174
Example session:
175
176
<pre>
177
% kubeadm token create --print-join-command
178
kubeadm join p10-api.k8s.ooo:6443 --token xmff4i.ABC --discovery-token-ca-cert-hash sha256:longhash 
179
180
% kubeadm init phase upload-certs --upload-certs
181
[upload-certs] Storing the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
182
[upload-certs] Using certificate key:
183
CERTKEY
184
185
# Then we use these two outputs on the joining node:
186
187
kubeadm join p10-api.k8s.ooo:6443 --token xmff4i.ABC --discovery-token-ca-cert-hash sha256:longhash --control-plane --certificate-key CERTKEY
188
</pre>
189
190
Commands to be used on a control plane node:
191
192
<pre>
193
kubeadm token create --print-join-command
194
kubeadm init phase upload-certs --upload-certs
195
</pre>
196
197
Commands to be used on the joining node:
198
199
<pre>
200
JOINCOMMAND --control-plane --certificate-key CERTKEY
201
</pre>
202 49 Nico Schottelius
203 51 Nico Schottelius
SEE ALSO
204
205
* https://stackoverflow.com/questions/63936268/how-to-generate-kubeadm-token-for-secondary-control-plane-nodes
206
* https://blog.scottlowe.org/2019/08/15/reconstructing-the-join-command-for-kubeadm/
207
208 53 Nico Schottelius
h3. How to fix etcd does not start when rejoining a kubernetes cluster as a control plane
209 52 Nico Schottelius
210
If during the above step etcd does not come up, @kubeadm join@ can hang as follows:
211
212
<pre>
213
[control-plane] Creating static Pod manifest for "kube-apiserver"                                                              
214
[control-plane] Creating static Pod manifest for "kube-controller-manager"                                                     
215
[control-plane] Creating static Pod manifest for "kube-scheduler"                                                              
216
[check-etcd] Checking that the etcd cluster is healthy                                                                         
217
error execution phase check-etcd: etcd cluster is not healthy: failed to dial endpoint https://[2a0a:e5c0:10:1:225:b3ff:fe20:37
218
8a]:2379 with maintenance client: context deadline exceeded                                                                    
219
To see the stack trace of this error execute with --v=5 or higher         
220
</pre>
221
222
Then the problem is likely that the etcd server is still a member of the cluster. We first need to remove it from the etcd cluster and then the join works.
223
224
To fix this we do:
225
226
* Find a working etcd pod
227
* Find the etcd members / member list
228
* Remove the etcd member that we want to re-join the cluster
229
230
231
<pre>
232
# Find the etcd pods
233
kubectl -n kube-system get pods -l component=etcd,tier=control-plane
234
235
# Get the list of etcd servers with the member id 
236
kubectl exec -n kube-system -ti ETCDPODNAME -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list
237
238
# Remove the member
239
kubectl exec -n kube-system -ti ETCDPODNAME -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove MEMBERID
240
</pre>
241
242
Sample session:
243
244
<pre>
245
[10:48] line:~% kubectl -n kube-system get pods -l component=etcd,tier=control-plane
246
NAME            READY   STATUS    RESTARTS     AGE
247
etcd-server63   1/1     Running   0            3m11s
248
etcd-server65   1/1     Running   3            7d2h
249
etcd-server83   1/1     Running   8 (6d ago)   7d2h
250
[10:48] line:~% kubectl exec -n kube-system -ti etcd-server65 -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list
251
356891cd676df6e4, started, server65, https://[2a0a:e5c0:10:1:225:b3ff:fe20:375c]:2380, https://[2a0a:e5c0:10:1:225:b3ff:fe20:375c]:2379, false
252
371b8a07185dee7e, started, server63, https://[2a0a:e5c0:10:1:225:b3ff:fe20:378a]:2380, https://[2a0a:e5c0:10:1:225:b3ff:fe20:378a]:2379, false
253
5942bc58307f8af9, started, server83, https://[2a0a:e5c0:10:1:3e4a:92ff:fe79:bb98]:2380, https://[2a0a:e5c0:10:1:3e4a:92ff:fe79:bb98]:2379, false
254
255
[10:48] line:~% kubectl exec -n kube-system -ti etcd-server65 -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove 371b8a07185dee7e
256
Member 371b8a07185dee7e removed from cluster e3c0805f592a8f77
257 1 Nico Schottelius
258
</pre>
259
260
SEE ALSO
261
262
* We found the solution using https://stackoverflow.com/questions/67921552/re-installed-node-cannot-join-kubernetes-cluster
263 56 Nico Schottelius
264 147 Nico Schottelius
h3. Node labels (adding, showing, removing)
265
266
Listing the labels:
267
268
<pre>
269
kubectl get nodes --show-labels
270
</pre>
271
272
Adding labels:
273
274
<pre>
275
kubectl label nodes LIST-OF-NODES label1=value1 
276
277
</pre>
278
279
For instance:
280
281
<pre>
282
kubectl label nodes router2 router3 hosttype=router 
283
</pre>
284
285
Selecting nodes in pods:
286
287
<pre>
288
apiVersion: v1
289
kind: Pod
290
...
291
spec:
292
  nodeSelector:
293
    hosttype: router
294
</pre>
295
296 148 Nico Schottelius
Removing labels by adding a minus at the end of the label name:
297
298
<pre>
299
kubectl label node <nodename> <labelname>-
300
</pre>
301
302
For instance:
303
304
<pre>
305
kubectl label nodes router2 router3 hosttype- 
306
</pre>
307
308 147 Nico Schottelius
SEE ALSO
309 1 Nico Schottelius
310 148 Nico Schottelius
* https://kubernetes.io/docs/tasks/configure-pod-container/assign-pods-nodes/
311
* https://stackoverflow.com/questions/34067979/how-to-delete-a-node-label-by-command-and-api
312 147 Nico Schottelius
313 199 Nico Schottelius
h3. Listing all pods on a node
314
315
<pre>
316
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=serverXX
317
</pre>
318
319
Found on https://stackoverflow.com/questions/62000559/how-to-list-all-the-pods-running-in-a-particular-worker-node-by-executing-a-comm
320
321 101 Nico Schottelius
h3. Hardware Maintenance using ungleich-hardware
322
323
Use the following manifest and replace the HOST with the actual host:
324
325
<pre>
326
apiVersion: v1
327
kind: Pod
328
metadata:
329
  name: ungleich-hardware-HOST
330
spec:
331
  containers:
332
  - name: ungleich-hardware
333
    image: ungleich/ungleich-hardware:0.0.5
334
    args:
335
    - sleep
336
    - "1000000"
337
    volumeMounts:
338
      - mountPath: /dev
339
        name: dev
340
    securityContext:
341
      privileged: true
342
  nodeSelector:
343
    kubernetes.io/hostname: "HOST"
344
345
  volumes:
346
    - name: dev
347
      hostPath:
348
        path: /dev
349
</pre>
350
351 102 Nico Schottelius
Also see: [[The_ungleich_hardware_maintenance_guide]]
352
353 105 Nico Schottelius
h3. Triggering a cronjob / creating a job from a cronjob
354 104 Nico Schottelius
355
To test a cronjob, we can create a job from a cronjob:
356
357
<pre>
358
kubectl create job --from=cronjob/volume2-daily-backup volume2-manual
359
</pre>
360
361
This creates a job volume2-manual based on the cronjob  volume2-daily
362
363 112 Nico Schottelius
h3. su-ing into a user that has nologin shell set
364
365
Many times users are having nologin as their shell inside the container. To be able to execute maintenance commands within the
366
container, we can use @su -s /bin/sh@ like this:
367
368
<pre>
369
su -s /bin/sh -c '/path/to/your/script' testuser
370
</pre>
371
372
Found on https://serverfault.com/questions/351046/how-to-run-command-as-user-who-has-usr-sbin-nologin-as-shell
373
374 113 Nico Schottelius
h3. How to print a secret value
375
376
Assuming you want the "password" item from a secret, use:
377
378
<pre>
379
kubectl get secret SECRETNAME -o jsonpath="{.data.password}" | base64 -d; echo "" 
380
</pre>
381
382 173 Nico Schottelius
h3. How to upgrade a kubernetes cluster
383 172 Nico Schottelius
384
h4. General
385
386
* Should be done every X months to stay up-to-date
387
** X probably something like 3-6
388
* kubeadm based clusters
389
* Needs specific kubeadm versions for upgrade
390
* Follow instructions on https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/
391 190 Nico Schottelius
* Finding releases: https://github.com/kubernetes/kubernetes/tree/master/CHANGELOG
392 172 Nico Schottelius
393
h4. Getting a specific kubeadm or kubelet version
394
395
<pre>
396 190 Nico Schottelius
RELEASE=v1.22.17
397
RELEASE=v1.23.17
398 181 Nico Schottelius
RELEASE=v1.24.9
399 1 Nico Schottelius
RELEASE=v1.25.9
400
RELEASE=v1.26.6
401 190 Nico Schottelius
RELEASE=v1.27.2
402
403 187 Nico Schottelius
ARCH=amd64
404 172 Nico Schottelius
405
curl -L --remote-name-all https://dl.k8s.io/release/${RELEASE}/bin/linux/${ARCH}/{kubeadm,kubelet}
406 182 Nico Schottelius
chmod u+x kubeadm kubelet
407 172 Nico Schottelius
</pre>
408
409
h4. Steps
410
411
* kubeadm upgrade plan
412
** On one control plane node
413
* kubeadm upgrade apply vXX.YY.ZZ
414
** On one control plane node
415 189 Nico Schottelius
* kubeadm upgrade node
416
** On all other control plane nodes
417
** On all worker nodes afterwards
418
419 172 Nico Schottelius
420 173 Nico Schottelius
Repeat for all control planes nodes. The upgrade kubelet on all other nodes via package manager.
421 172 Nico Schottelius
422 193 Nico Schottelius
h4. Upgrading to 1.22.17
423 1 Nico Schottelius
424 193 Nico Schottelius
* https://v1-22.docs.kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/
425 194 Nico Schottelius
* Need to create a kubeadm config map
426 198 Nico Schottelius
** f.i. using the following
427
** @/usr/local/bin/kubeadm-v1.22.17   upgrade --config kubeadm.yaml --ignore-preflight-errors=CoreDNSUnsupportedPlugins,CoreDNSMigration apply -y v1.22.17@
428 193 Nico Schottelius
* Done for p6 on 2023-10-04
429
430
h4. Upgrading to 1.23.17
431
432
* https://v1-23.docs.kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/
433
* No special notes
434
* Done for p6 on 2023-10-04
435
436
h4. Upgrading to 1.24.17
437
438
* https://v1-24.docs.kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/
439
* No special notes
440
* Done for p6 on 2023-10-04
441
442
h4. Upgrading to 1.25.14
443
444
* https://v1-24.docs.kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/
445
* No special notes
446
* Done for p6 on 2023-10-04
447
448
h4. Upgrading to 1.26.9
449
450 1 Nico Schottelius
* https://v1-26.docs.kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/
451 193 Nico Schottelius
* No special notes
452
* Done for p6 on 2023-10-04
453 188 Nico Schottelius
454 196 Nico Schottelius
h4. Upgrading to 1.27
455 186 Nico Schottelius
456 192 Nico Schottelius
* https://v1-27.docs.kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/
457 186 Nico Schottelius
* kubelet will not start anymore
458
* reason: @"command failed" err="failed to parse kubelet flag: unknown flag: --container-runtime"@
459
* /var/lib/kubelet/kubeadm-flags.env contains that parameter
460
* remove it, start kubelet
461 192 Nico Schottelius
462 197 Nico Schottelius
h4. Upgrading to 1.28
463 192 Nico Schottelius
464
* https://v1-28.docs.kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/
465 186 Nico Schottelius
466
h4. Upgrade to crio 1.27: missing crun
467
468
Error message
469
470
<pre>
471
level=fatal msg="validating runtime config: runtime validation: \"crun\" not found in $PATH: exec: \"crun\": executable file not found in $PATH"
472
</pre>
473
474
Fix:
475
476
<pre>
477
apk add crun
478
</pre>
479
480 157 Nico Schottelius
h2. Reference CNI
481
482
* Mainly "stupid", but effective plugins
483
* Main documentation on https://www.cni.dev/plugins/current/
484 158 Nico Schottelius
* Plugins
485
** bridge
486
*** Can create the bridge on the host
487
*** But seems not to be able to add host interfaces to it as well
488
*** Has support for vlan tags
489
** vlan
490
*** creates vlan tagged sub interface on the host
491 160 Nico Schottelius
*** "It's a 1:1 mapping (i.e. no bridge in between)":https://github.com/k8snetworkplumbingwg/multus-cni/issues/569
492 158 Nico Schottelius
** host-device
493
*** moves the interface from the host into the container
494
*** very easy for physical connections to containers
495 159 Nico Schottelius
** ipvlan
496
*** "virtualisation" of a host device
497
*** routing based on IP
498
*** Same MAC for everyone
499
*** Cannot reach the master interface
500
** maclvan
501
*** With mac addresses
502
*** Supports various modes (to be checked)
503
** ptp ("point to point")
504
*** Creates a host device and connects it to the container
505
** win*
506 158 Nico Schottelius
*** Windows implementations
507 157 Nico Schottelius
508 62 Nico Schottelius
h2. Calico CNI
509
510
h3. Calico Installation
511
512
* We install "calico using helm":https://docs.projectcalico.org/getting-started/kubernetes/helm
513
* This has the following advantages:
514
** Easy to upgrade
515
** Does not require os to configure IPv6/dual stack settings as the tigera operator figures out things on its own
516
517
Usually plain calico can be installed directly using:
518
519
<pre>
520 174 Nico Schottelius
VERSION=v3.25.0
521 149 Nico Schottelius
522 1 Nico Schottelius
helm repo add projectcalico https://docs.projectcalico.org/charts
523 167 Nico Schottelius
helm repo update
524 124 Nico Schottelius
helm upgrade --install --namespace tigera calico projectcalico/tigera-operator --version $VERSION --create-namespace
525 1 Nico Schottelius
</pre>
526 92 Nico Schottelius
527
* Check the tags on https://github.com/projectcalico/calico/tags for the latest release
528 62 Nico Schottelius
529
h3. Installing calicoctl
530
531 115 Nico Schottelius
* General installation instructions, including binary download: https://projectcalico.docs.tigera.io/maintenance/clis/calicoctl/install
532
533 62 Nico Schottelius
To be able to manage and configure calico, we need to 
534
"install calicoctl (we choose the version as a pod)":https://docs.projectcalico.org/getting-started/clis/calicoctl/install#install-calicoctl-as-a-kubernetes-pod
535
536
<pre>
537
kubectl apply -f https://docs.projectcalico.org/manifests/calicoctl.yaml
538
</pre>
539
540 93 Nico Schottelius
Or version specific:
541
542
<pre>
543
kubectl apply -f https://github.com/projectcalico/calico/blob/v3.20.4/manifests/calicoctl.yaml
544 97 Nico Schottelius
545
# For 3.22
546
kubectl apply -f https://projectcalico.docs.tigera.io/archive/v3.22/manifests/calicoctl.yaml
547 93 Nico Schottelius
</pre>
548
549 70 Nico Schottelius
And making it easier accessible by alias:
550
551
<pre>
552
alias calicoctl="kubectl exec -i -n kube-system calicoctl -- /calicoctl"
553
</pre>
554
555 62 Nico Schottelius
h3. Calico configuration
556
557 63 Nico Schottelius
By default our k8s clusters "BGP peer":https://docs.projectcalico.org/networking/bgp
558
with an upstream router to propagate podcidr and servicecidr.
559 62 Nico Schottelius
560
Default settings in our infrastructure:
561
562
* We use a full-mesh using the @nodeToNodeMeshEnabled: true@ option
563
* We keep the original next hop so that *only* the server with the pod is announcing it (instead of ecmp)
564 1 Nico Schottelius
* We use private ASNs for k8s clusters
565 63 Nico Schottelius
* We do *not* use any overlay
566 62 Nico Schottelius
567
After installing calico and calicoctl the last step of the installation is usually:
568
569 1 Nico Schottelius
<pre>
570 79 Nico Schottelius
calicoctl create -f - < calico-bgp.yaml
571 62 Nico Schottelius
</pre>
572
573
574
A sample BGP configuration:
575
576
<pre>
577
---
578
apiVersion: projectcalico.org/v3
579
kind: BGPConfiguration
580
metadata:
581
  name: default
582
spec:
583
  logSeverityScreen: Info
584
  nodeToNodeMeshEnabled: true
585
  asNumber: 65534
586
  serviceClusterIPs:
587
  - cidr: 2a0a:e5c0:10:3::/108
588
  serviceExternalIPs:
589
  - cidr: 2a0a:e5c0:10:3::/108
590
---
591
apiVersion: projectcalico.org/v3
592
kind: BGPPeer
593
metadata:
594
  name: router1-place10
595
spec:
596
  peerIP: 2a0a:e5c0:10:1::50
597
  asNumber: 213081
598
  keepOriginalNextHop: true
599
</pre>
600
601 126 Nico Schottelius
h2. Cilium CNI (experimental)
602
603 137 Nico Schottelius
h3. Status
604
605 138 Nico Schottelius
*NO WORKING CILIUM CONFIGURATION FOR IPV6 only modes*
606 137 Nico Schottelius
607 146 Nico Schottelius
h3. Latest error
608
609
It seems cilium does not run on IPv6 only hosts:
610
611
<pre>
612
level=info msg="Validating configured node address ranges" subsys=daemon
613
level=fatal msg="postinit failed" error="external IPv4 node address could not be derived, please configure via --ipv4-node" subsys=daemon
614
level=info msg="Starting IP identity watcher" subsys=ipcache
615
</pre>
616
617
It crashes after that log entry
618
619 128 Nico Schottelius
h3. BGP configuration
620
621
* The cilium-operator will not start without a correct configmap being present beforehand (see error message below)
622
* Creating the bgp config beforehand as a configmap is thus required.
623
624
The error one gets without the configmap present:
625
626
Pods are hanging with:
627
628
<pre>
629
cilium-bpqm6                       0/1     Init:0/4            0             9s
630
cilium-operator-5947d94f7f-5bmh2   0/1     ContainerCreating   0             9s
631
</pre>
632
633
The error message in the cilium-*perator is:
634
635
<pre>
636
Events:
637
  Type     Reason       Age                From               Message
638
  ----     ------       ----               ----               -------
639
  Normal   Scheduled    80s                default-scheduler  Successfully assigned kube-system/cilium-operator-5947d94f7f-lqcsp to server56
640
  Warning  FailedMount  16s (x8 over 80s)  kubelet            MountVolume.SetUp failed for volume "bgp-config-path" : configmap "bgp-config" not found
641
</pre>
642
643
A correct bgp config looks like this:
644
645
<pre>
646
apiVersion: v1
647
kind: ConfigMap
648
metadata:
649
  name: bgp-config
650
  namespace: kube-system
651
data:
652
  config.yaml: |
653
    peers:
654
      - peer-address: 2a0a:e5c0::46
655
        peer-asn: 209898
656
        my-asn: 65533
657
      - peer-address: 2a0a:e5c0::47
658
        peer-asn: 209898
659
        my-asn: 65533
660
    address-pools:
661
      - name: default
662
        protocol: bgp
663
        addresses:
664
          - 2a0a:e5c0:0:14::/64
665
</pre>
666 127 Nico Schottelius
667
h3. Installation
668 130 Nico Schottelius
669 127 Nico Schottelius
Adding the repo
670 1 Nico Schottelius
<pre>
671 127 Nico Schottelius
672 129 Nico Schottelius
helm repo add cilium https://helm.cilium.io/
673 130 Nico Schottelius
helm repo update
674
</pre>
675 129 Nico Schottelius
676 135 Nico Schottelius
Installing + configuring cilium
677 129 Nico Schottelius
<pre>
678 130 Nico Schottelius
ipv6pool=2a0a:e5c0:0:14::/112
679 1 Nico Schottelius
680 146 Nico Schottelius
version=1.12.2
681 129 Nico Schottelius
682
helm upgrade --install cilium cilium/cilium --version $version \
683 1 Nico Schottelius
  --namespace kube-system \
684
  --set ipv4.enabled=false \
685
  --set ipv6.enabled=true \
686 146 Nico Schottelius
  --set enableIPv6Masquerade=false \
687
  --set bgpControlPlane.enabled=true 
688 1 Nico Schottelius
689 146 Nico Schottelius
#  --set ipam.operator.clusterPoolIPv6PodCIDRList=$ipv6pool
690
691
# Old style bgp?
692 136 Nico Schottelius
#   --set bgp.enabled=true --set bgp.announce.podCIDR=true \
693 127 Nico Schottelius
694
# Show possible configuration options
695
helm show values cilium/cilium
696
697 1 Nico Schottelius
</pre>
698 132 Nico Schottelius
699
Using a /64 for ipam.operator.clusterPoolIPv6PodCIDRList fails with:
700
701
<pre>
702
level=fatal msg="Unable to init cluster-pool allocator" error="unable to initialize IPv6 allocator New CIDR set failed; the node CIDR size is too big" subsys=cilium-operator-generic
703
</pre>
704
705 126 Nico Schottelius
706 1 Nico Schottelius
See also https://github.com/cilium/cilium/issues/20756
707 135 Nico Schottelius
708
Seems a /112 is actually working.
709
710
h3. Kernel modules
711
712
Cilium requires the following modules to be loaded on the host (not loaded by default):
713
714
<pre>
715 1 Nico Schottelius
modprobe  ip6table_raw
716
modprobe  ip6table_filter
717
</pre>
718 146 Nico Schottelius
719
h3. Interesting helm flags
720
721
* autoDirectNodeRoutes
722
* bgpControlPlane.enabled = true
723
724
h3. SEE ALSO
725
726
* https://docs.cilium.io/en/v1.12/helm-reference/
727 133 Nico Schottelius
728 179 Nico Schottelius
h2. Multus
729 168 Nico Schottelius
730
* https://github.com/k8snetworkplumbingwg/multus-cni
731
* Installing a deployment w/ CRDs
732 150 Nico Schottelius
733 169 Nico Schottelius
<pre>
734 176 Nico Schottelius
VERSION=v4.0.1
735 169 Nico Schottelius
736 170 Nico Schottelius
kubectl apply -f https://raw.githubusercontent.com/k8snetworkplumbingwg/multus-cni/${VERSION}/deployments/multus-daemonset-crio.yml
737
</pre>
738 169 Nico Schottelius
739 191 Nico Schottelius
h2. ArgoCD
740 56 Nico Schottelius
741 60 Nico Schottelius
h3. Argocd Installation
742 1 Nico Schottelius
743 116 Nico Schottelius
* See https://argo-cd.readthedocs.io/en/stable/
744
745 60 Nico Schottelius
As there is no configuration management present yet, argocd is installed using
746
747 1 Nico Schottelius
<pre>
748 60 Nico Schottelius
kubectl create namespace argocd
749 1 Nico Schottelius
750
# OR: latest stable
751
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
752
753 191 Nico Schottelius
# OR Specific Version
754
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/v2.3.2/manifests/install.yaml
755 56 Nico Schottelius
756 191 Nico Schottelius
757
</pre>
758 1 Nico Schottelius
759 60 Nico Schottelius
h3. Get the argocd credentials
760
761
<pre>
762
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d; echo ""
763
</pre>
764 52 Nico Schottelius
765 87 Nico Schottelius
h3. Accessing argocd
766
767
In regular IPv6 clusters:
768
769
* Navigate to https://argocd-server.argocd.CLUSTERDOMAIN
770
771
In legacy IPv4 clusters
772
773
<pre>
774
kubectl --namespace argocd port-forward svc/argocd-server 8080:80
775
</pre>
776
777 88 Nico Schottelius
* Navigate to https://localhost:8080
778
779 68 Nico Schottelius
h3. Using the argocd webhook to trigger changes
780 67 Nico Schottelius
781
* To trigger changes post json https://argocd.example.com/api/webhook
782
783 72 Nico Schottelius
h3. Deploying an application
784
785
* Applications are deployed via git towards gitea (code.ungleich.ch) and then pulled by argo
786 73 Nico Schottelius
* Always include the *redmine-url* pointing to the (customer) ticket
787
** Also add the support-url if it exists
788 72 Nico Schottelius
789
Application sample
790
791
<pre>
792
apiVersion: argoproj.io/v1alpha1
793
kind: Application
794
metadata:
795
  name: gitea-CUSTOMER
796
  namespace: argocd
797
spec:
798
  destination:
799
    namespace: default
800
    server: 'https://kubernetes.default.svc'
801
  source:
802
    path: apps/prod/gitea
803
    repoURL: 'https://code.ungleich.ch/ungleich-intern/k8s-config.git'
804
    targetRevision: HEAD
805
    helm:
806
      parameters:
807
        - name: storage.data.storageClass
808
          value: rook-ceph-block-hdd
809
        - name: storage.data.size
810
          value: 200Gi
811
        - name: storage.db.storageClass
812
          value: rook-ceph-block-ssd
813
        - name: storage.db.size
814
          value: 10Gi
815
        - name: storage.letsencrypt.storageClass
816
          value: rook-ceph-block-hdd
817
        - name: storage.letsencrypt.size
818
          value: 50Mi
819
        - name: letsencryptStaging
820
          value: 'no'
821
        - name: fqdn
822
          value: 'code.verua.online'
823
  project: default
824
  syncPolicy:
825
    automated:
826
      prune: true
827
      selfHeal: true
828
  info:
829
    - name: 'redmine-url'
830
      value: 'https://redmine.ungleich.ch/issues/ISSUEID'
831
    - name: 'support-url'
832
      value: 'https://support.ungleich.ch/Ticket/Display.html?id=TICKETID'
833
</pre>
834
835 80 Nico Schottelius
h2. Helm related operations and conventions
836 55 Nico Schottelius
837 61 Nico Schottelius
We use helm charts extensively.
838
839
* In production, they are managed via argocd
840
* In development, helm chart can de developed and deployed manually using the helm utility.
841
842 55 Nico Schottelius
h3. Installing a helm chart
843
844
One can use the usual pattern of
845
846
<pre>
847
helm install <releasename> <chartdirectory>
848
</pre>
849
850
However often you want to reinstall/update when testing helm charts. The following pattern is "better", because it allows you to reinstall, if it is already installed:
851
852
<pre>
853
helm upgrade --install <releasename> <chartdirectory>
854 1 Nico Schottelius
</pre>
855 80 Nico Schottelius
856
h3. Naming services and deployments in helm charts [Application labels]
857
858
* We always have {{ .Release.Name }} to identify the current "instance"
859
* Deployments:
860
** use @app: <what it is>@, f.i. @app: nginx@, @app: postgres@, ...
861 81 Nico Schottelius
* See more about standard labels on
862
** https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/
863
** https://helm.sh/docs/chart_best_practices/labels/
864 55 Nico Schottelius
865 151 Nico Schottelius
h3. Show all versions of a helm chart
866
867
<pre>
868
helm search repo -l repo/chart
869
</pre>
870
871
For example:
872
873
<pre>
874
% helm search repo -l projectcalico/tigera-operator 
875
NAME                         	CHART VERSION	APP VERSION	DESCRIPTION                            
876
projectcalico/tigera-operator	v3.23.3      	v3.23.3    	Installs the Tigera operator for Calico
877
projectcalico/tigera-operator	v3.23.2      	v3.23.2    	Installs the Tigera operator for Calico
878
....
879
</pre>
880
881 152 Nico Schottelius
h3. Show possible values of a chart
882
883
<pre>
884
helm show values <repo/chart>
885
</pre>
886
887
Example:
888
889
<pre>
890
helm show values ingress-nginx/ingress-nginx
891
</pre>
892
893 178 Nico Schottelius
h3. Download a chart
894
895
For instance for checking it out locally. Use:
896
897
<pre>
898
helm pull <repo/chart>
899
</pre>
900 152 Nico Schottelius
901 139 Nico Schottelius
h2. Rook + Ceph
902
903
h3. Installation
904
905
* Usually directly via argocd
906
907 71 Nico Schottelius
h3. Executing ceph commands
908
909
Using the ceph-tools pod as follows:
910
911
<pre>
912
kubectl exec -n rook-ceph -ti $(kubectl -n rook-ceph get pods -l app=rook-ceph-tools -o jsonpath='{.items[*].metadata.name}') -- ceph -s
913
</pre>
914
915 43 Nico Schottelius
h3. Inspecting the logs of a specific server
916
917
<pre>
918
# Get the related pods
919
kubectl -n rook-ceph get pods -l app=rook-ceph-osd-prepare 
920
...
921
922
# Inspect the logs of a specific pod
923
kubectl -n rook-ceph logs -f rook-ceph-osd-prepare-server23--1-444qx
924
925 71 Nico Schottelius
</pre>
926
927
h3. Inspecting the logs of the rook-ceph-operator
928
929
<pre>
930
kubectl -n rook-ceph logs -f -l app=rook-ceph-operator
931 43 Nico Schottelius
</pre>
932
933 200 Nico Schottelius
h3. (Temporarily) Disabling the rook-operation
934
935
* first disabling the sync in argocd
936
* then scale it down
937
938
<pre>
939
kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=0
940
</pre>
941
942
When done with the work/maintenance, re-enable sync in argocd.
943
The following command is thus strictly speaking not required, as argocd will fix it on its own:
944
945
<pre>
946
kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=1
947
</pre>
948
949 121 Nico Schottelius
h3. Restarting the rook operator
950
951
<pre>
952
kubectl -n rook-ceph delete pods  -l app=rook-ceph-operator
953
</pre>
954
955 43 Nico Schottelius
h3. Triggering server prepare / adding new osds
956
957
The rook-ceph-operator triggers/watches/creates pods to maintain hosts. To trigger a full "re scan", simply delete that pod:
958
959
<pre>
960
kubectl -n rook-ceph delete pods -l app=rook-ceph-operator
961
</pre>
962
963
This will cause all the @rook-ceph-osd-prepare-..@ jobs to be recreated and thus OSDs to be created, if new disks have been added.
964
965
h3. Removing an OSD
966
967
* See "Ceph OSD Management":https://rook.io/docs/rook/v1.7/ceph-osd-mgmt.html
968 77 Nico Schottelius
* More specifically: https://github.com/rook/rook/blob/release-1.7/cluster/examples/kubernetes/ceph/osd-purge.yaml
969 99 Nico Schottelius
* Then delete the related deployment
970 41 Nico Schottelius
971 98 Nico Schottelius
Set osd id in the osd-purge.yaml and apply it. OSD should be down before.
972
973
<pre>
974
apiVersion: batch/v1
975
kind: Job
976
metadata:
977
  name: rook-ceph-purge-osd
978
  namespace: rook-ceph # namespace:cluster
979
  labels:
980
    app: rook-ceph-purge-osd
981
spec:
982
  template:
983
    metadata:
984
      labels:
985
        app: rook-ceph-purge-osd
986
    spec:
987
      serviceAccountName: rook-ceph-purge-osd
988
      containers:
989
        - name: osd-removal
990
          image: rook/ceph:master
991
          # TODO: Insert the OSD ID in the last parameter that is to be removed
992
          # The OSD IDs are a comma-separated list. For example: "0" or "0,2".
993
          # If you want to preserve the OSD PVCs, set `--preserve-pvc true`.
994
          #
995
          # A --force-osd-removal option is available if the OSD should be destroyed even though the
996
          # removal could lead to data loss.
997
          args:
998
            - "ceph"
999
            - "osd"
1000
            - "remove"
1001
            - "--preserve-pvc"
1002
            - "false"
1003
            - "--force-osd-removal"
1004
            - "false"
1005
            - "--osd-ids"
1006
            - "SETTHEOSDIDHERE"
1007
          env:
1008
            - name: POD_NAMESPACE
1009
              valueFrom:
1010
                fieldRef:
1011
                  fieldPath: metadata.namespace
1012
            - name: ROOK_MON_ENDPOINTS
1013
              valueFrom:
1014
                configMapKeyRef:
1015
                  key: data
1016
                  name: rook-ceph-mon-endpoints
1017
            - name: ROOK_CEPH_USERNAME
1018
              valueFrom:
1019
                secretKeyRef:
1020
                  key: ceph-username
1021
                  name: rook-ceph-mon
1022
            - name: ROOK_CEPH_SECRET
1023
              valueFrom:
1024
                secretKeyRef:
1025
                  key: ceph-secret
1026
                  name: rook-ceph-mon
1027
            - name: ROOK_CONFIG_DIR
1028
              value: /var/lib/rook
1029
            - name: ROOK_CEPH_CONFIG_OVERRIDE
1030
              value: /etc/rook/config/override.conf
1031
            - name: ROOK_FSID
1032
              valueFrom:
1033
                secretKeyRef:
1034
                  key: fsid
1035
                  name: rook-ceph-mon
1036
            - name: ROOK_LOG_LEVEL
1037
              value: DEBUG
1038
          volumeMounts:
1039
            - mountPath: /etc/ceph
1040
              name: ceph-conf-emptydir
1041
            - mountPath: /var/lib/rook
1042
              name: rook-config
1043
      volumes:
1044
        - emptyDir: {}
1045
          name: ceph-conf-emptydir
1046
        - emptyDir: {}
1047
          name: rook-config
1048
      restartPolicy: Never
1049
1050
1051 99 Nico Schottelius
</pre>
1052
1053 1 Nico Schottelius
Deleting the deployment:
1054
1055
<pre>
1056
[18:05] bridge:~% kubectl -n rook-ceph delete deployment rook-ceph-osd-6
1057 99 Nico Schottelius
deployment.apps "rook-ceph-osd-6" deleted
1058
</pre>
1059 185 Nico Schottelius
1060
h3. Placement of mons/osds/etc.
1061
1062
See https://rook.io/docs/rook/v1.11/CRDs/Cluster/ceph-cluster-crd/#placement-configuration-settings
1063 98 Nico Schottelius
1064 145 Nico Schottelius
h2. Ingress + Cert Manager
1065
1066
* We deploy "nginx-ingress":https://docs.nginx.com/nginx-ingress-controller/ to get an ingress
1067
* we deploy "cert-manager":https://cert-manager.io/ to handle certificates
1068
* We independently deploy @ClusterIssuer@ to allow the cert-manager app to deploy and the issuer to be created once the CRDs from cert manager are in place
1069
1070
h3. IPv4 reachability 
1071
1072
The ingress is by default IPv6 only. To make it reachable from the IPv4 world, get its IPv6 address and configure a NAT64 mapping in Jool.
1073
1074
Steps:
1075
1076
h4. Get the ingress IPv6 address
1077
1078
Use @kubectl -n ingress-nginx get svc ingress-nginx-controller -o jsonpath='{.spec.clusterIP}'; echo ''@
1079
1080
Example:
1081
1082
<pre>
1083
kubectl -n ingress-nginx get svc ingress-nginx-controller -o jsonpath='{.spec.clusterIP}'; echo ''
1084
2a0a:e5c0:10:1b::ce11
1085
</pre>
1086
1087
h4. Add NAT64 mapping
1088
1089
* Update the __dcl_jool_siit cdist type
1090
* Record the two IPs (IPv6 and IPv4)
1091
* Configure all routers
1092
1093
1094
h4. Add DNS record
1095
1096
To use the ingress capable as a CNAME destination, create an "ingress" DNS record, such as:
1097
1098
<pre>
1099
; k8s ingress for dev
1100
dev-ingress                 AAAA 2a0a:e5c0:10:1b::ce11
1101
dev-ingress                 A 147.78.194.23
1102
1103
</pre> 
1104
1105
h4. Add supporting wildcard DNS
1106
1107
If you plan to add various sites under a specific domain, we can add a wildcard DNS entry, such as *.k8s-dev.django-hosting.ch:
1108
1109
<pre>
1110
*.k8s-dev         CNAME dev-ingress.ungleich.ch.
1111
</pre>
1112
1113 76 Nico Schottelius
h2. Harbor
1114
1115 175 Nico Schottelius
* We user "Harbor":https://goharbor.io/ as an image registry for our own images. Internal app reference: apps/prod/harbor.
1116
* The admin password is in the password store, it is Harbor12345 by default
1117 76 Nico Schottelius
* At the moment harbor only authenticates against the internal ldap tree
1118
1119
h3. LDAP configuration
1120
1121
* The url needs to be ldaps://...
1122
* uid = uid
1123
* rest standard
1124 75 Nico Schottelius
1125 89 Nico Schottelius
h2. Monitoring / Prometheus
1126
1127 90 Nico Schottelius
* Via "kube-prometheus":https://github.com/prometheus-operator/kube-prometheus/
1128 89 Nico Schottelius
1129 91 Nico Schottelius
Access via ...
1130
1131
* http://prometheus-k8s.monitoring.svc:9090
1132
* http://grafana.monitoring.svc:3000
1133
* http://alertmanager.monitoring.svc:9093
1134
1135
1136 100 Nico Schottelius
h3. Prometheus Options
1137
1138
* "helm/kube-prometheus-stack":https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack
1139
** Includes dashboards and co.
1140
* "manifest based kube-prometheus":https://github.com/prometheus-operator/kube-prometheus
1141
** Includes dashboards and co.
1142
* "Prometheus Operator (mainly CRD manifest":https://github.com/prometheus-operator/prometheus-operator
1143
1144 171 Nico Schottelius
h3. Grafana default password
1145
1146
* If not changed: @prom-operator@
1147
1148 82 Nico Schottelius
h2. Nextcloud
1149
1150 85 Nico Schottelius
h3. How to get the nextcloud credentials 
1151 84 Nico Schottelius
1152
* The initial username is set to "nextcloud"
1153
* The password is autogenerated and saved in a kubernetes secret
1154
1155
<pre>
1156 85 Nico Schottelius
kubectl get secret RELEASENAME-nextcloud -o jsonpath="{.data.PASSWORD}" | base64 -d; echo "" 
1157 84 Nico Schottelius
</pre>
1158
1159 83 Nico Schottelius
h3. How to fix "Access through untrusted domain"
1160
1161 82 Nico Schottelius
* Nextcloud stores the initial domain configuration
1162 1 Nico Schottelius
* If the FQDN is changed, it will show the error message "Access through untrusted domain"
1163 82 Nico Schottelius
* To fix, edit /var/www/html/config/config.php and correct the domain
1164 1 Nico Schottelius
* Then delete the pods
1165 165 Nico Schottelius
1166
h3. Running occ commands inside the nextcloud container
1167
1168
* Find the pod in the right namespace
1169
1170
Exec:
1171
1172
<pre>
1173
su www-data -s /bin/sh -c ./occ
1174
</pre>
1175
1176
* -s /bin/sh is needed as the default shell is set to /bin/false
1177
1178 166 Nico Schottelius
h4. Rescanning files
1179 165 Nico Schottelius
1180 166 Nico Schottelius
* If files have been added without nextcloud's knowledge
1181
1182
<pre>
1183
su www-data -s /bin/sh -c "./occ files:scan --all"
1184
</pre>
1185 82 Nico Schottelius
1186 201 Nico Schottelius
h2. Sealed Secrets
1187
1188 202 Jin-Guk Kwon
* install kubeseal
1189 1 Nico Schottelius
1190 202 Jin-Guk Kwon
<pre>
1191
KUBESEAL_VERSION='0.23.0'
1192
wget "https://github.com/bitnami-labs/sealed-secrets/releases/download/v${KUBESEAL_VERSION:?}/kubeseal-${KUBESEAL_VERSION:?}-linux-amd64.tar.gz" 
1193
tar -xvzf kubeseal-${KUBESEAL_VERSION:?}-linux-amd64.tar.gz kubeseal
1194
sudo install -m 755 kubeseal /usr/local/bin/kubeseal
1195
</pre>
1196
1197
* create key for sealed-secret
1198
1199
<pre>
1200
kubeseal --fetch-cert > /tmp/public-key-cert.pem
1201
</pre>
1202
1203
* create the secret
1204
1205
<pre>
1206 203 Jin-Guk Kwon
ex)
1207 202 Jin-Guk Kwon
apiVersion: v1
1208
kind: Secret
1209
metadata:
1210
  name: Release.Name-postgres-config
1211
  annotations:
1212
    secret-generator.v1.mittwald.de/autogenerate: POSTGRES_PASSWORD
1213
    hosting: Release.Name
1214
  labels:
1215
    app.kubernetes.io/instance: Release.Name
1216
    app.kubernetes.io/component: postgres
1217
stringData:
1218
  POSTGRES_USER: postgresUser
1219
  POSTGRES_DB: postgresDBName
1220
  POSTGRES_INITDB_ARGS: "--no-locale --encoding=UTF8"
1221
</pre>
1222
1223
* convert secret.yaml to sealed-secret.yaml
1224
1225
<pre>
1226
kubeseal -n <namespace> --cert=/tmp/public-key-cert.pem --format=yaml < ./secret.yaml  > ./sealed-secret.yaml
1227
</pre>
1228
1229
* use sealed-secret.yaml on helm-chart directory
1230 201 Nico Schottelius
1231 205 Jin-Guk Kwon
* refer ticket : #11989 , #12120
1232 204 Jin-Guk Kwon
1233 1 Nico Schottelius
h2. Infrastructure versions
1234 35 Nico Schottelius
1235 57 Nico Schottelius
h3. ungleich kubernetes infrastructure v5 (2021-10)
1236 1 Nico Schottelius
1237 57 Nico Schottelius
Clusters are configured / setup in this order:
1238
1239
* Bootstrap via kubeadm
1240 59 Nico Schottelius
* "Networking via calico + BGP (non ECMP) using helm":https://docs.projectcalico.org/getting-started/kubernetes/helm
1241
* "ArgoCD for CD":https://argo-cd.readthedocs.io/en/stable/
1242
** "rook for storage via argocd":https://rook.io/
1243 58 Nico Schottelius
** haproxy for in IPv6-cluster-IPv4-to-IPv6 proxy via argocd
1244
** "kubernetes-secret-generator for in cluster secrets":https://github.com/mittwald/kubernetes-secret-generator
1245
** "ungleich-certbot managing certs and nginx":https://hub.docker.com/repository/docker/ungleich/ungleich-certbot
1246
1247 57 Nico Schottelius
1248
h3. ungleich kubernetes infrastructure v4 (2021-09)
1249
1250 54 Nico Schottelius
* rook is configured via manifests instead of using the rook-ceph-cluster helm chart
1251 1 Nico Schottelius
* The rook operator is still being installed via helm
1252 35 Nico Schottelius
1253 57 Nico Schottelius
h3. ungleich kubernetes infrastructure v3 (2021-07)
1254 1 Nico Schottelius
1255 10 Nico Schottelius
* rook is now installed via helm via argocd instead of directly via manifests
1256 28 Nico Schottelius
1257 57 Nico Schottelius
h3. ungleich kubernetes infrastructure v2 (2021-05)
1258 28 Nico Schottelius
1259
* Replaced fluxv2 from ungleich k8s v1 with argocd
1260 1 Nico Schottelius
** argocd can apply helm templates directly without needing to go through Chart releases
1261 28 Nico Schottelius
* We are also using argoflow for build flows
1262
* Planned to add "kaniko":https://github.com/GoogleContainerTools/kaniko for image building
1263
1264 57 Nico Schottelius
h3. ungleich kubernetes infrastructure v1 (2021-01)
1265 28 Nico Schottelius
1266
We are using the following components:
1267
1268
* "Calico as a CNI":https://www.projectcalico.org/ with BGP, IPv6 only, no encapsulation
1269
** Needed for basic networking
1270
* "kubernetes-secret-generator":https://github.com/mittwald/kubernetes-secret-generator for creating secrets
1271
** Needed so that secrets are not stored in the git repository, but only in the cluster
1272
* "ungleich-certbot":https://hub.docker.com/repository/docker/ungleich/ungleich-certbot
1273
** Needed to get letsencrypt certificates for services
1274
* "rook with ceph rbd + cephfs":https://rook.io/ for storage
1275
** rbd for almost everything, *ReadWriteOnce*
1276
** cephfs for smaller things, multi access *ReadWriteMany*
1277
** Needed for providing persistent storage
1278
* "flux v2":https://fluxcd.io/
1279
** Needed to manage resources automatically