Project

General

Profile

The ungleich kubernetes infrastructure » History » Version 108

Nico Schottelius, 06/17/2022 07:17 PM

1 22 Nico Schottelius
h1. The ungleich kubernetes infrastructure and ungleich kubernetes manual
2 1 Nico Schottelius
3 3 Nico Schottelius
{{toc}}
4
5 1 Nico Schottelius
h2. Status
6
7 28 Nico Schottelius
This document is **pre-production**.
8
This document is to become the ungleich kubernetes infrastructure overview as well as the ungleich kubernetes manual.
9 1 Nico Schottelius
10 10 Nico Schottelius
h2. k8s clusters
11
12 108 Nico Schottelius
| Cluster         | Purpose/Setup     | Maintainer | Master(s)                     | argo                                                | v4 http proxy | last verified |
13
| c0.k8s.ooo      | Dev               | -          | UNUSED                        |                                                     |               |    2021-10-05 |
14
| c1.k8s.ooo      | retired           |            | -                             |                                                     |               |    2022-03-15 |
15
| c2.k8s.ooo      | Dev p7 HW         | Nico       | server47 server53 server54    | "argo":https://argocd-server.argocd.svc.c2.k8s.ooo  |               |    2021-10-05 |
16
| c3.k8s.ooo      | retired           | -          | -                             |                                                     |               |    2021-10-05 |
17
| c4.k8s.ooo      | Dev2 p7 HW        | Jin-Guk    | server52 server53 server54    |                                                     |               |             - |
18
| c5.k8s.ooo      | retired           |            | -                             |                                                     |               |    2022-03-15 |
19
| c6.k8s.ooo      | Dev p6 VM Jin-Guk | Jin-Guk    |                               |                                                     |               |               |
20
| [[p5.k8s.ooo]]  | production        |            | server34 server36 server38    | "argo":https://argocd-server.argocd.svc.p5.k8s.ooo  |             - |               |
21
| [[p7.k8s.ooo]]  | production        |            | server107 server108 server109 | "argo":https://argocd-server.argocd.svc.nau.k8s.ooo |               |               |
22
| [[p6.k8s.ooo]]  | production        |            | server67 server69 server71    | "argo":https://argocd-server.argocd.svc.p6.k8s.ooo  | 147.78.194.13 |    2021-10-05 |
23
| [[p10.k8s.ooo]] | production        |            | server63 server65 server83    | "argo":https://argocd-server.argocd.svc.p10.k8s.ooo | 147.78.194.12 |    2021-10-05 |
24
25
26
27 21 Nico Schottelius
28 1 Nico Schottelius
h2. General architecture and components overview
29
30
* All k8s clusters are IPv6 only
31
* We use BGP peering to propagate podcidr and serviceCidr networks to our infrastructure
32
* The main public testing repository is "ungleich-k8s":https://code.ungleich.ch/ungleich-public/ungleich-k8s
33 18 Nico Schottelius
** Private configurations are found in the **k8s-config** repository
34 1 Nico Schottelius
35
h3. Cluster types
36
37 28 Nico Schottelius
| **Type/Feature**            | **Development**                | **Production**         |
38
| Min No. nodes               | 3 (1 master, 3 worker)         | 5 (3 master, 3 worker) |
39
| Recommended minimum         | 4 (dedicated master, 3 worker) | 8 (3 master, 5 worker) |
40
| Separation of control plane | optional                       | recommended            |
41
| Persistent storage          | required                       | required               |
42
| Number of storage monitors  | 3                              | 5                      |
43 1 Nico Schottelius
44 43 Nico Schottelius
h2. General k8s operations
45 1 Nico Schottelius
46 46 Nico Schottelius
h3. Cheat sheet / external great references
47
48
* "kubectl cheatsheet":https://kubernetes.io/docs/reference/kubectl/cheatsheet/
49
50 69 Nico Schottelius
h3. Allowing to schedule work on the control plane
51
52
* Mostly for single node / test / development clusters
53
* Just remove the master taint as follows
54
55
<pre>
56
kubectl taint nodes --all node-role.kubernetes.io/master-
57
</pre>
58
59
60 44 Nico Schottelius
h3. Get the cluster admin.conf
61
62
* On the masters of each cluster you can find the file @/etc/kubernetes/admin.conf@
63
* To be able to administrate the cluster you can copy the admin.conf to your local machine
64
* Multi cluster debugging can very easy if you name the config ~/cX-admin.conf (see example below)
65
66
<pre>
67
% scp root@server47.place7.ungleich.ch:/etc/kubernetes/admin.conf ~/c2-admin.conf
68
% export KUBECONFIG=~/c2-admin.conf    
69
% kubectl get nodes
70
NAME       STATUS                     ROLES                  AGE   VERSION
71
server47   Ready                      control-plane,master   82d   v1.22.0
72
server48   Ready                      control-plane,master   82d   v1.22.0
73
server49   Ready                      <none>                 82d   v1.22.0
74
server50   Ready                      <none>                 82d   v1.22.0
75
server59   Ready                      control-plane,master   82d   v1.22.0
76
server60   Ready,SchedulingDisabled   <none>                 82d   v1.22.0
77
server61   Ready                      <none>                 82d   v1.22.0
78
server62   Ready                      <none>                 82d   v1.22.0               
79
</pre>
80
81 18 Nico Schottelius
h3. Installing a new k8s cluster
82 8 Nico Schottelius
83 9 Nico Schottelius
* Decide on the cluster name (usually *cX.k8s.ooo*), X counting upwards
84 28 Nico Schottelius
** Using pXX.k8s.ooo for production clusters of placeXX
85 9 Nico Schottelius
* Use cdist to configure the nodes with requirements like crio
86
* Decide between single or multi node control plane setups (see below)
87 28 Nico Schottelius
** Single control plane suitable for development clusters
88 9 Nico Schottelius
89 28 Nico Schottelius
Typical init procedure:
90 9 Nico Schottelius
91 28 Nico Schottelius
* Single control plane: @kubeadm init --config bootstrap/XXX/kubeadm.yaml@
92
* Multi control plane (HA): @kubeadm init --config bootstrap/XXX/kubeadm.yaml --upload-certs@
93 10 Nico Schottelius
94 29 Nico Schottelius
h3. Deleting a pod that is hanging in terminating state
95
96
<pre>
97
kubectl delete pod <PODNAME> --grace-period=0 --force --namespace <NAMESPACE>
98
</pre>
99
100
(from https://stackoverflow.com/questions/35453792/pods-stuck-in-terminating-status)
101
102 42 Nico Schottelius
h3. Listing nodes of a cluster
103
104
<pre>
105
[15:05] bridge:~% kubectl get nodes
106
NAME       STATUS   ROLES                  AGE   VERSION
107
server22   Ready    <none>                 52d   v1.22.0
108
server23   Ready    <none>                 52d   v1.22.2
109
server24   Ready    <none>                 52d   v1.22.0
110
server25   Ready    <none>                 52d   v1.22.0
111
server26   Ready    <none>                 52d   v1.22.0
112
server27   Ready    <none>                 52d   v1.22.0
113
server63   Ready    control-plane,master   52d   v1.22.0
114
server64   Ready    <none>                 52d   v1.22.0
115
server65   Ready    control-plane,master   52d   v1.22.0
116
server66   Ready    <none>                 52d   v1.22.0
117
server83   Ready    control-plane,master   52d   v1.22.0
118
server84   Ready    <none>                 52d   v1.22.0
119
server85   Ready    <none>                 52d   v1.22.0
120
server86   Ready    <none>                 52d   v1.22.0
121
</pre>
122
123 41 Nico Schottelius
h3. Removing / draining a node
124
125
Usually @kubectl drain server@ should do the job, but sometimes we need to be more aggressive:
126
127 1 Nico Schottelius
<pre>
128 103 Nico Schottelius
kubectl drain --delete-emptydir-data --ignore-daemonsets serverXX
129 42 Nico Schottelius
</pre>
130
131
h3. Readding a node after draining
132
133
<pre>
134
kubectl uncordon serverXX
135 1 Nico Schottelius
</pre>
136 43 Nico Schottelius
137 50 Nico Schottelius
h3. (Re-)joining worker nodes after creating the cluster
138 49 Nico Schottelius
139
* We need to have an up-to-date token
140
* We use different join commands for the workers and control plane nodes
141
142
Generating the join command on an existing control plane node:
143
144
<pre>
145
kubeadm token create --print-join-command
146
</pre>
147
148 50 Nico Schottelius
h3. (Re-)joining control plane nodes after creating the cluster
149 1 Nico Schottelius
150 50 Nico Schottelius
* We generate the token again
151
* We upload the certificates
152
* We need to combine/create the join command for the control plane node
153
154
Example session:
155
156
<pre>
157
% kubeadm token create --print-join-command
158
kubeadm join p10-api.k8s.ooo:6443 --token xmff4i.ABC --discovery-token-ca-cert-hash sha256:longhash 
159
160
% kubeadm init phase upload-certs --upload-certs
161
[upload-certs] Storing the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
162
[upload-certs] Using certificate key:
163
CERTKEY
164
165
# Then we use these two outputs on the joining node:
166
167
kubeadm join p10-api.k8s.ooo:6443 --token xmff4i.ABC --discovery-token-ca-cert-hash sha256:longhash --control-plane --certificate-key CERTKEY
168
</pre>
169
170
Commands to be used on a control plane node:
171
172
<pre>
173
kubeadm token create --print-join-command
174
kubeadm init phase upload-certs --upload-certs
175
</pre>
176
177
Commands to be used on the joining node:
178
179
<pre>
180
JOINCOMMAND --control-plane --certificate-key CERTKEY
181
</pre>
182 49 Nico Schottelius
183 51 Nico Schottelius
SEE ALSO
184
185
* https://stackoverflow.com/questions/63936268/how-to-generate-kubeadm-token-for-secondary-control-plane-nodes
186
* https://blog.scottlowe.org/2019/08/15/reconstructing-the-join-command-for-kubeadm/
187
188 53 Nico Schottelius
h3. How to fix etcd does not start when rejoining a kubernetes cluster as a control plane
189 52 Nico Schottelius
190
If during the above step etcd does not come up, @kubeadm join@ can hang as follows:
191
192
<pre>
193
[control-plane] Creating static Pod manifest for "kube-apiserver"                                                              
194
[control-plane] Creating static Pod manifest for "kube-controller-manager"                                                     
195
[control-plane] Creating static Pod manifest for "kube-scheduler"                                                              
196
[check-etcd] Checking that the etcd cluster is healthy                                                                         
197
error execution phase check-etcd: etcd cluster is not healthy: failed to dial endpoint https://[2a0a:e5c0:10:1:225:b3ff:fe20:37
198
8a]:2379 with maintenance client: context deadline exceeded                                                                    
199
To see the stack trace of this error execute with --v=5 or higher         
200
</pre>
201
202
Then the problem is likely that the etcd server is still a member of the cluster. We first need to remove it from the etcd cluster and then the join works.
203
204
To fix this we do:
205
206
* Find a working etcd pod
207
* Find the etcd members / member list
208
* Remove the etcd member that we want to re-join the cluster
209
210
211
<pre>
212
# Find the etcd pods
213
kubectl -n kube-system get pods -l component=etcd,tier=control-plane
214
215
# Get the list of etcd servers with the member id 
216
kubectl exec -n kube-system -ti ETCDPODNAME -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list
217
218
# Remove the member
219
kubectl exec -n kube-system -ti ETCDPODNAME -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove MEMBERID
220
</pre>
221
222
Sample session:
223
224
<pre>
225
[10:48] line:~% kubectl -n kube-system get pods -l component=etcd,tier=control-plane
226
NAME            READY   STATUS    RESTARTS     AGE
227
etcd-server63   1/1     Running   0            3m11s
228
etcd-server65   1/1     Running   3            7d2h
229
etcd-server83   1/1     Running   8 (6d ago)   7d2h
230
[10:48] line:~% kubectl exec -n kube-system -ti etcd-server65 -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list
231
356891cd676df6e4, started, server65, https://[2a0a:e5c0:10:1:225:b3ff:fe20:375c]:2380, https://[2a0a:e5c0:10:1:225:b3ff:fe20:375c]:2379, false
232
371b8a07185dee7e, started, server63, https://[2a0a:e5c0:10:1:225:b3ff:fe20:378a]:2380, https://[2a0a:e5c0:10:1:225:b3ff:fe20:378a]:2379, false
233
5942bc58307f8af9, started, server83, https://[2a0a:e5c0:10:1:3e4a:92ff:fe79:bb98]:2380, https://[2a0a:e5c0:10:1:3e4a:92ff:fe79:bb98]:2379, false
234
235
[10:48] line:~% kubectl exec -n kube-system -ti etcd-server65 -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove 371b8a07185dee7e
236
Member 371b8a07185dee7e removed from cluster e3c0805f592a8f77
237 1 Nico Schottelius
238
</pre>
239
240
SEE ALSO
241
242
* We found the solution using https://stackoverflow.com/questions/67921552/re-installed-node-cannot-join-kubernetes-cluster
243 56 Nico Schottelius
244 101 Nico Schottelius
h3. Hardware Maintenance using ungleich-hardware
245
246
Use the following manifest and replace the HOST with the actual host:
247
248
<pre>
249
apiVersion: v1
250
kind: Pod
251
metadata:
252
  name: ungleich-hardware-HOST
253
spec:
254
  containers:
255
  - name: ungleich-hardware
256
    image: ungleich/ungleich-hardware:0.0.5
257
    args:
258
    - sleep
259
    - "1000000"
260
    volumeMounts:
261
      - mountPath: /dev
262
        name: dev
263
    securityContext:
264
      privileged: true
265
  nodeSelector:
266
    kubernetes.io/hostname: "HOST"
267
268
  volumes:
269
    - name: dev
270
      hostPath:
271
        path: /dev
272
</pre>
273
274 102 Nico Schottelius
Also see: [[The_ungleich_hardware_maintenance_guide]]
275
276 105 Nico Schottelius
h3. Triggering a cronjob / creating a job from a cronjob
277 104 Nico Schottelius
278
To test a cronjob, we can create a job from a cronjob:
279
280
<pre>
281
kubectl create job --from=cronjob/volume2-daily-backup volume2-manual
282
</pre>
283
284
This creates a job volume2-manual based on the cronjob  volume2-daily
285
286 62 Nico Schottelius
h2. Calico CNI
287
288
h3. Calico Installation
289
290
* We install "calico using helm":https://docs.projectcalico.org/getting-started/kubernetes/helm
291
* This has the following advantages:
292
** Easy to upgrade
293
** Does not require os to configure IPv6/dual stack settings as the tigera operator figures out things on its own
294
295
Usually plain calico can be installed directly using:
296
297
<pre>
298
helm repo add projectcalico https://docs.projectcalico.org/charts
299 107 Nico Schottelius
helm install --namespace tigera calico projectcalico/tigera-operator --version v3.23.1
300 1 Nico Schottelius
</pre>
301 92 Nico Schottelius
302
* Check the tags on https://github.com/projectcalico/calico/tags for the latest release
303 62 Nico Schottelius
304
h3. Installing calicoctl
305
306
To be able to manage and configure calico, we need to 
307
"install calicoctl (we choose the version as a pod)":https://docs.projectcalico.org/getting-started/clis/calicoctl/install#install-calicoctl-as-a-kubernetes-pod
308
309
<pre>
310
kubectl apply -f https://docs.projectcalico.org/manifests/calicoctl.yaml
311
</pre>
312
313 93 Nico Schottelius
Or version specific:
314
315
<pre>
316
kubectl apply -f https://github.com/projectcalico/calico/blob/v3.20.4/manifests/calicoctl.yaml
317 97 Nico Schottelius
318
# For 3.22
319
kubectl apply -f https://projectcalico.docs.tigera.io/archive/v3.22/manifests/calicoctl.yaml
320 93 Nico Schottelius
</pre>
321
322 70 Nico Schottelius
And making it easier accessible by alias:
323
324
<pre>
325
alias calicoctl="kubectl exec -i -n kube-system calicoctl -- /calicoctl"
326
</pre>
327
328 62 Nico Schottelius
h3. Calico configuration
329
330 63 Nico Schottelius
By default our k8s clusters "BGP peer":https://docs.projectcalico.org/networking/bgp
331
with an upstream router to propagate podcidr and servicecidr.
332 62 Nico Schottelius
333
Default settings in our infrastructure:
334
335
* We use a full-mesh using the @nodeToNodeMeshEnabled: true@ option
336
* We keep the original next hop so that *only* the server with the pod is announcing it (instead of ecmp)
337 1 Nico Schottelius
* We use private ASNs for k8s clusters
338 63 Nico Schottelius
* We do *not* use any overlay
339 62 Nico Schottelius
340
After installing calico and calicoctl the last step of the installation is usually:
341
342 1 Nico Schottelius
<pre>
343 79 Nico Schottelius
calicoctl create -f - < calico-bgp.yaml
344 62 Nico Schottelius
</pre>
345
346
347
A sample BGP configuration:
348
349
<pre>
350
---
351
apiVersion: projectcalico.org/v3
352
kind: BGPConfiguration
353
metadata:
354
  name: default
355
spec:
356
  logSeverityScreen: Info
357
  nodeToNodeMeshEnabled: true
358
  asNumber: 65534
359
  serviceClusterIPs:
360
  - cidr: 2a0a:e5c0:10:3::/108
361
  serviceExternalIPs:
362
  - cidr: 2a0a:e5c0:10:3::/108
363
---
364
apiVersion: projectcalico.org/v3
365
kind: BGPPeer
366
metadata:
367
  name: router1-place10
368
spec:
369
  peerIP: 2a0a:e5c0:10:1::50
370
  asNumber: 213081
371
  keepOriginalNextHop: true
372
</pre>
373
374 64 Nico Schottelius
h2. ArgoCD / ArgoWorkFlow
375 56 Nico Schottelius
376 60 Nico Schottelius
h3. Argocd Installation
377 1 Nico Schottelius
378 60 Nico Schottelius
As there is no configuration management present yet, argocd is installed using
379
380 1 Nico Schottelius
<pre>
381 60 Nico Schottelius
kubectl create namespace argocd
382 86 Nico Schottelius
383 96 Nico Schottelius
# Specific Version
384
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/v2.3.2/manifests/install.yaml
385 86 Nico Schottelius
386
# OR: latest stable
387 60 Nico Schottelius
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
388 1 Nico Schottelius
</pre>
389 56 Nico Schottelius
390 60 Nico Schottelius
* See https://argo-cd.readthedocs.io/en/stable/
391 1 Nico Schottelius
392 60 Nico Schottelius
h3. Get the argocd credentials
393
394
<pre>
395
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d; echo ""
396
</pre>
397 52 Nico Schottelius
398 87 Nico Schottelius
h3. Accessing argocd
399
400
In regular IPv6 clusters:
401
402
* Navigate to https://argocd-server.argocd.CLUSTERDOMAIN
403
404
In legacy IPv4 clusters
405
406
<pre>
407
kubectl --namespace argocd port-forward svc/argocd-server 8080:80
408
</pre>
409
410 88 Nico Schottelius
* Navigate to https://localhost:8080
411
412 68 Nico Schottelius
h3. Using the argocd webhook to trigger changes
413 67 Nico Schottelius
414
* To trigger changes post json https://argocd.example.com/api/webhook
415
416 72 Nico Schottelius
h3. Deploying an application
417
418
* Applications are deployed via git towards gitea (code.ungleich.ch) and then pulled by argo
419 73 Nico Schottelius
* Always include the *redmine-url* pointing to the (customer) ticket
420
** Also add the support-url if it exists
421 72 Nico Schottelius
422
Application sample
423
424
<pre>
425
apiVersion: argoproj.io/v1alpha1
426
kind: Application
427
metadata:
428
  name: gitea-CUSTOMER
429
  namespace: argocd
430
spec:
431
  destination:
432
    namespace: default
433
    server: 'https://kubernetes.default.svc'
434
  source:
435
    path: apps/prod/gitea
436
    repoURL: 'https://code.ungleich.ch/ungleich-intern/k8s-config.git'
437
    targetRevision: HEAD
438
    helm:
439
      parameters:
440
        - name: storage.data.storageClass
441
          value: rook-ceph-block-hdd
442
        - name: storage.data.size
443
          value: 200Gi
444
        - name: storage.db.storageClass
445
          value: rook-ceph-block-ssd
446
        - name: storage.db.size
447
          value: 10Gi
448
        - name: storage.letsencrypt.storageClass
449
          value: rook-ceph-block-hdd
450
        - name: storage.letsencrypt.size
451
          value: 50Mi
452
        - name: letsencryptStaging
453
          value: 'no'
454
        - name: fqdn
455
          value: 'code.verua.online'
456
  project: default
457
  syncPolicy:
458
    automated:
459
      prune: true
460
      selfHeal: true
461
  info:
462
    - name: 'redmine-url'
463
      value: 'https://redmine.ungleich.ch/issues/ISSUEID'
464
    - name: 'support-url'
465
      value: 'https://support.ungleich.ch/Ticket/Display.html?id=TICKETID'
466
</pre>
467
468 80 Nico Schottelius
h2. Helm related operations and conventions
469 55 Nico Schottelius
470 61 Nico Schottelius
We use helm charts extensively.
471
472
* In production, they are managed via argocd
473
* In development, helm chart can de developed and deployed manually using the helm utility.
474
475 55 Nico Schottelius
h3. Installing a helm chart
476
477
One can use the usual pattern of
478
479
<pre>
480
helm install <releasename> <chartdirectory>
481
</pre>
482
483
However often you want to reinstall/update when testing helm charts. The following pattern is "better", because it allows you to reinstall, if it is already installed:
484
485
<pre>
486
helm upgrade --install <releasename> <chartdirectory>
487 1 Nico Schottelius
</pre>
488 80 Nico Schottelius
489
h3. Naming services and deployments in helm charts [Application labels]
490
491
* We always have {{ .Release.Name }} to identify the current "instance"
492
* Deployments:
493
** use @app: <what it is>@, f.i. @app: nginx@, @app: postgres@, ...
494 81 Nico Schottelius
* See more about standard labels on
495
** https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/
496
** https://helm.sh/docs/chart_best_practices/labels/
497 55 Nico Schottelius
498 43 Nico Schottelius
h2. Rook / Ceph Related Operations
499
500 71 Nico Schottelius
h3. Executing ceph commands
501
502
Using the ceph-tools pod as follows:
503
504
<pre>
505
kubectl exec -n rook-ceph -ti $(kubectl -n rook-ceph get pods -l app=rook-ceph-tools -o jsonpath='{.items[*].metadata.name}') -- ceph -s
506
</pre>
507
508 43 Nico Schottelius
h3. Inspecting the logs of a specific server
509
510
<pre>
511
# Get the related pods
512
kubectl -n rook-ceph get pods -l app=rook-ceph-osd-prepare 
513
...
514
515
# Inspect the logs of a specific pod
516
kubectl -n rook-ceph logs -f rook-ceph-osd-prepare-server23--1-444qx
517
518 71 Nico Schottelius
</pre>
519
520
h3. Inspecting the logs of the rook-ceph-operator
521
522
<pre>
523
kubectl -n rook-ceph logs -f -l app=rook-ceph-operator
524 43 Nico Schottelius
</pre>
525
526
h3. Triggering server prepare / adding new osds
527
528
The rook-ceph-operator triggers/watches/creates pods to maintain hosts. To trigger a full "re scan", simply delete that pod:
529
530
<pre>
531
kubectl -n rook-ceph delete pods -l app=rook-ceph-operator
532
</pre>
533
534
This will cause all the @rook-ceph-osd-prepare-..@ jobs to be recreated and thus OSDs to be created, if new disks have been added.
535
536
h3. Removing an OSD
537
538
* See "Ceph OSD Management":https://rook.io/docs/rook/v1.7/ceph-osd-mgmt.html
539 77 Nico Schottelius
* More specifically: https://github.com/rook/rook/blob/release-1.7/cluster/examples/kubernetes/ceph/osd-purge.yaml
540 99 Nico Schottelius
* Then delete the related deployment
541 41 Nico Schottelius
542 98 Nico Schottelius
Set osd id in the osd-purge.yaml and apply it. OSD should be down before.
543
544
<pre>
545
apiVersion: batch/v1
546
kind: Job
547
metadata:
548
  name: rook-ceph-purge-osd
549
  namespace: rook-ceph # namespace:cluster
550
  labels:
551
    app: rook-ceph-purge-osd
552
spec:
553
  template:
554
    metadata:
555
      labels:
556
        app: rook-ceph-purge-osd
557
    spec:
558
      serviceAccountName: rook-ceph-purge-osd
559
      containers:
560
        - name: osd-removal
561
          image: rook/ceph:master
562
          # TODO: Insert the OSD ID in the last parameter that is to be removed
563
          # The OSD IDs are a comma-separated list. For example: "0" or "0,2".
564
          # If you want to preserve the OSD PVCs, set `--preserve-pvc true`.
565
          #
566
          # A --force-osd-removal option is available if the OSD should be destroyed even though the
567
          # removal could lead to data loss.
568
          args:
569
            - "ceph"
570
            - "osd"
571
            - "remove"
572
            - "--preserve-pvc"
573
            - "false"
574
            - "--force-osd-removal"
575
            - "false"
576
            - "--osd-ids"
577
            - "SETTHEOSDIDHERE"
578
          env:
579
            - name: POD_NAMESPACE
580
              valueFrom:
581
                fieldRef:
582
                  fieldPath: metadata.namespace
583
            - name: ROOK_MON_ENDPOINTS
584
              valueFrom:
585
                configMapKeyRef:
586
                  key: data
587
                  name: rook-ceph-mon-endpoints
588
            - name: ROOK_CEPH_USERNAME
589
              valueFrom:
590
                secretKeyRef:
591
                  key: ceph-username
592
                  name: rook-ceph-mon
593
            - name: ROOK_CEPH_SECRET
594
              valueFrom:
595
                secretKeyRef:
596
                  key: ceph-secret
597
                  name: rook-ceph-mon
598
            - name: ROOK_CONFIG_DIR
599
              value: /var/lib/rook
600
            - name: ROOK_CEPH_CONFIG_OVERRIDE
601
              value: /etc/rook/config/override.conf
602
            - name: ROOK_FSID
603
              valueFrom:
604
                secretKeyRef:
605
                  key: fsid
606
                  name: rook-ceph-mon
607
            - name: ROOK_LOG_LEVEL
608
              value: DEBUG
609
          volumeMounts:
610
            - mountPath: /etc/ceph
611
              name: ceph-conf-emptydir
612
            - mountPath: /var/lib/rook
613
              name: rook-config
614
      volumes:
615
        - emptyDir: {}
616
          name: ceph-conf-emptydir
617
        - emptyDir: {}
618
          name: rook-config
619
      restartPolicy: Never
620
621
622 99 Nico Schottelius
</pre>
623
624
Deleting the deployment:
625
626
<pre>
627
[18:05] bridge:~% kubectl -n rook-ceph delete deployment rook-ceph-osd-6
628
deployment.apps "rook-ceph-osd-6" deleted
629 98 Nico Schottelius
</pre>
630
631 76 Nico Schottelius
h2. Harbor
632
633
* We user "Harbor":https://goharbor.io/ for caching and as an image registry. Internal app reference: apps/prod/harbor.
634
* The admin password is in the password store, auto generated per cluster
635
* At the moment harbor only authenticates against the internal ldap tree
636
637
h3. LDAP configuration
638
639
* The url needs to be ldaps://...
640
* uid = uid
641
* rest standard
642 75 Nico Schottelius
643 89 Nico Schottelius
h2. Monitoring / Prometheus
644
645 90 Nico Schottelius
* Via "kube-prometheus":https://github.com/prometheus-operator/kube-prometheus/
646 89 Nico Schottelius
647 91 Nico Schottelius
Access via ...
648
649
* http://prometheus-k8s.monitoring.svc:9090
650
* http://grafana.monitoring.svc:3000
651
* http://alertmanager.monitoring.svc:9093
652
653
654 100 Nico Schottelius
h3. Prometheus Options
655
656
* "helm/kube-prometheus-stack":https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack
657
** Includes dashboards and co.
658
* "manifest based kube-prometheus":https://github.com/prometheus-operator/kube-prometheus
659
** Includes dashboards and co.
660
* "Prometheus Operator (mainly CRD manifest":https://github.com/prometheus-operator/prometheus-operator
661
662 91 Nico Schottelius
663 82 Nico Schottelius
h2. Nextcloud
664
665 85 Nico Schottelius
h3. How to get the nextcloud credentials 
666 84 Nico Schottelius
667
* The initial username is set to "nextcloud"
668
* The password is autogenerated and saved in a kubernetes secret
669
670
<pre>
671 85 Nico Schottelius
kubectl get secret RELEASENAME-nextcloud -o jsonpath="{.data.PASSWORD}" | base64 -d; echo "" 
672 84 Nico Schottelius
</pre>
673
674 83 Nico Schottelius
h3. How to fix "Access through untrusted domain"
675
676 82 Nico Schottelius
* Nextcloud stores the initial domain configuration
677 1 Nico Schottelius
* If the FQDN is changed, it will show the error message "Access through untrusted domain"
678 82 Nico Schottelius
* To fix, edit /var/www/html/config/config.php and correct the domain
679 83 Nico Schottelius
* Then delete the pods
680 82 Nico Schottelius
681 1 Nico Schottelius
h2. Infrastructure versions
682 35 Nico Schottelius
683 57 Nico Schottelius
h3. ungleich kubernetes infrastructure v5 (2021-10)
684 1 Nico Schottelius
685 57 Nico Schottelius
Clusters are configured / setup in this order:
686
687
* Bootstrap via kubeadm
688 59 Nico Schottelius
* "Networking via calico + BGP (non ECMP) using helm":https://docs.projectcalico.org/getting-started/kubernetes/helm
689
* "ArgoCD for CD":https://argo-cd.readthedocs.io/en/stable/
690
** "rook for storage via argocd":https://rook.io/
691 58 Nico Schottelius
** haproxy for in IPv6-cluster-IPv4-to-IPv6 proxy via argocd
692
** "kubernetes-secret-generator for in cluster secrets":https://github.com/mittwald/kubernetes-secret-generator
693
** "ungleich-certbot managing certs and nginx":https://hub.docker.com/repository/docker/ungleich/ungleich-certbot
694
695 57 Nico Schottelius
696
h3. ungleich kubernetes infrastructure v4 (2021-09)
697
698 54 Nico Schottelius
* rook is configured via manifests instead of using the rook-ceph-cluster helm chart
699 1 Nico Schottelius
* The rook operator is still being installed via helm
700 35 Nico Schottelius
701 57 Nico Schottelius
h3. ungleich kubernetes infrastructure v3 (2021-07)
702 1 Nico Schottelius
703 10 Nico Schottelius
* rook is now installed via helm via argocd instead of directly via manifests
704 28 Nico Schottelius
705 57 Nico Schottelius
h3. ungleich kubernetes infrastructure v2 (2021-05)
706 28 Nico Schottelius
707
* Replaced fluxv2 from ungleich k8s v1 with argocd
708 1 Nico Schottelius
** argocd can apply helm templates directly without needing to go through Chart releases
709 28 Nico Schottelius
* We are also using argoflow for build flows
710
* Planned to add "kaniko":https://github.com/GoogleContainerTools/kaniko for image building
711
712 57 Nico Schottelius
h3. ungleich kubernetes infrastructure v1 (2021-01)
713 28 Nico Schottelius
714
We are using the following components:
715
716
* "Calico as a CNI":https://www.projectcalico.org/ with BGP, IPv6 only, no encapsulation
717
** Needed for basic networking
718
* "kubernetes-secret-generator":https://github.com/mittwald/kubernetes-secret-generator for creating secrets
719
** Needed so that secrets are not stored in the git repository, but only in the cluster
720
* "ungleich-certbot":https://hub.docker.com/repository/docker/ungleich/ungleich-certbot
721
** Needed to get letsencrypt certificates for services
722
* "rook with ceph rbd + cephfs":https://rook.io/ for storage
723
** rbd for almost everything, *ReadWriteOnce*
724
** cephfs for smaller things, multi access *ReadWriteMany*
725
** Needed for providing persistent storage
726
* "flux v2":https://fluxcd.io/
727
** Needed to manage resources automatically