Project

General

Profile

The ungleich kubernetes infrastructure » History » Version 102

Nico Schottelius, 05/21/2022 06:47 PM

1 22 Nico Schottelius
h1. The ungleich kubernetes infrastructure and ungleich kubernetes manual
2 1 Nico Schottelius
3 3 Nico Schottelius
{{toc}}
4
5 1 Nico Schottelius
h2. Status
6
7 28 Nico Schottelius
This document is **pre-production**.
8
This document is to become the ungleich kubernetes infrastructure overview as well as the ungleich kubernetes manual.
9 1 Nico Schottelius
10 10 Nico Schottelius
h2. k8s clusters
11
12 95 Nico Schottelius
| Cluster         | Purpose/Setup     | Maintainer | Master(s)                  | argo                                                | v4 http proxy | last verified |
13
| c0.k8s.ooo      | Dev               | -          | UNUSED                     |                                                     |               |    2021-10-05 |
14
| c1.k8s.ooo      | retired           |            | -                          |                                                     |               |    2022-03-15 |
15
| c2.k8s.ooo      | Dev p7 HW         | Nico       | server47 server53 server54 | "argo":https://argocd-server.argocd.svc.c2.k8s.ooo  |               |    2021-10-05 |
16
| c3.k8s.ooo      | retired           | -          | -                          |                                                     |               |    2021-10-05 |
17
| c4.k8s.ooo      | Dev2 p7 HW        | Jin-Guk    | server52 server53 server54 |                                                     |               |             - |
18
| c5.k8s.ooo      | retired           |            | -                          |                                                     |               |    2022-03-15 |
19
| c6.k8s.ooo      | Dev p6 VM Jin-Guk | Jin-Guk    |                            |                                                     |               |               |
20
| [[p5.k8s.ooo]]  | production        |            | server34 server36 server38 | "argo":https://argocd-server.argocd.svc.p5.k8s.ooo  |             - |               |
21
| [[p6.k8s.ooo]]  | production        |            | server67 server69 server71 | "argo":https://argocd-server.argocd.svc.p6.k8s.ooo  | 147.78.194.13 |    2021-10-05 |
22
| [[p10.k8s.ooo]] | production        |            | server63 server65 server83 | "argo":https://argocd-server.argocd.svc.p10.k8s.ooo | 147.78.194.12 |    2021-10-05 |
23
| fnnf            | development       | Nico       | server75                   |                                                     |               |               |
24 78 Nico Schottelius
25 21 Nico Schottelius
26 1 Nico Schottelius
h2. General architecture and components overview
27
28
* All k8s clusters are IPv6 only
29
* We use BGP peering to propagate podcidr and serviceCidr networks to our infrastructure
30
* The main public testing repository is "ungleich-k8s":https://code.ungleich.ch/ungleich-public/ungleich-k8s
31 18 Nico Schottelius
** Private configurations are found in the **k8s-config** repository
32 1 Nico Schottelius
33
h3. Cluster types
34
35 28 Nico Schottelius
| **Type/Feature**            | **Development**                | **Production**         |
36
| Min No. nodes               | 3 (1 master, 3 worker)         | 5 (3 master, 3 worker) |
37
| Recommended minimum         | 4 (dedicated master, 3 worker) | 8 (3 master, 5 worker) |
38
| Separation of control plane | optional                       | recommended            |
39
| Persistent storage          | required                       | required               |
40
| Number of storage monitors  | 3                              | 5                      |
41 1 Nico Schottelius
42 43 Nico Schottelius
h2. General k8s operations
43 1 Nico Schottelius
44 46 Nico Schottelius
h3. Cheat sheet / external great references
45
46
* "kubectl cheatsheet":https://kubernetes.io/docs/reference/kubectl/cheatsheet/
47
48 69 Nico Schottelius
h3. Allowing to schedule work on the control plane
49
50
* Mostly for single node / test / development clusters
51
* Just remove the master taint as follows
52
53
<pre>
54
kubectl taint nodes --all node-role.kubernetes.io/master-
55
</pre>
56
57
58 44 Nico Schottelius
h3. Get the cluster admin.conf
59
60
* On the masters of each cluster you can find the file @/etc/kubernetes/admin.conf@
61
* To be able to administrate the cluster you can copy the admin.conf to your local machine
62
* Multi cluster debugging can very easy if you name the config ~/cX-admin.conf (see example below)
63
64
<pre>
65
% scp root@server47.place7.ungleich.ch:/etc/kubernetes/admin.conf ~/c2-admin.conf
66
% export KUBECONFIG=~/c2-admin.conf    
67
% kubectl get nodes
68
NAME       STATUS                     ROLES                  AGE   VERSION
69
server47   Ready                      control-plane,master   82d   v1.22.0
70
server48   Ready                      control-plane,master   82d   v1.22.0
71
server49   Ready                      <none>                 82d   v1.22.0
72
server50   Ready                      <none>                 82d   v1.22.0
73
server59   Ready                      control-plane,master   82d   v1.22.0
74
server60   Ready,SchedulingDisabled   <none>                 82d   v1.22.0
75
server61   Ready                      <none>                 82d   v1.22.0
76
server62   Ready                      <none>                 82d   v1.22.0               
77
</pre>
78
79 18 Nico Schottelius
h3. Installing a new k8s cluster
80 8 Nico Schottelius
81 9 Nico Schottelius
* Decide on the cluster name (usually *cX.k8s.ooo*), X counting upwards
82 28 Nico Schottelius
** Using pXX.k8s.ooo for production clusters of placeXX
83 9 Nico Schottelius
* Use cdist to configure the nodes with requirements like crio
84
* Decide between single or multi node control plane setups (see below)
85 28 Nico Schottelius
** Single control plane suitable for development clusters
86 9 Nico Schottelius
87 28 Nico Schottelius
Typical init procedure:
88 9 Nico Schottelius
89 28 Nico Schottelius
* Single control plane: @kubeadm init --config bootstrap/XXX/kubeadm.yaml@
90
* Multi control plane (HA): @kubeadm init --config bootstrap/XXX/kubeadm.yaml --upload-certs@
91 10 Nico Schottelius
92 29 Nico Schottelius
h3. Deleting a pod that is hanging in terminating state
93
94
<pre>
95
kubectl delete pod <PODNAME> --grace-period=0 --force --namespace <NAMESPACE>
96
</pre>
97
98
(from https://stackoverflow.com/questions/35453792/pods-stuck-in-terminating-status)
99
100 42 Nico Schottelius
h3. Listing nodes of a cluster
101
102
<pre>
103
[15:05] bridge:~% kubectl get nodes
104
NAME       STATUS   ROLES                  AGE   VERSION
105
server22   Ready    <none>                 52d   v1.22.0
106
server23   Ready    <none>                 52d   v1.22.2
107
server24   Ready    <none>                 52d   v1.22.0
108
server25   Ready    <none>                 52d   v1.22.0
109
server26   Ready    <none>                 52d   v1.22.0
110
server27   Ready    <none>                 52d   v1.22.0
111
server63   Ready    control-plane,master   52d   v1.22.0
112
server64   Ready    <none>                 52d   v1.22.0
113
server65   Ready    control-plane,master   52d   v1.22.0
114
server66   Ready    <none>                 52d   v1.22.0
115
server83   Ready    control-plane,master   52d   v1.22.0
116
server84   Ready    <none>                 52d   v1.22.0
117
server85   Ready    <none>                 52d   v1.22.0
118
server86   Ready    <none>                 52d   v1.22.0
119
</pre>
120
121
122 41 Nico Schottelius
h3. Removing / draining a node
123
124
Usually @kubectl drain server@ should do the job, but sometimes we need to be more aggressive:
125
126
<pre>
127
kubectl drain --delete-emptydir-data --ignore-daemonsets server23
128 42 Nico Schottelius
</pre>
129
130
h3. Readding a node after draining
131
132
<pre>
133
kubectl uncordon serverXX
134 1 Nico Schottelius
</pre>
135 43 Nico Schottelius
136 50 Nico Schottelius
h3. (Re-)joining worker nodes after creating the cluster
137 49 Nico Schottelius
138
* We need to have an up-to-date token
139
* We use different join commands for the workers and control plane nodes
140
141
Generating the join command on an existing control plane node:
142
143
<pre>
144
kubeadm token create --print-join-command
145
</pre>
146
147 50 Nico Schottelius
h3. (Re-)joining control plane nodes after creating the cluster
148 1 Nico Schottelius
149 50 Nico Schottelius
* We generate the token again
150
* We upload the certificates
151
* We need to combine/create the join command for the control plane node
152
153
Example session:
154
155
<pre>
156
% kubeadm token create --print-join-command
157
kubeadm join p10-api.k8s.ooo:6443 --token xmff4i.ABC --discovery-token-ca-cert-hash sha256:longhash 
158
159
% kubeadm init phase upload-certs --upload-certs
160
[upload-certs] Storing the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
161
[upload-certs] Using certificate key:
162
CERTKEY
163
164
# Then we use these two outputs on the joining node:
165
166
kubeadm join p10-api.k8s.ooo:6443 --token xmff4i.ABC --discovery-token-ca-cert-hash sha256:longhash --control-plane --certificate-key CERTKEY
167
</pre>
168
169
Commands to be used on a control plane node:
170
171
<pre>
172
kubeadm token create --print-join-command
173
kubeadm init phase upload-certs --upload-certs
174
</pre>
175
176
Commands to be used on the joining node:
177
178
<pre>
179
JOINCOMMAND --control-plane --certificate-key CERTKEY
180
</pre>
181 49 Nico Schottelius
182 51 Nico Schottelius
SEE ALSO
183
184
* https://stackoverflow.com/questions/63936268/how-to-generate-kubeadm-token-for-secondary-control-plane-nodes
185
* https://blog.scottlowe.org/2019/08/15/reconstructing-the-join-command-for-kubeadm/
186
187 53 Nico Schottelius
h3. How to fix etcd does not start when rejoining a kubernetes cluster as a control plane
188 52 Nico Schottelius
189
If during the above step etcd does not come up, @kubeadm join@ can hang as follows:
190
191
<pre>
192
[control-plane] Creating static Pod manifest for "kube-apiserver"                                                              
193
[control-plane] Creating static Pod manifest for "kube-controller-manager"                                                     
194
[control-plane] Creating static Pod manifest for "kube-scheduler"                                                              
195
[check-etcd] Checking that the etcd cluster is healthy                                                                         
196
error execution phase check-etcd: etcd cluster is not healthy: failed to dial endpoint https://[2a0a:e5c0:10:1:225:b3ff:fe20:37
197
8a]:2379 with maintenance client: context deadline exceeded                                                                    
198
To see the stack trace of this error execute with --v=5 or higher         
199
</pre>
200
201
Then the problem is likely that the etcd server is still a member of the cluster. We first need to remove it from the etcd cluster and then the join works.
202
203
To fix this we do:
204
205
* Find a working etcd pod
206
* Find the etcd members / member list
207
* Remove the etcd member that we want to re-join the cluster
208
209
210
<pre>
211
# Find the etcd pods
212
kubectl -n kube-system get pods -l component=etcd,tier=control-plane
213
214
# Get the list of etcd servers with the member id 
215
kubectl exec -n kube-system -ti ETCDPODNAME -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list
216
217
# Remove the member
218
kubectl exec -n kube-system -ti ETCDPODNAME -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove MEMBERID
219
</pre>
220
221
Sample session:
222
223
<pre>
224
[10:48] line:~% kubectl -n kube-system get pods -l component=etcd,tier=control-plane
225
NAME            READY   STATUS    RESTARTS     AGE
226
etcd-server63   1/1     Running   0            3m11s
227
etcd-server65   1/1     Running   3            7d2h
228
etcd-server83   1/1     Running   8 (6d ago)   7d2h
229
[10:48] line:~% kubectl exec -n kube-system -ti etcd-server65 -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list
230
356891cd676df6e4, started, server65, https://[2a0a:e5c0:10:1:225:b3ff:fe20:375c]:2380, https://[2a0a:e5c0:10:1:225:b3ff:fe20:375c]:2379, false
231
371b8a07185dee7e, started, server63, https://[2a0a:e5c0:10:1:225:b3ff:fe20:378a]:2380, https://[2a0a:e5c0:10:1:225:b3ff:fe20:378a]:2379, false
232
5942bc58307f8af9, started, server83, https://[2a0a:e5c0:10:1:3e4a:92ff:fe79:bb98]:2380, https://[2a0a:e5c0:10:1:3e4a:92ff:fe79:bb98]:2379, false
233
234
[10:48] line:~% kubectl exec -n kube-system -ti etcd-server65 -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove 371b8a07185dee7e
235
Member 371b8a07185dee7e removed from cluster e3c0805f592a8f77
236 1 Nico Schottelius
237
</pre>
238
239
SEE ALSO
240
241
* We found the solution using https://stackoverflow.com/questions/67921552/re-installed-node-cannot-join-kubernetes-cluster
242 56 Nico Schottelius
243 101 Nico Schottelius
h3. Hardware Maintenance using ungleich-hardware
244
245
Use the following manifest and replace the HOST with the actual host:
246
247
<pre>
248
apiVersion: v1
249
kind: Pod
250
metadata:
251
  name: ungleich-hardware-HOST
252
spec:
253
  containers:
254
  - name: ungleich-hardware
255
    image: ungleich/ungleich-hardware:0.0.5
256
    args:
257
    - sleep
258
    - "1000000"
259
    volumeMounts:
260
      - mountPath: /dev
261
        name: dev
262
    securityContext:
263
      privileged: true
264
  nodeSelector:
265
    kubernetes.io/hostname: "HOST"
266
267
  volumes:
268
    - name: dev
269
      hostPath:
270
        path: /dev
271
</pre>
272
273 102 Nico Schottelius
Also see: [[The_ungleich_hardware_maintenance_guide]]
274
275 62 Nico Schottelius
h2. Calico CNI
276
277
h3. Calico Installation
278
279
* We install "calico using helm":https://docs.projectcalico.org/getting-started/kubernetes/helm
280
* This has the following advantages:
281
** Easy to upgrade
282
** Does not require os to configure IPv6/dual stack settings as the tigera operator figures out things on its own
283
284
Usually plain calico can be installed directly using:
285
286
<pre>
287
helm repo add projectcalico https://docs.projectcalico.org/charts
288 94 Nico Schottelius
helm install calico projectcalico/tigera-operator --version v3.20.4
289 1 Nico Schottelius
</pre>
290 92 Nico Schottelius
291
* Check the tags on https://github.com/projectcalico/calico/tags for the latest release
292 62 Nico Schottelius
293
h3. Installing calicoctl
294
295
To be able to manage and configure calico, we need to 
296
"install calicoctl (we choose the version as a pod)":https://docs.projectcalico.org/getting-started/clis/calicoctl/install#install-calicoctl-as-a-kubernetes-pod
297
298
<pre>
299
kubectl apply -f https://docs.projectcalico.org/manifests/calicoctl.yaml
300
</pre>
301
302 93 Nico Schottelius
Or version specific:
303
304
<pre>
305
kubectl apply -f https://github.com/projectcalico/calico/blob/v3.20.4/manifests/calicoctl.yaml
306 97 Nico Schottelius
307
# For 3.22
308
kubectl apply -f https://projectcalico.docs.tigera.io/archive/v3.22/manifests/calicoctl.yaml
309 93 Nico Schottelius
</pre>
310
311 70 Nico Schottelius
And making it easier accessible by alias:
312
313
<pre>
314
alias calicoctl="kubectl exec -i -n kube-system calicoctl -- /calicoctl"
315
</pre>
316
317 62 Nico Schottelius
h3. Calico configuration
318
319 63 Nico Schottelius
By default our k8s clusters "BGP peer":https://docs.projectcalico.org/networking/bgp
320
with an upstream router to propagate podcidr and servicecidr.
321 62 Nico Schottelius
322
Default settings in our infrastructure:
323
324
* We use a full-mesh using the @nodeToNodeMeshEnabled: true@ option
325
* We keep the original next hop so that *only* the server with the pod is announcing it (instead of ecmp)
326 1 Nico Schottelius
* We use private ASNs for k8s clusters
327 63 Nico Schottelius
* We do *not* use any overlay
328 62 Nico Schottelius
329
After installing calico and calicoctl the last step of the installation is usually:
330
331 1 Nico Schottelius
<pre>
332 79 Nico Schottelius
calicoctl create -f - < calico-bgp.yaml
333 62 Nico Schottelius
</pre>
334
335
336
A sample BGP configuration:
337
338
<pre>
339
---
340
apiVersion: projectcalico.org/v3
341
kind: BGPConfiguration
342
metadata:
343
  name: default
344
spec:
345
  logSeverityScreen: Info
346
  nodeToNodeMeshEnabled: true
347
  asNumber: 65534
348
  serviceClusterIPs:
349
  - cidr: 2a0a:e5c0:10:3::/108
350
  serviceExternalIPs:
351
  - cidr: 2a0a:e5c0:10:3::/108
352
---
353
apiVersion: projectcalico.org/v3
354
kind: BGPPeer
355
metadata:
356
  name: router1-place10
357
spec:
358
  peerIP: 2a0a:e5c0:10:1::50
359
  asNumber: 213081
360
  keepOriginalNextHop: true
361
</pre>
362
363 64 Nico Schottelius
h2. ArgoCD / ArgoWorkFlow
364 56 Nico Schottelius
365 60 Nico Schottelius
h3. Argocd Installation
366 1 Nico Schottelius
367 60 Nico Schottelius
As there is no configuration management present yet, argocd is installed using
368
369 1 Nico Schottelius
<pre>
370 60 Nico Schottelius
kubectl create namespace argocd
371 86 Nico Schottelius
372 96 Nico Schottelius
# Specific Version
373
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/v2.3.2/manifests/install.yaml
374 86 Nico Schottelius
375
# OR: latest stable
376 60 Nico Schottelius
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
377 1 Nico Schottelius
</pre>
378 56 Nico Schottelius
379 60 Nico Schottelius
* See https://argo-cd.readthedocs.io/en/stable/
380 1 Nico Schottelius
381 60 Nico Schottelius
h3. Get the argocd credentials
382
383
<pre>
384
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d; echo ""
385
</pre>
386 52 Nico Schottelius
387 87 Nico Schottelius
h3. Accessing argocd
388
389
In regular IPv6 clusters:
390
391
* Navigate to https://argocd-server.argocd.CLUSTERDOMAIN
392
393
In legacy IPv4 clusters
394
395
<pre>
396
kubectl --namespace argocd port-forward svc/argocd-server 8080:80
397
</pre>
398
399 88 Nico Schottelius
* Navigate to https://localhost:8080
400
401 68 Nico Schottelius
h3. Using the argocd webhook to trigger changes
402 67 Nico Schottelius
403
* To trigger changes post json https://argocd.example.com/api/webhook
404
405 72 Nico Schottelius
h3. Deploying an application
406
407
* Applications are deployed via git towards gitea (code.ungleich.ch) and then pulled by argo
408 73 Nico Schottelius
* Always include the *redmine-url* pointing to the (customer) ticket
409
** Also add the support-url if it exists
410 72 Nico Schottelius
411
Application sample
412
413
<pre>
414
apiVersion: argoproj.io/v1alpha1
415
kind: Application
416
metadata:
417
  name: gitea-CUSTOMER
418
  namespace: argocd
419
spec:
420
  destination:
421
    namespace: default
422
    server: 'https://kubernetes.default.svc'
423
  source:
424
    path: apps/prod/gitea
425
    repoURL: 'https://code.ungleich.ch/ungleich-intern/k8s-config.git'
426
    targetRevision: HEAD
427
    helm:
428
      parameters:
429
        - name: storage.data.storageClass
430
          value: rook-ceph-block-hdd
431
        - name: storage.data.size
432
          value: 200Gi
433
        - name: storage.db.storageClass
434
          value: rook-ceph-block-ssd
435
        - name: storage.db.size
436
          value: 10Gi
437
        - name: storage.letsencrypt.storageClass
438
          value: rook-ceph-block-hdd
439
        - name: storage.letsencrypt.size
440
          value: 50Mi
441
        - name: letsencryptStaging
442
          value: 'no'
443
        - name: fqdn
444
          value: 'code.verua.online'
445
  project: default
446
  syncPolicy:
447
    automated:
448
      prune: true
449
      selfHeal: true
450
  info:
451
    - name: 'redmine-url'
452
      value: 'https://redmine.ungleich.ch/issues/ISSUEID'
453
    - name: 'support-url'
454
      value: 'https://support.ungleich.ch/Ticket/Display.html?id=TICKETID'
455
</pre>
456
457 80 Nico Schottelius
h2. Helm related operations and conventions
458 55 Nico Schottelius
459 61 Nico Schottelius
We use helm charts extensively.
460
461
* In production, they are managed via argocd
462
* In development, helm chart can de developed and deployed manually using the helm utility.
463
464 55 Nico Schottelius
h3. Installing a helm chart
465
466
One can use the usual pattern of
467
468
<pre>
469
helm install <releasename> <chartdirectory>
470
</pre>
471
472
However often you want to reinstall/update when testing helm charts. The following pattern is "better", because it allows you to reinstall, if it is already installed:
473
474
<pre>
475
helm upgrade --install <releasename> <chartdirectory>
476 1 Nico Schottelius
</pre>
477 80 Nico Schottelius
478
h3. Naming services and deployments in helm charts [Application labels]
479
480
* We always have {{ .Release.Name }} to identify the current "instance"
481
* Deployments:
482
** use @app: <what it is>@, f.i. @app: nginx@, @app: postgres@, ...
483 81 Nico Schottelius
* See more about standard labels on
484
** https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/
485
** https://helm.sh/docs/chart_best_practices/labels/
486 55 Nico Schottelius
487 43 Nico Schottelius
h2. Rook / Ceph Related Operations
488
489 71 Nico Schottelius
h3. Executing ceph commands
490
491
Using the ceph-tools pod as follows:
492
493
<pre>
494
kubectl exec -n rook-ceph -ti $(kubectl -n rook-ceph get pods -l app=rook-ceph-tools -o jsonpath='{.items[*].metadata.name}') -- ceph -s
495
</pre>
496
497 43 Nico Schottelius
h3. Inspecting the logs of a specific server
498
499
<pre>
500
# Get the related pods
501
kubectl -n rook-ceph get pods -l app=rook-ceph-osd-prepare 
502
...
503
504
# Inspect the logs of a specific pod
505
kubectl -n rook-ceph logs -f rook-ceph-osd-prepare-server23--1-444qx
506
507 71 Nico Schottelius
</pre>
508
509
h3. Inspecting the logs of the rook-ceph-operator
510
511
<pre>
512
kubectl -n rook-ceph logs -f -l app=rook-ceph-operator
513 43 Nico Schottelius
</pre>
514
515
h3. Triggering server prepare / adding new osds
516
517
The rook-ceph-operator triggers/watches/creates pods to maintain hosts. To trigger a full "re scan", simply delete that pod:
518
519
<pre>
520
kubectl -n rook-ceph delete pods -l app=rook-ceph-operator
521
</pre>
522
523
This will cause all the @rook-ceph-osd-prepare-..@ jobs to be recreated and thus OSDs to be created, if new disks have been added.
524
525
h3. Removing an OSD
526
527
* See "Ceph OSD Management":https://rook.io/docs/rook/v1.7/ceph-osd-mgmt.html
528 77 Nico Schottelius
* More specifically: https://github.com/rook/rook/blob/release-1.7/cluster/examples/kubernetes/ceph/osd-purge.yaml
529 99 Nico Schottelius
* Then delete the related deployment
530 41 Nico Schottelius
531 98 Nico Schottelius
Set osd id in the osd-purge.yaml and apply it. OSD should be down before.
532
533
<pre>
534
apiVersion: batch/v1
535
kind: Job
536
metadata:
537
  name: rook-ceph-purge-osd
538
  namespace: rook-ceph # namespace:cluster
539
  labels:
540
    app: rook-ceph-purge-osd
541
spec:
542
  template:
543
    metadata:
544
      labels:
545
        app: rook-ceph-purge-osd
546
    spec:
547
      serviceAccountName: rook-ceph-purge-osd
548
      containers:
549
        - name: osd-removal
550
          image: rook/ceph:master
551
          # TODO: Insert the OSD ID in the last parameter that is to be removed
552
          # The OSD IDs are a comma-separated list. For example: "0" or "0,2".
553
          # If you want to preserve the OSD PVCs, set `--preserve-pvc true`.
554
          #
555
          # A --force-osd-removal option is available if the OSD should be destroyed even though the
556
          # removal could lead to data loss.
557
          args:
558
            - "ceph"
559
            - "osd"
560
            - "remove"
561
            - "--preserve-pvc"
562
            - "false"
563
            - "--force-osd-removal"
564
            - "false"
565
            - "--osd-ids"
566
            - "SETTHEOSDIDHERE"
567
          env:
568
            - name: POD_NAMESPACE
569
              valueFrom:
570
                fieldRef:
571
                  fieldPath: metadata.namespace
572
            - name: ROOK_MON_ENDPOINTS
573
              valueFrom:
574
                configMapKeyRef:
575
                  key: data
576
                  name: rook-ceph-mon-endpoints
577
            - name: ROOK_CEPH_USERNAME
578
              valueFrom:
579
                secretKeyRef:
580
                  key: ceph-username
581
                  name: rook-ceph-mon
582
            - name: ROOK_CEPH_SECRET
583
              valueFrom:
584
                secretKeyRef:
585
                  key: ceph-secret
586
                  name: rook-ceph-mon
587
            - name: ROOK_CONFIG_DIR
588
              value: /var/lib/rook
589
            - name: ROOK_CEPH_CONFIG_OVERRIDE
590
              value: /etc/rook/config/override.conf
591
            - name: ROOK_FSID
592
              valueFrom:
593
                secretKeyRef:
594
                  key: fsid
595
                  name: rook-ceph-mon
596
            - name: ROOK_LOG_LEVEL
597
              value: DEBUG
598
          volumeMounts:
599
            - mountPath: /etc/ceph
600
              name: ceph-conf-emptydir
601
            - mountPath: /var/lib/rook
602
              name: rook-config
603
      volumes:
604
        - emptyDir: {}
605
          name: ceph-conf-emptydir
606
        - emptyDir: {}
607
          name: rook-config
608
      restartPolicy: Never
609
610
611 99 Nico Schottelius
</pre>
612
613
Deleting the deployment:
614
615
<pre>
616
[18:05] bridge:~% kubectl -n rook-ceph delete deployment rook-ceph-osd-6
617
deployment.apps "rook-ceph-osd-6" deleted
618 98 Nico Schottelius
</pre>
619
620 76 Nico Schottelius
h2. Harbor
621
622
* We user "Harbor":https://goharbor.io/ for caching and as an image registry. Internal app reference: apps/prod/harbor.
623
* The admin password is in the password store, auto generated per cluster
624
* At the moment harbor only authenticates against the internal ldap tree
625
626
h3. LDAP configuration
627
628
* The url needs to be ldaps://...
629
* uid = uid
630
* rest standard
631 75 Nico Schottelius
632 89 Nico Schottelius
h2. Monitoring / Prometheus
633
634 90 Nico Schottelius
* Via "kube-prometheus":https://github.com/prometheus-operator/kube-prometheus/
635 89 Nico Schottelius
636 91 Nico Schottelius
Access via ...
637
638
* http://prometheus-k8s.monitoring.svc:9090
639
* http://grafana.monitoring.svc:3000
640
* http://alertmanager.monitoring.svc:9093
641
642
643 100 Nico Schottelius
h3. Prometheus Options
644
645
* "helm/kube-prometheus-stack":https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack
646
** Includes dashboards and co.
647
* "manifest based kube-prometheus":https://github.com/prometheus-operator/kube-prometheus
648
** Includes dashboards and co.
649
* "Prometheus Operator (mainly CRD manifest":https://github.com/prometheus-operator/prometheus-operator
650
651 91 Nico Schottelius
652 82 Nico Schottelius
h2. Nextcloud
653
654 85 Nico Schottelius
h3. How to get the nextcloud credentials 
655 84 Nico Schottelius
656
* The initial username is set to "nextcloud"
657
* The password is autogenerated and saved in a kubernetes secret
658
659
<pre>
660 85 Nico Schottelius
kubectl get secret RELEASENAME-nextcloud -o jsonpath="{.data.PASSWORD}" | base64 -d; echo "" 
661 84 Nico Schottelius
</pre>
662
663 83 Nico Schottelius
h3. How to fix "Access through untrusted domain"
664
665 82 Nico Schottelius
* Nextcloud stores the initial domain configuration
666 1 Nico Schottelius
* If the FQDN is changed, it will show the error message "Access through untrusted domain"
667 82 Nico Schottelius
* To fix, edit /var/www/html/config/config.php and correct the domain
668 83 Nico Schottelius
* Then delete the pods
669 82 Nico Schottelius
670 1 Nico Schottelius
h2. Infrastructure versions
671 35 Nico Schottelius
672 57 Nico Schottelius
h3. ungleich kubernetes infrastructure v5 (2021-10)
673 1 Nico Schottelius
674 57 Nico Schottelius
Clusters are configured / setup in this order:
675
676
* Bootstrap via kubeadm
677 59 Nico Schottelius
* "Networking via calico + BGP (non ECMP) using helm":https://docs.projectcalico.org/getting-started/kubernetes/helm
678
* "ArgoCD for CD":https://argo-cd.readthedocs.io/en/stable/
679
** "rook for storage via argocd":https://rook.io/
680 58 Nico Schottelius
** haproxy for in IPv6-cluster-IPv4-to-IPv6 proxy via argocd
681
** "kubernetes-secret-generator for in cluster secrets":https://github.com/mittwald/kubernetes-secret-generator
682
** "ungleich-certbot managing certs and nginx":https://hub.docker.com/repository/docker/ungleich/ungleich-certbot
683
684 57 Nico Schottelius
685
h3. ungleich kubernetes infrastructure v4 (2021-09)
686
687 54 Nico Schottelius
* rook is configured via manifests instead of using the rook-ceph-cluster helm chart
688 1 Nico Schottelius
* The rook operator is still being installed via helm
689 35 Nico Schottelius
690 57 Nico Schottelius
h3. ungleich kubernetes infrastructure v3 (2021-07)
691 1 Nico Schottelius
692 10 Nico Schottelius
* rook is now installed via helm via argocd instead of directly via manifests
693 28 Nico Schottelius
694 57 Nico Schottelius
h3. ungleich kubernetes infrastructure v2 (2021-05)
695 28 Nico Schottelius
696
* Replaced fluxv2 from ungleich k8s v1 with argocd
697 1 Nico Schottelius
** argocd can apply helm templates directly without needing to go through Chart releases
698 28 Nico Schottelius
* We are also using argoflow for build flows
699
* Planned to add "kaniko":https://github.com/GoogleContainerTools/kaniko for image building
700
701 57 Nico Schottelius
h3. ungleich kubernetes infrastructure v1 (2021-01)
702 28 Nico Schottelius
703
We are using the following components:
704
705
* "Calico as a CNI":https://www.projectcalico.org/ with BGP, IPv6 only, no encapsulation
706
** Needed for basic networking
707
* "kubernetes-secret-generator":https://github.com/mittwald/kubernetes-secret-generator for creating secrets
708
** Needed so that secrets are not stored in the git repository, but only in the cluster
709
* "ungleich-certbot":https://hub.docker.com/repository/docker/ungleich/ungleich-certbot
710
** Needed to get letsencrypt certificates for services
711
* "rook with ceph rbd + cephfs":https://rook.io/ for storage
712
** rbd for almost everything, *ReadWriteOnce*
713
** cephfs for smaller things, multi access *ReadWriteMany*
714
** Needed for providing persistent storage
715
* "flux v2":https://fluxcd.io/
716
** Needed to manage resources automatically