Project

General

Profile

The ungleich kubernetes infrastructure » History » Version 106

Nico Schottelius, 06/14/2022 05:08 PM

1 22 Nico Schottelius
h1. The ungleich kubernetes infrastructure and ungleich kubernetes manual
2 1 Nico Schottelius
3 3 Nico Schottelius
{{toc}}
4
5 1 Nico Schottelius
h2. Status
6
7 28 Nico Schottelius
This document is **pre-production**.
8
This document is to become the ungleich kubernetes infrastructure overview as well as the ungleich kubernetes manual.
9 1 Nico Schottelius
10 10 Nico Schottelius
h2. k8s clusters
11
12 95 Nico Schottelius
| Cluster         | Purpose/Setup     | Maintainer | Master(s)                  | argo                                                | v4 http proxy | last verified |
13
| c0.k8s.ooo      | Dev               | -          | UNUSED                     |                                                     |               |    2021-10-05 |
14
| c1.k8s.ooo      | retired           |            | -                          |                                                     |               |    2022-03-15 |
15
| c2.k8s.ooo      | Dev p7 HW         | Nico       | server47 server53 server54 | "argo":https://argocd-server.argocd.svc.c2.k8s.ooo  |               |    2021-10-05 |
16
| c3.k8s.ooo      | retired           | -          | -                          |                                                     |               |    2021-10-05 |
17
| c4.k8s.ooo      | Dev2 p7 HW        | Jin-Guk    | server52 server53 server54 |                                                     |               |             - |
18
| c5.k8s.ooo      | retired           |            | -                          |                                                     |               |    2022-03-15 |
19
| c6.k8s.ooo      | Dev p6 VM Jin-Guk | Jin-Guk    |                            |                                                     |               |               |
20
| [[p5.k8s.ooo]]  | production        |            | server34 server36 server38 | "argo":https://argocd-server.argocd.svc.p5.k8s.ooo  |             - |               |
21
| [[p6.k8s.ooo]]  | production        |            | server67 server69 server71 | "argo":https://argocd-server.argocd.svc.p6.k8s.ooo  | 147.78.194.13 |    2021-10-05 |
22
| [[p10.k8s.ooo]] | production        |            | server63 server65 server83 | "argo":https://argocd-server.argocd.svc.p10.k8s.ooo | 147.78.194.12 |    2021-10-05 |
23 106 Nico Schottelius
| nau             | development       | Nico       | server75                   |                                                     |               |               |
24 21 Nico Schottelius
25 1 Nico Schottelius
h2. General architecture and components overview
26
27
* All k8s clusters are IPv6 only
28
* We use BGP peering to propagate podcidr and serviceCidr networks to our infrastructure
29
* The main public testing repository is "ungleich-k8s":https://code.ungleich.ch/ungleich-public/ungleich-k8s
30 18 Nico Schottelius
** Private configurations are found in the **k8s-config** repository
31 1 Nico Schottelius
32
h3. Cluster types
33
34 28 Nico Schottelius
| **Type/Feature**            | **Development**                | **Production**         |
35
| Min No. nodes               | 3 (1 master, 3 worker)         | 5 (3 master, 3 worker) |
36
| Recommended minimum         | 4 (dedicated master, 3 worker) | 8 (3 master, 5 worker) |
37
| Separation of control plane | optional                       | recommended            |
38
| Persistent storage          | required                       | required               |
39
| Number of storage monitors  | 3                              | 5                      |
40 1 Nico Schottelius
41 43 Nico Schottelius
h2. General k8s operations
42 1 Nico Schottelius
43 46 Nico Schottelius
h3. Cheat sheet / external great references
44
45
* "kubectl cheatsheet":https://kubernetes.io/docs/reference/kubectl/cheatsheet/
46
47 69 Nico Schottelius
h3. Allowing to schedule work on the control plane
48
49
* Mostly for single node / test / development clusters
50
* Just remove the master taint as follows
51
52
<pre>
53
kubectl taint nodes --all node-role.kubernetes.io/master-
54
</pre>
55
56
57 44 Nico Schottelius
h3. Get the cluster admin.conf
58
59
* On the masters of each cluster you can find the file @/etc/kubernetes/admin.conf@
60
* To be able to administrate the cluster you can copy the admin.conf to your local machine
61
* Multi cluster debugging can very easy if you name the config ~/cX-admin.conf (see example below)
62
63
<pre>
64
% scp root@server47.place7.ungleich.ch:/etc/kubernetes/admin.conf ~/c2-admin.conf
65
% export KUBECONFIG=~/c2-admin.conf    
66
% kubectl get nodes
67
NAME       STATUS                     ROLES                  AGE   VERSION
68
server47   Ready                      control-plane,master   82d   v1.22.0
69
server48   Ready                      control-plane,master   82d   v1.22.0
70
server49   Ready                      <none>                 82d   v1.22.0
71
server50   Ready                      <none>                 82d   v1.22.0
72
server59   Ready                      control-plane,master   82d   v1.22.0
73
server60   Ready,SchedulingDisabled   <none>                 82d   v1.22.0
74
server61   Ready                      <none>                 82d   v1.22.0
75
server62   Ready                      <none>                 82d   v1.22.0               
76
</pre>
77
78 18 Nico Schottelius
h3. Installing a new k8s cluster
79 8 Nico Schottelius
80 9 Nico Schottelius
* Decide on the cluster name (usually *cX.k8s.ooo*), X counting upwards
81 28 Nico Schottelius
** Using pXX.k8s.ooo for production clusters of placeXX
82 9 Nico Schottelius
* Use cdist to configure the nodes with requirements like crio
83
* Decide between single or multi node control plane setups (see below)
84 28 Nico Schottelius
** Single control plane suitable for development clusters
85 9 Nico Schottelius
86 28 Nico Schottelius
Typical init procedure:
87 9 Nico Schottelius
88 28 Nico Schottelius
* Single control plane: @kubeadm init --config bootstrap/XXX/kubeadm.yaml@
89
* Multi control plane (HA): @kubeadm init --config bootstrap/XXX/kubeadm.yaml --upload-certs@
90 10 Nico Schottelius
91 29 Nico Schottelius
h3. Deleting a pod that is hanging in terminating state
92
93
<pre>
94
kubectl delete pod <PODNAME> --grace-period=0 --force --namespace <NAMESPACE>
95
</pre>
96
97
(from https://stackoverflow.com/questions/35453792/pods-stuck-in-terminating-status)
98
99 42 Nico Schottelius
h3. Listing nodes of a cluster
100
101
<pre>
102
[15:05] bridge:~% kubectl get nodes
103
NAME       STATUS   ROLES                  AGE   VERSION
104
server22   Ready    <none>                 52d   v1.22.0
105
server23   Ready    <none>                 52d   v1.22.2
106
server24   Ready    <none>                 52d   v1.22.0
107
server25   Ready    <none>                 52d   v1.22.0
108
server26   Ready    <none>                 52d   v1.22.0
109
server27   Ready    <none>                 52d   v1.22.0
110
server63   Ready    control-plane,master   52d   v1.22.0
111
server64   Ready    <none>                 52d   v1.22.0
112
server65   Ready    control-plane,master   52d   v1.22.0
113
server66   Ready    <none>                 52d   v1.22.0
114
server83   Ready    control-plane,master   52d   v1.22.0
115
server84   Ready    <none>                 52d   v1.22.0
116
server85   Ready    <none>                 52d   v1.22.0
117
server86   Ready    <none>                 52d   v1.22.0
118
</pre>
119
120 41 Nico Schottelius
h3. Removing / draining a node
121
122
Usually @kubectl drain server@ should do the job, but sometimes we need to be more aggressive:
123
124 1 Nico Schottelius
<pre>
125 103 Nico Schottelius
kubectl drain --delete-emptydir-data --ignore-daemonsets serverXX
126 42 Nico Schottelius
</pre>
127
128
h3. Readding a node after draining
129
130
<pre>
131
kubectl uncordon serverXX
132 1 Nico Schottelius
</pre>
133 43 Nico Schottelius
134 50 Nico Schottelius
h3. (Re-)joining worker nodes after creating the cluster
135 49 Nico Schottelius
136
* We need to have an up-to-date token
137
* We use different join commands for the workers and control plane nodes
138
139
Generating the join command on an existing control plane node:
140
141
<pre>
142
kubeadm token create --print-join-command
143
</pre>
144
145 50 Nico Schottelius
h3. (Re-)joining control plane nodes after creating the cluster
146 1 Nico Schottelius
147 50 Nico Schottelius
* We generate the token again
148
* We upload the certificates
149
* We need to combine/create the join command for the control plane node
150
151
Example session:
152
153
<pre>
154
% kubeadm token create --print-join-command
155
kubeadm join p10-api.k8s.ooo:6443 --token xmff4i.ABC --discovery-token-ca-cert-hash sha256:longhash 
156
157
% kubeadm init phase upload-certs --upload-certs
158
[upload-certs] Storing the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
159
[upload-certs] Using certificate key:
160
CERTKEY
161
162
# Then we use these two outputs on the joining node:
163
164
kubeadm join p10-api.k8s.ooo:6443 --token xmff4i.ABC --discovery-token-ca-cert-hash sha256:longhash --control-plane --certificate-key CERTKEY
165
</pre>
166
167
Commands to be used on a control plane node:
168
169
<pre>
170
kubeadm token create --print-join-command
171
kubeadm init phase upload-certs --upload-certs
172
</pre>
173
174
Commands to be used on the joining node:
175
176
<pre>
177
JOINCOMMAND --control-plane --certificate-key CERTKEY
178
</pre>
179 49 Nico Schottelius
180 51 Nico Schottelius
SEE ALSO
181
182
* https://stackoverflow.com/questions/63936268/how-to-generate-kubeadm-token-for-secondary-control-plane-nodes
183
* https://blog.scottlowe.org/2019/08/15/reconstructing-the-join-command-for-kubeadm/
184
185 53 Nico Schottelius
h3. How to fix etcd does not start when rejoining a kubernetes cluster as a control plane
186 52 Nico Schottelius
187
If during the above step etcd does not come up, @kubeadm join@ can hang as follows:
188
189
<pre>
190
[control-plane] Creating static Pod manifest for "kube-apiserver"                                                              
191
[control-plane] Creating static Pod manifest for "kube-controller-manager"                                                     
192
[control-plane] Creating static Pod manifest for "kube-scheduler"                                                              
193
[check-etcd] Checking that the etcd cluster is healthy                                                                         
194
error execution phase check-etcd: etcd cluster is not healthy: failed to dial endpoint https://[2a0a:e5c0:10:1:225:b3ff:fe20:37
195
8a]:2379 with maintenance client: context deadline exceeded                                                                    
196
To see the stack trace of this error execute with --v=5 or higher         
197
</pre>
198
199
Then the problem is likely that the etcd server is still a member of the cluster. We first need to remove it from the etcd cluster and then the join works.
200
201
To fix this we do:
202
203
* Find a working etcd pod
204
* Find the etcd members / member list
205
* Remove the etcd member that we want to re-join the cluster
206
207
208
<pre>
209
# Find the etcd pods
210
kubectl -n kube-system get pods -l component=etcd,tier=control-plane
211
212
# Get the list of etcd servers with the member id 
213
kubectl exec -n kube-system -ti ETCDPODNAME -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list
214
215
# Remove the member
216
kubectl exec -n kube-system -ti ETCDPODNAME -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove MEMBERID
217
</pre>
218
219
Sample session:
220
221
<pre>
222
[10:48] line:~% kubectl -n kube-system get pods -l component=etcd,tier=control-plane
223
NAME            READY   STATUS    RESTARTS     AGE
224
etcd-server63   1/1     Running   0            3m11s
225
etcd-server65   1/1     Running   3            7d2h
226
etcd-server83   1/1     Running   8 (6d ago)   7d2h
227
[10:48] line:~% kubectl exec -n kube-system -ti etcd-server65 -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list
228
356891cd676df6e4, started, server65, https://[2a0a:e5c0:10:1:225:b3ff:fe20:375c]:2380, https://[2a0a:e5c0:10:1:225:b3ff:fe20:375c]:2379, false
229
371b8a07185dee7e, started, server63, https://[2a0a:e5c0:10:1:225:b3ff:fe20:378a]:2380, https://[2a0a:e5c0:10:1:225:b3ff:fe20:378a]:2379, false
230
5942bc58307f8af9, started, server83, https://[2a0a:e5c0:10:1:3e4a:92ff:fe79:bb98]:2380, https://[2a0a:e5c0:10:1:3e4a:92ff:fe79:bb98]:2379, false
231
232
[10:48] line:~% kubectl exec -n kube-system -ti etcd-server65 -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove 371b8a07185dee7e
233
Member 371b8a07185dee7e removed from cluster e3c0805f592a8f77
234 1 Nico Schottelius
235
</pre>
236
237
SEE ALSO
238
239
* We found the solution using https://stackoverflow.com/questions/67921552/re-installed-node-cannot-join-kubernetes-cluster
240 56 Nico Schottelius
241 101 Nico Schottelius
h3. Hardware Maintenance using ungleich-hardware
242
243
Use the following manifest and replace the HOST with the actual host:
244
245
<pre>
246
apiVersion: v1
247
kind: Pod
248
metadata:
249
  name: ungleich-hardware-HOST
250
spec:
251
  containers:
252
  - name: ungleich-hardware
253
    image: ungleich/ungleich-hardware:0.0.5
254
    args:
255
    - sleep
256
    - "1000000"
257
    volumeMounts:
258
      - mountPath: /dev
259
        name: dev
260
    securityContext:
261
      privileged: true
262
  nodeSelector:
263
    kubernetes.io/hostname: "HOST"
264
265
  volumes:
266
    - name: dev
267
      hostPath:
268
        path: /dev
269
</pre>
270
271 102 Nico Schottelius
Also see: [[The_ungleich_hardware_maintenance_guide]]
272
273 105 Nico Schottelius
h3. Triggering a cronjob / creating a job from a cronjob
274 104 Nico Schottelius
275
To test a cronjob, we can create a job from a cronjob:
276
277
<pre>
278
kubectl create job --from=cronjob/volume2-daily-backup volume2-manual
279
</pre>
280
281
This creates a job volume2-manual based on the cronjob  volume2-daily
282
283 62 Nico Schottelius
h2. Calico CNI
284
285
h3. Calico Installation
286
287
* We install "calico using helm":https://docs.projectcalico.org/getting-started/kubernetes/helm
288
* This has the following advantages:
289
** Easy to upgrade
290
** Does not require os to configure IPv6/dual stack settings as the tigera operator figures out things on its own
291
292
Usually plain calico can be installed directly using:
293
294
<pre>
295
helm repo add projectcalico https://docs.projectcalico.org/charts
296 94 Nico Schottelius
helm install calico projectcalico/tigera-operator --version v3.20.4
297 1 Nico Schottelius
</pre>
298 92 Nico Schottelius
299
* Check the tags on https://github.com/projectcalico/calico/tags for the latest release
300 62 Nico Schottelius
301
h3. Installing calicoctl
302
303
To be able to manage and configure calico, we need to 
304
"install calicoctl (we choose the version as a pod)":https://docs.projectcalico.org/getting-started/clis/calicoctl/install#install-calicoctl-as-a-kubernetes-pod
305
306
<pre>
307
kubectl apply -f https://docs.projectcalico.org/manifests/calicoctl.yaml
308
</pre>
309
310 93 Nico Schottelius
Or version specific:
311
312
<pre>
313
kubectl apply -f https://github.com/projectcalico/calico/blob/v3.20.4/manifests/calicoctl.yaml
314 97 Nico Schottelius
315
# For 3.22
316
kubectl apply -f https://projectcalico.docs.tigera.io/archive/v3.22/manifests/calicoctl.yaml
317 93 Nico Schottelius
</pre>
318
319 70 Nico Schottelius
And making it easier accessible by alias:
320
321
<pre>
322
alias calicoctl="kubectl exec -i -n kube-system calicoctl -- /calicoctl"
323
</pre>
324
325 62 Nico Schottelius
h3. Calico configuration
326
327 63 Nico Schottelius
By default our k8s clusters "BGP peer":https://docs.projectcalico.org/networking/bgp
328
with an upstream router to propagate podcidr and servicecidr.
329 62 Nico Schottelius
330
Default settings in our infrastructure:
331
332
* We use a full-mesh using the @nodeToNodeMeshEnabled: true@ option
333
* We keep the original next hop so that *only* the server with the pod is announcing it (instead of ecmp)
334 1 Nico Schottelius
* We use private ASNs for k8s clusters
335 63 Nico Schottelius
* We do *not* use any overlay
336 62 Nico Schottelius
337
After installing calico and calicoctl the last step of the installation is usually:
338
339 1 Nico Schottelius
<pre>
340 79 Nico Schottelius
calicoctl create -f - < calico-bgp.yaml
341 62 Nico Schottelius
</pre>
342
343
344
A sample BGP configuration:
345
346
<pre>
347
---
348
apiVersion: projectcalico.org/v3
349
kind: BGPConfiguration
350
metadata:
351
  name: default
352
spec:
353
  logSeverityScreen: Info
354
  nodeToNodeMeshEnabled: true
355
  asNumber: 65534
356
  serviceClusterIPs:
357
  - cidr: 2a0a:e5c0:10:3::/108
358
  serviceExternalIPs:
359
  - cidr: 2a0a:e5c0:10:3::/108
360
---
361
apiVersion: projectcalico.org/v3
362
kind: BGPPeer
363
metadata:
364
  name: router1-place10
365
spec:
366
  peerIP: 2a0a:e5c0:10:1::50
367
  asNumber: 213081
368
  keepOriginalNextHop: true
369
</pre>
370
371 64 Nico Schottelius
h2. ArgoCD / ArgoWorkFlow
372 56 Nico Schottelius
373 60 Nico Schottelius
h3. Argocd Installation
374 1 Nico Schottelius
375 60 Nico Schottelius
As there is no configuration management present yet, argocd is installed using
376
377 1 Nico Schottelius
<pre>
378 60 Nico Schottelius
kubectl create namespace argocd
379 86 Nico Schottelius
380 96 Nico Schottelius
# Specific Version
381
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/v2.3.2/manifests/install.yaml
382 86 Nico Schottelius
383
# OR: latest stable
384 60 Nico Schottelius
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
385 1 Nico Schottelius
</pre>
386 56 Nico Schottelius
387 60 Nico Schottelius
* See https://argo-cd.readthedocs.io/en/stable/
388 1 Nico Schottelius
389 60 Nico Schottelius
h3. Get the argocd credentials
390
391
<pre>
392
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d; echo ""
393
</pre>
394 52 Nico Schottelius
395 87 Nico Schottelius
h3. Accessing argocd
396
397
In regular IPv6 clusters:
398
399
* Navigate to https://argocd-server.argocd.CLUSTERDOMAIN
400
401
In legacy IPv4 clusters
402
403
<pre>
404
kubectl --namespace argocd port-forward svc/argocd-server 8080:80
405
</pre>
406
407 88 Nico Schottelius
* Navigate to https://localhost:8080
408
409 68 Nico Schottelius
h3. Using the argocd webhook to trigger changes
410 67 Nico Schottelius
411
* To trigger changes post json https://argocd.example.com/api/webhook
412
413 72 Nico Schottelius
h3. Deploying an application
414
415
* Applications are deployed via git towards gitea (code.ungleich.ch) and then pulled by argo
416 73 Nico Schottelius
* Always include the *redmine-url* pointing to the (customer) ticket
417
** Also add the support-url if it exists
418 72 Nico Schottelius
419
Application sample
420
421
<pre>
422
apiVersion: argoproj.io/v1alpha1
423
kind: Application
424
metadata:
425
  name: gitea-CUSTOMER
426
  namespace: argocd
427
spec:
428
  destination:
429
    namespace: default
430
    server: 'https://kubernetes.default.svc'
431
  source:
432
    path: apps/prod/gitea
433
    repoURL: 'https://code.ungleich.ch/ungleich-intern/k8s-config.git'
434
    targetRevision: HEAD
435
    helm:
436
      parameters:
437
        - name: storage.data.storageClass
438
          value: rook-ceph-block-hdd
439
        - name: storage.data.size
440
          value: 200Gi
441
        - name: storage.db.storageClass
442
          value: rook-ceph-block-ssd
443
        - name: storage.db.size
444
          value: 10Gi
445
        - name: storage.letsencrypt.storageClass
446
          value: rook-ceph-block-hdd
447
        - name: storage.letsencrypt.size
448
          value: 50Mi
449
        - name: letsencryptStaging
450
          value: 'no'
451
        - name: fqdn
452
          value: 'code.verua.online'
453
  project: default
454
  syncPolicy:
455
    automated:
456
      prune: true
457
      selfHeal: true
458
  info:
459
    - name: 'redmine-url'
460
      value: 'https://redmine.ungleich.ch/issues/ISSUEID'
461
    - name: 'support-url'
462
      value: 'https://support.ungleich.ch/Ticket/Display.html?id=TICKETID'
463
</pre>
464
465 80 Nico Schottelius
h2. Helm related operations and conventions
466 55 Nico Schottelius
467 61 Nico Schottelius
We use helm charts extensively.
468
469
* In production, they are managed via argocd
470
* In development, helm chart can de developed and deployed manually using the helm utility.
471
472 55 Nico Schottelius
h3. Installing a helm chart
473
474
One can use the usual pattern of
475
476
<pre>
477
helm install <releasename> <chartdirectory>
478
</pre>
479
480
However often you want to reinstall/update when testing helm charts. The following pattern is "better", because it allows you to reinstall, if it is already installed:
481
482
<pre>
483
helm upgrade --install <releasename> <chartdirectory>
484 1 Nico Schottelius
</pre>
485 80 Nico Schottelius
486
h3. Naming services and deployments in helm charts [Application labels]
487
488
* We always have {{ .Release.Name }} to identify the current "instance"
489
* Deployments:
490
** use @app: <what it is>@, f.i. @app: nginx@, @app: postgres@, ...
491 81 Nico Schottelius
* See more about standard labels on
492
** https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/
493
** https://helm.sh/docs/chart_best_practices/labels/
494 55 Nico Schottelius
495 43 Nico Schottelius
h2. Rook / Ceph Related Operations
496
497 71 Nico Schottelius
h3. Executing ceph commands
498
499
Using the ceph-tools pod as follows:
500
501
<pre>
502
kubectl exec -n rook-ceph -ti $(kubectl -n rook-ceph get pods -l app=rook-ceph-tools -o jsonpath='{.items[*].metadata.name}') -- ceph -s
503
</pre>
504
505 43 Nico Schottelius
h3. Inspecting the logs of a specific server
506
507
<pre>
508
# Get the related pods
509
kubectl -n rook-ceph get pods -l app=rook-ceph-osd-prepare 
510
...
511
512
# Inspect the logs of a specific pod
513
kubectl -n rook-ceph logs -f rook-ceph-osd-prepare-server23--1-444qx
514
515 71 Nico Schottelius
</pre>
516
517
h3. Inspecting the logs of the rook-ceph-operator
518
519
<pre>
520
kubectl -n rook-ceph logs -f -l app=rook-ceph-operator
521 43 Nico Schottelius
</pre>
522
523
h3. Triggering server prepare / adding new osds
524
525
The rook-ceph-operator triggers/watches/creates pods to maintain hosts. To trigger a full "re scan", simply delete that pod:
526
527
<pre>
528
kubectl -n rook-ceph delete pods -l app=rook-ceph-operator
529
</pre>
530
531
This will cause all the @rook-ceph-osd-prepare-..@ jobs to be recreated and thus OSDs to be created, if new disks have been added.
532
533
h3. Removing an OSD
534
535
* See "Ceph OSD Management":https://rook.io/docs/rook/v1.7/ceph-osd-mgmt.html
536 77 Nico Schottelius
* More specifically: https://github.com/rook/rook/blob/release-1.7/cluster/examples/kubernetes/ceph/osd-purge.yaml
537 99 Nico Schottelius
* Then delete the related deployment
538 41 Nico Schottelius
539 98 Nico Schottelius
Set osd id in the osd-purge.yaml and apply it. OSD should be down before.
540
541
<pre>
542
apiVersion: batch/v1
543
kind: Job
544
metadata:
545
  name: rook-ceph-purge-osd
546
  namespace: rook-ceph # namespace:cluster
547
  labels:
548
    app: rook-ceph-purge-osd
549
spec:
550
  template:
551
    metadata:
552
      labels:
553
        app: rook-ceph-purge-osd
554
    spec:
555
      serviceAccountName: rook-ceph-purge-osd
556
      containers:
557
        - name: osd-removal
558
          image: rook/ceph:master
559
          # TODO: Insert the OSD ID in the last parameter that is to be removed
560
          # The OSD IDs are a comma-separated list. For example: "0" or "0,2".
561
          # If you want to preserve the OSD PVCs, set `--preserve-pvc true`.
562
          #
563
          # A --force-osd-removal option is available if the OSD should be destroyed even though the
564
          # removal could lead to data loss.
565
          args:
566
            - "ceph"
567
            - "osd"
568
            - "remove"
569
            - "--preserve-pvc"
570
            - "false"
571
            - "--force-osd-removal"
572
            - "false"
573
            - "--osd-ids"
574
            - "SETTHEOSDIDHERE"
575
          env:
576
            - name: POD_NAMESPACE
577
              valueFrom:
578
                fieldRef:
579
                  fieldPath: metadata.namespace
580
            - name: ROOK_MON_ENDPOINTS
581
              valueFrom:
582
                configMapKeyRef:
583
                  key: data
584
                  name: rook-ceph-mon-endpoints
585
            - name: ROOK_CEPH_USERNAME
586
              valueFrom:
587
                secretKeyRef:
588
                  key: ceph-username
589
                  name: rook-ceph-mon
590
            - name: ROOK_CEPH_SECRET
591
              valueFrom:
592
                secretKeyRef:
593
                  key: ceph-secret
594
                  name: rook-ceph-mon
595
            - name: ROOK_CONFIG_DIR
596
              value: /var/lib/rook
597
            - name: ROOK_CEPH_CONFIG_OVERRIDE
598
              value: /etc/rook/config/override.conf
599
            - name: ROOK_FSID
600
              valueFrom:
601
                secretKeyRef:
602
                  key: fsid
603
                  name: rook-ceph-mon
604
            - name: ROOK_LOG_LEVEL
605
              value: DEBUG
606
          volumeMounts:
607
            - mountPath: /etc/ceph
608
              name: ceph-conf-emptydir
609
            - mountPath: /var/lib/rook
610
              name: rook-config
611
      volumes:
612
        - emptyDir: {}
613
          name: ceph-conf-emptydir
614
        - emptyDir: {}
615
          name: rook-config
616
      restartPolicy: Never
617
618
619 99 Nico Schottelius
</pre>
620
621
Deleting the deployment:
622
623
<pre>
624
[18:05] bridge:~% kubectl -n rook-ceph delete deployment rook-ceph-osd-6
625
deployment.apps "rook-ceph-osd-6" deleted
626 98 Nico Schottelius
</pre>
627
628 76 Nico Schottelius
h2. Harbor
629
630
* We user "Harbor":https://goharbor.io/ for caching and as an image registry. Internal app reference: apps/prod/harbor.
631
* The admin password is in the password store, auto generated per cluster
632
* At the moment harbor only authenticates against the internal ldap tree
633
634
h3. LDAP configuration
635
636
* The url needs to be ldaps://...
637
* uid = uid
638
* rest standard
639 75 Nico Schottelius
640 89 Nico Schottelius
h2. Monitoring / Prometheus
641
642 90 Nico Schottelius
* Via "kube-prometheus":https://github.com/prometheus-operator/kube-prometheus/
643 89 Nico Schottelius
644 91 Nico Schottelius
Access via ...
645
646
* http://prometheus-k8s.monitoring.svc:9090
647
* http://grafana.monitoring.svc:3000
648
* http://alertmanager.monitoring.svc:9093
649
650
651 100 Nico Schottelius
h3. Prometheus Options
652
653
* "helm/kube-prometheus-stack":https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack
654
** Includes dashboards and co.
655
* "manifest based kube-prometheus":https://github.com/prometheus-operator/kube-prometheus
656
** Includes dashboards and co.
657
* "Prometheus Operator (mainly CRD manifest":https://github.com/prometheus-operator/prometheus-operator
658
659 91 Nico Schottelius
660 82 Nico Schottelius
h2. Nextcloud
661
662 85 Nico Schottelius
h3. How to get the nextcloud credentials 
663 84 Nico Schottelius
664
* The initial username is set to "nextcloud"
665
* The password is autogenerated and saved in a kubernetes secret
666
667
<pre>
668 85 Nico Schottelius
kubectl get secret RELEASENAME-nextcloud -o jsonpath="{.data.PASSWORD}" | base64 -d; echo "" 
669 84 Nico Schottelius
</pre>
670
671 83 Nico Schottelius
h3. How to fix "Access through untrusted domain"
672
673 82 Nico Schottelius
* Nextcloud stores the initial domain configuration
674 1 Nico Schottelius
* If the FQDN is changed, it will show the error message "Access through untrusted domain"
675 82 Nico Schottelius
* To fix, edit /var/www/html/config/config.php and correct the domain
676 83 Nico Schottelius
* Then delete the pods
677 82 Nico Schottelius
678 1 Nico Schottelius
h2. Infrastructure versions
679 35 Nico Schottelius
680 57 Nico Schottelius
h3. ungleich kubernetes infrastructure v5 (2021-10)
681 1 Nico Schottelius
682 57 Nico Schottelius
Clusters are configured / setup in this order:
683
684
* Bootstrap via kubeadm
685 59 Nico Schottelius
* "Networking via calico + BGP (non ECMP) using helm":https://docs.projectcalico.org/getting-started/kubernetes/helm
686
* "ArgoCD for CD":https://argo-cd.readthedocs.io/en/stable/
687
** "rook for storage via argocd":https://rook.io/
688 58 Nico Schottelius
** haproxy for in IPv6-cluster-IPv4-to-IPv6 proxy via argocd
689
** "kubernetes-secret-generator for in cluster secrets":https://github.com/mittwald/kubernetes-secret-generator
690
** "ungleich-certbot managing certs and nginx":https://hub.docker.com/repository/docker/ungleich/ungleich-certbot
691
692 57 Nico Schottelius
693
h3. ungleich kubernetes infrastructure v4 (2021-09)
694
695 54 Nico Schottelius
* rook is configured via manifests instead of using the rook-ceph-cluster helm chart
696 1 Nico Schottelius
* The rook operator is still being installed via helm
697 35 Nico Schottelius
698 57 Nico Schottelius
h3. ungleich kubernetes infrastructure v3 (2021-07)
699 1 Nico Schottelius
700 10 Nico Schottelius
* rook is now installed via helm via argocd instead of directly via manifests
701 28 Nico Schottelius
702 57 Nico Schottelius
h3. ungleich kubernetes infrastructure v2 (2021-05)
703 28 Nico Schottelius
704
* Replaced fluxv2 from ungleich k8s v1 with argocd
705 1 Nico Schottelius
** argocd can apply helm templates directly without needing to go through Chart releases
706 28 Nico Schottelius
* We are also using argoflow for build flows
707
* Planned to add "kaniko":https://github.com/GoogleContainerTools/kaniko for image building
708
709 57 Nico Schottelius
h3. ungleich kubernetes infrastructure v1 (2021-01)
710 28 Nico Schottelius
711
We are using the following components:
712
713
* "Calico as a CNI":https://www.projectcalico.org/ with BGP, IPv6 only, no encapsulation
714
** Needed for basic networking
715
* "kubernetes-secret-generator":https://github.com/mittwald/kubernetes-secret-generator for creating secrets
716
** Needed so that secrets are not stored in the git repository, but only in the cluster
717
* "ungleich-certbot":https://hub.docker.com/repository/docker/ungleich/ungleich-certbot
718
** Needed to get letsencrypt certificates for services
719
* "rook with ceph rbd + cephfs":https://rook.io/ for storage
720
** rbd for almost everything, *ReadWriteOnce*
721
** cephfs for smaller things, multi access *ReadWriteMany*
722
** Needed for providing persistent storage
723
* "flux v2":https://fluxcd.io/
724
** Needed to manage resources automatically