Project

General

Profile

The ungleich kubernetes infrastructure » History » Version 116

Nico Schottelius, 07/08/2022 09:57 AM

1 22 Nico Schottelius
h1. The ungleich kubernetes infrastructure and ungleich kubernetes manual
2 1 Nico Schottelius
3 3 Nico Schottelius
{{toc}}
4
5 1 Nico Schottelius
h2. Status
6
7 28 Nico Schottelius
This document is **pre-production**.
8
This document is to become the ungleich kubernetes infrastructure overview as well as the ungleich kubernetes manual.
9 1 Nico Schottelius
10 10 Nico Schottelius
h2. k8s clusters
11
12 111 Nico Schottelius
| Cluster           | Purpose/Setup     | Maintainer | Master(s)                     | argo                                                  | v4 http proxy | last verified |
13
| c0.k8s.ooo        | Dev               | -          | UNUSED                        |                                                       |               |    2021-10-05 |
14
| c1.k8s.ooo        | retired           |            | -                             |                                                       |               |    2022-03-15 |
15
| c2.k8s.ooo        | Dev p7 HW         | Nico       | server47 server53 server54    | "argo":https://argocd-server.argocd.svc.c2.k8s.ooo    |               |    2021-10-05 |
16
| c3.k8s.ooo        | retired           | -          | -                             |                                                       |               |    2021-10-05 |
17
| c4.k8s.ooo        | Dev2 p7 HW        | Jin-Guk    | server52 server53 server54    |                                                       |               |             - |
18
| c5.k8s.ooo        | retired           |            | -                             |                                                       |               |    2022-03-15 |
19
| c6.k8s.ooo        | Dev p6 VM Jin-Guk | Jin-Guk    |                               |                                                       |               |               |
20
| [[p5.k8s.ooo]]    | production        |            | server34 server36 server38    | "argo":https://argocd-server.argocd.svc.p5.k8s.ooo    |             - |               |
21
| [[p6.k8s.ooo]]    | production        |            | server67 server69 server71    | "argo":https://argocd-server.argocd.svc.p6.k8s.ooo    | 147.78.194.13 |    2021-10-05 |
22
| [[p10.k8s.ooo]]   | production        |            | server63 server65 server83    | "argo":https://argocd-server.argocd.svc.p10.k8s.ooo   | 147.78.194.12 |    2021-10-05 |
23
| [[k8s.ge.nau.so]] | development       |            | server107 server108 server109 | "argo":https://argocd-server.argocd.svc.k8s.ge.nau.so |               |               |
24 110 Nico Schottelius
25 21 Nico Schottelius
26 1 Nico Schottelius
h2. General architecture and components overview
27
28
* All k8s clusters are IPv6 only
29
* We use BGP peering to propagate podcidr and serviceCidr networks to our infrastructure
30
* The main public testing repository is "ungleich-k8s":https://code.ungleich.ch/ungleich-public/ungleich-k8s
31 18 Nico Schottelius
** Private configurations are found in the **k8s-config** repository
32 1 Nico Schottelius
33
h3. Cluster types
34
35 28 Nico Schottelius
| **Type/Feature**            | **Development**                | **Production**         |
36
| Min No. nodes               | 3 (1 master, 3 worker)         | 5 (3 master, 3 worker) |
37
| Recommended minimum         | 4 (dedicated master, 3 worker) | 8 (3 master, 5 worker) |
38
| Separation of control plane | optional                       | recommended            |
39
| Persistent storage          | required                       | required               |
40
| Number of storage monitors  | 3                              | 5                      |
41 1 Nico Schottelius
42 43 Nico Schottelius
h2. General k8s operations
43 1 Nico Schottelius
44 46 Nico Schottelius
h3. Cheat sheet / external great references
45
46
* "kubectl cheatsheet":https://kubernetes.io/docs/reference/kubectl/cheatsheet/
47
48 69 Nico Schottelius
h3. Allowing to schedule work on the control plane
49
50
* Mostly for single node / test / development clusters
51
* Just remove the master taint as follows
52
53
<pre>
54
kubectl taint nodes --all node-role.kubernetes.io/master-
55
</pre>
56
57
58 44 Nico Schottelius
h3. Get the cluster admin.conf
59
60
* On the masters of each cluster you can find the file @/etc/kubernetes/admin.conf@
61
* To be able to administrate the cluster you can copy the admin.conf to your local machine
62
* Multi cluster debugging can very easy if you name the config ~/cX-admin.conf (see example below)
63
64
<pre>
65
% scp root@server47.place7.ungleich.ch:/etc/kubernetes/admin.conf ~/c2-admin.conf
66
% export KUBECONFIG=~/c2-admin.conf    
67
% kubectl get nodes
68
NAME       STATUS                     ROLES                  AGE   VERSION
69
server47   Ready                      control-plane,master   82d   v1.22.0
70
server48   Ready                      control-plane,master   82d   v1.22.0
71
server49   Ready                      <none>                 82d   v1.22.0
72
server50   Ready                      <none>                 82d   v1.22.0
73
server59   Ready                      control-plane,master   82d   v1.22.0
74
server60   Ready,SchedulingDisabled   <none>                 82d   v1.22.0
75
server61   Ready                      <none>                 82d   v1.22.0
76
server62   Ready                      <none>                 82d   v1.22.0               
77
</pre>
78
79 18 Nico Schottelius
h3. Installing a new k8s cluster
80 8 Nico Schottelius
81 9 Nico Schottelius
* Decide on the cluster name (usually *cX.k8s.ooo*), X counting upwards
82 28 Nico Schottelius
** Using pXX.k8s.ooo for production clusters of placeXX
83 9 Nico Schottelius
* Use cdist to configure the nodes with requirements like crio
84
* Decide between single or multi node control plane setups (see below)
85 28 Nico Schottelius
** Single control plane suitable for development clusters
86 9 Nico Schottelius
87 28 Nico Schottelius
Typical init procedure:
88 9 Nico Schottelius
89 28 Nico Schottelius
* Single control plane: @kubeadm init --config bootstrap/XXX/kubeadm.yaml@
90
* Multi control plane (HA): @kubeadm init --config bootstrap/XXX/kubeadm.yaml --upload-certs@
91 10 Nico Schottelius
92 29 Nico Schottelius
h3. Deleting a pod that is hanging in terminating state
93
94
<pre>
95
kubectl delete pod <PODNAME> --grace-period=0 --force --namespace <NAMESPACE>
96
</pre>
97
98
(from https://stackoverflow.com/questions/35453792/pods-stuck-in-terminating-status)
99
100 42 Nico Schottelius
h3. Listing nodes of a cluster
101
102
<pre>
103
[15:05] bridge:~% kubectl get nodes
104
NAME       STATUS   ROLES                  AGE   VERSION
105
server22   Ready    <none>                 52d   v1.22.0
106
server23   Ready    <none>                 52d   v1.22.2
107
server24   Ready    <none>                 52d   v1.22.0
108
server25   Ready    <none>                 52d   v1.22.0
109
server26   Ready    <none>                 52d   v1.22.0
110
server27   Ready    <none>                 52d   v1.22.0
111
server63   Ready    control-plane,master   52d   v1.22.0
112
server64   Ready    <none>                 52d   v1.22.0
113
server65   Ready    control-plane,master   52d   v1.22.0
114
server66   Ready    <none>                 52d   v1.22.0
115
server83   Ready    control-plane,master   52d   v1.22.0
116
server84   Ready    <none>                 52d   v1.22.0
117
server85   Ready    <none>                 52d   v1.22.0
118
server86   Ready    <none>                 52d   v1.22.0
119
</pre>
120
121 41 Nico Schottelius
h3. Removing / draining a node
122
123
Usually @kubectl drain server@ should do the job, but sometimes we need to be more aggressive:
124
125 1 Nico Schottelius
<pre>
126 103 Nico Schottelius
kubectl drain --delete-emptydir-data --ignore-daemonsets serverXX
127 42 Nico Schottelius
</pre>
128
129
h3. Readding a node after draining
130
131
<pre>
132
kubectl uncordon serverXX
133 1 Nico Schottelius
</pre>
134 43 Nico Schottelius
135 50 Nico Schottelius
h3. (Re-)joining worker nodes after creating the cluster
136 49 Nico Schottelius
137
* We need to have an up-to-date token
138
* We use different join commands for the workers and control plane nodes
139
140
Generating the join command on an existing control plane node:
141
142
<pre>
143
kubeadm token create --print-join-command
144
</pre>
145
146 50 Nico Schottelius
h3. (Re-)joining control plane nodes after creating the cluster
147 1 Nico Schottelius
148 50 Nico Schottelius
* We generate the token again
149
* We upload the certificates
150
* We need to combine/create the join command for the control plane node
151
152
Example session:
153
154
<pre>
155
% kubeadm token create --print-join-command
156
kubeadm join p10-api.k8s.ooo:6443 --token xmff4i.ABC --discovery-token-ca-cert-hash sha256:longhash 
157
158
% kubeadm init phase upload-certs --upload-certs
159
[upload-certs] Storing the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
160
[upload-certs] Using certificate key:
161
CERTKEY
162
163
# Then we use these two outputs on the joining node:
164
165
kubeadm join p10-api.k8s.ooo:6443 --token xmff4i.ABC --discovery-token-ca-cert-hash sha256:longhash --control-plane --certificate-key CERTKEY
166
</pre>
167
168
Commands to be used on a control plane node:
169
170
<pre>
171
kubeadm token create --print-join-command
172
kubeadm init phase upload-certs --upload-certs
173
</pre>
174
175
Commands to be used on the joining node:
176
177
<pre>
178
JOINCOMMAND --control-plane --certificate-key CERTKEY
179
</pre>
180 49 Nico Schottelius
181 51 Nico Schottelius
SEE ALSO
182
183
* https://stackoverflow.com/questions/63936268/how-to-generate-kubeadm-token-for-secondary-control-plane-nodes
184
* https://blog.scottlowe.org/2019/08/15/reconstructing-the-join-command-for-kubeadm/
185
186 53 Nico Schottelius
h3. How to fix etcd does not start when rejoining a kubernetes cluster as a control plane
187 52 Nico Schottelius
188
If during the above step etcd does not come up, @kubeadm join@ can hang as follows:
189
190
<pre>
191
[control-plane] Creating static Pod manifest for "kube-apiserver"                                                              
192
[control-plane] Creating static Pod manifest for "kube-controller-manager"                                                     
193
[control-plane] Creating static Pod manifest for "kube-scheduler"                                                              
194
[check-etcd] Checking that the etcd cluster is healthy                                                                         
195
error execution phase check-etcd: etcd cluster is not healthy: failed to dial endpoint https://[2a0a:e5c0:10:1:225:b3ff:fe20:37
196
8a]:2379 with maintenance client: context deadline exceeded                                                                    
197
To see the stack trace of this error execute with --v=5 or higher         
198
</pre>
199
200
Then the problem is likely that the etcd server is still a member of the cluster. We first need to remove it from the etcd cluster and then the join works.
201
202
To fix this we do:
203
204
* Find a working etcd pod
205
* Find the etcd members / member list
206
* Remove the etcd member that we want to re-join the cluster
207
208
209
<pre>
210
# Find the etcd pods
211
kubectl -n kube-system get pods -l component=etcd,tier=control-plane
212
213
# Get the list of etcd servers with the member id 
214
kubectl exec -n kube-system -ti ETCDPODNAME -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list
215
216
# Remove the member
217
kubectl exec -n kube-system -ti ETCDPODNAME -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove MEMBERID
218
</pre>
219
220
Sample session:
221
222
<pre>
223
[10:48] line:~% kubectl -n kube-system get pods -l component=etcd,tier=control-plane
224
NAME            READY   STATUS    RESTARTS     AGE
225
etcd-server63   1/1     Running   0            3m11s
226
etcd-server65   1/1     Running   3            7d2h
227
etcd-server83   1/1     Running   8 (6d ago)   7d2h
228
[10:48] line:~% kubectl exec -n kube-system -ti etcd-server65 -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list
229
356891cd676df6e4, started, server65, https://[2a0a:e5c0:10:1:225:b3ff:fe20:375c]:2380, https://[2a0a:e5c0:10:1:225:b3ff:fe20:375c]:2379, false
230
371b8a07185dee7e, started, server63, https://[2a0a:e5c0:10:1:225:b3ff:fe20:378a]:2380, https://[2a0a:e5c0:10:1:225:b3ff:fe20:378a]:2379, false
231
5942bc58307f8af9, started, server83, https://[2a0a:e5c0:10:1:3e4a:92ff:fe79:bb98]:2380, https://[2a0a:e5c0:10:1:3e4a:92ff:fe79:bb98]:2379, false
232
233
[10:48] line:~% kubectl exec -n kube-system -ti etcd-server65 -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove 371b8a07185dee7e
234
Member 371b8a07185dee7e removed from cluster e3c0805f592a8f77
235 1 Nico Schottelius
236
</pre>
237
238
SEE ALSO
239
240
* We found the solution using https://stackoverflow.com/questions/67921552/re-installed-node-cannot-join-kubernetes-cluster
241 56 Nico Schottelius
242 101 Nico Schottelius
h3. Hardware Maintenance using ungleich-hardware
243
244
Use the following manifest and replace the HOST with the actual host:
245
246
<pre>
247
apiVersion: v1
248
kind: Pod
249
metadata:
250
  name: ungleich-hardware-HOST
251
spec:
252
  containers:
253
  - name: ungleich-hardware
254
    image: ungleich/ungleich-hardware:0.0.5
255
    args:
256
    - sleep
257
    - "1000000"
258
    volumeMounts:
259
      - mountPath: /dev
260
        name: dev
261
    securityContext:
262
      privileged: true
263
  nodeSelector:
264
    kubernetes.io/hostname: "HOST"
265
266
  volumes:
267
    - name: dev
268
      hostPath:
269
        path: /dev
270
</pre>
271
272 102 Nico Schottelius
Also see: [[The_ungleich_hardware_maintenance_guide]]
273
274 105 Nico Schottelius
h3. Triggering a cronjob / creating a job from a cronjob
275 104 Nico Schottelius
276
To test a cronjob, we can create a job from a cronjob:
277
278
<pre>
279
kubectl create job --from=cronjob/volume2-daily-backup volume2-manual
280
</pre>
281
282
This creates a job volume2-manual based on the cronjob  volume2-daily
283
284 112 Nico Schottelius
h3. su-ing into a user that has nologin shell set
285
286
Many times users are having nologin as their shell inside the container. To be able to execute maintenance commands within the
287
container, we can use @su -s /bin/sh@ like this:
288
289
<pre>
290
su -s /bin/sh -c '/path/to/your/script' testuser
291
</pre>
292
293
Found on https://serverfault.com/questions/351046/how-to-run-command-as-user-who-has-usr-sbin-nologin-as-shell
294
295 113 Nico Schottelius
h3. How to print a secret value
296
297
Assuming you want the "password" item from a secret, use:
298
299
<pre>
300
kubectl get secret SECRETNAME -o jsonpath="{.data.password}" | base64 -d; echo "" 
301
</pre>
302
303 62 Nico Schottelius
h2. Calico CNI
304
305
h3. Calico Installation
306
307
* We install "calico using helm":https://docs.projectcalico.org/getting-started/kubernetes/helm
308
* This has the following advantages:
309
** Easy to upgrade
310
** Does not require os to configure IPv6/dual stack settings as the tigera operator figures out things on its own
311
312
Usually plain calico can be installed directly using:
313
314
<pre>
315
helm repo add projectcalico https://docs.projectcalico.org/charts
316 114 Jin-Guk Kwon
helm install --namespace tigera calico projectcalico/tigera-operator --version v3.23.2 --create-namespace
317 1 Nico Schottelius
</pre>
318 92 Nico Schottelius
319
* Check the tags on https://github.com/projectcalico/calico/tags for the latest release
320 62 Nico Schottelius
321
h3. Installing calicoctl
322
323 115 Nico Schottelius
* General installation instructions, including binary download: https://projectcalico.docs.tigera.io/maintenance/clis/calicoctl/install
324
325 62 Nico Schottelius
To be able to manage and configure calico, we need to 
326
"install calicoctl (we choose the version as a pod)":https://docs.projectcalico.org/getting-started/clis/calicoctl/install#install-calicoctl-as-a-kubernetes-pod
327
328
<pre>
329
kubectl apply -f https://docs.projectcalico.org/manifests/calicoctl.yaml
330
</pre>
331
332 93 Nico Schottelius
Or version specific:
333
334
<pre>
335
kubectl apply -f https://github.com/projectcalico/calico/blob/v3.20.4/manifests/calicoctl.yaml
336 97 Nico Schottelius
337
# For 3.22
338
kubectl apply -f https://projectcalico.docs.tigera.io/archive/v3.22/manifests/calicoctl.yaml
339 93 Nico Schottelius
</pre>
340
341 70 Nico Schottelius
And making it easier accessible by alias:
342
343
<pre>
344
alias calicoctl="kubectl exec -i -n kube-system calicoctl -- /calicoctl"
345
</pre>
346
347 62 Nico Schottelius
h3. Calico configuration
348
349 63 Nico Schottelius
By default our k8s clusters "BGP peer":https://docs.projectcalico.org/networking/bgp
350
with an upstream router to propagate podcidr and servicecidr.
351 62 Nico Schottelius
352
Default settings in our infrastructure:
353
354
* We use a full-mesh using the @nodeToNodeMeshEnabled: true@ option
355
* We keep the original next hop so that *only* the server with the pod is announcing it (instead of ecmp)
356 1 Nico Schottelius
* We use private ASNs for k8s clusters
357 63 Nico Schottelius
* We do *not* use any overlay
358 62 Nico Schottelius
359
After installing calico and calicoctl the last step of the installation is usually:
360
361 1 Nico Schottelius
<pre>
362 79 Nico Schottelius
calicoctl create -f - < calico-bgp.yaml
363 62 Nico Schottelius
</pre>
364
365
366
A sample BGP configuration:
367
368
<pre>
369
---
370
apiVersion: projectcalico.org/v3
371
kind: BGPConfiguration
372
metadata:
373
  name: default
374
spec:
375
  logSeverityScreen: Info
376
  nodeToNodeMeshEnabled: true
377
  asNumber: 65534
378
  serviceClusterIPs:
379
  - cidr: 2a0a:e5c0:10:3::/108
380
  serviceExternalIPs:
381
  - cidr: 2a0a:e5c0:10:3::/108
382
---
383
apiVersion: projectcalico.org/v3
384
kind: BGPPeer
385
metadata:
386
  name: router1-place10
387
spec:
388
  peerIP: 2a0a:e5c0:10:1::50
389
  asNumber: 213081
390
  keepOriginalNextHop: true
391
</pre>
392
393 64 Nico Schottelius
h2. ArgoCD / ArgoWorkFlow
394 56 Nico Schottelius
395 60 Nico Schottelius
h3. Argocd Installation
396 1 Nico Schottelius
397 116 Nico Schottelius
* See https://argo-cd.readthedocs.io/en/stable/
398
399 60 Nico Schottelius
As there is no configuration management present yet, argocd is installed using
400
401 1 Nico Schottelius
<pre>
402 60 Nico Schottelius
kubectl create namespace argocd
403 86 Nico Schottelius
404 96 Nico Schottelius
# Specific Version
405
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/v2.3.2/manifests/install.yaml
406 86 Nico Schottelius
407
# OR: latest stable
408 60 Nico Schottelius
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
409 56 Nico Schottelius
</pre>
410 1 Nico Schottelius
411 116 Nico Schottelius
412 1 Nico Schottelius
413 60 Nico Schottelius
h3. Get the argocd credentials
414
415
<pre>
416
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d; echo ""
417
</pre>
418 52 Nico Schottelius
419 87 Nico Schottelius
h3. Accessing argocd
420
421
In regular IPv6 clusters:
422
423
* Navigate to https://argocd-server.argocd.CLUSTERDOMAIN
424
425
In legacy IPv4 clusters
426
427
<pre>
428
kubectl --namespace argocd port-forward svc/argocd-server 8080:80
429
</pre>
430
431 88 Nico Schottelius
* Navigate to https://localhost:8080
432
433 68 Nico Schottelius
h3. Using the argocd webhook to trigger changes
434 67 Nico Schottelius
435
* To trigger changes post json https://argocd.example.com/api/webhook
436
437 72 Nico Schottelius
h3. Deploying an application
438
439
* Applications are deployed via git towards gitea (code.ungleich.ch) and then pulled by argo
440 73 Nico Schottelius
* Always include the *redmine-url* pointing to the (customer) ticket
441
** Also add the support-url if it exists
442 72 Nico Schottelius
443
Application sample
444
445
<pre>
446
apiVersion: argoproj.io/v1alpha1
447
kind: Application
448
metadata:
449
  name: gitea-CUSTOMER
450
  namespace: argocd
451
spec:
452
  destination:
453
    namespace: default
454
    server: 'https://kubernetes.default.svc'
455
  source:
456
    path: apps/prod/gitea
457
    repoURL: 'https://code.ungleich.ch/ungleich-intern/k8s-config.git'
458
    targetRevision: HEAD
459
    helm:
460
      parameters:
461
        - name: storage.data.storageClass
462
          value: rook-ceph-block-hdd
463
        - name: storage.data.size
464
          value: 200Gi
465
        - name: storage.db.storageClass
466
          value: rook-ceph-block-ssd
467
        - name: storage.db.size
468
          value: 10Gi
469
        - name: storage.letsencrypt.storageClass
470
          value: rook-ceph-block-hdd
471
        - name: storage.letsencrypt.size
472
          value: 50Mi
473
        - name: letsencryptStaging
474
          value: 'no'
475
        - name: fqdn
476
          value: 'code.verua.online'
477
  project: default
478
  syncPolicy:
479
    automated:
480
      prune: true
481
      selfHeal: true
482
  info:
483
    - name: 'redmine-url'
484
      value: 'https://redmine.ungleich.ch/issues/ISSUEID'
485
    - name: 'support-url'
486
      value: 'https://support.ungleich.ch/Ticket/Display.html?id=TICKETID'
487
</pre>
488
489 80 Nico Schottelius
h2. Helm related operations and conventions
490 55 Nico Schottelius
491 61 Nico Schottelius
We use helm charts extensively.
492
493
* In production, they are managed via argocd
494
* In development, helm chart can de developed and deployed manually using the helm utility.
495
496 55 Nico Schottelius
h3. Installing a helm chart
497
498
One can use the usual pattern of
499
500
<pre>
501
helm install <releasename> <chartdirectory>
502
</pre>
503
504
However often you want to reinstall/update when testing helm charts. The following pattern is "better", because it allows you to reinstall, if it is already installed:
505
506
<pre>
507
helm upgrade --install <releasename> <chartdirectory>
508 1 Nico Schottelius
</pre>
509 80 Nico Schottelius
510
h3. Naming services and deployments in helm charts [Application labels]
511
512
* We always have {{ .Release.Name }} to identify the current "instance"
513
* Deployments:
514
** use @app: <what it is>@, f.i. @app: nginx@, @app: postgres@, ...
515 81 Nico Schottelius
* See more about standard labels on
516
** https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/
517
** https://helm.sh/docs/chart_best_practices/labels/
518 55 Nico Schottelius
519 43 Nico Schottelius
h2. Rook / Ceph Related Operations
520
521 71 Nico Schottelius
h3. Executing ceph commands
522
523
Using the ceph-tools pod as follows:
524
525
<pre>
526
kubectl exec -n rook-ceph -ti $(kubectl -n rook-ceph get pods -l app=rook-ceph-tools -o jsonpath='{.items[*].metadata.name}') -- ceph -s
527
</pre>
528
529 43 Nico Schottelius
h3. Inspecting the logs of a specific server
530
531
<pre>
532
# Get the related pods
533
kubectl -n rook-ceph get pods -l app=rook-ceph-osd-prepare 
534
...
535
536
# Inspect the logs of a specific pod
537
kubectl -n rook-ceph logs -f rook-ceph-osd-prepare-server23--1-444qx
538
539 71 Nico Schottelius
</pre>
540
541
h3. Inspecting the logs of the rook-ceph-operator
542
543
<pre>
544
kubectl -n rook-ceph logs -f -l app=rook-ceph-operator
545 43 Nico Schottelius
</pre>
546
547
h3. Triggering server prepare / adding new osds
548
549
The rook-ceph-operator triggers/watches/creates pods to maintain hosts. To trigger a full "re scan", simply delete that pod:
550
551
<pre>
552
kubectl -n rook-ceph delete pods -l app=rook-ceph-operator
553
</pre>
554
555
This will cause all the @rook-ceph-osd-prepare-..@ jobs to be recreated and thus OSDs to be created, if new disks have been added.
556
557
h3. Removing an OSD
558
559
* See "Ceph OSD Management":https://rook.io/docs/rook/v1.7/ceph-osd-mgmt.html
560 77 Nico Schottelius
* More specifically: https://github.com/rook/rook/blob/release-1.7/cluster/examples/kubernetes/ceph/osd-purge.yaml
561 99 Nico Schottelius
* Then delete the related deployment
562 41 Nico Schottelius
563 98 Nico Schottelius
Set osd id in the osd-purge.yaml and apply it. OSD should be down before.
564
565
<pre>
566
apiVersion: batch/v1
567
kind: Job
568
metadata:
569
  name: rook-ceph-purge-osd
570
  namespace: rook-ceph # namespace:cluster
571
  labels:
572
    app: rook-ceph-purge-osd
573
spec:
574
  template:
575
    metadata:
576
      labels:
577
        app: rook-ceph-purge-osd
578
    spec:
579
      serviceAccountName: rook-ceph-purge-osd
580
      containers:
581
        - name: osd-removal
582
          image: rook/ceph:master
583
          # TODO: Insert the OSD ID in the last parameter that is to be removed
584
          # The OSD IDs are a comma-separated list. For example: "0" or "0,2".
585
          # If you want to preserve the OSD PVCs, set `--preserve-pvc true`.
586
          #
587
          # A --force-osd-removal option is available if the OSD should be destroyed even though the
588
          # removal could lead to data loss.
589
          args:
590
            - "ceph"
591
            - "osd"
592
            - "remove"
593
            - "--preserve-pvc"
594
            - "false"
595
            - "--force-osd-removal"
596
            - "false"
597
            - "--osd-ids"
598
            - "SETTHEOSDIDHERE"
599
          env:
600
            - name: POD_NAMESPACE
601
              valueFrom:
602
                fieldRef:
603
                  fieldPath: metadata.namespace
604
            - name: ROOK_MON_ENDPOINTS
605
              valueFrom:
606
                configMapKeyRef:
607
                  key: data
608
                  name: rook-ceph-mon-endpoints
609
            - name: ROOK_CEPH_USERNAME
610
              valueFrom:
611
                secretKeyRef:
612
                  key: ceph-username
613
                  name: rook-ceph-mon
614
            - name: ROOK_CEPH_SECRET
615
              valueFrom:
616
                secretKeyRef:
617
                  key: ceph-secret
618
                  name: rook-ceph-mon
619
            - name: ROOK_CONFIG_DIR
620
              value: /var/lib/rook
621
            - name: ROOK_CEPH_CONFIG_OVERRIDE
622
              value: /etc/rook/config/override.conf
623
            - name: ROOK_FSID
624
              valueFrom:
625
                secretKeyRef:
626
                  key: fsid
627
                  name: rook-ceph-mon
628
            - name: ROOK_LOG_LEVEL
629
              value: DEBUG
630
          volumeMounts:
631
            - mountPath: /etc/ceph
632
              name: ceph-conf-emptydir
633
            - mountPath: /var/lib/rook
634
              name: rook-config
635
      volumes:
636
        - emptyDir: {}
637
          name: ceph-conf-emptydir
638
        - emptyDir: {}
639
          name: rook-config
640
      restartPolicy: Never
641
642
643 99 Nico Schottelius
</pre>
644
645
Deleting the deployment:
646
647
<pre>
648
[18:05] bridge:~% kubectl -n rook-ceph delete deployment rook-ceph-osd-6
649
deployment.apps "rook-ceph-osd-6" deleted
650 98 Nico Schottelius
</pre>
651
652 76 Nico Schottelius
h2. Harbor
653
654
* We user "Harbor":https://goharbor.io/ for caching and as an image registry. Internal app reference: apps/prod/harbor.
655
* The admin password is in the password store, auto generated per cluster
656
* At the moment harbor only authenticates against the internal ldap tree
657
658
h3. LDAP configuration
659
660
* The url needs to be ldaps://...
661
* uid = uid
662
* rest standard
663 75 Nico Schottelius
664 89 Nico Schottelius
h2. Monitoring / Prometheus
665
666 90 Nico Schottelius
* Via "kube-prometheus":https://github.com/prometheus-operator/kube-prometheus/
667 89 Nico Schottelius
668 91 Nico Schottelius
Access via ...
669
670
* http://prometheus-k8s.monitoring.svc:9090
671
* http://grafana.monitoring.svc:3000
672
* http://alertmanager.monitoring.svc:9093
673
674
675 100 Nico Schottelius
h3. Prometheus Options
676
677
* "helm/kube-prometheus-stack":https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack
678
** Includes dashboards and co.
679
* "manifest based kube-prometheus":https://github.com/prometheus-operator/kube-prometheus
680
** Includes dashboards and co.
681
* "Prometheus Operator (mainly CRD manifest":https://github.com/prometheus-operator/prometheus-operator
682
683 91 Nico Schottelius
684 82 Nico Schottelius
h2. Nextcloud
685
686 85 Nico Schottelius
h3. How to get the nextcloud credentials 
687 84 Nico Schottelius
688
* The initial username is set to "nextcloud"
689
* The password is autogenerated and saved in a kubernetes secret
690
691
<pre>
692 85 Nico Schottelius
kubectl get secret RELEASENAME-nextcloud -o jsonpath="{.data.PASSWORD}" | base64 -d; echo "" 
693 84 Nico Schottelius
</pre>
694
695 83 Nico Schottelius
h3. How to fix "Access through untrusted domain"
696
697 82 Nico Schottelius
* Nextcloud stores the initial domain configuration
698 1 Nico Schottelius
* If the FQDN is changed, it will show the error message "Access through untrusted domain"
699 82 Nico Schottelius
* To fix, edit /var/www/html/config/config.php and correct the domain
700 83 Nico Schottelius
* Then delete the pods
701 82 Nico Schottelius
702 1 Nico Schottelius
h2. Infrastructure versions
703 35 Nico Schottelius
704 57 Nico Schottelius
h3. ungleich kubernetes infrastructure v5 (2021-10)
705 1 Nico Schottelius
706 57 Nico Schottelius
Clusters are configured / setup in this order:
707
708
* Bootstrap via kubeadm
709 59 Nico Schottelius
* "Networking via calico + BGP (non ECMP) using helm":https://docs.projectcalico.org/getting-started/kubernetes/helm
710
* "ArgoCD for CD":https://argo-cd.readthedocs.io/en/stable/
711
** "rook for storage via argocd":https://rook.io/
712 58 Nico Schottelius
** haproxy for in IPv6-cluster-IPv4-to-IPv6 proxy via argocd
713
** "kubernetes-secret-generator for in cluster secrets":https://github.com/mittwald/kubernetes-secret-generator
714
** "ungleich-certbot managing certs and nginx":https://hub.docker.com/repository/docker/ungleich/ungleich-certbot
715
716 57 Nico Schottelius
717
h3. ungleich kubernetes infrastructure v4 (2021-09)
718
719 54 Nico Schottelius
* rook is configured via manifests instead of using the rook-ceph-cluster helm chart
720 1 Nico Schottelius
* The rook operator is still being installed via helm
721 35 Nico Schottelius
722 57 Nico Schottelius
h3. ungleich kubernetes infrastructure v3 (2021-07)
723 1 Nico Schottelius
724 10 Nico Schottelius
* rook is now installed via helm via argocd instead of directly via manifests
725 28 Nico Schottelius
726 57 Nico Schottelius
h3. ungleich kubernetes infrastructure v2 (2021-05)
727 28 Nico Schottelius
728
* Replaced fluxv2 from ungleich k8s v1 with argocd
729 1 Nico Schottelius
** argocd can apply helm templates directly without needing to go through Chart releases
730 28 Nico Schottelius
* We are also using argoflow for build flows
731
* Planned to add "kaniko":https://github.com/GoogleContainerTools/kaniko for image building
732
733 57 Nico Schottelius
h3. ungleich kubernetes infrastructure v1 (2021-01)
734 28 Nico Schottelius
735
We are using the following components:
736
737
* "Calico as a CNI":https://www.projectcalico.org/ with BGP, IPv6 only, no encapsulation
738
** Needed for basic networking
739
* "kubernetes-secret-generator":https://github.com/mittwald/kubernetes-secret-generator for creating secrets
740
** Needed so that secrets are not stored in the git repository, but only in the cluster
741
* "ungleich-certbot":https://hub.docker.com/repository/docker/ungleich/ungleich-certbot
742
** Needed to get letsencrypt certificates for services
743
* "rook with ceph rbd + cephfs":https://rook.io/ for storage
744
** rbd for almost everything, *ReadWriteOnce*
745
** cephfs for smaller things, multi access *ReadWriteMany*
746
** Needed for providing persistent storage
747
* "flux v2":https://fluxcd.io/
748
** Needed to manage resources automatically