Project

General

Profile

The ungleich kubernetes infrastructure » History » Version 117

Nico Schottelius, 07/08/2022 11:09 AM

1 22 Nico Schottelius
h1. The ungleich kubernetes infrastructure and ungleich kubernetes manual
2 1 Nico Schottelius
3 3 Nico Schottelius
{{toc}}
4
5 1 Nico Schottelius
h2. Status
6
7 28 Nico Schottelius
This document is **pre-production**.
8
This document is to become the ungleich kubernetes infrastructure overview as well as the ungleich kubernetes manual.
9 1 Nico Schottelius
10 10 Nico Schottelius
h2. k8s clusters
11
12 111 Nico Schottelius
| Cluster           | Purpose/Setup     | Maintainer | Master(s)                     | argo                                                  | v4 http proxy | last verified |
13
| c0.k8s.ooo        | Dev               | -          | UNUSED                        |                                                       |               |    2021-10-05 |
14
| c1.k8s.ooo        | retired           |            | -                             |                                                       |               |    2022-03-15 |
15
| c2.k8s.ooo        | Dev p7 HW         | Nico       | server47 server53 server54    | "argo":https://argocd-server.argocd.svc.c2.k8s.ooo    |               |    2021-10-05 |
16
| c3.k8s.ooo        | retired           | -          | -                             |                                                       |               |    2021-10-05 |
17
| c4.k8s.ooo        | Dev2 p7 HW        | Jin-Guk    | server52 server53 server54    |                                                       |               |             - |
18
| c5.k8s.ooo        | retired           |            | -                             |                                                       |               |    2022-03-15 |
19
| c6.k8s.ooo        | Dev p6 VM Jin-Guk | Jin-Guk    |                               |                                                       |               |               |
20
| [[p5.k8s.ooo]]    | production        |            | server34 server36 server38    | "argo":https://argocd-server.argocd.svc.p5.k8s.ooo    |             - |               |
21
| [[p6.k8s.ooo]]    | production        |            | server67 server69 server71    | "argo":https://argocd-server.argocd.svc.p6.k8s.ooo    | 147.78.194.13 |    2021-10-05 |
22
| [[p10.k8s.ooo]]   | production        |            | server63 server65 server83    | "argo":https://argocd-server.argocd.svc.p10.k8s.ooo   | 147.78.194.12 |    2021-10-05 |
23
| [[k8s.ge.nau.so]] | development       |            | server107 server108 server109 | "argo":https://argocd-server.argocd.svc.k8s.ge.nau.so |               |               |
24 110 Nico Schottelius
25 21 Nico Schottelius
26 1 Nico Schottelius
h2. General architecture and components overview
27
28
* All k8s clusters are IPv6 only
29
* We use BGP peering to propagate podcidr and serviceCidr networks to our infrastructure
30
* The main public testing repository is "ungleich-k8s":https://code.ungleich.ch/ungleich-public/ungleich-k8s
31 18 Nico Schottelius
** Private configurations are found in the **k8s-config** repository
32 1 Nico Schottelius
33
h3. Cluster types
34
35 28 Nico Schottelius
| **Type/Feature**            | **Development**                | **Production**         |
36
| Min No. nodes               | 3 (1 master, 3 worker)         | 5 (3 master, 3 worker) |
37
| Recommended minimum         | 4 (dedicated master, 3 worker) | 8 (3 master, 5 worker) |
38
| Separation of control plane | optional                       | recommended            |
39
| Persistent storage          | required                       | required               |
40
| Number of storage monitors  | 3                              | 5                      |
41 1 Nico Schottelius
42 43 Nico Schottelius
h2. General k8s operations
43 1 Nico Schottelius
44 46 Nico Schottelius
h3. Cheat sheet / external great references
45
46
* "kubectl cheatsheet":https://kubernetes.io/docs/reference/kubectl/cheatsheet/
47
48 117 Nico Schottelius
h3. Allowing to schedule work on the control plane / removing node taints
49 69 Nico Schottelius
50
* Mostly for single node / test / development clusters
51
* Just remove the master taint as follows
52
53
<pre>
54
kubectl taint nodes --all node-role.kubernetes.io/master-
55
</pre>
56 1 Nico Schottelius
57 117 Nico Schottelius
You can check the node taints using @kubectl describe node ...@
58 69 Nico Schottelius
59 44 Nico Schottelius
h3. Get the cluster admin.conf
60
61
* On the masters of each cluster you can find the file @/etc/kubernetes/admin.conf@
62
* To be able to administrate the cluster you can copy the admin.conf to your local machine
63
* Multi cluster debugging can very easy if you name the config ~/cX-admin.conf (see example below)
64
65
<pre>
66
% scp root@server47.place7.ungleich.ch:/etc/kubernetes/admin.conf ~/c2-admin.conf
67
% export KUBECONFIG=~/c2-admin.conf    
68
% kubectl get nodes
69
NAME       STATUS                     ROLES                  AGE   VERSION
70
server47   Ready                      control-plane,master   82d   v1.22.0
71
server48   Ready                      control-plane,master   82d   v1.22.0
72
server49   Ready                      <none>                 82d   v1.22.0
73
server50   Ready                      <none>                 82d   v1.22.0
74
server59   Ready                      control-plane,master   82d   v1.22.0
75
server60   Ready,SchedulingDisabled   <none>                 82d   v1.22.0
76
server61   Ready                      <none>                 82d   v1.22.0
77
server62   Ready                      <none>                 82d   v1.22.0               
78
</pre>
79
80 18 Nico Schottelius
h3. Installing a new k8s cluster
81 8 Nico Schottelius
82 9 Nico Schottelius
* Decide on the cluster name (usually *cX.k8s.ooo*), X counting upwards
83 28 Nico Schottelius
** Using pXX.k8s.ooo for production clusters of placeXX
84 9 Nico Schottelius
* Use cdist to configure the nodes with requirements like crio
85
* Decide between single or multi node control plane setups (see below)
86 28 Nico Schottelius
** Single control plane suitable for development clusters
87 9 Nico Schottelius
88 28 Nico Schottelius
Typical init procedure:
89 9 Nico Schottelius
90 28 Nico Schottelius
* Single control plane: @kubeadm init --config bootstrap/XXX/kubeadm.yaml@
91
* Multi control plane (HA): @kubeadm init --config bootstrap/XXX/kubeadm.yaml --upload-certs@
92 10 Nico Schottelius
93 29 Nico Schottelius
h3. Deleting a pod that is hanging in terminating state
94
95
<pre>
96
kubectl delete pod <PODNAME> --grace-period=0 --force --namespace <NAMESPACE>
97
</pre>
98
99
(from https://stackoverflow.com/questions/35453792/pods-stuck-in-terminating-status)
100
101 42 Nico Schottelius
h3. Listing nodes of a cluster
102
103
<pre>
104
[15:05] bridge:~% kubectl get nodes
105
NAME       STATUS   ROLES                  AGE   VERSION
106
server22   Ready    <none>                 52d   v1.22.0
107
server23   Ready    <none>                 52d   v1.22.2
108
server24   Ready    <none>                 52d   v1.22.0
109
server25   Ready    <none>                 52d   v1.22.0
110
server26   Ready    <none>                 52d   v1.22.0
111
server27   Ready    <none>                 52d   v1.22.0
112
server63   Ready    control-plane,master   52d   v1.22.0
113
server64   Ready    <none>                 52d   v1.22.0
114
server65   Ready    control-plane,master   52d   v1.22.0
115
server66   Ready    <none>                 52d   v1.22.0
116
server83   Ready    control-plane,master   52d   v1.22.0
117
server84   Ready    <none>                 52d   v1.22.0
118
server85   Ready    <none>                 52d   v1.22.0
119
server86   Ready    <none>                 52d   v1.22.0
120
</pre>
121
122 41 Nico Schottelius
h3. Removing / draining a node
123
124
Usually @kubectl drain server@ should do the job, but sometimes we need to be more aggressive:
125
126 1 Nico Schottelius
<pre>
127 103 Nico Schottelius
kubectl drain --delete-emptydir-data --ignore-daemonsets serverXX
128 42 Nico Schottelius
</pre>
129
130
h3. Readding a node after draining
131
132
<pre>
133
kubectl uncordon serverXX
134 1 Nico Schottelius
</pre>
135 43 Nico Schottelius
136 50 Nico Schottelius
h3. (Re-)joining worker nodes after creating the cluster
137 49 Nico Schottelius
138
* We need to have an up-to-date token
139
* We use different join commands for the workers and control plane nodes
140
141
Generating the join command on an existing control plane node:
142
143
<pre>
144
kubeadm token create --print-join-command
145
</pre>
146
147 50 Nico Schottelius
h3. (Re-)joining control plane nodes after creating the cluster
148 1 Nico Schottelius
149 50 Nico Schottelius
* We generate the token again
150
* We upload the certificates
151
* We need to combine/create the join command for the control plane node
152
153
Example session:
154
155
<pre>
156
% kubeadm token create --print-join-command
157
kubeadm join p10-api.k8s.ooo:6443 --token xmff4i.ABC --discovery-token-ca-cert-hash sha256:longhash 
158
159
% kubeadm init phase upload-certs --upload-certs
160
[upload-certs] Storing the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
161
[upload-certs] Using certificate key:
162
CERTKEY
163
164
# Then we use these two outputs on the joining node:
165
166
kubeadm join p10-api.k8s.ooo:6443 --token xmff4i.ABC --discovery-token-ca-cert-hash sha256:longhash --control-plane --certificate-key CERTKEY
167
</pre>
168
169
Commands to be used on a control plane node:
170
171
<pre>
172
kubeadm token create --print-join-command
173
kubeadm init phase upload-certs --upload-certs
174
</pre>
175
176
Commands to be used on the joining node:
177
178
<pre>
179
JOINCOMMAND --control-plane --certificate-key CERTKEY
180
</pre>
181 49 Nico Schottelius
182 51 Nico Schottelius
SEE ALSO
183
184
* https://stackoverflow.com/questions/63936268/how-to-generate-kubeadm-token-for-secondary-control-plane-nodes
185
* https://blog.scottlowe.org/2019/08/15/reconstructing-the-join-command-for-kubeadm/
186
187 53 Nico Schottelius
h3. How to fix etcd does not start when rejoining a kubernetes cluster as a control plane
188 52 Nico Schottelius
189
If during the above step etcd does not come up, @kubeadm join@ can hang as follows:
190
191
<pre>
192
[control-plane] Creating static Pod manifest for "kube-apiserver"                                                              
193
[control-plane] Creating static Pod manifest for "kube-controller-manager"                                                     
194
[control-plane] Creating static Pod manifest for "kube-scheduler"                                                              
195
[check-etcd] Checking that the etcd cluster is healthy                                                                         
196
error execution phase check-etcd: etcd cluster is not healthy: failed to dial endpoint https://[2a0a:e5c0:10:1:225:b3ff:fe20:37
197
8a]:2379 with maintenance client: context deadline exceeded                                                                    
198
To see the stack trace of this error execute with --v=5 or higher         
199
</pre>
200
201
Then the problem is likely that the etcd server is still a member of the cluster. We first need to remove it from the etcd cluster and then the join works.
202
203
To fix this we do:
204
205
* Find a working etcd pod
206
* Find the etcd members / member list
207
* Remove the etcd member that we want to re-join the cluster
208
209
210
<pre>
211
# Find the etcd pods
212
kubectl -n kube-system get pods -l component=etcd,tier=control-plane
213
214
# Get the list of etcd servers with the member id 
215
kubectl exec -n kube-system -ti ETCDPODNAME -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list
216
217
# Remove the member
218
kubectl exec -n kube-system -ti ETCDPODNAME -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove MEMBERID
219
</pre>
220
221
Sample session:
222
223
<pre>
224
[10:48] line:~% kubectl -n kube-system get pods -l component=etcd,tier=control-plane
225
NAME            READY   STATUS    RESTARTS     AGE
226
etcd-server63   1/1     Running   0            3m11s
227
etcd-server65   1/1     Running   3            7d2h
228
etcd-server83   1/1     Running   8 (6d ago)   7d2h
229
[10:48] line:~% kubectl exec -n kube-system -ti etcd-server65 -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list
230
356891cd676df6e4, started, server65, https://[2a0a:e5c0:10:1:225:b3ff:fe20:375c]:2380, https://[2a0a:e5c0:10:1:225:b3ff:fe20:375c]:2379, false
231
371b8a07185dee7e, started, server63, https://[2a0a:e5c0:10:1:225:b3ff:fe20:378a]:2380, https://[2a0a:e5c0:10:1:225:b3ff:fe20:378a]:2379, false
232
5942bc58307f8af9, started, server83, https://[2a0a:e5c0:10:1:3e4a:92ff:fe79:bb98]:2380, https://[2a0a:e5c0:10:1:3e4a:92ff:fe79:bb98]:2379, false
233
234
[10:48] line:~% kubectl exec -n kube-system -ti etcd-server65 -- etcdctl --endpoints '[::1]:2379' --cacert /etc/kubernetes/pki/etcd/ca.crt --cert  /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove 371b8a07185dee7e
235
Member 371b8a07185dee7e removed from cluster e3c0805f592a8f77
236 1 Nico Schottelius
237
</pre>
238
239
SEE ALSO
240
241
* We found the solution using https://stackoverflow.com/questions/67921552/re-installed-node-cannot-join-kubernetes-cluster
242 56 Nico Schottelius
243 101 Nico Schottelius
h3. Hardware Maintenance using ungleich-hardware
244
245
Use the following manifest and replace the HOST with the actual host:
246
247
<pre>
248
apiVersion: v1
249
kind: Pod
250
metadata:
251
  name: ungleich-hardware-HOST
252
spec:
253
  containers:
254
  - name: ungleich-hardware
255
    image: ungleich/ungleich-hardware:0.0.5
256
    args:
257
    - sleep
258
    - "1000000"
259
    volumeMounts:
260
      - mountPath: /dev
261
        name: dev
262
    securityContext:
263
      privileged: true
264
  nodeSelector:
265
    kubernetes.io/hostname: "HOST"
266
267
  volumes:
268
    - name: dev
269
      hostPath:
270
        path: /dev
271
</pre>
272
273 102 Nico Schottelius
Also see: [[The_ungleich_hardware_maintenance_guide]]
274
275 105 Nico Schottelius
h3. Triggering a cronjob / creating a job from a cronjob
276 104 Nico Schottelius
277
To test a cronjob, we can create a job from a cronjob:
278
279
<pre>
280
kubectl create job --from=cronjob/volume2-daily-backup volume2-manual
281
</pre>
282
283
This creates a job volume2-manual based on the cronjob  volume2-daily
284
285 112 Nico Schottelius
h3. su-ing into a user that has nologin shell set
286
287
Many times users are having nologin as their shell inside the container. To be able to execute maintenance commands within the
288
container, we can use @su -s /bin/sh@ like this:
289
290
<pre>
291
su -s /bin/sh -c '/path/to/your/script' testuser
292
</pre>
293
294
Found on https://serverfault.com/questions/351046/how-to-run-command-as-user-who-has-usr-sbin-nologin-as-shell
295
296 113 Nico Schottelius
h3. How to print a secret value
297
298
Assuming you want the "password" item from a secret, use:
299
300
<pre>
301
kubectl get secret SECRETNAME -o jsonpath="{.data.password}" | base64 -d; echo "" 
302
</pre>
303
304 62 Nico Schottelius
h2. Calico CNI
305
306
h3. Calico Installation
307
308
* We install "calico using helm":https://docs.projectcalico.org/getting-started/kubernetes/helm
309
* This has the following advantages:
310
** Easy to upgrade
311
** Does not require os to configure IPv6/dual stack settings as the tigera operator figures out things on its own
312
313
Usually plain calico can be installed directly using:
314
315
<pre>
316
helm repo add projectcalico https://docs.projectcalico.org/charts
317 114 Jin-Guk Kwon
helm install --namespace tigera calico projectcalico/tigera-operator --version v3.23.2 --create-namespace
318 1 Nico Schottelius
</pre>
319 92 Nico Schottelius
320
* Check the tags on https://github.com/projectcalico/calico/tags for the latest release
321 62 Nico Schottelius
322
h3. Installing calicoctl
323
324 115 Nico Schottelius
* General installation instructions, including binary download: https://projectcalico.docs.tigera.io/maintenance/clis/calicoctl/install
325
326 62 Nico Schottelius
To be able to manage and configure calico, we need to 
327
"install calicoctl (we choose the version as a pod)":https://docs.projectcalico.org/getting-started/clis/calicoctl/install#install-calicoctl-as-a-kubernetes-pod
328
329
<pre>
330
kubectl apply -f https://docs.projectcalico.org/manifests/calicoctl.yaml
331
</pre>
332
333 93 Nico Schottelius
Or version specific:
334
335
<pre>
336
kubectl apply -f https://github.com/projectcalico/calico/blob/v3.20.4/manifests/calicoctl.yaml
337 97 Nico Schottelius
338
# For 3.22
339
kubectl apply -f https://projectcalico.docs.tigera.io/archive/v3.22/manifests/calicoctl.yaml
340 93 Nico Schottelius
</pre>
341
342 70 Nico Schottelius
And making it easier accessible by alias:
343
344
<pre>
345
alias calicoctl="kubectl exec -i -n kube-system calicoctl -- /calicoctl"
346
</pre>
347
348 62 Nico Schottelius
h3. Calico configuration
349
350 63 Nico Schottelius
By default our k8s clusters "BGP peer":https://docs.projectcalico.org/networking/bgp
351
with an upstream router to propagate podcidr and servicecidr.
352 62 Nico Schottelius
353
Default settings in our infrastructure:
354
355
* We use a full-mesh using the @nodeToNodeMeshEnabled: true@ option
356
* We keep the original next hop so that *only* the server with the pod is announcing it (instead of ecmp)
357 1 Nico Schottelius
* We use private ASNs for k8s clusters
358 63 Nico Schottelius
* We do *not* use any overlay
359 62 Nico Schottelius
360
After installing calico and calicoctl the last step of the installation is usually:
361
362 1 Nico Schottelius
<pre>
363 79 Nico Schottelius
calicoctl create -f - < calico-bgp.yaml
364 62 Nico Schottelius
</pre>
365
366
367
A sample BGP configuration:
368
369
<pre>
370
---
371
apiVersion: projectcalico.org/v3
372
kind: BGPConfiguration
373
metadata:
374
  name: default
375
spec:
376
  logSeverityScreen: Info
377
  nodeToNodeMeshEnabled: true
378
  asNumber: 65534
379
  serviceClusterIPs:
380
  - cidr: 2a0a:e5c0:10:3::/108
381
  serviceExternalIPs:
382
  - cidr: 2a0a:e5c0:10:3::/108
383
---
384
apiVersion: projectcalico.org/v3
385
kind: BGPPeer
386
metadata:
387
  name: router1-place10
388
spec:
389
  peerIP: 2a0a:e5c0:10:1::50
390
  asNumber: 213081
391
  keepOriginalNextHop: true
392
</pre>
393
394 64 Nico Schottelius
h2. ArgoCD / ArgoWorkFlow
395 56 Nico Schottelius
396 60 Nico Schottelius
h3. Argocd Installation
397 1 Nico Schottelius
398 116 Nico Schottelius
* See https://argo-cd.readthedocs.io/en/stable/
399
400 60 Nico Schottelius
As there is no configuration management present yet, argocd is installed using
401
402 1 Nico Schottelius
<pre>
403 60 Nico Schottelius
kubectl create namespace argocd
404 86 Nico Schottelius
405 96 Nico Schottelius
# Specific Version
406
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/v2.3.2/manifests/install.yaml
407 86 Nico Schottelius
408
# OR: latest stable
409 60 Nico Schottelius
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
410 56 Nico Schottelius
</pre>
411 1 Nico Schottelius
412 116 Nico Schottelius
413 1 Nico Schottelius
414 60 Nico Schottelius
h3. Get the argocd credentials
415
416
<pre>
417
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d; echo ""
418
</pre>
419 52 Nico Schottelius
420 87 Nico Schottelius
h3. Accessing argocd
421
422
In regular IPv6 clusters:
423
424
* Navigate to https://argocd-server.argocd.CLUSTERDOMAIN
425
426
In legacy IPv4 clusters
427
428
<pre>
429
kubectl --namespace argocd port-forward svc/argocd-server 8080:80
430
</pre>
431
432 88 Nico Schottelius
* Navigate to https://localhost:8080
433
434 68 Nico Schottelius
h3. Using the argocd webhook to trigger changes
435 67 Nico Schottelius
436
* To trigger changes post json https://argocd.example.com/api/webhook
437
438 72 Nico Schottelius
h3. Deploying an application
439
440
* Applications are deployed via git towards gitea (code.ungleich.ch) and then pulled by argo
441 73 Nico Schottelius
* Always include the *redmine-url* pointing to the (customer) ticket
442
** Also add the support-url if it exists
443 72 Nico Schottelius
444
Application sample
445
446
<pre>
447
apiVersion: argoproj.io/v1alpha1
448
kind: Application
449
metadata:
450
  name: gitea-CUSTOMER
451
  namespace: argocd
452
spec:
453
  destination:
454
    namespace: default
455
    server: 'https://kubernetes.default.svc'
456
  source:
457
    path: apps/prod/gitea
458
    repoURL: 'https://code.ungleich.ch/ungleich-intern/k8s-config.git'
459
    targetRevision: HEAD
460
    helm:
461
      parameters:
462
        - name: storage.data.storageClass
463
          value: rook-ceph-block-hdd
464
        - name: storage.data.size
465
          value: 200Gi
466
        - name: storage.db.storageClass
467
          value: rook-ceph-block-ssd
468
        - name: storage.db.size
469
          value: 10Gi
470
        - name: storage.letsencrypt.storageClass
471
          value: rook-ceph-block-hdd
472
        - name: storage.letsencrypt.size
473
          value: 50Mi
474
        - name: letsencryptStaging
475
          value: 'no'
476
        - name: fqdn
477
          value: 'code.verua.online'
478
  project: default
479
  syncPolicy:
480
    automated:
481
      prune: true
482
      selfHeal: true
483
  info:
484
    - name: 'redmine-url'
485
      value: 'https://redmine.ungleich.ch/issues/ISSUEID'
486
    - name: 'support-url'
487
      value: 'https://support.ungleich.ch/Ticket/Display.html?id=TICKETID'
488
</pre>
489
490 80 Nico Schottelius
h2. Helm related operations and conventions
491 55 Nico Schottelius
492 61 Nico Schottelius
We use helm charts extensively.
493
494
* In production, they are managed via argocd
495
* In development, helm chart can de developed and deployed manually using the helm utility.
496
497 55 Nico Schottelius
h3. Installing a helm chart
498
499
One can use the usual pattern of
500
501
<pre>
502
helm install <releasename> <chartdirectory>
503
</pre>
504
505
However often you want to reinstall/update when testing helm charts. The following pattern is "better", because it allows you to reinstall, if it is already installed:
506
507
<pre>
508
helm upgrade --install <releasename> <chartdirectory>
509 1 Nico Schottelius
</pre>
510 80 Nico Schottelius
511
h3. Naming services and deployments in helm charts [Application labels]
512
513
* We always have {{ .Release.Name }} to identify the current "instance"
514
* Deployments:
515
** use @app: <what it is>@, f.i. @app: nginx@, @app: postgres@, ...
516 81 Nico Schottelius
* See more about standard labels on
517
** https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/
518
** https://helm.sh/docs/chart_best_practices/labels/
519 55 Nico Schottelius
520 43 Nico Schottelius
h2. Rook / Ceph Related Operations
521
522 71 Nico Schottelius
h3. Executing ceph commands
523
524
Using the ceph-tools pod as follows:
525
526
<pre>
527
kubectl exec -n rook-ceph -ti $(kubectl -n rook-ceph get pods -l app=rook-ceph-tools -o jsonpath='{.items[*].metadata.name}') -- ceph -s
528
</pre>
529
530 43 Nico Schottelius
h3. Inspecting the logs of a specific server
531
532
<pre>
533
# Get the related pods
534
kubectl -n rook-ceph get pods -l app=rook-ceph-osd-prepare 
535
...
536
537
# Inspect the logs of a specific pod
538
kubectl -n rook-ceph logs -f rook-ceph-osd-prepare-server23--1-444qx
539
540 71 Nico Schottelius
</pre>
541
542
h3. Inspecting the logs of the rook-ceph-operator
543
544
<pre>
545
kubectl -n rook-ceph logs -f -l app=rook-ceph-operator
546 43 Nico Schottelius
</pre>
547
548
h3. Triggering server prepare / adding new osds
549
550
The rook-ceph-operator triggers/watches/creates pods to maintain hosts. To trigger a full "re scan", simply delete that pod:
551
552
<pre>
553
kubectl -n rook-ceph delete pods -l app=rook-ceph-operator
554
</pre>
555
556
This will cause all the @rook-ceph-osd-prepare-..@ jobs to be recreated and thus OSDs to be created, if new disks have been added.
557
558
h3. Removing an OSD
559
560
* See "Ceph OSD Management":https://rook.io/docs/rook/v1.7/ceph-osd-mgmt.html
561 77 Nico Schottelius
* More specifically: https://github.com/rook/rook/blob/release-1.7/cluster/examples/kubernetes/ceph/osd-purge.yaml
562 99 Nico Schottelius
* Then delete the related deployment
563 41 Nico Schottelius
564 98 Nico Schottelius
Set osd id in the osd-purge.yaml and apply it. OSD should be down before.
565
566
<pre>
567
apiVersion: batch/v1
568
kind: Job
569
metadata:
570
  name: rook-ceph-purge-osd
571
  namespace: rook-ceph # namespace:cluster
572
  labels:
573
    app: rook-ceph-purge-osd
574
spec:
575
  template:
576
    metadata:
577
      labels:
578
        app: rook-ceph-purge-osd
579
    spec:
580
      serviceAccountName: rook-ceph-purge-osd
581
      containers:
582
        - name: osd-removal
583
          image: rook/ceph:master
584
          # TODO: Insert the OSD ID in the last parameter that is to be removed
585
          # The OSD IDs are a comma-separated list. For example: "0" or "0,2".
586
          # If you want to preserve the OSD PVCs, set `--preserve-pvc true`.
587
          #
588
          # A --force-osd-removal option is available if the OSD should be destroyed even though the
589
          # removal could lead to data loss.
590
          args:
591
            - "ceph"
592
            - "osd"
593
            - "remove"
594
            - "--preserve-pvc"
595
            - "false"
596
            - "--force-osd-removal"
597
            - "false"
598
            - "--osd-ids"
599
            - "SETTHEOSDIDHERE"
600
          env:
601
            - name: POD_NAMESPACE
602
              valueFrom:
603
                fieldRef:
604
                  fieldPath: metadata.namespace
605
            - name: ROOK_MON_ENDPOINTS
606
              valueFrom:
607
                configMapKeyRef:
608
                  key: data
609
                  name: rook-ceph-mon-endpoints
610
            - name: ROOK_CEPH_USERNAME
611
              valueFrom:
612
                secretKeyRef:
613
                  key: ceph-username
614
                  name: rook-ceph-mon
615
            - name: ROOK_CEPH_SECRET
616
              valueFrom:
617
                secretKeyRef:
618
                  key: ceph-secret
619
                  name: rook-ceph-mon
620
            - name: ROOK_CONFIG_DIR
621
              value: /var/lib/rook
622
            - name: ROOK_CEPH_CONFIG_OVERRIDE
623
              value: /etc/rook/config/override.conf
624
            - name: ROOK_FSID
625
              valueFrom:
626
                secretKeyRef:
627
                  key: fsid
628
                  name: rook-ceph-mon
629
            - name: ROOK_LOG_LEVEL
630
              value: DEBUG
631
          volumeMounts:
632
            - mountPath: /etc/ceph
633
              name: ceph-conf-emptydir
634
            - mountPath: /var/lib/rook
635
              name: rook-config
636
      volumes:
637
        - emptyDir: {}
638
          name: ceph-conf-emptydir
639
        - emptyDir: {}
640
          name: rook-config
641
      restartPolicy: Never
642
643
644 99 Nico Schottelius
</pre>
645
646
Deleting the deployment:
647
648
<pre>
649
[18:05] bridge:~% kubectl -n rook-ceph delete deployment rook-ceph-osd-6
650
deployment.apps "rook-ceph-osd-6" deleted
651 98 Nico Schottelius
</pre>
652
653 76 Nico Schottelius
h2. Harbor
654
655
* We user "Harbor":https://goharbor.io/ for caching and as an image registry. Internal app reference: apps/prod/harbor.
656
* The admin password is in the password store, auto generated per cluster
657
* At the moment harbor only authenticates against the internal ldap tree
658
659
h3. LDAP configuration
660
661
* The url needs to be ldaps://...
662
* uid = uid
663
* rest standard
664 75 Nico Schottelius
665 89 Nico Schottelius
h2. Monitoring / Prometheus
666
667 90 Nico Schottelius
* Via "kube-prometheus":https://github.com/prometheus-operator/kube-prometheus/
668 89 Nico Schottelius
669 91 Nico Schottelius
Access via ...
670
671
* http://prometheus-k8s.monitoring.svc:9090
672
* http://grafana.monitoring.svc:3000
673
* http://alertmanager.monitoring.svc:9093
674
675
676 100 Nico Schottelius
h3. Prometheus Options
677
678
* "helm/kube-prometheus-stack":https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack
679
** Includes dashboards and co.
680
* "manifest based kube-prometheus":https://github.com/prometheus-operator/kube-prometheus
681
** Includes dashboards and co.
682
* "Prometheus Operator (mainly CRD manifest":https://github.com/prometheus-operator/prometheus-operator
683
684 91 Nico Schottelius
685 82 Nico Schottelius
h2. Nextcloud
686
687 85 Nico Schottelius
h3. How to get the nextcloud credentials 
688 84 Nico Schottelius
689
* The initial username is set to "nextcloud"
690
* The password is autogenerated and saved in a kubernetes secret
691
692
<pre>
693 85 Nico Schottelius
kubectl get secret RELEASENAME-nextcloud -o jsonpath="{.data.PASSWORD}" | base64 -d; echo "" 
694 84 Nico Schottelius
</pre>
695
696 83 Nico Schottelius
h3. How to fix "Access through untrusted domain"
697
698 82 Nico Schottelius
* Nextcloud stores the initial domain configuration
699 1 Nico Schottelius
* If the FQDN is changed, it will show the error message "Access through untrusted domain"
700 82 Nico Schottelius
* To fix, edit /var/www/html/config/config.php and correct the domain
701 83 Nico Schottelius
* Then delete the pods
702 82 Nico Schottelius
703 1 Nico Schottelius
h2. Infrastructure versions
704 35 Nico Schottelius
705 57 Nico Schottelius
h3. ungleich kubernetes infrastructure v5 (2021-10)
706 1 Nico Schottelius
707 57 Nico Schottelius
Clusters are configured / setup in this order:
708
709
* Bootstrap via kubeadm
710 59 Nico Schottelius
* "Networking via calico + BGP (non ECMP) using helm":https://docs.projectcalico.org/getting-started/kubernetes/helm
711
* "ArgoCD for CD":https://argo-cd.readthedocs.io/en/stable/
712
** "rook for storage via argocd":https://rook.io/
713 58 Nico Schottelius
** haproxy for in IPv6-cluster-IPv4-to-IPv6 proxy via argocd
714
** "kubernetes-secret-generator for in cluster secrets":https://github.com/mittwald/kubernetes-secret-generator
715
** "ungleich-certbot managing certs and nginx":https://hub.docker.com/repository/docker/ungleich/ungleich-certbot
716
717 57 Nico Schottelius
718
h3. ungleich kubernetes infrastructure v4 (2021-09)
719
720 54 Nico Schottelius
* rook is configured via manifests instead of using the rook-ceph-cluster helm chart
721 1 Nico Schottelius
* The rook operator is still being installed via helm
722 35 Nico Schottelius
723 57 Nico Schottelius
h3. ungleich kubernetes infrastructure v3 (2021-07)
724 1 Nico Schottelius
725 10 Nico Schottelius
* rook is now installed via helm via argocd instead of directly via manifests
726 28 Nico Schottelius
727 57 Nico Schottelius
h3. ungleich kubernetes infrastructure v2 (2021-05)
728 28 Nico Schottelius
729
* Replaced fluxv2 from ungleich k8s v1 with argocd
730 1 Nico Schottelius
** argocd can apply helm templates directly without needing to go through Chart releases
731 28 Nico Schottelius
* We are also using argoflow for build flows
732
* Planned to add "kaniko":https://github.com/GoogleContainerTools/kaniko for image building
733
734 57 Nico Schottelius
h3. ungleich kubernetes infrastructure v1 (2021-01)
735 28 Nico Schottelius
736
We are using the following components:
737
738
* "Calico as a CNI":https://www.projectcalico.org/ with BGP, IPv6 only, no encapsulation
739
** Needed for basic networking
740
* "kubernetes-secret-generator":https://github.com/mittwald/kubernetes-secret-generator for creating secrets
741
** Needed so that secrets are not stored in the git repository, but only in the cluster
742
* "ungleich-certbot":https://hub.docker.com/repository/docker/ungleich/ungleich-certbot
743
** Needed to get letsencrypt certificates for services
744
* "rook with ceph rbd + cephfs":https://rook.io/ for storage
745
** rbd for almost everything, *ReadWriteOnce*
746
** cephfs for smaller things, multi access *ReadWriteMany*
747
** Needed for providing persistent storage
748
* "flux v2":https://fluxcd.io/
749
** Needed to manage resources automatically