Project

General

Profile

The ungleich ceph handbook » History » Version 72

Nico Schottelius, 01/30/2024 10:16 AM

1 1 Nico Schottelius
h1. The ungleich ceph handbook
2
3 3 Nico Schottelius
{{toc}}
4
5 1 Nico Schottelius
h2. Status
6
7 7 Nico Schottelius
This document is **IN PRODUCTION**.
8 1 Nico Schottelius
9
h2. Introduction
10
11
This article describes the ungleich storage architecture that is based on ceph. It describes our architecture as well maintenance commands. Required for 
12
13 45 Nico Schottelius
h2. Processes
14
15
h3. Usage monitoring
16
17
* Usage should be kept somewhere in 70-75% area
18
* If usage reaches 72.5%, we start reducing usage by adding disks
19
* We stop when usage is below 70%
20
21
h3. Phasing in new disks
22
23
* 24h performance test prior to using it
24
25
h3. Phasing in new servers
26
27
* 24h performance test with 1 ssd or 1 hdd (whatever is applicable)
28
29
30 1 Nico Schottelius
h2. Communication guide
31
32
Usually when disks fails no customer communication is necessary, as it is automatically compensated/rebalanced by ceph. However in case multiple disk failures happen at the same time, I/O speed might be reduced and thus customer experience impacted.
33
34
For this reason communicate whenever I/O recovery settings are temporarily tuned.
35
36 20 Nico Schottelius
h2. Analysing 
37
38 21 Nico Schottelius
h3. ceph osd df tree
39 20 Nico Schottelius
40
Using @ceph osd df tree@ you can see not only the disk usage per OSD, but also the number of PGs on an OSD. This is especially useful to see how the OSDs are balanced.
41
42 22 Nico Schottelius
h3. Find out the device of an OSD
43
44
Use @mount | grep /var/lib/ceph/osd/ceph-OSDID@ on the server on which the OSD is located:
45
46
<pre>
47
48
[16:01:23] server2.place6:~# mount | grep /var/lib/ceph/osd/ceph-31
49
/dev/sdk1 on /var/lib/ceph/osd/ceph-31 type xfs (rw,relatime,attr2,inode64,noquota)
50
</pre>
51
52 57 Nico Schottelius
h3. Show config
53
54
<pre>
55
ceph config dump
56
</pre>
57
58 58 Nico Schottelius
h3. Show backfill and recovery config
59
60
<pre>
61
ceph-conf --show-config | egrep "osd_recovery_max_active|osd_recovery_op_priority|osd_max_backfills"
62
</pre>
63
64 59 Nico Schottelius
* See also: https://www.suse.com/support/kb/doc/?id=000019693
65
66 63 Nico Schottelius
h3. Checking and clearing crash reports
67
68
If the cluster is reporting HEALTH_WARN and a recent crash such as:
69
70
<pre>
71
[rook@rook-ceph-tools-f569797b4-z4542 /]$ ceph -s
72
  cluster:
73
    id:     ...
74
    health: HEALTH_WARN
75
            1 daemons have recently crashed
76
</pre>
77
78
One can analyse it using
79
80
* List the crashes: @ceph crash ls@ 
81
* Checkout the details: @ceph crash info <id>@
82
83
To archive the error:
84
85
* To archive a specific report: @ceph crash archive <id>@
86
* To archive all: @ceph crash archive-all@
87
88
After archiving, the cluster health should return to HEALTH_OK:
89
90
<pre>
91
[rook@rook-ceph-tools-f569797b4-z4542 /]$  ceph crash ls
92
ID                                                                ENTITY  NEW  
93
2022-09-19T04:33:19.378500Z_b2e26755-0712-41de-bf2b-b370dbe94ebc  mon.c    *   
94
[rook@rook-ceph-tools-f569797b4-z4542 /]$ ceph crash archive 2022-09-19T04:33:19.378500Z_b2e26755-0712-41de-bf2b-b370dbe94ebc 
95
[rook@rook-ceph-tools-f569797b4-z4542 /]$ ceph -s
96
  cluster:
97
    id:     ..
98
    health: HEALTH_OK
99
 
100
</pre>
101
102 68 Nico Schottelius
h3. Low monitor space warning
103
104
If you see
105
106
<pre>
107
[rook@rook-ceph-tools-6bdf996-8g792 /]$ ceph health detail
108
HEALTH_WARN mon q is low on available space
109
[WRN] MON_DISK_LOW: mon q is low on available space
110
    mon.q has 29% avail
111
112
</pre>
113
114
there are two options to fix it:
115
116
* a) free up space
117
* b) raise the limit as specified in @mon_data_avail_warn@
118
119 2 Nico Schottelius
h2. Adding a new disk/ssd to the ceph cluster
120 1 Nico Schottelius
121 25 Jin-Guk Kwon
write on the disks, which order / date we bought it with a permanent marker.
122
123 46 Nico Schottelius
h3. Checking the shadow trees
124
125
To be able to spot differences / weights of hosts, it can be very helpful to look at the crush shadow tree
126
using @ceph osd crush tree --show-shadow@:
127
128
<pre>
129
-16   hdd-big 653.03418           root default~hdd-big        
130
-34   hdd-big         0         0     host server14~hdd-big   
131
-38   hdd-big         0         0     host server15~hdd-big   
132
-42   hdd-big  81.86153  78.28352     host server17~hdd-big   
133
 36   hdd-big   9.09560   9.09560         osd.36              
134
 59   hdd-big   9.09499   9.09499         osd.59              
135
 60   hdd-big   9.09499   9.09499         osd.60              
136
 68   hdd-big   9.09599   8.93999         osd.68              
137
 69   hdd-big   9.09599   7.65999         osd.69              
138
 70   hdd-big   9.09599   8.35899         osd.70              
139
 71   hdd-big   9.09599   8.56000         osd.71              
140
 72   hdd-big   9.09599   8.93700         osd.72              
141
 73   hdd-big   9.09599   8.54199         osd.73              
142
-46   hdd-big  90.94986  90.94986     host server18~hdd-big   
143
...
144
</pre>
145
146
147
Here we can see that the weight of server17 for the class hdd-big is about 81, the one of server18 about 90.
148
SSDs and other classes have their own shadow trees, too.
149
150 2 Nico Schottelius
h3. For Dell servers
151
152
First find the disk and then add it to the operating system
153
154
<pre>
155
megacli -PDList -aALL  | grep -B16 -i unconfigur
156
157
# Sample output:
158
[19:46:50] server7.place6:~#  megacli -PDList -aALL  | grep -B16 -i unconfigur
159
Enclosure Device ID: N/A
160
Slot Number: 0
161
Enclosure position: N/A
162
Device Id: 0
163
WWN: 0000000000000000
164
Sequence Number: 1
165
Media Error Count: 0
166
Other Error Count: 0
167
Predictive Failure Count: 0
168
Last Predictive Failure Event Seq Number: 0
169
PD Type: SATA
170
171
Raw Size: 894.252 GB [0x6fc81ab0 Sectors]
172
Non Coerced Size: 893.752 GB [0x6fb81ab0 Sectors]
173
Coerced Size: 893.75 GB [0x6fb80000 Sectors]
174
Sector Size:  0
175
Firmware state: Unconfigured(good), Spun Up
176
</pre>
177
178
Then add the disk to the OS:
179
180
<pre>
181 26 ll nu
megacli -CfgLdAdd -r0 [Enclosure Device ID:slot] -aX (X : host is 0. md-array is 1)
182 2 Nico Schottelius
183
# Sample call, if enclosure and slot are KNOWN (aka not N/A)
184
megacli -CfgLdAdd -r0 [32:0] -a0
185
186
# Sample call, if enclosure is N/A
187 1 Nico Schottelius
megacli -CfgLdAdd -r0 [:0] -a0
188 25 Jin-Guk Kwon
</pre>
189
190
Then check disk
191
192
<pre>
193
fdisk -l
194
[11:26:23] server2.place6:~# fdisk -l
195
......
196
Disk /dev/sdh: 7.3 TiB, 8000987201536 bytes, 15626928128 sectors
197
Units: sectors of 1 * 512 = 512 bytes
198
Sector size (logical/physical): 512 bytes / 512 bytes
199
I/O size (minimum/optimal): 512 bytes / 512 bytes
200
[11:27:24] server2.place6:~#
201
</pre>
202
203
Then create gpt
204
205
<pre>
206
/opt/ungleich-tools/disk-create-fresh-gpt /dev/XXX
207
[11:31:10] server2.place6:~# /opt/ungleich-tools/disk-create-fresh-gpt /dev/sdh
208
......
209
Created a new DOS disklabel with disk identifier 0x9c4a0355.
210
Command (m for help): Created a new GPT disklabel (GUID: 374E31AD-7B96-4837-B5ED-7B22C452899E).
211
......
212
</pre>
213
214
Then create osd for ssd/hdd-big
215
216
<pre>
217
/opt/ungleich-tools/ceph-osd-create-start /dev/XXX XXX(sdd or hdd-big)
218
[11:33:58] server2.place6:~# /opt/ungleich-tools/ceph-osd-create-start /dev/sdh hdd-big
219
+ set -e
220
+ [ 2 -lt 2 ]
221
......
222
+ /opt/ungleich-tools/monit-ceph-create-start osd.14
223
osd.14
224
[ ok ] Restarting daemon monitor: monit.
225
[11:36:14] server2.place6:~#
226
</pre>
227
228
Then check rebalancing(if you want to add another disk, you should do after rebalancing)
229
230
<pre>
231
ceph -s
232
[12:37:57] server2.place6:~# ceph -s
233
  cluster:
234
    id:     1ccd84f6-e362-4c50-9ffe-59436745e445
235
    health: HEALTH_WARN
236
            2248811/49628409 objects misplaced (4.531%)
237
......
238
  io:
239
    client:   170KiB/s rd, 35.0MiB/s wr, 463op/s rd, 728op/s wr
240
    recovery: 27.1MiB/s, 6objects/s
241 1 Nico Schottelius
[12:49:41] server2.place6:~#
242 64 Nico Schottelius
</pre>
243
244 66 Nico Schottelius
h3. For HP servers (hpacucli)
245 64 Nico Schottelius
246
* Ensure the module "sg" has been loaded
247
248
Use the following to verify that the controller is detected:
249
250
<pre>
251
# hpacucli controller all show
252
253
Smart Array P420i in Slot 0 (Embedded)    (sn: 001438033ECEF60)
254 2 Nico Schottelius
</pre>
255
256 65 Nico Schottelius
257
h4. Show all disks from controller on slot 0
258
259
<pre>
260
hpacucli controller slot=0 physicaldrive all show
261
</pre>
262
263
Example
264
265
<pre>
266
# hpacucli controller slot=0 physicaldrive all show
267
268
Smart Array P420i in Slot 0 (Embedded)
269
270
   array A
271
272
      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SATA, 12000.1 GB, OK)
273
274
   array B
275
276
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SATA, 12000.1 GB, OK)
277
278
   unassigned
279
280
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SATA, 10000.8 GB, OK)
281
282
root@ungleich-hardware-server97:/# 
283
284
</pre>
285
286
In this example the last disk has not been assigned yet.
287
288
h4. Create RAID 0 for ceph
289
290
For ceph we want a raid 0 over 1 disk to expose the disk to the OS.
291
292
This can be done using the following command:
293
294
<pre>
295
hpacucli controller slot=0 create type=ld drives=$DRIVEID raid=0
296
</pre>
297
298
For example:
299
300
<pre>
301
hpacucli controller slot=0 create type=ld drives=1I:1:3 raid=0
302
</pre>
303
304
h4. Show the controller configuration
305
306
<pre>
307
hpacucli controller slot=0 show config
308
</pre>
309
310
For example:
311
312
<pre>
313
# hpacucli controller slot=0 show config
314
315
Smart Array P420i in Slot 0 (Embedded)    (sn: 001438033ECEF60)
316
317
   array A (SATA, Unused Space: 0  MB)
318
319
320
      logicaldrive 1 (10.9 TB, RAID 0, OK)
321
322
      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SATA, 12000.1 GB, OK)
323
324
   array B (SATA, Unused Space: 0  MB)
325
326
327
      logicaldrive 2 (10.9 TB, RAID 0, OK)
328
329
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SATA, 12000.1 GB, OK)
330
331
   array C (SATA, Unused Space: 0  MB)
332
333
334
      logicaldrive 3 (9.1 TB, RAID 0, OK)
335
336
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SATA, 10000.8 GB, OK)
337
338
   Expander 380 (WWID: 50014380324EBFE0, Port: 1I, Box: 1)
339
340
   Enclosure SEP (Vendor ID HP, Model Gen8 ServBP 12+2) 378 (WWID: 50014380324EBFF9, Port: 1I, Box: 1)
341 1 Nico Schottelius
342
   SEP (Vendor ID PMCSIERA, Model SRCv8x6G) 379 (WWID: 5001438033ECEF6F)
343 71 Nico Schottelius
</pre>
344
345
h3. Removing signatures preventing disk being used by ceph
346
347
If you see
348
349
<pre>
350
cephosd: skipping device "sdX" because it contains a filesystem "ddf_raid_member"
351
</pre>
352
353
you can clean it with wipefs:
354
355
<pre>
356
[20:47] server98.place10:~# wipefs /dev/sde 
357
DEVICE OFFSET        TYPE            UUID         LABEL
358
sde    0xae9fffffe00 ddf_raid_member Dell    \x10 
359
[20:48] server98.place10:~# wipefs -a /dev/sde 
360
/dev/sde: 4 bytes were erased at offset 0xae9fffffe00 (ddf_raid_member): de 11 de 11
361
[20:48] server98.place10:~# 
362
363 65 Nico Schottelius
</pre>
364
365 1 Nico Schottelius
h2. Moving a disk/ssd to another server
366 4 Nico Schottelius
367
(needs to be described better)
368
369
Generally speaking:
370
371 27 ll nu
* //needs to be tested: disable recovery so data wont start move while you have the osd down
372 9 Nico Schottelius
* /opt/ungleich-tools/ceph-osd-stop-disable does the following:
373
** Stop the osd, remove monit on the server you want to take it out
374
** umount the disk
375 1 Nico Schottelius
* Take disk out
376
* Discard preserved cache on the server you took it out 
377 54 Nico Schottelius
** using megacli:  @megacli -DiscardPreservedCache -Lall -aAll@
378 1 Nico Schottelius
* Insert into new server
379 9 Nico Schottelius
* Clear foreign configuration
380 54 Nico Schottelius
** using megacli: @megacli -CfgForeign -Clear -aAll@
381 9 Nico Schottelius
* Disk will now appear in the OS, ceph/udev will automatically start the OSD (!)
382
** No creating of the osd required!
383
* Verify that the disk exists and that the osd is started
384
** using *ps aux*
385
** using *ceph osd tree*
386 10 Nico Schottelius
* */opt/ungleich-tools/monit-ceph-create-start osd.XX* # where osd.XX is the osd + number
387 9 Nico Schottelius
** Creates the monit configuration file so that monit watches the OSD
388
** Reload monit
389 11 Nico Schottelius
* Verify monit using *monit status*
390 1 Nico Schottelius
391 72 Nico Schottelius
<pre>
392
megacli -DiscardPreservedCache -Lall -aAll
393
megacli -CfgForeign -Clear -aAll
394
</pre>
395
396 56 Nico Schottelius
h2. OSD related processes
397 1 Nico Schottelius
398 56 Nico Schottelius
h3. Removing a disk/ssd
399
400 1 Nico Schottelius
To permanently remove a failed disk from a cluster, use ***ceph-osd-stop-remove-permanently*** from ungleich-tools repo. Warning: if the disk is still active, the OSD will be shutdown AND removed from the cluster -> all data of that disk will need to be rebalanced.
401
402 56 Nico Schottelius
h3. Handling DOWN osds with filesystem errors
403 1 Nico Schottelius
404
If an email arrives with the subject "monit alert -- Does not exist osd.XX-whoami", the filesystem of an OSD cannot be read anymore. It is very highly likely that the disk / ssd is broken. Steps that need to be done:
405
406
* Login to any ceph monitor (cephX.placeY.ungleich.ch)
407
* Check **ceph -s**, find host using **ceph osd tree**
408
* Login to the affected host
409
* Run the following commands:
410
** ls /var/lib/ceph/osd/ceph-XX
411
** dmesg
412 24 Jin-Guk Kwon
<pre>
413
ex) After checking message of dmesg, you can do next step
414
[204696.406756] XFS (sdl1): metadata I/O error: block 0x19100 ("xlog_iodone") error 5 numblks 64
415
[204696.408094] XFS (sdl1): xfs_do_force_shutdown(0x2) called from line 1233 of file /build/linux-BsFdsw/linux-4.9.65/fs/xfs/xfs_log.c.  Return address = 0xffffffffc08eb612
416
[204696.410702] XFS (sdl1): Log I/O Error Detected.  Shutting down filesystem
417
[204696.411977] XFS (sdl1): Please umount the filesystem and rectify the problem(
418
</pre>
419
420 1 Nico Schottelius
* Create a new ticket in the datacenter light project
421
** Subject: "Replace broken OSD.XX on serverX.placeY.ungleich.ch"
422
** Add (partial) output of above commands
423
** Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently XX, where XX is the osd id, to remove the disk from the cluster
424
** Remove the physical disk from the host, checkout if there is warranty on it and if yes
425
*** Create a short letter to the vendor, including technical details a from above
426
*** Record when you sent it in
427
*** Put ticket into status waiting
428
** If there is no warranty, dispose it
429
430 56 Nico Schottelius
h3. [[Create new pool and place new osd]]
431
432
h3. Configuring auto repair on pgs
433
434
<pre>
435
ceph config set osd osd_scrub_auto_repair true
436
</pre>
437
438
Verify using:
439
440
<pre>
441
ceph config dump
442
</pre>
443 39 Jin-Guk Kwon
444 67 Nico Schottelius
h3. Change the device class of an OSD
445
446
<pre>
447
OSD=XX
448
NEWCLASS=ZZ
449
450
# Set new device class to "ssd"
451
ceph osd crush rm-device-class osd.$OSD
452
ceph osd crush set-device-class $NEWCLASS osd.$OSD
453
</pre>
454
455
* Found on https://arpnetworks.com/blog/2019/06/28/how-to-update-the-device-class-on-a-ceph-osd.html
456
457 70 Nico Schottelius
h2. Managing ceph Daemon crashes
458
459
If there is a warning about crashed daemons, they can be displayed and deleted as follows:
460
461
* @ceph crash ls@
462
* @ceph crash info <id>@
463
* @ceph crash archive <id>@
464
* @ceph crash archive-all@
465
466
Summary originally found on https://forum.proxmox.com/threads/health_warn-1-daemons-have-recently-crashed.63105/
467
468 1 Nico Schottelius
h2. Change ceph speed for i/o recovery
469
470
By default we want to keep I/O recovery traffic low to not impact customer experience. However when multiple disks fail at the same point, we might want to prioritise recover for data safety over performance.
471
472
The default configuration on our servers contains:
473
474
<pre>
475
[osd]
476
osd max backfills = 1
477
osd recovery max active = 1
478
osd recovery op priority = 2
479
</pre>
480
481
The important settings are *osd max backfills* and *osd recovery max active*, the priority is always kept low so that regular I/O has priority.
482
483
To adjust the number of backfills *per osd* and to change the *number of threads* used for recovery, we can use on any node with the admin keyring:
484
485
<pre>
486
ceph tell osd.* injectargs '--osd-max-backfills Y'
487
ceph tell osd.* injectargs '--osd-recovery-max-active X'
488
</pre>
489
490
where Y and X are the values that we want to use. Experience shows that Y=5 and X=5 doubles to triples the recovery performance, whereas X=10 and Y=10 increases recovery performance 5 times.
491
492 69 Nico Schottelius
This can also be combined in one command:
493
494
<pre>
495
ceph tell osd.* injectargs '--osd-max-backfills Y' '--osd-recovery-max-active X'
496
497
# f.i.: reset to 1
498
ceph tell osd.* injectargs '--osd-max-backfills 1' '--osd-recovery-max-active 1'
499
500
# f.i.: set to 4
501
ceph tell osd.* injectargs '--osd-max-backfills 4' '--osd-recovery-max-active 4'
502
503
</pre>
504
505 1 Nico Schottelius
h2. Debug scrub errors / inconsistent pg message
506 6 Nico Schottelius
507 1 Nico Schottelius
From time to time disks don't save what they are told to save. Ceph scrubbing detects these errors and switches to HEALTH_ERR. Use *ceph health detail* to find out which placement groups (*pgs*) are affected. Usually a ***ceph pg repair <number>*** fixes the problem.
508
509
If this does not help, consult https://ceph.com/geen-categorie/ceph-manually-repair-object/.
510 12 Nico Schottelius
511
h2. Move servers into the osd tree
512
513
New servers have their buckets placed outside the **default root** and thus need to be moved inside.
514
Output might look as follows:
515
516
<pre>
517
[11:19:27] server5.place6:~# ceph osd tree
518
ID  CLASS   WEIGHT    TYPE NAME        STATUS REWEIGHT PRI-AFF 
519
 -3           0.87270 host server5                             
520
 41     ssd   0.87270     osd.41           up  1.00000 1.00000 
521
 -1         251.85580 root default                             
522
 -7          81.56271     host server2                         
523
  0 hdd-big   9.09511         osd.0        up  1.00000 1.00000 
524
  5 hdd-big   9.09511         osd.5        up  1.00000 1.00000 
525
...
526
</pre>
527
528
529
Use **ceph osd crush move serverX root=default** (where serverX is the new server),
530
which will move the bucket in the right place:
531
532
<pre>
533
[11:21:17] server5.place6:~# ceph osd crush move server5 root=default
534
moved item id -3 name 'server5' to location {root=default} in crush map
535
[11:32:12] server5.place6:~# ceph osd tree
536
ID  CLASS   WEIGHT    TYPE NAME        STATUS REWEIGHT PRI-AFF 
537
 -1         252.72850 root default                             
538
...
539
 -3           0.87270     host server5                         
540
 41     ssd   0.87270         osd.41       up  1.00000 1.00000 
541
542
543
</pre>
544 13 Nico Schottelius
545
h2. How to fix existing osds with wrong partition layout
546
547
In the first version of DCL we used filestore/3 partition based layout.
548
In the second version of DCL, including OSD autodection, we use bluestore/2 partition based layout.
549
550
To convert, we delete the old OSD, clean the partitions and create a new osd:
551
552 14 Nico Schottelius
h3. Inactive OSD
553 1 Nico Schottelius
554 14 Nico Schottelius
If the OSD is *not active*, we can do the following:
555
556 13 Nico Schottelius
* Find the OSD number: mount the partition and find the whoami file
557
558
<pre>
559
root@server2:/opt/ungleich-tools# mount /dev/sda2 /mnt/
560
root@server2:/opt/ungleich-tools# cat /mnt/whoami 
561
0
562
root@server2:/opt/ungleich-tools# umount  /mnt/
563
564
</pre>
565
566
* Verify in the *ceph osd tree* that the OSD is on that server
567
* Deleting the OSD
568
** ceph osd crush remove $osd_name
569 1 Nico Schottelius
** ceph osd rm $osd_name
570 14 Nico Schottelius
571
Then continue below as described in "Recreating the OSD".
572
573
h3. Remove Active OSD
574
575
* Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently OSDID to stop and remove the OSD
576
* Then continue below as described in "Recreating the OSD".
577
578
579
h3. Recreating the OSD
580
581 13 Nico Schottelius
* Create an empty partition table
582
** fdisk /dev/sdX
583
** g
584
** w
585
* Create a new OSD
586
** /opt/ungleich-tools/ceph-osd-create-start /dev/sdX CLASS # use hdd, ssd, ... for the CLASS
587 15 Jin-Guk Kwon
588
h2. How to fix unfound pg
589
590
refer to https://redmine.ungleich.ch/issues/6388
591 16 Jin-Guk Kwon
592
* Check health state 
593
** ceph health detail
594
* Check which server has that osd
595
** ceph osd tree
596
* Check which VM is running in server place
597 17 Jin-Guk Kwon
** virsh list  
598 16 Jin-Guk Kwon
* Check pg map
599 17 Jin-Guk Kwon
** ceph osd map [osd pool] [VMID]
600 18 Jin-Guk Kwon
* revert pg
601
** ceph pg [PGID] mark_unfound_lost revert
602 28 Nico Schottelius
603 60 Nico Schottelius
h2. Phasing out OSDs
604
605 61 Nico Schottelius
* Either directly via /opt/ungleich-tools/ceph/ceph-osd-stop-remove-permanently
606 62 Nico Schottelius
* Or first draining it using @ceph osd crush reweight osd.XX 0@
607 60 Nico Schottelius
** Wait until rebalance done
608 61 Nico Schottelius
** Then remove
609 60 Nico Schottelius
610 28 Nico Schottelius
h2. Enabling per image RBD statistics for prometheus
611
612
613
<pre>
614
[20:26:57] red2.place5:~# ceph config set mgr mgr/prometheus/rbd_stats_pools "one,hdd"
615
[20:27:57] black2.place6:~# ceph config set mgr mgr/prometheus/rbd_stats_pools "hdd,ssd"
616
</pre>
617 29 Nico Schottelius
618
h2. S3 Object Storage
619
620 36 Nico Schottelius
This section is ** UNDER CONTRUCTION ** 
621
622 29 Nico Schottelius
h3. Introduction
623 1 Nico Schottelius
624 30 Nico Schottelius
* See the "Red Hat manual":https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/object_gateway_guide_for_red_hat_enterprise_linux/overview-rgw
625
* The "ceph docs about object storage":https://docs.ceph.com/docs/mimic/radosgw/
626 29 Nico Schottelius
627
h3. Architecture
628
629
* S3 requests are handled by a publicly accessible gateway, which also has access to the ceph cluster.
630 34 Nico Schottelius
* s3 buckets are usually
631 29 Nico Schottelius
632 32 Nico Schottelius
h3. Authentication / Users
633
634
* Ceph *can* make use of LDAP as a backend
635 1 Nico Schottelius
** However it uses the clear text username+password as a token
636 34 Nico Schottelius
** See https://docs.ceph.com/docs/mimic/radosgw/ldap-auth/
637 32 Nico Schottelius
* We do not want users to store their regular account on machines
638
* For this reason we use independent users / tokens, but with the same username as in LDAP
639
640 38 Nico Schottelius
Creating a user:
641
642
<pre>
643
radosgw-admin user create --uid=USERNAME --display-name="Name of user"
644
</pre>
645
646
647
Listing users:
648
649
<pre>
650
radosgw-admin user list
651
</pre>
652
653
654
Deleting users and their storage:
655
656
<pre>
657
radosgw-admin user rm --uid=USERNAME --purge-data
658
</pre>
659
660 1 Nico Schottelius
h3. Setting up S3 object storage on Ceph
661 33 Nico Schottelius
662
* Setup a gateway node with Alpine Linux
663
** Change do edge
664
** Enable testing
665
* Update the firewall to allow access from this node to the ceph monitors
666 35 Nico Schottelius
* Setting up the wildcard DNS certificate
667
668
<pre>
669
apk add ceph-radosgw
670
</pre>
671 37 Nico Schottelius
672
h3. Wildcard DNS certificate from letsencrypt
673
674
Acquiring and renewing this certificate is currently a manual process, as it requires to change DNS settings.
675
676
* run certbot
677
* update DNS with the first token
678
* update DNS with the second token
679
680
Sample session:
681
682
<pre>
683
s3:/etc/ceph# certbot certonly --manual --preferred-challenges=dns --email sre@ungleich.ch --server https://acme-v02.api.letsencrypt.org/directory --agree-tos 
684
-d *.s3.ungleich.ch -d s3.ungleich.ch
685
Saving debug log to /var/log/letsencrypt/letsencrypt.log
686
Plugins selected: Authenticator manual, Installer None
687
Cert is due for renewal, auto-renewing...
688
Renewing an existing certificate
689
Performing the following challenges:
690
dns-01 challenge for s3.ungleich.ch
691
dns-01 challenge for s3.ungleich.ch
692
693
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
694
NOTE: The IP of this machine will be publicly logged as having requested this
695
certificate. If you're running certbot in manual mode on a machine that is not
696
your server, please ensure you're okay with that.
697
698
Are you OK with your IP being logged?
699
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
700
(Y)es/(N)o: y
701
702
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
703
Please deploy a DNS TXT record under the name
704
_acme-challenge.s3.ungleich.ch with the following value:
705
706
KxGLZNiVjFwz1ifNheoR_KQoPVpkvRUV1oT2pOvJlU0
707
708
Before continuing, verify the record is deployed.
709
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
710
Press Enter to Continue
711
712
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
713
Please deploy a DNS TXT record under the name
714
_acme-challenge.s3.ungleich.ch with the following value:
715
716
bkrhtxWZUipCAL5cBfvrjDuftqsZdQ2JjisiKmXBbaI
717
718
Before continuing, verify the record is deployed.
719
(This must be set up in addition to the previous challenges; do not remove,
720
replace, or undo the previous challenge tasks yet. Note that you might be
721
asked to create multiple distinct TXT records with the same name. This is
722
permitted by DNS standards.)
723
724
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
725
Press Enter to Continue
726
Waiting for verification...
727
Cleaning up challenges
728
729
IMPORTANT NOTES:
730
 - Congratulations! Your certificate and chain have been saved at:
731
   /etc/letsencrypt/live/s3.ungleich.ch/fullchain.pem
732
   Your key file has been saved at:
733
   /etc/letsencrypt/live/s3.ungleich.ch/privkey.pem
734
   Your cert will expire on 2020-12-09. To obtain a new or tweaked
735
   version of this certificate in the future, simply run certbot
736
   again. To non-interactively renew *all* of your certificates, run
737
   "certbot renew"
738
 - If you like Certbot, please consider supporting our work by:
739
740
   Donating to ISRG / Let's Encrypt:   https://letsencrypt.org/donate
741
   Donating to EFF:                    https://eff.org/donate-le
742
743
</pre>
744 41 Nico Schottelius
745
h2. Debugging ceph
746
747
748
<pre>
749
    ceph status
750
    ceph osd status
751
    ceph osd df
752
    ceph osd utilization
753
    ceph osd pool stats
754
    ceph osd tree
755
    ceph pg stat
756
</pre>
757 42 Nico Schottelius
758 53 Nico Schottelius
h3. How to list the version overview
759
760 55 Nico Schottelius
This lists the versions of osds, mgrs and mons:
761
762 53 Nico Schottelius
<pre>
763
ceph versions
764
</pre>
765 55 Nico Schottelius
766
Listing the "features" of clients, osds, mgrs and mons can be done using @ceph features@:
767
768
<pre>
769
[15:32:20] red1.place5:~# ceph features
770
{
771
    "mon": [
772
        {
773
            "features": "0x3ffddff8ffecffff",
774
            "release": "luminous",
775
            "num": 5
776
        }
777
    ],
778
    "osd": [
779
        {
780
            "features": "0x3ffddff8ffecffff",
781
            "release": "luminous",
782
            "num": 44
783
        }
784
    ],
785
    "client": [
786
        {
787
            "features": "0x3ffddff8eea4fffb",
788
            "release": "luminous",
789
            "num": 4
790
        },
791
        {
792
            "features": "0x3ffddff8ffacffff",
793
            "release": "luminous",
794
            "num": 18
795
        },
796
        {
797
            "features": "0x3ffddff8ffecffff",
798
            "release": "luminous",
799
            "num": 31
800
        }
801
    ],
802
    "mgr": [
803
        {
804
            "features": "0x3ffddff8ffecffff",
805
            "release": "luminous",
806
            "num": 4
807
        }
808
    ]
809
}
810
811
</pre>
812
 
813 53 Nico Schottelius
814
h3. How to list the version of every OSD and every monitor
815
816
To list the version of each ceph OSD:
817
818
<pre>
819
ceph tell osd.* version
820
</pre>
821
822
To list the version of each ceph mon:
823
2
824
<pre>
825
ceph tell mon.* version
826
</pre>
827
828
The mgr do not seem to support this command as of 14.2.21.
829
830 49 Nico Schottelius
h2. Performance Tuning
831
832
* Ensure that the basic options for reducing rebalancing workload are set:
833
834
<pre>
835
osd max backfills = 1
836
osd recovery max active = 1
837
osd recovery op priority = 2
838
</pre>
839
840
* Ensure that "osd_op_queue_cut_off":https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_op_queue_cut_off is set to **high**
841
** Requires OSD restart on change
842
843 50 Nico Schottelius
<pre>
844
ceph config set global osd_op_queue_cut_off high
845
</pre>
846
847 51 Nico Schottelius
<pre>
848
be sure to check your osd recovery sleep settings, there are several
849
depending on your underlying drives:
850
851
    "osd_recovery_sleep": "0.000000",
852
    "osd_recovery_sleep_hdd": "0.050000",
853
    "osd_recovery_sleep_hybrid": "0.050000",
854
    "osd_recovery_sleep_ssd": "0.050000",
855
856
Adjusting these will upwards will dramatically reduce IO, and take effect
857
immediately at the cost of slowing rebalance/recovery.
858
</pre>
859
860 52 Nico Schottelius
Reference settings from Frank Schilder:
861
862
<pre>
863
  osd       class:hdd      advanced osd_recovery_sleep                0.050000
864
  osd       class:rbd_data advanced osd_recovery_sleep                0.025000
865
  osd       class:rbd_meta advanced osd_recovery_sleep                0.002500
866
  osd       class:ssd      advanced osd_recovery_sleep                0.002500
867
  osd                      advanced osd_recovery_sleep                0.050000
868
869
  osd       class:hdd      advanced osd_max_backfills                 3
870
  osd       class:rbd_data advanced osd_max_backfills                 6
871
  osd       class:rbd_meta advanced osd_max_backfills                 12
872
  osd       class:ssd      advanced osd_max_backfills                 12
873
  osd                      advanced osd_max_backfills                 3
874
875
  osd       class:hdd      advanced osd_recovery_max_active           8
876
  osd       class:rbd_data advanced osd_recovery_max_active           16
877
  osd       class:rbd_meta advanced osd_recovery_max_active           32
878
  osd       class:ssd      advanced osd_recovery_max_active           32
879
  osd                      advanced osd_recovery_max_active           8
880
</pre>
881
882
(have not yet been tested in our clusters)
883 51 Nico Schottelius
884 42 Nico Schottelius
h2. Ceph theory
885
886
h3. How much data per Server?
887
888
Q: How much data should we add into one server?
889
A: Not more than it can handle.
890
891
How much data can a server handle? For this let's have a look at 2 scenarios:
892
893
* How long does it take to compensate the loss of the server?
894
895
* Assuming a server has X TiB storage in Y disks attached and a network speed of Z GiB/s.
896
* And our estimated rebuild goal is to compensate the loss of a server within U hours.
897
898
899
h4. Approach 1
900
901
Then
902
903
Let's take an example: 
904
905
* A server with @10 disks * 10 TiB@ = 100 TiB = 100 000 GiB data. It is network connected with 10 Gbit = 1.25 GiB/s.
906
* 100000/1.25 = 80000s = 22.22h
907
908
However, our logic assumes that we actually rebuild from the failed server, which... is failed. 
909
910
h4. Approach 2: calculating with left servers
911
912
However we can apply our logic also to distribute
913
the rebuild over several servers that now pull in data from each other for rebuilding.
914
We need to *read* the data (100TiB) from other servers and distribute it to new OSDs. Assuming each server has a 10 Gbit/s
915
network connection.
916
917
Now the servers might need to *read* (get data from other osds) and *write) (send data to other osds). Luckily, networking is 10 Gbit/s duplex - i.e. in both directions.
918
919
However how fast can we actually read data from the disks? 
920
921
* SSDs are in the range of hundreds of MB/s (best case, not necessarily true for random reads) - let's assume
922
* HDDs are in the range of tenths of MB/s (depending on the work load, but 30-40 MB/s random reads seems realistic)
923
924
 
925
926
927
Further assumptions:
928
929
* Assuming further that each disk should be dedicated at least one CPU core.
930 43 Nico Schottelius
931
h3. Disk/SSD speeds
932
933 44 Nico Schottelius
* Tuning for #8473 showed that a 10TB HDD can write up to 180-200MB/s when backfilling (at about 70% cpu usage and 20% disk usage), max backfills = 8
934 43 Nico Schottelius
* Debugging SSD usage in #8461 showed SSDs can read about 470-520MB/s sequential
935
* Debugging SSD usage in #8461 showed SSDs can write about 170-280MB/s sequential
936
* Debugging SSD usage in #8461 showed SSDs can write about 4MB/s RANDOM (need to verify this even though 3 runs showed these numbers)
937 47 Dominique Roux
938 48 Dominique Roux
h3. Ceph theoretical fundament
939 47 Dominique Roux
940
If you are very much into the theoretical fundament of Ceph check out their "paper":https://www3.nd.edu/~dthain/courses/cse40771/spring2007/papers/ceph.pdf