Project

General

Profile

The ungleich ceph handbook » History » Version 70

Nico Schottelius, 01/02/2024 04:29 PM

1 1 Nico Schottelius
h1. The ungleich ceph handbook
2
3 3 Nico Schottelius
{{toc}}
4
5 1 Nico Schottelius
h2. Status
6
7 7 Nico Schottelius
This document is **IN PRODUCTION**.
8 1 Nico Schottelius
9
h2. Introduction
10
11
This article describes the ungleich storage architecture that is based on ceph. It describes our architecture as well maintenance commands. Required for 
12
13 45 Nico Schottelius
h2. Processes
14
15
h3. Usage monitoring
16
17
* Usage should be kept somewhere in 70-75% area
18
* If usage reaches 72.5%, we start reducing usage by adding disks
19
* We stop when usage is below 70%
20
21
h3. Phasing in new disks
22
23
* 24h performance test prior to using it
24
25
h3. Phasing in new servers
26
27
* 24h performance test with 1 ssd or 1 hdd (whatever is applicable)
28
29
30 1 Nico Schottelius
h2. Communication guide
31
32
Usually when disks fails no customer communication is necessary, as it is automatically compensated/rebalanced by ceph. However in case multiple disk failures happen at the same time, I/O speed might be reduced and thus customer experience impacted.
33
34
For this reason communicate whenever I/O recovery settings are temporarily tuned.
35
36 20 Nico Schottelius
h2. Analysing 
37
38 21 Nico Schottelius
h3. ceph osd df tree
39 20 Nico Schottelius
40
Using @ceph osd df tree@ you can see not only the disk usage per OSD, but also the number of PGs on an OSD. This is especially useful to see how the OSDs are balanced.
41
42 22 Nico Schottelius
h3. Find out the device of an OSD
43
44
Use @mount | grep /var/lib/ceph/osd/ceph-OSDID@ on the server on which the OSD is located:
45
46
<pre>
47
48
[16:01:23] server2.place6:~# mount | grep /var/lib/ceph/osd/ceph-31
49
/dev/sdk1 on /var/lib/ceph/osd/ceph-31 type xfs (rw,relatime,attr2,inode64,noquota)
50
</pre>
51
52 57 Nico Schottelius
h3. Show config
53
54
<pre>
55
ceph config dump
56
</pre>
57
58 58 Nico Schottelius
h3. Show backfill and recovery config
59
60
<pre>
61
ceph-conf --show-config | egrep "osd_recovery_max_active|osd_recovery_op_priority|osd_max_backfills"
62
</pre>
63
64 59 Nico Schottelius
* See also: https://www.suse.com/support/kb/doc/?id=000019693
65
66 63 Nico Schottelius
h3. Checking and clearing crash reports
67
68
If the cluster is reporting HEALTH_WARN and a recent crash such as:
69
70
<pre>
71
[rook@rook-ceph-tools-f569797b4-z4542 /]$ ceph -s
72
  cluster:
73
    id:     ...
74
    health: HEALTH_WARN
75
            1 daemons have recently crashed
76
</pre>
77
78
One can analyse it using
79
80
* List the crashes: @ceph crash ls@ 
81
* Checkout the details: @ceph crash info <id>@
82
83
To archive the error:
84
85
* To archive a specific report: @ceph crash archive <id>@
86
* To archive all: @ceph crash archive-all@
87
88
After archiving, the cluster health should return to HEALTH_OK:
89
90
<pre>
91
[rook@rook-ceph-tools-f569797b4-z4542 /]$  ceph crash ls
92
ID                                                                ENTITY  NEW  
93
2022-09-19T04:33:19.378500Z_b2e26755-0712-41de-bf2b-b370dbe94ebc  mon.c    *   
94
[rook@rook-ceph-tools-f569797b4-z4542 /]$ ceph crash archive 2022-09-19T04:33:19.378500Z_b2e26755-0712-41de-bf2b-b370dbe94ebc 
95
[rook@rook-ceph-tools-f569797b4-z4542 /]$ ceph -s
96
  cluster:
97
    id:     ..
98
    health: HEALTH_OK
99
 
100
</pre>
101
102 68 Nico Schottelius
h3. Low monitor space warning
103
104
If you see
105
106
<pre>
107
[rook@rook-ceph-tools-6bdf996-8g792 /]$ ceph health detail
108
HEALTH_WARN mon q is low on available space
109
[WRN] MON_DISK_LOW: mon q is low on available space
110
    mon.q has 29% avail
111
112
</pre>
113
114
there are two options to fix it:
115
116
* a) free up space
117
* b) raise the limit as specified in @mon_data_avail_warn@
118
119
120
121 2 Nico Schottelius
h2. Adding a new disk/ssd to the ceph cluster
122 1 Nico Schottelius
123 25 Jin-Guk Kwon
write on the disks, which order / date we bought it with a permanent marker.
124
125 46 Nico Schottelius
h3. Checking the shadow trees
126
127
To be able to spot differences / weights of hosts, it can be very helpful to look at the crush shadow tree
128
using @ceph osd crush tree --show-shadow@:
129
130
<pre>
131
-16   hdd-big 653.03418           root default~hdd-big        
132
-34   hdd-big         0         0     host server14~hdd-big   
133
-38   hdd-big         0         0     host server15~hdd-big   
134
-42   hdd-big  81.86153  78.28352     host server17~hdd-big   
135
 36   hdd-big   9.09560   9.09560         osd.36              
136
 59   hdd-big   9.09499   9.09499         osd.59              
137
 60   hdd-big   9.09499   9.09499         osd.60              
138
 68   hdd-big   9.09599   8.93999         osd.68              
139
 69   hdd-big   9.09599   7.65999         osd.69              
140
 70   hdd-big   9.09599   8.35899         osd.70              
141
 71   hdd-big   9.09599   8.56000         osd.71              
142
 72   hdd-big   9.09599   8.93700         osd.72              
143
 73   hdd-big   9.09599   8.54199         osd.73              
144
-46   hdd-big  90.94986  90.94986     host server18~hdd-big   
145
...
146
</pre>
147
148
149
Here we can see that the weight of server17 for the class hdd-big is about 81, the one of server18 about 90.
150
SSDs and other classes have their own shadow trees, too.
151
152 2 Nico Schottelius
h3. For Dell servers
153
154
First find the disk and then add it to the operating system
155
156
<pre>
157
megacli -PDList -aALL  | grep -B16 -i unconfigur
158
159
# Sample output:
160
[19:46:50] server7.place6:~#  megacli -PDList -aALL  | grep -B16 -i unconfigur
161
Enclosure Device ID: N/A
162
Slot Number: 0
163
Enclosure position: N/A
164
Device Id: 0
165
WWN: 0000000000000000
166
Sequence Number: 1
167
Media Error Count: 0
168
Other Error Count: 0
169
Predictive Failure Count: 0
170
Last Predictive Failure Event Seq Number: 0
171
PD Type: SATA
172
173
Raw Size: 894.252 GB [0x6fc81ab0 Sectors]
174
Non Coerced Size: 893.752 GB [0x6fb81ab0 Sectors]
175
Coerced Size: 893.75 GB [0x6fb80000 Sectors]
176
Sector Size:  0
177
Firmware state: Unconfigured(good), Spun Up
178
</pre>
179
180
Then add the disk to the OS:
181
182
<pre>
183 26 ll nu
megacli -CfgLdAdd -r0 [Enclosure Device ID:slot] -aX (X : host is 0. md-array is 1)
184 2 Nico Schottelius
185
# Sample call, if enclosure and slot are KNOWN (aka not N/A)
186
megacli -CfgLdAdd -r0 [32:0] -a0
187
188
# Sample call, if enclosure is N/A
189 1 Nico Schottelius
megacli -CfgLdAdd -r0 [:0] -a0
190 25 Jin-Guk Kwon
</pre>
191
192
Then check disk
193
194
<pre>
195
fdisk -l
196
[11:26:23] server2.place6:~# fdisk -l
197
......
198
Disk /dev/sdh: 7.3 TiB, 8000987201536 bytes, 15626928128 sectors
199
Units: sectors of 1 * 512 = 512 bytes
200
Sector size (logical/physical): 512 bytes / 512 bytes
201
I/O size (minimum/optimal): 512 bytes / 512 bytes
202
[11:27:24] server2.place6:~#
203
</pre>
204
205
Then create gpt
206
207
<pre>
208
/opt/ungleich-tools/disk-create-fresh-gpt /dev/XXX
209
[11:31:10] server2.place6:~# /opt/ungleich-tools/disk-create-fresh-gpt /dev/sdh
210
......
211
Created a new DOS disklabel with disk identifier 0x9c4a0355.
212
Command (m for help): Created a new GPT disklabel (GUID: 374E31AD-7B96-4837-B5ED-7B22C452899E).
213
......
214
</pre>
215
216
Then create osd for ssd/hdd-big
217
218
<pre>
219
/opt/ungleich-tools/ceph-osd-create-start /dev/XXX XXX(sdd or hdd-big)
220
[11:33:58] server2.place6:~# /opt/ungleich-tools/ceph-osd-create-start /dev/sdh hdd-big
221
+ set -e
222
+ [ 2 -lt 2 ]
223
......
224
+ /opt/ungleich-tools/monit-ceph-create-start osd.14
225
osd.14
226
[ ok ] Restarting daemon monitor: monit.
227
[11:36:14] server2.place6:~#
228
</pre>
229
230
Then check rebalancing(if you want to add another disk, you should do after rebalancing)
231
232
<pre>
233
ceph -s
234
[12:37:57] server2.place6:~# ceph -s
235
  cluster:
236
    id:     1ccd84f6-e362-4c50-9ffe-59436745e445
237
    health: HEALTH_WARN
238
            2248811/49628409 objects misplaced (4.531%)
239
......
240
  io:
241
    client:   170KiB/s rd, 35.0MiB/s wr, 463op/s rd, 728op/s wr
242
    recovery: 27.1MiB/s, 6objects/s
243 1 Nico Schottelius
[12:49:41] server2.place6:~#
244 64 Nico Schottelius
</pre>
245
246 66 Nico Schottelius
h3. For HP servers (hpacucli)
247 64 Nico Schottelius
248
* Ensure the module "sg" has been loaded
249
250
Use the following to verify that the controller is detected:
251
252
<pre>
253
# hpacucli controller all show
254
255
Smart Array P420i in Slot 0 (Embedded)    (sn: 001438033ECEF60)
256 2 Nico Schottelius
</pre>
257
258 65 Nico Schottelius
259
h4. Show all disks from controller on slot 0
260
261
<pre>
262
hpacucli controller slot=0 physicaldrive all show
263
</pre>
264
265
Example
266
267
<pre>
268
# hpacucli controller slot=0 physicaldrive all show
269
270
Smart Array P420i in Slot 0 (Embedded)
271
272
   array A
273
274
      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SATA, 12000.1 GB, OK)
275
276
   array B
277
278
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SATA, 12000.1 GB, OK)
279
280
   unassigned
281
282
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SATA, 10000.8 GB, OK)
283
284
root@ungleich-hardware-server97:/# 
285
286
</pre>
287
288
In this example the last disk has not been assigned yet.
289
290
h4. Create RAID 0 for ceph
291
292
For ceph we want a raid 0 over 1 disk to expose the disk to the OS.
293
294
This can be done using the following command:
295
296
<pre>
297
hpacucli controller slot=0 create type=ld drives=$DRIVEID raid=0
298
</pre>
299
300
For example:
301
302
<pre>
303
hpacucli controller slot=0 create type=ld drives=1I:1:3 raid=0
304
</pre>
305
306
h4. Show the controller configuration
307
308
<pre>
309
hpacucli controller slot=0 show config
310
</pre>
311
312
For example:
313
314
<pre>
315
# hpacucli controller slot=0 show config
316
317
Smart Array P420i in Slot 0 (Embedded)    (sn: 001438033ECEF60)
318
319
   array A (SATA, Unused Space: 0  MB)
320
321
322
      logicaldrive 1 (10.9 TB, RAID 0, OK)
323
324
      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SATA, 12000.1 GB, OK)
325
326
   array B (SATA, Unused Space: 0  MB)
327
328
329
      logicaldrive 2 (10.9 TB, RAID 0, OK)
330
331
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SATA, 12000.1 GB, OK)
332
333
   array C (SATA, Unused Space: 0  MB)
334
335
336
      logicaldrive 3 (9.1 TB, RAID 0, OK)
337
338
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SATA, 10000.8 GB, OK)
339
340
   Expander 380 (WWID: 50014380324EBFE0, Port: 1I, Box: 1)
341
342
   Enclosure SEP (Vendor ID HP, Model Gen8 ServBP 12+2) 378 (WWID: 50014380324EBFF9, Port: 1I, Box: 1)
343
344
   SEP (Vendor ID PMCSIERA, Model SRCv8x6G) 379 (WWID: 5001438033ECEF6F)
345
</pre>
346
347 1 Nico Schottelius
h2. Moving a disk/ssd to another server
348 4 Nico Schottelius
349
(needs to be described better)
350
351
Generally speaking:
352
353 27 ll nu
* //needs to be tested: disable recovery so data wont start move while you have the osd down
354 9 Nico Schottelius
* /opt/ungleich-tools/ceph-osd-stop-disable does the following:
355
** Stop the osd, remove monit on the server you want to take it out
356
** umount the disk
357 1 Nico Schottelius
* Take disk out
358
* Discard preserved cache on the server you took it out 
359 54 Nico Schottelius
** using megacli:  @megacli -DiscardPreservedCache -Lall -aAll@
360 1 Nico Schottelius
* Insert into new server
361 9 Nico Schottelius
* Clear foreign configuration
362 54 Nico Schottelius
** using megacli: @megacli -CfgForeign -Clear -aAll@
363 9 Nico Schottelius
* Disk will now appear in the OS, ceph/udev will automatically start the OSD (!)
364
** No creating of the osd required!
365
* Verify that the disk exists and that the osd is started
366
** using *ps aux*
367
** using *ceph osd tree*
368 10 Nico Schottelius
* */opt/ungleich-tools/monit-ceph-create-start osd.XX* # where osd.XX is the osd + number
369 9 Nico Schottelius
** Creates the monit configuration file so that monit watches the OSD
370
** Reload monit
371 11 Nico Schottelius
* Verify monit using *monit status*
372 1 Nico Schottelius
373 56 Nico Schottelius
h2. OSD related processes
374 1 Nico Schottelius
375 56 Nico Schottelius
h3. Removing a disk/ssd
376
377 1 Nico Schottelius
To permanently remove a failed disk from a cluster, use ***ceph-osd-stop-remove-permanently*** from ungleich-tools repo. Warning: if the disk is still active, the OSD will be shutdown AND removed from the cluster -> all data of that disk will need to be rebalanced.
378
379 56 Nico Schottelius
h3. Handling DOWN osds with filesystem errors
380 1 Nico Schottelius
381
If an email arrives with the subject "monit alert -- Does not exist osd.XX-whoami", the filesystem of an OSD cannot be read anymore. It is very highly likely that the disk / ssd is broken. Steps that need to be done:
382
383
* Login to any ceph monitor (cephX.placeY.ungleich.ch)
384
* Check **ceph -s**, find host using **ceph osd tree**
385
* Login to the affected host
386
* Run the following commands:
387
** ls /var/lib/ceph/osd/ceph-XX
388
** dmesg
389 24 Jin-Guk Kwon
<pre>
390
ex) After checking message of dmesg, you can do next step
391
[204696.406756] XFS (sdl1): metadata I/O error: block 0x19100 ("xlog_iodone") error 5 numblks 64
392
[204696.408094] XFS (sdl1): xfs_do_force_shutdown(0x2) called from line 1233 of file /build/linux-BsFdsw/linux-4.9.65/fs/xfs/xfs_log.c.  Return address = 0xffffffffc08eb612
393
[204696.410702] XFS (sdl1): Log I/O Error Detected.  Shutting down filesystem
394
[204696.411977] XFS (sdl1): Please umount the filesystem and rectify the problem(
395
</pre>
396
397 1 Nico Schottelius
* Create a new ticket in the datacenter light project
398
** Subject: "Replace broken OSD.XX on serverX.placeY.ungleich.ch"
399
** Add (partial) output of above commands
400
** Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently XX, where XX is the osd id, to remove the disk from the cluster
401
** Remove the physical disk from the host, checkout if there is warranty on it and if yes
402
*** Create a short letter to the vendor, including technical details a from above
403
*** Record when you sent it in
404
*** Put ticket into status waiting
405
** If there is no warranty, dispose it
406
407 56 Nico Schottelius
h3. [[Create new pool and place new osd]]
408
409
h3. Configuring auto repair on pgs
410
411
<pre>
412
ceph config set osd osd_scrub_auto_repair true
413
</pre>
414
415
Verify using:
416
417
<pre>
418
ceph config dump
419
</pre>
420 39 Jin-Guk Kwon
421 67 Nico Schottelius
h3. Change the device class of an OSD
422
423
<pre>
424
OSD=XX
425
NEWCLASS=ZZ
426
427
# Set new device class to "ssd"
428
ceph osd crush rm-device-class osd.$OSD
429
ceph osd crush set-device-class $NEWCLASS osd.$OSD
430
</pre>
431
432
* Found on https://arpnetworks.com/blog/2019/06/28/how-to-update-the-device-class-on-a-ceph-osd.html
433
434 70 Nico Schottelius
h2. Managing ceph Daemon crashes
435
436
If there is a warning about crashed daemons, they can be displayed and deleted as follows:
437
438
* @ceph crash ls@
439
* @ceph crash info <id>@
440
* @ceph crash archive <id>@
441
* @ceph crash archive-all@
442
443
Summary originally found on https://forum.proxmox.com/threads/health_warn-1-daemons-have-recently-crashed.63105/
444
445 1 Nico Schottelius
h2. Change ceph speed for i/o recovery
446
447
By default we want to keep I/O recovery traffic low to not impact customer experience. However when multiple disks fail at the same point, we might want to prioritise recover for data safety over performance.
448
449
The default configuration on our servers contains:
450
451
<pre>
452
[osd]
453
osd max backfills = 1
454
osd recovery max active = 1
455
osd recovery op priority = 2
456
</pre>
457
458
The important settings are *osd max backfills* and *osd recovery max active*, the priority is always kept low so that regular I/O has priority.
459
460
To adjust the number of backfills *per osd* and to change the *number of threads* used for recovery, we can use on any node with the admin keyring:
461
462
<pre>
463
ceph tell osd.* injectargs '--osd-max-backfills Y'
464
ceph tell osd.* injectargs '--osd-recovery-max-active X'
465
</pre>
466
467
where Y and X are the values that we want to use. Experience shows that Y=5 and X=5 doubles to triples the recovery performance, whereas X=10 and Y=10 increases recovery performance 5 times.
468
469 69 Nico Schottelius
This can also be combined in one command:
470
471
<pre>
472
ceph tell osd.* injectargs '--osd-max-backfills Y' '--osd-recovery-max-active X'
473
474
# f.i.: reset to 1
475
ceph tell osd.* injectargs '--osd-max-backfills 1' '--osd-recovery-max-active 1'
476
477
# f.i.: set to 4
478
ceph tell osd.* injectargs '--osd-max-backfills 4' '--osd-recovery-max-active 4'
479
480
</pre>
481
482 1 Nico Schottelius
h2. Debug scrub errors / inconsistent pg message
483 6 Nico Schottelius
484 1 Nico Schottelius
From time to time disks don't save what they are told to save. Ceph scrubbing detects these errors and switches to HEALTH_ERR. Use *ceph health detail* to find out which placement groups (*pgs*) are affected. Usually a ***ceph pg repair <number>*** fixes the problem.
485
486
If this does not help, consult https://ceph.com/geen-categorie/ceph-manually-repair-object/.
487 12 Nico Schottelius
488
h2. Move servers into the osd tree
489
490
New servers have their buckets placed outside the **default root** and thus need to be moved inside.
491
Output might look as follows:
492
493
<pre>
494
[11:19:27] server5.place6:~# ceph osd tree
495
ID  CLASS   WEIGHT    TYPE NAME        STATUS REWEIGHT PRI-AFF 
496
 -3           0.87270 host server5                             
497
 41     ssd   0.87270     osd.41           up  1.00000 1.00000 
498
 -1         251.85580 root default                             
499
 -7          81.56271     host server2                         
500
  0 hdd-big   9.09511         osd.0        up  1.00000 1.00000 
501
  5 hdd-big   9.09511         osd.5        up  1.00000 1.00000 
502
...
503
</pre>
504
505
506
Use **ceph osd crush move serverX root=default** (where serverX is the new server),
507
which will move the bucket in the right place:
508
509
<pre>
510
[11:21:17] server5.place6:~# ceph osd crush move server5 root=default
511
moved item id -3 name 'server5' to location {root=default} in crush map
512
[11:32:12] server5.place6:~# ceph osd tree
513
ID  CLASS   WEIGHT    TYPE NAME        STATUS REWEIGHT PRI-AFF 
514
 -1         252.72850 root default                             
515
...
516
 -3           0.87270     host server5                         
517
 41     ssd   0.87270         osd.41       up  1.00000 1.00000 
518
519
520
</pre>
521 13 Nico Schottelius
522
h2. How to fix existing osds with wrong partition layout
523
524
In the first version of DCL we used filestore/3 partition based layout.
525
In the second version of DCL, including OSD autodection, we use bluestore/2 partition based layout.
526
527
To convert, we delete the old OSD, clean the partitions and create a new osd:
528
529 14 Nico Schottelius
h3. Inactive OSD
530 1 Nico Schottelius
531 14 Nico Schottelius
If the OSD is *not active*, we can do the following:
532
533 13 Nico Schottelius
* Find the OSD number: mount the partition and find the whoami file
534
535
<pre>
536
root@server2:/opt/ungleich-tools# mount /dev/sda2 /mnt/
537
root@server2:/opt/ungleich-tools# cat /mnt/whoami 
538
0
539
root@server2:/opt/ungleich-tools# umount  /mnt/
540
541
</pre>
542
543
* Verify in the *ceph osd tree* that the OSD is on that server
544
* Deleting the OSD
545
** ceph osd crush remove $osd_name
546 1 Nico Schottelius
** ceph osd rm $osd_name
547 14 Nico Schottelius
548
Then continue below as described in "Recreating the OSD".
549
550
h3. Remove Active OSD
551
552
* Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently OSDID to stop and remove the OSD
553
* Then continue below as described in "Recreating the OSD".
554
555
556
h3. Recreating the OSD
557
558 13 Nico Schottelius
* Create an empty partition table
559
** fdisk /dev/sdX
560
** g
561
** w
562
* Create a new OSD
563
** /opt/ungleich-tools/ceph-osd-create-start /dev/sdX CLASS # use hdd, ssd, ... for the CLASS
564 15 Jin-Guk Kwon
565
h2. How to fix unfound pg
566
567
refer to https://redmine.ungleich.ch/issues/6388
568 16 Jin-Guk Kwon
569
* Check health state 
570
** ceph health detail
571
* Check which server has that osd
572
** ceph osd tree
573
* Check which VM is running in server place
574 17 Jin-Guk Kwon
** virsh list  
575 16 Jin-Guk Kwon
* Check pg map
576 17 Jin-Guk Kwon
** ceph osd map [osd pool] [VMID]
577 18 Jin-Guk Kwon
* revert pg
578
** ceph pg [PGID] mark_unfound_lost revert
579 28 Nico Schottelius
580 60 Nico Schottelius
h2. Phasing out OSDs
581
582 61 Nico Schottelius
* Either directly via /opt/ungleich-tools/ceph/ceph-osd-stop-remove-permanently
583 62 Nico Schottelius
* Or first draining it using @ceph osd crush reweight osd.XX 0@
584 60 Nico Schottelius
** Wait until rebalance done
585 61 Nico Schottelius
** Then remove
586 60 Nico Schottelius
587 28 Nico Schottelius
h2. Enabling per image RBD statistics for prometheus
588
589
590
<pre>
591
[20:26:57] red2.place5:~# ceph config set mgr mgr/prometheus/rbd_stats_pools "one,hdd"
592
[20:27:57] black2.place6:~# ceph config set mgr mgr/prometheus/rbd_stats_pools "hdd,ssd"
593
</pre>
594 29 Nico Schottelius
595
h2. S3 Object Storage
596
597 36 Nico Schottelius
This section is ** UNDER CONTRUCTION ** 
598
599 29 Nico Schottelius
h3. Introduction
600 1 Nico Schottelius
601 30 Nico Schottelius
* See the "Red Hat manual":https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/object_gateway_guide_for_red_hat_enterprise_linux/overview-rgw
602
* The "ceph docs about object storage":https://docs.ceph.com/docs/mimic/radosgw/
603 29 Nico Schottelius
604
h3. Architecture
605
606
* S3 requests are handled by a publicly accessible gateway, which also has access to the ceph cluster.
607 34 Nico Schottelius
* s3 buckets are usually
608 29 Nico Schottelius
609 32 Nico Schottelius
h3. Authentication / Users
610
611
* Ceph *can* make use of LDAP as a backend
612 1 Nico Schottelius
** However it uses the clear text username+password as a token
613 34 Nico Schottelius
** See https://docs.ceph.com/docs/mimic/radosgw/ldap-auth/
614 32 Nico Schottelius
* We do not want users to store their regular account on machines
615
* For this reason we use independent users / tokens, but with the same username as in LDAP
616
617 38 Nico Schottelius
Creating a user:
618
619
<pre>
620
radosgw-admin user create --uid=USERNAME --display-name="Name of user"
621
</pre>
622
623
624
Listing users:
625
626
<pre>
627
radosgw-admin user list
628
</pre>
629
630
631
Deleting users and their storage:
632
633
<pre>
634
radosgw-admin user rm --uid=USERNAME --purge-data
635
</pre>
636
637 1 Nico Schottelius
h3. Setting up S3 object storage on Ceph
638 33 Nico Schottelius
639
* Setup a gateway node with Alpine Linux
640
** Change do edge
641
** Enable testing
642
* Update the firewall to allow access from this node to the ceph monitors
643 35 Nico Schottelius
* Setting up the wildcard DNS certificate
644
645
<pre>
646
apk add ceph-radosgw
647
</pre>
648 37 Nico Schottelius
649
h3. Wildcard DNS certificate from letsencrypt
650
651
Acquiring and renewing this certificate is currently a manual process, as it requires to change DNS settings.
652
653
* run certbot
654
* update DNS with the first token
655
* update DNS with the second token
656
657
Sample session:
658
659
<pre>
660
s3:/etc/ceph# certbot certonly --manual --preferred-challenges=dns --email sre@ungleich.ch --server https://acme-v02.api.letsencrypt.org/directory --agree-tos 
661
-d *.s3.ungleich.ch -d s3.ungleich.ch
662
Saving debug log to /var/log/letsencrypt/letsencrypt.log
663
Plugins selected: Authenticator manual, Installer None
664
Cert is due for renewal, auto-renewing...
665
Renewing an existing certificate
666
Performing the following challenges:
667
dns-01 challenge for s3.ungleich.ch
668
dns-01 challenge for s3.ungleich.ch
669
670
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
671
NOTE: The IP of this machine will be publicly logged as having requested this
672
certificate. If you're running certbot in manual mode on a machine that is not
673
your server, please ensure you're okay with that.
674
675
Are you OK with your IP being logged?
676
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
677
(Y)es/(N)o: y
678
679
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
680
Please deploy a DNS TXT record under the name
681
_acme-challenge.s3.ungleich.ch with the following value:
682
683
KxGLZNiVjFwz1ifNheoR_KQoPVpkvRUV1oT2pOvJlU0
684
685
Before continuing, verify the record is deployed.
686
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
687
Press Enter to Continue
688
689
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
690
Please deploy a DNS TXT record under the name
691
_acme-challenge.s3.ungleich.ch with the following value:
692
693
bkrhtxWZUipCAL5cBfvrjDuftqsZdQ2JjisiKmXBbaI
694
695
Before continuing, verify the record is deployed.
696
(This must be set up in addition to the previous challenges; do not remove,
697
replace, or undo the previous challenge tasks yet. Note that you might be
698
asked to create multiple distinct TXT records with the same name. This is
699
permitted by DNS standards.)
700
701
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
702
Press Enter to Continue
703
Waiting for verification...
704
Cleaning up challenges
705
706
IMPORTANT NOTES:
707
 - Congratulations! Your certificate and chain have been saved at:
708
   /etc/letsencrypt/live/s3.ungleich.ch/fullchain.pem
709
   Your key file has been saved at:
710
   /etc/letsencrypt/live/s3.ungleich.ch/privkey.pem
711
   Your cert will expire on 2020-12-09. To obtain a new or tweaked
712
   version of this certificate in the future, simply run certbot
713
   again. To non-interactively renew *all* of your certificates, run
714
   "certbot renew"
715
 - If you like Certbot, please consider supporting our work by:
716
717
   Donating to ISRG / Let's Encrypt:   https://letsencrypt.org/donate
718
   Donating to EFF:                    https://eff.org/donate-le
719
720
</pre>
721 41 Nico Schottelius
722
h2. Debugging ceph
723
724
725
<pre>
726
    ceph status
727
    ceph osd status
728
    ceph osd df
729
    ceph osd utilization
730
    ceph osd pool stats
731
    ceph osd tree
732
    ceph pg stat
733
</pre>
734 42 Nico Schottelius
735 53 Nico Schottelius
h3. How to list the version overview
736
737 55 Nico Schottelius
This lists the versions of osds, mgrs and mons:
738
739 53 Nico Schottelius
<pre>
740
ceph versions
741
</pre>
742 55 Nico Schottelius
743
Listing the "features" of clients, osds, mgrs and mons can be done using @ceph features@:
744
745
<pre>
746
[15:32:20] red1.place5:~# ceph features
747
{
748
    "mon": [
749
        {
750
            "features": "0x3ffddff8ffecffff",
751
            "release": "luminous",
752
            "num": 5
753
        }
754
    ],
755
    "osd": [
756
        {
757
            "features": "0x3ffddff8ffecffff",
758
            "release": "luminous",
759
            "num": 44
760
        }
761
    ],
762
    "client": [
763
        {
764
            "features": "0x3ffddff8eea4fffb",
765
            "release": "luminous",
766
            "num": 4
767
        },
768
        {
769
            "features": "0x3ffddff8ffacffff",
770
            "release": "luminous",
771
            "num": 18
772
        },
773
        {
774
            "features": "0x3ffddff8ffecffff",
775
            "release": "luminous",
776
            "num": 31
777
        }
778
    ],
779
    "mgr": [
780
        {
781
            "features": "0x3ffddff8ffecffff",
782
            "release": "luminous",
783
            "num": 4
784
        }
785
    ]
786
}
787
788
</pre>
789
 
790 53 Nico Schottelius
791
h3. How to list the version of every OSD and every monitor
792
793
To list the version of each ceph OSD:
794
795
<pre>
796
ceph tell osd.* version
797
</pre>
798
799
To list the version of each ceph mon:
800
2
801
<pre>
802
ceph tell mon.* version
803
</pre>
804
805
The mgr do not seem to support this command as of 14.2.21.
806
807 49 Nico Schottelius
h2. Performance Tuning
808
809
* Ensure that the basic options for reducing rebalancing workload are set:
810
811
<pre>
812
osd max backfills = 1
813
osd recovery max active = 1
814
osd recovery op priority = 2
815
</pre>
816
817
* Ensure that "osd_op_queue_cut_off":https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_op_queue_cut_off is set to **high**
818
** Requires OSD restart on change
819
820 50 Nico Schottelius
<pre>
821
ceph config set global osd_op_queue_cut_off high
822
</pre>
823
824 51 Nico Schottelius
<pre>
825
be sure to check your osd recovery sleep settings, there are several
826
depending on your underlying drives:
827
828
    "osd_recovery_sleep": "0.000000",
829
    "osd_recovery_sleep_hdd": "0.050000",
830
    "osd_recovery_sleep_hybrid": "0.050000",
831
    "osd_recovery_sleep_ssd": "0.050000",
832
833
Adjusting these will upwards will dramatically reduce IO, and take effect
834
immediately at the cost of slowing rebalance/recovery.
835
</pre>
836
837 52 Nico Schottelius
Reference settings from Frank Schilder:
838
839
<pre>
840
  osd       class:hdd      advanced osd_recovery_sleep                0.050000
841
  osd       class:rbd_data advanced osd_recovery_sleep                0.025000
842
  osd       class:rbd_meta advanced osd_recovery_sleep                0.002500
843
  osd       class:ssd      advanced osd_recovery_sleep                0.002500
844
  osd                      advanced osd_recovery_sleep                0.050000
845
846
  osd       class:hdd      advanced osd_max_backfills                 3
847
  osd       class:rbd_data advanced osd_max_backfills                 6
848
  osd       class:rbd_meta advanced osd_max_backfills                 12
849
  osd       class:ssd      advanced osd_max_backfills                 12
850
  osd                      advanced osd_max_backfills                 3
851
852
  osd       class:hdd      advanced osd_recovery_max_active           8
853
  osd       class:rbd_data advanced osd_recovery_max_active           16
854
  osd       class:rbd_meta advanced osd_recovery_max_active           32
855
  osd       class:ssd      advanced osd_recovery_max_active           32
856
  osd                      advanced osd_recovery_max_active           8
857
</pre>
858
859
(have not yet been tested in our clusters)
860 51 Nico Schottelius
861 42 Nico Schottelius
h2. Ceph theory
862
863
h3. How much data per Server?
864
865
Q: How much data should we add into one server?
866
A: Not more than it can handle.
867
868
How much data can a server handle? For this let's have a look at 2 scenarios:
869
870
* How long does it take to compensate the loss of the server?
871
872
* Assuming a server has X TiB storage in Y disks attached and a network speed of Z GiB/s.
873
* And our estimated rebuild goal is to compensate the loss of a server within U hours.
874
875
876
h4. Approach 1
877
878
Then
879
880
Let's take an example: 
881
882
* A server with @10 disks * 10 TiB@ = 100 TiB = 100 000 GiB data. It is network connected with 10 Gbit = 1.25 GiB/s.
883
* 100000/1.25 = 80000s = 22.22h
884
885
However, our logic assumes that we actually rebuild from the failed server, which... is failed. 
886
887
h4. Approach 2: calculating with left servers
888
889
However we can apply our logic also to distribute
890
the rebuild over several servers that now pull in data from each other for rebuilding.
891
We need to *read* the data (100TiB) from other servers and distribute it to new OSDs. Assuming each server has a 10 Gbit/s
892
network connection.
893
894
Now the servers might need to *read* (get data from other osds) and *write) (send data to other osds). Luckily, networking is 10 Gbit/s duplex - i.e. in both directions.
895
896
However how fast can we actually read data from the disks? 
897
898
* SSDs are in the range of hundreds of MB/s (best case, not necessarily true for random reads) - let's assume
899
* HDDs are in the range of tenths of MB/s (depending on the work load, but 30-40 MB/s random reads seems realistic)
900
901
 
902
903
904
Further assumptions:
905
906
* Assuming further that each disk should be dedicated at least one CPU core.
907 43 Nico Schottelius
908
h3. Disk/SSD speeds
909
910 44 Nico Schottelius
* Tuning for #8473 showed that a 10TB HDD can write up to 180-200MB/s when backfilling (at about 70% cpu usage and 20% disk usage), max backfills = 8
911 43 Nico Schottelius
* Debugging SSD usage in #8461 showed SSDs can read about 470-520MB/s sequential
912
* Debugging SSD usage in #8461 showed SSDs can write about 170-280MB/s sequential
913
* Debugging SSD usage in #8461 showed SSDs can write about 4MB/s RANDOM (need to verify this even though 3 runs showed these numbers)
914 47 Dominique Roux
915 48 Dominique Roux
h3. Ceph theoretical fundament
916 47 Dominique Roux
917
If you are very much into the theoretical fundament of Ceph check out their "paper":https://www3.nd.edu/~dthain/courses/cse40771/spring2007/papers/ceph.pdf