Project

General

Profile

The ungleich ceph handbook » History » Version 73

Nico Schottelius, 07/31/2024 09:34 AM

1 1 Nico Schottelius
h1. The ungleich ceph handbook
2
3 3 Nico Schottelius
{{toc}}
4
5 1 Nico Schottelius
h2. Status
6
7 7 Nico Schottelius
This document is **IN PRODUCTION**.
8 1 Nico Schottelius
9
h2. Introduction
10
11 73 Nico Schottelius
This article describes the ungleich storage architecture that is based on ceph. It describes our architecture as well maintenance commands. Required for
12 1 Nico Schottelius
13 45 Nico Schottelius
h2. Processes
14
15
h3. Usage monitoring
16
17
* Usage should be kept somewhere in 70-75% area
18
* If usage reaches 72.5%, we start reducing usage by adding disks
19
* We stop when usage is below 70%
20
21
h3. Phasing in new disks
22
23
* 24h performance test prior to using it
24
25
h3. Phasing in new servers
26
27 1 Nico Schottelius
* 24h performance test with 1 ssd or 1 hdd (whatever is applicable)
28 73 Nico Schottelius
29
h3. Find all ceph osd hosts 
30
31
Use smart awk:
32
33
<pre>
34
ceph osd tree | awk '/host/ { print $4 }'
35
</pre>
36 45 Nico Schottelius
37
38 1 Nico Schottelius
h2. Communication guide
39
40
Usually when disks fails no customer communication is necessary, as it is automatically compensated/rebalanced by ceph. However in case multiple disk failures happen at the same time, I/O speed might be reduced and thus customer experience impacted.
41
42
For this reason communicate whenever I/O recovery settings are temporarily tuned.
43
44 20 Nico Schottelius
h2. Analysing 
45
46 21 Nico Schottelius
h3. ceph osd df tree
47 20 Nico Schottelius
48
Using @ceph osd df tree@ you can see not only the disk usage per OSD, but also the number of PGs on an OSD. This is especially useful to see how the OSDs are balanced.
49
50 22 Nico Schottelius
h3. Find out the device of an OSD
51
52
Use @mount | grep /var/lib/ceph/osd/ceph-OSDID@ on the server on which the OSD is located:
53
54
<pre>
55
56
[16:01:23] server2.place6:~# mount | grep /var/lib/ceph/osd/ceph-31
57
/dev/sdk1 on /var/lib/ceph/osd/ceph-31 type xfs (rw,relatime,attr2,inode64,noquota)
58
</pre>
59
60 57 Nico Schottelius
h3. Show config
61
62
<pre>
63
ceph config dump
64
</pre>
65
66 58 Nico Schottelius
h3. Show backfill and recovery config
67
68
<pre>
69
ceph-conf --show-config | egrep "osd_recovery_max_active|osd_recovery_op_priority|osd_max_backfills"
70
</pre>
71
72 59 Nico Schottelius
* See also: https://www.suse.com/support/kb/doc/?id=000019693
73
74 63 Nico Schottelius
h3. Checking and clearing crash reports
75
76
If the cluster is reporting HEALTH_WARN and a recent crash such as:
77
78
<pre>
79
[rook@rook-ceph-tools-f569797b4-z4542 /]$ ceph -s
80
  cluster:
81
    id:     ...
82
    health: HEALTH_WARN
83
            1 daemons have recently crashed
84
</pre>
85
86
One can analyse it using
87
88
* List the crashes: @ceph crash ls@ 
89
* Checkout the details: @ceph crash info <id>@
90
91
To archive the error:
92
93
* To archive a specific report: @ceph crash archive <id>@
94
* To archive all: @ceph crash archive-all@
95
96
After archiving, the cluster health should return to HEALTH_OK:
97
98
<pre>
99
[rook@rook-ceph-tools-f569797b4-z4542 /]$  ceph crash ls
100
ID                                                                ENTITY  NEW  
101
2022-09-19T04:33:19.378500Z_b2e26755-0712-41de-bf2b-b370dbe94ebc  mon.c    *   
102
[rook@rook-ceph-tools-f569797b4-z4542 /]$ ceph crash archive 2022-09-19T04:33:19.378500Z_b2e26755-0712-41de-bf2b-b370dbe94ebc 
103
[rook@rook-ceph-tools-f569797b4-z4542 /]$ ceph -s
104
  cluster:
105
    id:     ..
106
    health: HEALTH_OK
107
 
108
</pre>
109
110 68 Nico Schottelius
h3. Low monitor space warning
111
112
If you see
113
114
<pre>
115
[rook@rook-ceph-tools-6bdf996-8g792 /]$ ceph health detail
116
HEALTH_WARN mon q is low on available space
117
[WRN] MON_DISK_LOW: mon q is low on available space
118
    mon.q has 29% avail
119
120
</pre>
121
122
there are two options to fix it:
123
124
* a) free up space
125
* b) raise the limit as specified in @mon_data_avail_warn@
126
127 2 Nico Schottelius
h2. Adding a new disk/ssd to the ceph cluster
128 1 Nico Schottelius
129 25 Jin-Guk Kwon
write on the disks, which order / date we bought it with a permanent marker.
130
131 46 Nico Schottelius
h3. Checking the shadow trees
132
133
To be able to spot differences / weights of hosts, it can be very helpful to look at the crush shadow tree
134
using @ceph osd crush tree --show-shadow@:
135
136
<pre>
137
-16   hdd-big 653.03418           root default~hdd-big        
138
-34   hdd-big         0         0     host server14~hdd-big   
139
-38   hdd-big         0         0     host server15~hdd-big   
140
-42   hdd-big  81.86153  78.28352     host server17~hdd-big   
141
 36   hdd-big   9.09560   9.09560         osd.36              
142
 59   hdd-big   9.09499   9.09499         osd.59              
143
 60   hdd-big   9.09499   9.09499         osd.60              
144
 68   hdd-big   9.09599   8.93999         osd.68              
145
 69   hdd-big   9.09599   7.65999         osd.69              
146
 70   hdd-big   9.09599   8.35899         osd.70              
147
 71   hdd-big   9.09599   8.56000         osd.71              
148
 72   hdd-big   9.09599   8.93700         osd.72              
149
 73   hdd-big   9.09599   8.54199         osd.73              
150
-46   hdd-big  90.94986  90.94986     host server18~hdd-big   
151
...
152
</pre>
153
154
155
Here we can see that the weight of server17 for the class hdd-big is about 81, the one of server18 about 90.
156
SSDs and other classes have their own shadow trees, too.
157
158 2 Nico Schottelius
h3. For Dell servers
159
160
First find the disk and then add it to the operating system
161
162
<pre>
163
megacli -PDList -aALL  | grep -B16 -i unconfigur
164
165
# Sample output:
166
[19:46:50] server7.place6:~#  megacli -PDList -aALL  | grep -B16 -i unconfigur
167
Enclosure Device ID: N/A
168
Slot Number: 0
169
Enclosure position: N/A
170
Device Id: 0
171
WWN: 0000000000000000
172
Sequence Number: 1
173
Media Error Count: 0
174
Other Error Count: 0
175
Predictive Failure Count: 0
176
Last Predictive Failure Event Seq Number: 0
177
PD Type: SATA
178
179
Raw Size: 894.252 GB [0x6fc81ab0 Sectors]
180
Non Coerced Size: 893.752 GB [0x6fb81ab0 Sectors]
181
Coerced Size: 893.75 GB [0x6fb80000 Sectors]
182
Sector Size:  0
183
Firmware state: Unconfigured(good), Spun Up
184
</pre>
185
186
Then add the disk to the OS:
187
188
<pre>
189 26 ll nu
megacli -CfgLdAdd -r0 [Enclosure Device ID:slot] -aX (X : host is 0. md-array is 1)
190 2 Nico Schottelius
191
# Sample call, if enclosure and slot are KNOWN (aka not N/A)
192
megacli -CfgLdAdd -r0 [32:0] -a0
193
194
# Sample call, if enclosure is N/A
195 1 Nico Schottelius
megacli -CfgLdAdd -r0 [:0] -a0
196 25 Jin-Guk Kwon
</pre>
197
198
Then check disk
199
200
<pre>
201
fdisk -l
202
[11:26:23] server2.place6:~# fdisk -l
203
......
204
Disk /dev/sdh: 7.3 TiB, 8000987201536 bytes, 15626928128 sectors
205
Units: sectors of 1 * 512 = 512 bytes
206
Sector size (logical/physical): 512 bytes / 512 bytes
207
I/O size (minimum/optimal): 512 bytes / 512 bytes
208
[11:27:24] server2.place6:~#
209
</pre>
210
211
Then create gpt
212
213
<pre>
214
/opt/ungleich-tools/disk-create-fresh-gpt /dev/XXX
215
[11:31:10] server2.place6:~# /opt/ungleich-tools/disk-create-fresh-gpt /dev/sdh
216
......
217
Created a new DOS disklabel with disk identifier 0x9c4a0355.
218
Command (m for help): Created a new GPT disklabel (GUID: 374E31AD-7B96-4837-B5ED-7B22C452899E).
219
......
220
</pre>
221
222
Then create osd for ssd/hdd-big
223
224
<pre>
225
/opt/ungleich-tools/ceph-osd-create-start /dev/XXX XXX(sdd or hdd-big)
226
[11:33:58] server2.place6:~# /opt/ungleich-tools/ceph-osd-create-start /dev/sdh hdd-big
227
+ set -e
228
+ [ 2 -lt 2 ]
229
......
230
+ /opt/ungleich-tools/monit-ceph-create-start osd.14
231
osd.14
232
[ ok ] Restarting daemon monitor: monit.
233
[11:36:14] server2.place6:~#
234
</pre>
235
236
Then check rebalancing(if you want to add another disk, you should do after rebalancing)
237
238
<pre>
239
ceph -s
240
[12:37:57] server2.place6:~# ceph -s
241
  cluster:
242
    id:     1ccd84f6-e362-4c50-9ffe-59436745e445
243
    health: HEALTH_WARN
244
            2248811/49628409 objects misplaced (4.531%)
245
......
246
  io:
247
    client:   170KiB/s rd, 35.0MiB/s wr, 463op/s rd, 728op/s wr
248
    recovery: 27.1MiB/s, 6objects/s
249 1 Nico Schottelius
[12:49:41] server2.place6:~#
250 64 Nico Schottelius
</pre>
251
252 66 Nico Schottelius
h3. For HP servers (hpacucli)
253 64 Nico Schottelius
254
* Ensure the module "sg" has been loaded
255
256
Use the following to verify that the controller is detected:
257
258
<pre>
259
# hpacucli controller all show
260
261
Smart Array P420i in Slot 0 (Embedded)    (sn: 001438033ECEF60)
262 2 Nico Schottelius
</pre>
263
264 65 Nico Schottelius
265
h4. Show all disks from controller on slot 0
266
267
<pre>
268
hpacucli controller slot=0 physicaldrive all show
269
</pre>
270
271
Example
272
273
<pre>
274
# hpacucli controller slot=0 physicaldrive all show
275
276
Smart Array P420i in Slot 0 (Embedded)
277
278
   array A
279
280
      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SATA, 12000.1 GB, OK)
281
282
   array B
283
284
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SATA, 12000.1 GB, OK)
285
286
   unassigned
287
288
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SATA, 10000.8 GB, OK)
289
290
root@ungleich-hardware-server97:/# 
291
292
</pre>
293
294
In this example the last disk has not been assigned yet.
295
296
h4. Create RAID 0 for ceph
297
298
For ceph we want a raid 0 over 1 disk to expose the disk to the OS.
299
300
This can be done using the following command:
301
302
<pre>
303
hpacucli controller slot=0 create type=ld drives=$DRIVEID raid=0
304
</pre>
305
306
For example:
307
308
<pre>
309
hpacucli controller slot=0 create type=ld drives=1I:1:3 raid=0
310
</pre>
311
312
h4. Show the controller configuration
313
314
<pre>
315
hpacucli controller slot=0 show config
316
</pre>
317
318
For example:
319
320
<pre>
321
# hpacucli controller slot=0 show config
322
323
Smart Array P420i in Slot 0 (Embedded)    (sn: 001438033ECEF60)
324
325
   array A (SATA, Unused Space: 0  MB)
326
327
328
      logicaldrive 1 (10.9 TB, RAID 0, OK)
329
330
      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SATA, 12000.1 GB, OK)
331
332
   array B (SATA, Unused Space: 0  MB)
333
334
335
      logicaldrive 2 (10.9 TB, RAID 0, OK)
336
337
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SATA, 12000.1 GB, OK)
338
339
   array C (SATA, Unused Space: 0  MB)
340
341
342
      logicaldrive 3 (9.1 TB, RAID 0, OK)
343
344
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SATA, 10000.8 GB, OK)
345
346
   Expander 380 (WWID: 50014380324EBFE0, Port: 1I, Box: 1)
347
348
   Enclosure SEP (Vendor ID HP, Model Gen8 ServBP 12+2) 378 (WWID: 50014380324EBFF9, Port: 1I, Box: 1)
349 1 Nico Schottelius
350
   SEP (Vendor ID PMCSIERA, Model SRCv8x6G) 379 (WWID: 5001438033ECEF6F)
351 71 Nico Schottelius
</pre>
352
353
h3. Removing signatures preventing disk being used by ceph
354
355
If you see
356
357
<pre>
358
cephosd: skipping device "sdX" because it contains a filesystem "ddf_raid_member"
359
</pre>
360
361
you can clean it with wipefs:
362
363
<pre>
364
[20:47] server98.place10:~# wipefs /dev/sde 
365
DEVICE OFFSET        TYPE            UUID         LABEL
366
sde    0xae9fffffe00 ddf_raid_member Dell    \x10 
367
[20:48] server98.place10:~# wipefs -a /dev/sde 
368
/dev/sde: 4 bytes were erased at offset 0xae9fffffe00 (ddf_raid_member): de 11 de 11
369
[20:48] server98.place10:~# 
370
371 65 Nico Schottelius
</pre>
372
373 1 Nico Schottelius
h2. Moving a disk/ssd to another server
374 4 Nico Schottelius
375
(needs to be described better)
376
377
Generally speaking:
378
379 27 ll nu
* //needs to be tested: disable recovery so data wont start move while you have the osd down
380 9 Nico Schottelius
* /opt/ungleich-tools/ceph-osd-stop-disable does the following:
381
** Stop the osd, remove monit on the server you want to take it out
382
** umount the disk
383 1 Nico Schottelius
* Take disk out
384
* Discard preserved cache on the server you took it out 
385 54 Nico Schottelius
** using megacli:  @megacli -DiscardPreservedCache -Lall -aAll@
386 1 Nico Schottelius
* Insert into new server
387 9 Nico Schottelius
* Clear foreign configuration
388 54 Nico Schottelius
** using megacli: @megacli -CfgForeign -Clear -aAll@
389 9 Nico Schottelius
* Disk will now appear in the OS, ceph/udev will automatically start the OSD (!)
390
** No creating of the osd required!
391
* Verify that the disk exists and that the osd is started
392
** using *ps aux*
393
** using *ceph osd tree*
394 10 Nico Schottelius
* */opt/ungleich-tools/monit-ceph-create-start osd.XX* # where osd.XX is the osd + number
395 9 Nico Schottelius
** Creates the monit configuration file so that monit watches the OSD
396
** Reload monit
397 11 Nico Schottelius
* Verify monit using *monit status*
398 1 Nico Schottelius
399 72 Nico Schottelius
<pre>
400
megacli -DiscardPreservedCache -Lall -aAll
401
megacli -CfgForeign -Clear -aAll
402
</pre>
403
404 56 Nico Schottelius
h2. OSD related processes
405 1 Nico Schottelius
406 56 Nico Schottelius
h3. Removing a disk/ssd
407
408 1 Nico Schottelius
To permanently remove a failed disk from a cluster, use ***ceph-osd-stop-remove-permanently*** from ungleich-tools repo. Warning: if the disk is still active, the OSD will be shutdown AND removed from the cluster -> all data of that disk will need to be rebalanced.
409
410 56 Nico Schottelius
h3. Handling DOWN osds with filesystem errors
411 1 Nico Schottelius
412
If an email arrives with the subject "monit alert -- Does not exist osd.XX-whoami", the filesystem of an OSD cannot be read anymore. It is very highly likely that the disk / ssd is broken. Steps that need to be done:
413
414
* Login to any ceph monitor (cephX.placeY.ungleich.ch)
415
* Check **ceph -s**, find host using **ceph osd tree**
416
* Login to the affected host
417
* Run the following commands:
418
** ls /var/lib/ceph/osd/ceph-XX
419
** dmesg
420 24 Jin-Guk Kwon
<pre>
421
ex) After checking message of dmesg, you can do next step
422
[204696.406756] XFS (sdl1): metadata I/O error: block 0x19100 ("xlog_iodone") error 5 numblks 64
423
[204696.408094] XFS (sdl1): xfs_do_force_shutdown(0x2) called from line 1233 of file /build/linux-BsFdsw/linux-4.9.65/fs/xfs/xfs_log.c.  Return address = 0xffffffffc08eb612
424
[204696.410702] XFS (sdl1): Log I/O Error Detected.  Shutting down filesystem
425
[204696.411977] XFS (sdl1): Please umount the filesystem and rectify the problem(
426
</pre>
427
428 1 Nico Schottelius
* Create a new ticket in the datacenter light project
429
** Subject: "Replace broken OSD.XX on serverX.placeY.ungleich.ch"
430
** Add (partial) output of above commands
431
** Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently XX, where XX is the osd id, to remove the disk from the cluster
432
** Remove the physical disk from the host, checkout if there is warranty on it and if yes
433
*** Create a short letter to the vendor, including technical details a from above
434
*** Record when you sent it in
435
*** Put ticket into status waiting
436
** If there is no warranty, dispose it
437
438 56 Nico Schottelius
h3. [[Create new pool and place new osd]]
439
440
h3. Configuring auto repair on pgs
441
442
<pre>
443
ceph config set osd osd_scrub_auto_repair true
444
</pre>
445
446
Verify using:
447
448
<pre>
449
ceph config dump
450
</pre>
451 39 Jin-Guk Kwon
452 67 Nico Schottelius
h3. Change the device class of an OSD
453
454
<pre>
455
OSD=XX
456
NEWCLASS=ZZ
457
458
# Set new device class to "ssd"
459
ceph osd crush rm-device-class osd.$OSD
460
ceph osd crush set-device-class $NEWCLASS osd.$OSD
461
</pre>
462
463
* Found on https://arpnetworks.com/blog/2019/06/28/how-to-update-the-device-class-on-a-ceph-osd.html
464
465 70 Nico Schottelius
h2. Managing ceph Daemon crashes
466
467
If there is a warning about crashed daemons, they can be displayed and deleted as follows:
468
469
* @ceph crash ls@
470
* @ceph crash info <id>@
471
* @ceph crash archive <id>@
472
* @ceph crash archive-all@
473
474
Summary originally found on https://forum.proxmox.com/threads/health_warn-1-daemons-have-recently-crashed.63105/
475
476 1 Nico Schottelius
h2. Change ceph speed for i/o recovery
477
478
By default we want to keep I/O recovery traffic low to not impact customer experience. However when multiple disks fail at the same point, we might want to prioritise recover for data safety over performance.
479
480
The default configuration on our servers contains:
481
482
<pre>
483
[osd]
484
osd max backfills = 1
485
osd recovery max active = 1
486
osd recovery op priority = 2
487
</pre>
488
489
The important settings are *osd max backfills* and *osd recovery max active*, the priority is always kept low so that regular I/O has priority.
490
491
To adjust the number of backfills *per osd* and to change the *number of threads* used for recovery, we can use on any node with the admin keyring:
492
493
<pre>
494
ceph tell osd.* injectargs '--osd-max-backfills Y'
495
ceph tell osd.* injectargs '--osd-recovery-max-active X'
496
</pre>
497
498
where Y and X are the values that we want to use. Experience shows that Y=5 and X=5 doubles to triples the recovery performance, whereas X=10 and Y=10 increases recovery performance 5 times.
499
500 69 Nico Schottelius
This can also be combined in one command:
501
502
<pre>
503
ceph tell osd.* injectargs '--osd-max-backfills Y' '--osd-recovery-max-active X'
504
505
# f.i.: reset to 1
506
ceph tell osd.* injectargs '--osd-max-backfills 1' '--osd-recovery-max-active 1'
507
508
# f.i.: set to 4
509
ceph tell osd.* injectargs '--osd-max-backfills 4' '--osd-recovery-max-active 4'
510
511
</pre>
512
513 1 Nico Schottelius
h2. Debug scrub errors / inconsistent pg message
514 6 Nico Schottelius
515 1 Nico Schottelius
From time to time disks don't save what they are told to save. Ceph scrubbing detects these errors and switches to HEALTH_ERR. Use *ceph health detail* to find out which placement groups (*pgs*) are affected. Usually a ***ceph pg repair <number>*** fixes the problem.
516
517
If this does not help, consult https://ceph.com/geen-categorie/ceph-manually-repair-object/.
518 12 Nico Schottelius
519
h2. Move servers into the osd tree
520
521
New servers have their buckets placed outside the **default root** and thus need to be moved inside.
522
Output might look as follows:
523
524
<pre>
525
[11:19:27] server5.place6:~# ceph osd tree
526
ID  CLASS   WEIGHT    TYPE NAME        STATUS REWEIGHT PRI-AFF 
527
 -3           0.87270 host server5                             
528
 41     ssd   0.87270     osd.41           up  1.00000 1.00000 
529
 -1         251.85580 root default                             
530
 -7          81.56271     host server2                         
531
  0 hdd-big   9.09511         osd.0        up  1.00000 1.00000 
532
  5 hdd-big   9.09511         osd.5        up  1.00000 1.00000 
533
...
534
</pre>
535
536
537
Use **ceph osd crush move serverX root=default** (where serverX is the new server),
538
which will move the bucket in the right place:
539
540
<pre>
541
[11:21:17] server5.place6:~# ceph osd crush move server5 root=default
542
moved item id -3 name 'server5' to location {root=default} in crush map
543
[11:32:12] server5.place6:~# ceph osd tree
544
ID  CLASS   WEIGHT    TYPE NAME        STATUS REWEIGHT PRI-AFF 
545
 -1         252.72850 root default                             
546
...
547
 -3           0.87270     host server5                         
548
 41     ssd   0.87270         osd.41       up  1.00000 1.00000 
549
550
551
</pre>
552 13 Nico Schottelius
553
h2. How to fix existing osds with wrong partition layout
554
555
In the first version of DCL we used filestore/3 partition based layout.
556
In the second version of DCL, including OSD autodection, we use bluestore/2 partition based layout.
557
558
To convert, we delete the old OSD, clean the partitions and create a new osd:
559
560 14 Nico Schottelius
h3. Inactive OSD
561 1 Nico Schottelius
562 14 Nico Schottelius
If the OSD is *not active*, we can do the following:
563
564 13 Nico Schottelius
* Find the OSD number: mount the partition and find the whoami file
565
566
<pre>
567
root@server2:/opt/ungleich-tools# mount /dev/sda2 /mnt/
568
root@server2:/opt/ungleich-tools# cat /mnt/whoami 
569
0
570
root@server2:/opt/ungleich-tools# umount  /mnt/
571
572
</pre>
573
574
* Verify in the *ceph osd tree* that the OSD is on that server
575
* Deleting the OSD
576
** ceph osd crush remove $osd_name
577 1 Nico Schottelius
** ceph osd rm $osd_name
578 14 Nico Schottelius
579
Then continue below as described in "Recreating the OSD".
580
581
h3. Remove Active OSD
582
583
* Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently OSDID to stop and remove the OSD
584
* Then continue below as described in "Recreating the OSD".
585
586
587
h3. Recreating the OSD
588
589 13 Nico Schottelius
* Create an empty partition table
590
** fdisk /dev/sdX
591
** g
592
** w
593
* Create a new OSD
594
** /opt/ungleich-tools/ceph-osd-create-start /dev/sdX CLASS # use hdd, ssd, ... for the CLASS
595 15 Jin-Guk Kwon
596
h2. How to fix unfound pg
597
598
refer to https://redmine.ungleich.ch/issues/6388
599 16 Jin-Guk Kwon
600
* Check health state 
601
** ceph health detail
602
* Check which server has that osd
603
** ceph osd tree
604
* Check which VM is running in server place
605 17 Jin-Guk Kwon
** virsh list  
606 16 Jin-Guk Kwon
* Check pg map
607 17 Jin-Guk Kwon
** ceph osd map [osd pool] [VMID]
608 18 Jin-Guk Kwon
* revert pg
609
** ceph pg [PGID] mark_unfound_lost revert
610 28 Nico Schottelius
611 60 Nico Schottelius
h2. Phasing out OSDs
612
613 61 Nico Schottelius
* Either directly via /opt/ungleich-tools/ceph/ceph-osd-stop-remove-permanently
614 62 Nico Schottelius
* Or first draining it using @ceph osd crush reweight osd.XX 0@
615 60 Nico Schottelius
** Wait until rebalance done
616 61 Nico Schottelius
** Then remove
617 60 Nico Schottelius
618 28 Nico Schottelius
h2. Enabling per image RBD statistics for prometheus
619
620
621
<pre>
622
[20:26:57] red2.place5:~# ceph config set mgr mgr/prometheus/rbd_stats_pools "one,hdd"
623
[20:27:57] black2.place6:~# ceph config set mgr mgr/prometheus/rbd_stats_pools "hdd,ssd"
624
</pre>
625 29 Nico Schottelius
626
h2. S3 Object Storage
627
628 36 Nico Schottelius
This section is ** UNDER CONTRUCTION ** 
629
630 29 Nico Schottelius
h3. Introduction
631 1 Nico Schottelius
632 30 Nico Schottelius
* See the "Red Hat manual":https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/object_gateway_guide_for_red_hat_enterprise_linux/overview-rgw
633
* The "ceph docs about object storage":https://docs.ceph.com/docs/mimic/radosgw/
634 29 Nico Schottelius
635
h3. Architecture
636
637
* S3 requests are handled by a publicly accessible gateway, which also has access to the ceph cluster.
638 34 Nico Schottelius
* s3 buckets are usually
639 29 Nico Schottelius
640 32 Nico Schottelius
h3. Authentication / Users
641
642
* Ceph *can* make use of LDAP as a backend
643 1 Nico Schottelius
** However it uses the clear text username+password as a token
644 34 Nico Schottelius
** See https://docs.ceph.com/docs/mimic/radosgw/ldap-auth/
645 32 Nico Schottelius
* We do not want users to store their regular account on machines
646
* For this reason we use independent users / tokens, but with the same username as in LDAP
647
648 38 Nico Schottelius
Creating a user:
649
650
<pre>
651
radosgw-admin user create --uid=USERNAME --display-name="Name of user"
652
</pre>
653
654
655
Listing users:
656
657
<pre>
658
radosgw-admin user list
659
</pre>
660
661
662
Deleting users and their storage:
663
664
<pre>
665
radosgw-admin user rm --uid=USERNAME --purge-data
666
</pre>
667
668 1 Nico Schottelius
h3. Setting up S3 object storage on Ceph
669 33 Nico Schottelius
670
* Setup a gateway node with Alpine Linux
671
** Change do edge
672
** Enable testing
673
* Update the firewall to allow access from this node to the ceph monitors
674 35 Nico Schottelius
* Setting up the wildcard DNS certificate
675
676
<pre>
677
apk add ceph-radosgw
678
</pre>
679 37 Nico Schottelius
680
h3. Wildcard DNS certificate from letsencrypt
681
682
Acquiring and renewing this certificate is currently a manual process, as it requires to change DNS settings.
683
684
* run certbot
685
* update DNS with the first token
686
* update DNS with the second token
687
688
Sample session:
689
690
<pre>
691
s3:/etc/ceph# certbot certonly --manual --preferred-challenges=dns --email sre@ungleich.ch --server https://acme-v02.api.letsencrypt.org/directory --agree-tos 
692
-d *.s3.ungleich.ch -d s3.ungleich.ch
693
Saving debug log to /var/log/letsencrypt/letsencrypt.log
694
Plugins selected: Authenticator manual, Installer None
695
Cert is due for renewal, auto-renewing...
696
Renewing an existing certificate
697
Performing the following challenges:
698
dns-01 challenge for s3.ungleich.ch
699
dns-01 challenge for s3.ungleich.ch
700
701
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
702
NOTE: The IP of this machine will be publicly logged as having requested this
703
certificate. If you're running certbot in manual mode on a machine that is not
704
your server, please ensure you're okay with that.
705
706
Are you OK with your IP being logged?
707
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
708
(Y)es/(N)o: y
709
710
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
711
Please deploy a DNS TXT record under the name
712
_acme-challenge.s3.ungleich.ch with the following value:
713
714
KxGLZNiVjFwz1ifNheoR_KQoPVpkvRUV1oT2pOvJlU0
715
716
Before continuing, verify the record is deployed.
717
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
718
Press Enter to Continue
719
720
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
721
Please deploy a DNS TXT record under the name
722
_acme-challenge.s3.ungleich.ch with the following value:
723
724
bkrhtxWZUipCAL5cBfvrjDuftqsZdQ2JjisiKmXBbaI
725
726
Before continuing, verify the record is deployed.
727
(This must be set up in addition to the previous challenges; do not remove,
728
replace, or undo the previous challenge tasks yet. Note that you might be
729
asked to create multiple distinct TXT records with the same name. This is
730
permitted by DNS standards.)
731
732
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
733
Press Enter to Continue
734
Waiting for verification...
735
Cleaning up challenges
736
737
IMPORTANT NOTES:
738
 - Congratulations! Your certificate and chain have been saved at:
739
   /etc/letsencrypt/live/s3.ungleich.ch/fullchain.pem
740
   Your key file has been saved at:
741
   /etc/letsencrypt/live/s3.ungleich.ch/privkey.pem
742
   Your cert will expire on 2020-12-09. To obtain a new or tweaked
743
   version of this certificate in the future, simply run certbot
744
   again. To non-interactively renew *all* of your certificates, run
745
   "certbot renew"
746
 - If you like Certbot, please consider supporting our work by:
747
748
   Donating to ISRG / Let's Encrypt:   https://letsencrypt.org/donate
749
   Donating to EFF:                    https://eff.org/donate-le
750
751
</pre>
752 41 Nico Schottelius
753
h2. Debugging ceph
754
755
756
<pre>
757
    ceph status
758
    ceph osd status
759
    ceph osd df
760
    ceph osd utilization
761
    ceph osd pool stats
762
    ceph osd tree
763
    ceph pg stat
764
</pre>
765 42 Nico Schottelius
766 53 Nico Schottelius
h3. How to list the version overview
767
768 55 Nico Schottelius
This lists the versions of osds, mgrs and mons:
769
770 53 Nico Schottelius
<pre>
771
ceph versions
772
</pre>
773 55 Nico Schottelius
774
Listing the "features" of clients, osds, mgrs and mons can be done using @ceph features@:
775
776
<pre>
777
[15:32:20] red1.place5:~# ceph features
778
{
779
    "mon": [
780
        {
781
            "features": "0x3ffddff8ffecffff",
782
            "release": "luminous",
783
            "num": 5
784
        }
785
    ],
786
    "osd": [
787
        {
788
            "features": "0x3ffddff8ffecffff",
789
            "release": "luminous",
790
            "num": 44
791
        }
792
    ],
793
    "client": [
794
        {
795
            "features": "0x3ffddff8eea4fffb",
796
            "release": "luminous",
797
            "num": 4
798
        },
799
        {
800
            "features": "0x3ffddff8ffacffff",
801
            "release": "luminous",
802
            "num": 18
803
        },
804
        {
805
            "features": "0x3ffddff8ffecffff",
806
            "release": "luminous",
807
            "num": 31
808
        }
809
    ],
810
    "mgr": [
811
        {
812
            "features": "0x3ffddff8ffecffff",
813
            "release": "luminous",
814
            "num": 4
815
        }
816
    ]
817
}
818
819
</pre>
820
 
821 53 Nico Schottelius
822
h3. How to list the version of every OSD and every monitor
823
824
To list the version of each ceph OSD:
825
826
<pre>
827
ceph tell osd.* version
828
</pre>
829
830
To list the version of each ceph mon:
831
2
832
<pre>
833
ceph tell mon.* version
834
</pre>
835
836
The mgr do not seem to support this command as of 14.2.21.
837
838 49 Nico Schottelius
h2. Performance Tuning
839
840
* Ensure that the basic options for reducing rebalancing workload are set:
841
842
<pre>
843
osd max backfills = 1
844
osd recovery max active = 1
845
osd recovery op priority = 2
846
</pre>
847
848
* Ensure that "osd_op_queue_cut_off":https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_op_queue_cut_off is set to **high**
849
** Requires OSD restart on change
850
851 50 Nico Schottelius
<pre>
852
ceph config set global osd_op_queue_cut_off high
853
</pre>
854
855 51 Nico Schottelius
<pre>
856
be sure to check your osd recovery sleep settings, there are several
857
depending on your underlying drives:
858
859
    "osd_recovery_sleep": "0.000000",
860
    "osd_recovery_sleep_hdd": "0.050000",
861
    "osd_recovery_sleep_hybrid": "0.050000",
862
    "osd_recovery_sleep_ssd": "0.050000",
863
864
Adjusting these will upwards will dramatically reduce IO, and take effect
865
immediately at the cost of slowing rebalance/recovery.
866
</pre>
867
868 52 Nico Schottelius
Reference settings from Frank Schilder:
869
870
<pre>
871
  osd       class:hdd      advanced osd_recovery_sleep                0.050000
872
  osd       class:rbd_data advanced osd_recovery_sleep                0.025000
873
  osd       class:rbd_meta advanced osd_recovery_sleep                0.002500
874
  osd       class:ssd      advanced osd_recovery_sleep                0.002500
875
  osd                      advanced osd_recovery_sleep                0.050000
876
877
  osd       class:hdd      advanced osd_max_backfills                 3
878
  osd       class:rbd_data advanced osd_max_backfills                 6
879
  osd       class:rbd_meta advanced osd_max_backfills                 12
880
  osd       class:ssd      advanced osd_max_backfills                 12
881
  osd                      advanced osd_max_backfills                 3
882
883
  osd       class:hdd      advanced osd_recovery_max_active           8
884
  osd       class:rbd_data advanced osd_recovery_max_active           16
885
  osd       class:rbd_meta advanced osd_recovery_max_active           32
886
  osd       class:ssd      advanced osd_recovery_max_active           32
887
  osd                      advanced osd_recovery_max_active           8
888
</pre>
889
890
(have not yet been tested in our clusters)
891 51 Nico Schottelius
892 42 Nico Schottelius
h2. Ceph theory
893
894
h3. How much data per Server?
895
896
Q: How much data should we add into one server?
897
A: Not more than it can handle.
898
899
How much data can a server handle? For this let's have a look at 2 scenarios:
900
901
* How long does it take to compensate the loss of the server?
902
903
* Assuming a server has X TiB storage in Y disks attached and a network speed of Z GiB/s.
904
* And our estimated rebuild goal is to compensate the loss of a server within U hours.
905
906
907
h4. Approach 1
908
909
Then
910
911
Let's take an example: 
912
913
* A server with @10 disks * 10 TiB@ = 100 TiB = 100 000 GiB data. It is network connected with 10 Gbit = 1.25 GiB/s.
914
* 100000/1.25 = 80000s = 22.22h
915
916
However, our logic assumes that we actually rebuild from the failed server, which... is failed. 
917
918
h4. Approach 2: calculating with left servers
919
920
However we can apply our logic also to distribute
921
the rebuild over several servers that now pull in data from each other for rebuilding.
922
We need to *read* the data (100TiB) from other servers and distribute it to new OSDs. Assuming each server has a 10 Gbit/s
923
network connection.
924
925
Now the servers might need to *read* (get data from other osds) and *write) (send data to other osds). Luckily, networking is 10 Gbit/s duplex - i.e. in both directions.
926
927
However how fast can we actually read data from the disks? 
928
929
* SSDs are in the range of hundreds of MB/s (best case, not necessarily true for random reads) - let's assume
930
* HDDs are in the range of tenths of MB/s (depending on the work load, but 30-40 MB/s random reads seems realistic)
931
932
 
933
934
935
Further assumptions:
936
937
* Assuming further that each disk should be dedicated at least one CPU core.
938 43 Nico Schottelius
939
h3. Disk/SSD speeds
940
941 44 Nico Schottelius
* Tuning for #8473 showed that a 10TB HDD can write up to 180-200MB/s when backfilling (at about 70% cpu usage and 20% disk usage), max backfills = 8
942 43 Nico Schottelius
* Debugging SSD usage in #8461 showed SSDs can read about 470-520MB/s sequential
943
* Debugging SSD usage in #8461 showed SSDs can write about 170-280MB/s sequential
944
* Debugging SSD usage in #8461 showed SSDs can write about 4MB/s RANDOM (need to verify this even though 3 runs showed these numbers)
945 47 Dominique Roux
946 48 Dominique Roux
h3. Ceph theoretical fundament
947 47 Dominique Roux
948
If you are very much into the theoretical fundament of Ceph check out their "paper":https://www3.nd.edu/~dthain/courses/cse40771/spring2007/papers/ceph.pdf