Project

General

Profile

The ungleich ceph handbook » History » Version 67

Nico Schottelius, 01/18/2023 08:45 PM

1 1 Nico Schottelius
h1. The ungleich ceph handbook
2
3 3 Nico Schottelius
{{toc}}
4
5 1 Nico Schottelius
h2. Status
6
7 7 Nico Schottelius
This document is **IN PRODUCTION**.
8 1 Nico Schottelius
9
h2. Introduction
10
11
This article describes the ungleich storage architecture that is based on ceph. It describes our architecture as well maintenance commands. Required for 
12
13 45 Nico Schottelius
h2. Processes
14
15
h3. Usage monitoring
16
17
* Usage should be kept somewhere in 70-75% area
18
* If usage reaches 72.5%, we start reducing usage by adding disks
19
* We stop when usage is below 70%
20
21
h3. Phasing in new disks
22
23
* 24h performance test prior to using it
24
25
h3. Phasing in new servers
26
27
* 24h performance test with 1 ssd or 1 hdd (whatever is applicable)
28
29
30 1 Nico Schottelius
h2. Communication guide
31
32
Usually when disks fails no customer communication is necessary, as it is automatically compensated/rebalanced by ceph. However in case multiple disk failures happen at the same time, I/O speed might be reduced and thus customer experience impacted.
33
34
For this reason communicate whenever I/O recovery settings are temporarily tuned.
35
36 20 Nico Schottelius
h2. Analysing 
37
38 21 Nico Schottelius
h3. ceph osd df tree
39 20 Nico Schottelius
40
Using @ceph osd df tree@ you can see not only the disk usage per OSD, but also the number of PGs on an OSD. This is especially useful to see how the OSDs are balanced.
41
42 22 Nico Schottelius
h3. Find out the device of an OSD
43
44
Use @mount | grep /var/lib/ceph/osd/ceph-OSDID@ on the server on which the OSD is located:
45
46
<pre>
47
48
[16:01:23] server2.place6:~# mount | grep /var/lib/ceph/osd/ceph-31
49
/dev/sdk1 on /var/lib/ceph/osd/ceph-31 type xfs (rw,relatime,attr2,inode64,noquota)
50
</pre>
51
52 57 Nico Schottelius
h3. Show config
53
54
<pre>
55
ceph config dump
56
</pre>
57
58 58 Nico Schottelius
h3. Show backfill and recovery config
59
60
<pre>
61
ceph-conf --show-config | egrep "osd_recovery_max_active|osd_recovery_op_priority|osd_max_backfills"
62
</pre>
63
64 59 Nico Schottelius
* See also: https://www.suse.com/support/kb/doc/?id=000019693
65
66 63 Nico Schottelius
h3. Checking and clearing crash reports
67
68
If the cluster is reporting HEALTH_WARN and a recent crash such as:
69
70
<pre>
71
[rook@rook-ceph-tools-f569797b4-z4542 /]$ ceph -s
72
  cluster:
73
    id:     ...
74
    health: HEALTH_WARN
75
            1 daemons have recently crashed
76
</pre>
77
78
One can analyse it using
79
80
* List the crashes: @ceph crash ls@ 
81
* Checkout the details: @ceph crash info <id>@
82
83
To archive the error:
84
85
* To archive a specific report: @ceph crash archive <id>@
86
* To archive all: @ceph crash archive-all@
87
88
After archiving, the cluster health should return to HEALTH_OK:
89
90
<pre>
91
[rook@rook-ceph-tools-f569797b4-z4542 /]$  ceph crash ls
92
ID                                                                ENTITY  NEW  
93
2022-09-19T04:33:19.378500Z_b2e26755-0712-41de-bf2b-b370dbe94ebc  mon.c    *   
94
[rook@rook-ceph-tools-f569797b4-z4542 /]$ ceph crash archive 2022-09-19T04:33:19.378500Z_b2e26755-0712-41de-bf2b-b370dbe94ebc 
95
[rook@rook-ceph-tools-f569797b4-z4542 /]$ ceph -s
96
  cluster:
97
    id:     ..
98
    health: HEALTH_OK
99
 
100
</pre>
101
102 2 Nico Schottelius
h2. Adding a new disk/ssd to the ceph cluster
103 1 Nico Schottelius
104 25 Jin-Guk Kwon
write on the disks, which order / date we bought it with a permanent marker.
105
106 46 Nico Schottelius
h3. Checking the shadow trees
107
108
To be able to spot differences / weights of hosts, it can be very helpful to look at the crush shadow tree
109
using @ceph osd crush tree --show-shadow@:
110
111
<pre>
112
-16   hdd-big 653.03418           root default~hdd-big        
113
-34   hdd-big         0         0     host server14~hdd-big   
114
-38   hdd-big         0         0     host server15~hdd-big   
115
-42   hdd-big  81.86153  78.28352     host server17~hdd-big   
116
 36   hdd-big   9.09560   9.09560         osd.36              
117
 59   hdd-big   9.09499   9.09499         osd.59              
118
 60   hdd-big   9.09499   9.09499         osd.60              
119
 68   hdd-big   9.09599   8.93999         osd.68              
120
 69   hdd-big   9.09599   7.65999         osd.69              
121
 70   hdd-big   9.09599   8.35899         osd.70              
122
 71   hdd-big   9.09599   8.56000         osd.71              
123
 72   hdd-big   9.09599   8.93700         osd.72              
124
 73   hdd-big   9.09599   8.54199         osd.73              
125
-46   hdd-big  90.94986  90.94986     host server18~hdd-big   
126
...
127
</pre>
128
129
130
Here we can see that the weight of server17 for the class hdd-big is about 81, the one of server18 about 90.
131
SSDs and other classes have their own shadow trees, too.
132
133 2 Nico Schottelius
h3. For Dell servers
134
135
First find the disk and then add it to the operating system
136
137
<pre>
138
megacli -PDList -aALL  | grep -B16 -i unconfigur
139
140
# Sample output:
141
[19:46:50] server7.place6:~#  megacli -PDList -aALL  | grep -B16 -i unconfigur
142
Enclosure Device ID: N/A
143
Slot Number: 0
144
Enclosure position: N/A
145
Device Id: 0
146
WWN: 0000000000000000
147
Sequence Number: 1
148
Media Error Count: 0
149
Other Error Count: 0
150
Predictive Failure Count: 0
151
Last Predictive Failure Event Seq Number: 0
152
PD Type: SATA
153
154
Raw Size: 894.252 GB [0x6fc81ab0 Sectors]
155
Non Coerced Size: 893.752 GB [0x6fb81ab0 Sectors]
156
Coerced Size: 893.75 GB [0x6fb80000 Sectors]
157
Sector Size:  0
158
Firmware state: Unconfigured(good), Spun Up
159
</pre>
160
161
Then add the disk to the OS:
162
163
<pre>
164 26 ll nu
megacli -CfgLdAdd -r0 [Enclosure Device ID:slot] -aX (X : host is 0. md-array is 1)
165 2 Nico Schottelius
166
# Sample call, if enclosure and slot are KNOWN (aka not N/A)
167
megacli -CfgLdAdd -r0 [32:0] -a0
168
169
# Sample call, if enclosure is N/A
170 1 Nico Schottelius
megacli -CfgLdAdd -r0 [:0] -a0
171 25 Jin-Guk Kwon
</pre>
172
173
Then check disk
174
175
<pre>
176
fdisk -l
177
[11:26:23] server2.place6:~# fdisk -l
178
......
179
Disk /dev/sdh: 7.3 TiB, 8000987201536 bytes, 15626928128 sectors
180
Units: sectors of 1 * 512 = 512 bytes
181
Sector size (logical/physical): 512 bytes / 512 bytes
182
I/O size (minimum/optimal): 512 bytes / 512 bytes
183
[11:27:24] server2.place6:~#
184
</pre>
185
186
Then create gpt
187
188
<pre>
189
/opt/ungleich-tools/disk-create-fresh-gpt /dev/XXX
190
[11:31:10] server2.place6:~# /opt/ungleich-tools/disk-create-fresh-gpt /dev/sdh
191
......
192
Created a new DOS disklabel with disk identifier 0x9c4a0355.
193
Command (m for help): Created a new GPT disklabel (GUID: 374E31AD-7B96-4837-B5ED-7B22C452899E).
194
......
195
</pre>
196
197
Then create osd for ssd/hdd-big
198
199
<pre>
200
/opt/ungleich-tools/ceph-osd-create-start /dev/XXX XXX(sdd or hdd-big)
201
[11:33:58] server2.place6:~# /opt/ungleich-tools/ceph-osd-create-start /dev/sdh hdd-big
202
+ set -e
203
+ [ 2 -lt 2 ]
204
......
205
+ /opt/ungleich-tools/monit-ceph-create-start osd.14
206
osd.14
207
[ ok ] Restarting daemon monitor: monit.
208
[11:36:14] server2.place6:~#
209
</pre>
210
211
Then check rebalancing(if you want to add another disk, you should do after rebalancing)
212
213
<pre>
214
ceph -s
215
[12:37:57] server2.place6:~# ceph -s
216
  cluster:
217
    id:     1ccd84f6-e362-4c50-9ffe-59436745e445
218
    health: HEALTH_WARN
219
            2248811/49628409 objects misplaced (4.531%)
220
......
221
  io:
222
    client:   170KiB/s rd, 35.0MiB/s wr, 463op/s rd, 728op/s wr
223
    recovery: 27.1MiB/s, 6objects/s
224 1 Nico Schottelius
[12:49:41] server2.place6:~#
225 64 Nico Schottelius
</pre>
226
227 66 Nico Schottelius
h3. For HP servers (hpacucli)
228 64 Nico Schottelius
229
* Ensure the module "sg" has been loaded
230
231
Use the following to verify that the controller is detected:
232
233
<pre>
234
# hpacucli controller all show
235
236
Smart Array P420i in Slot 0 (Embedded)    (sn: 001438033ECEF60)
237 2 Nico Schottelius
</pre>
238
239 65 Nico Schottelius
240
h4. Show all disks from controller on slot 0
241
242
<pre>
243
hpacucli controller slot=0 physicaldrive all show
244
</pre>
245
246
Example
247
248
<pre>
249
# hpacucli controller slot=0 physicaldrive all show
250
251
Smart Array P420i in Slot 0 (Embedded)
252
253
   array A
254
255
      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SATA, 12000.1 GB, OK)
256
257
   array B
258
259
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SATA, 12000.1 GB, OK)
260
261
   unassigned
262
263
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SATA, 10000.8 GB, OK)
264
265
root@ungleich-hardware-server97:/# 
266
267
</pre>
268
269
In this example the last disk has not been assigned yet.
270
271
h4. Create RAID 0 for ceph
272
273
For ceph we want a raid 0 over 1 disk to expose the disk to the OS.
274
275
This can be done using the following command:
276
277
<pre>
278
hpacucli controller slot=0 create type=ld drives=$DRIVEID raid=0
279
</pre>
280
281
For example:
282
283
<pre>
284
hpacucli controller slot=0 create type=ld drives=1I:1:3 raid=0
285
</pre>
286
287
h4. Show the controller configuration
288
289
<pre>
290
hpacucli controller slot=0 show config
291
</pre>
292
293
For example:
294
295
<pre>
296
# hpacucli controller slot=0 show config
297
298
Smart Array P420i in Slot 0 (Embedded)    (sn: 001438033ECEF60)
299
300
   array A (SATA, Unused Space: 0  MB)
301
302
303
      logicaldrive 1 (10.9 TB, RAID 0, OK)
304
305
      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SATA, 12000.1 GB, OK)
306
307
   array B (SATA, Unused Space: 0  MB)
308
309
310
      logicaldrive 2 (10.9 TB, RAID 0, OK)
311
312
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SATA, 12000.1 GB, OK)
313
314
   array C (SATA, Unused Space: 0  MB)
315
316
317
      logicaldrive 3 (9.1 TB, RAID 0, OK)
318
319
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SATA, 10000.8 GB, OK)
320
321
   Expander 380 (WWID: 50014380324EBFE0, Port: 1I, Box: 1)
322
323
   Enclosure SEP (Vendor ID HP, Model Gen8 ServBP 12+2) 378 (WWID: 50014380324EBFF9, Port: 1I, Box: 1)
324
325
   SEP (Vendor ID PMCSIERA, Model SRCv8x6G) 379 (WWID: 5001438033ECEF6F)
326
</pre>
327
328 1 Nico Schottelius
h2. Moving a disk/ssd to another server
329 4 Nico Schottelius
330
(needs to be described better)
331
332
Generally speaking:
333
334 27 ll nu
* //needs to be tested: disable recovery so data wont start move while you have the osd down
335 9 Nico Schottelius
* /opt/ungleich-tools/ceph-osd-stop-disable does the following:
336
** Stop the osd, remove monit on the server you want to take it out
337
** umount the disk
338 1 Nico Schottelius
* Take disk out
339
* Discard preserved cache on the server you took it out 
340 54 Nico Schottelius
** using megacli:  @megacli -DiscardPreservedCache -Lall -aAll@
341 1 Nico Schottelius
* Insert into new server
342 9 Nico Schottelius
* Clear foreign configuration
343 54 Nico Schottelius
** using megacli: @megacli -CfgForeign -Clear -aAll@
344 9 Nico Schottelius
* Disk will now appear in the OS, ceph/udev will automatically start the OSD (!)
345
** No creating of the osd required!
346
* Verify that the disk exists and that the osd is started
347
** using *ps aux*
348
** using *ceph osd tree*
349 10 Nico Schottelius
* */opt/ungleich-tools/monit-ceph-create-start osd.XX* # where osd.XX is the osd + number
350 9 Nico Schottelius
** Creates the monit configuration file so that monit watches the OSD
351
** Reload monit
352 11 Nico Schottelius
* Verify monit using *monit status*
353 1 Nico Schottelius
354 56 Nico Schottelius
h2. OSD related processes
355 1 Nico Schottelius
356 56 Nico Schottelius
h3. Removing a disk/ssd
357
358 1 Nico Schottelius
To permanently remove a failed disk from a cluster, use ***ceph-osd-stop-remove-permanently*** from ungleich-tools repo. Warning: if the disk is still active, the OSD will be shutdown AND removed from the cluster -> all data of that disk will need to be rebalanced.
359
360 56 Nico Schottelius
h3. Handling DOWN osds with filesystem errors
361 1 Nico Schottelius
362
If an email arrives with the subject "monit alert -- Does not exist osd.XX-whoami", the filesystem of an OSD cannot be read anymore. It is very highly likely that the disk / ssd is broken. Steps that need to be done:
363
364
* Login to any ceph monitor (cephX.placeY.ungleich.ch)
365
* Check **ceph -s**, find host using **ceph osd tree**
366
* Login to the affected host
367
* Run the following commands:
368
** ls /var/lib/ceph/osd/ceph-XX
369
** dmesg
370 24 Jin-Guk Kwon
<pre>
371
ex) After checking message of dmesg, you can do next step
372
[204696.406756] XFS (sdl1): metadata I/O error: block 0x19100 ("xlog_iodone") error 5 numblks 64
373
[204696.408094] XFS (sdl1): xfs_do_force_shutdown(0x2) called from line 1233 of file /build/linux-BsFdsw/linux-4.9.65/fs/xfs/xfs_log.c.  Return address = 0xffffffffc08eb612
374
[204696.410702] XFS (sdl1): Log I/O Error Detected.  Shutting down filesystem
375
[204696.411977] XFS (sdl1): Please umount the filesystem and rectify the problem(
376
</pre>
377
378 1 Nico Schottelius
* Create a new ticket in the datacenter light project
379
** Subject: "Replace broken OSD.XX on serverX.placeY.ungleich.ch"
380
** Add (partial) output of above commands
381
** Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently XX, where XX is the osd id, to remove the disk from the cluster
382
** Remove the physical disk from the host, checkout if there is warranty on it and if yes
383
*** Create a short letter to the vendor, including technical details a from above
384
*** Record when you sent it in
385
*** Put ticket into status waiting
386
** If there is no warranty, dispose it
387
388 56 Nico Schottelius
h3. [[Create new pool and place new osd]]
389
390
h3. Configuring auto repair on pgs
391
392
<pre>
393
ceph config set osd osd_scrub_auto_repair true
394
</pre>
395
396
Verify using:
397
398
<pre>
399
ceph config dump
400
</pre>
401 39 Jin-Guk Kwon
402 67 Nico Schottelius
h3. Change the device class of an OSD
403
404
<pre>
405
OSD=XX
406
NEWCLASS=ZZ
407
408
# Set new device class to "ssd"
409
ceph osd crush rm-device-class osd.$OSD
410
ceph osd crush set-device-class $NEWCLASS osd.$OSD
411
</pre>
412
413
* Found on https://arpnetworks.com/blog/2019/06/28/how-to-update-the-device-class-on-a-ceph-osd.html
414
415 1 Nico Schottelius
h2. Change ceph speed for i/o recovery
416
417
By default we want to keep I/O recovery traffic low to not impact customer experience. However when multiple disks fail at the same point, we might want to prioritise recover for data safety over performance.
418
419
The default configuration on our servers contains:
420
421
<pre>
422
[osd]
423
osd max backfills = 1
424
osd recovery max active = 1
425
osd recovery op priority = 2
426
</pre>
427
428
The important settings are *osd max backfills* and *osd recovery max active*, the priority is always kept low so that regular I/O has priority.
429
430
To adjust the number of backfills *per osd* and to change the *number of threads* used for recovery, we can use on any node with the admin keyring:
431
432
<pre>
433
ceph tell osd.* injectargs '--osd-max-backfills Y'
434
ceph tell osd.* injectargs '--osd-recovery-max-active X'
435
</pre>
436
437
where Y and X are the values that we want to use. Experience shows that Y=5 and X=5 doubles to triples the recovery performance, whereas X=10 and Y=10 increases recovery performance 5 times.
438
439
h2. Debug scrub errors / inconsistent pg message
440 6 Nico Schottelius
441 1 Nico Schottelius
From time to time disks don't save what they are told to save. Ceph scrubbing detects these errors and switches to HEALTH_ERR. Use *ceph health detail* to find out which placement groups (*pgs*) are affected. Usually a ***ceph pg repair <number>*** fixes the problem.
442
443
If this does not help, consult https://ceph.com/geen-categorie/ceph-manually-repair-object/.
444 12 Nico Schottelius
445
h2. Move servers into the osd tree
446
447
New servers have their buckets placed outside the **default root** and thus need to be moved inside.
448
Output might look as follows:
449
450
<pre>
451
[11:19:27] server5.place6:~# ceph osd tree
452
ID  CLASS   WEIGHT    TYPE NAME        STATUS REWEIGHT PRI-AFF 
453
 -3           0.87270 host server5                             
454
 41     ssd   0.87270     osd.41           up  1.00000 1.00000 
455
 -1         251.85580 root default                             
456
 -7          81.56271     host server2                         
457
  0 hdd-big   9.09511         osd.0        up  1.00000 1.00000 
458
  5 hdd-big   9.09511         osd.5        up  1.00000 1.00000 
459
...
460
</pre>
461
462
463
Use **ceph osd crush move serverX root=default** (where serverX is the new server),
464
which will move the bucket in the right place:
465
466
<pre>
467
[11:21:17] server5.place6:~# ceph osd crush move server5 root=default
468
moved item id -3 name 'server5' to location {root=default} in crush map
469
[11:32:12] server5.place6:~# ceph osd tree
470
ID  CLASS   WEIGHT    TYPE NAME        STATUS REWEIGHT PRI-AFF 
471
 -1         252.72850 root default                             
472
...
473
 -3           0.87270     host server5                         
474
 41     ssd   0.87270         osd.41       up  1.00000 1.00000 
475
476
477
</pre>
478 13 Nico Schottelius
479
h2. How to fix existing osds with wrong partition layout
480
481
In the first version of DCL we used filestore/3 partition based layout.
482
In the second version of DCL, including OSD autodection, we use bluestore/2 partition based layout.
483
484
To convert, we delete the old OSD, clean the partitions and create a new osd:
485
486 14 Nico Schottelius
h3. Inactive OSD
487 1 Nico Schottelius
488 14 Nico Schottelius
If the OSD is *not active*, we can do the following:
489
490 13 Nico Schottelius
* Find the OSD number: mount the partition and find the whoami file
491
492
<pre>
493
root@server2:/opt/ungleich-tools# mount /dev/sda2 /mnt/
494
root@server2:/opt/ungleich-tools# cat /mnt/whoami 
495
0
496
root@server2:/opt/ungleich-tools# umount  /mnt/
497
498
</pre>
499
500
* Verify in the *ceph osd tree* that the OSD is on that server
501
* Deleting the OSD
502
** ceph osd crush remove $osd_name
503 1 Nico Schottelius
** ceph osd rm $osd_name
504 14 Nico Schottelius
505
Then continue below as described in "Recreating the OSD".
506
507
h3. Remove Active OSD
508
509
* Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently OSDID to stop and remove the OSD
510
* Then continue below as described in "Recreating the OSD".
511
512
513
h3. Recreating the OSD
514
515 13 Nico Schottelius
* Create an empty partition table
516
** fdisk /dev/sdX
517
** g
518
** w
519
* Create a new OSD
520
** /opt/ungleich-tools/ceph-osd-create-start /dev/sdX CLASS # use hdd, ssd, ... for the CLASS
521 15 Jin-Guk Kwon
522
h2. How to fix unfound pg
523
524
refer to https://redmine.ungleich.ch/issues/6388
525 16 Jin-Guk Kwon
526
* Check health state 
527
** ceph health detail
528
* Check which server has that osd
529
** ceph osd tree
530
* Check which VM is running in server place
531 17 Jin-Guk Kwon
** virsh list  
532 16 Jin-Guk Kwon
* Check pg map
533 17 Jin-Guk Kwon
** ceph osd map [osd pool] [VMID]
534 18 Jin-Guk Kwon
* revert pg
535
** ceph pg [PGID] mark_unfound_lost revert
536 28 Nico Schottelius
537 60 Nico Schottelius
h2. Phasing out OSDs
538
539 61 Nico Schottelius
* Either directly via /opt/ungleich-tools/ceph/ceph-osd-stop-remove-permanently
540 62 Nico Schottelius
* Or first draining it using @ceph osd crush reweight osd.XX 0@
541 60 Nico Schottelius
** Wait until rebalance done
542 61 Nico Schottelius
** Then remove
543 60 Nico Schottelius
544 28 Nico Schottelius
h2. Enabling per image RBD statistics for prometheus
545
546
547
<pre>
548
[20:26:57] red2.place5:~# ceph config set mgr mgr/prometheus/rbd_stats_pools "one,hdd"
549
[20:27:57] black2.place6:~# ceph config set mgr mgr/prometheus/rbd_stats_pools "hdd,ssd"
550
</pre>
551 29 Nico Schottelius
552
h2. S3 Object Storage
553
554 36 Nico Schottelius
This section is ** UNDER CONTRUCTION ** 
555
556 29 Nico Schottelius
h3. Introduction
557 1 Nico Schottelius
558 30 Nico Schottelius
* See the "Red Hat manual":https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/object_gateway_guide_for_red_hat_enterprise_linux/overview-rgw
559
* The "ceph docs about object storage":https://docs.ceph.com/docs/mimic/radosgw/
560 29 Nico Schottelius
561
h3. Architecture
562
563
* S3 requests are handled by a publicly accessible gateway, which also has access to the ceph cluster.
564 34 Nico Schottelius
* s3 buckets are usually
565 29 Nico Schottelius
566 32 Nico Schottelius
h3. Authentication / Users
567
568
* Ceph *can* make use of LDAP as a backend
569 1 Nico Schottelius
** However it uses the clear text username+password as a token
570 34 Nico Schottelius
** See https://docs.ceph.com/docs/mimic/radosgw/ldap-auth/
571 32 Nico Schottelius
* We do not want users to store their regular account on machines
572
* For this reason we use independent users / tokens, but with the same username as in LDAP
573
574 38 Nico Schottelius
Creating a user:
575
576
<pre>
577
radosgw-admin user create --uid=USERNAME --display-name="Name of user"
578
</pre>
579
580
581
Listing users:
582
583
<pre>
584
radosgw-admin user list
585
</pre>
586
587
588
Deleting users and their storage:
589
590
<pre>
591
radosgw-admin user rm --uid=USERNAME --purge-data
592
</pre>
593
594 1 Nico Schottelius
h3. Setting up S3 object storage on Ceph
595 33 Nico Schottelius
596
* Setup a gateway node with Alpine Linux
597
** Change do edge
598
** Enable testing
599
* Update the firewall to allow access from this node to the ceph monitors
600 35 Nico Schottelius
* Setting up the wildcard DNS certificate
601
602
<pre>
603
apk add ceph-radosgw
604
</pre>
605 37 Nico Schottelius
606
h3. Wildcard DNS certificate from letsencrypt
607
608
Acquiring and renewing this certificate is currently a manual process, as it requires to change DNS settings.
609
610
* run certbot
611
* update DNS with the first token
612
* update DNS with the second token
613
614
Sample session:
615
616
<pre>
617
s3:/etc/ceph# certbot certonly --manual --preferred-challenges=dns --email sre@ungleich.ch --server https://acme-v02.api.letsencrypt.org/directory --agree-tos 
618
-d *.s3.ungleich.ch -d s3.ungleich.ch
619
Saving debug log to /var/log/letsencrypt/letsencrypt.log
620
Plugins selected: Authenticator manual, Installer None
621
Cert is due for renewal, auto-renewing...
622
Renewing an existing certificate
623
Performing the following challenges:
624
dns-01 challenge for s3.ungleich.ch
625
dns-01 challenge for s3.ungleich.ch
626
627
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
628
NOTE: The IP of this machine will be publicly logged as having requested this
629
certificate. If you're running certbot in manual mode on a machine that is not
630
your server, please ensure you're okay with that.
631
632
Are you OK with your IP being logged?
633
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
634
(Y)es/(N)o: y
635
636
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
637
Please deploy a DNS TXT record under the name
638
_acme-challenge.s3.ungleich.ch with the following value:
639
640
KxGLZNiVjFwz1ifNheoR_KQoPVpkvRUV1oT2pOvJlU0
641
642
Before continuing, verify the record is deployed.
643
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
644
Press Enter to Continue
645
646
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
647
Please deploy a DNS TXT record under the name
648
_acme-challenge.s3.ungleich.ch with the following value:
649
650
bkrhtxWZUipCAL5cBfvrjDuftqsZdQ2JjisiKmXBbaI
651
652
Before continuing, verify the record is deployed.
653
(This must be set up in addition to the previous challenges; do not remove,
654
replace, or undo the previous challenge tasks yet. Note that you might be
655
asked to create multiple distinct TXT records with the same name. This is
656
permitted by DNS standards.)
657
658
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
659
Press Enter to Continue
660
Waiting for verification...
661
Cleaning up challenges
662
663
IMPORTANT NOTES:
664
 - Congratulations! Your certificate and chain have been saved at:
665
   /etc/letsencrypt/live/s3.ungleich.ch/fullchain.pem
666
   Your key file has been saved at:
667
   /etc/letsencrypt/live/s3.ungleich.ch/privkey.pem
668
   Your cert will expire on 2020-12-09. To obtain a new or tweaked
669
   version of this certificate in the future, simply run certbot
670
   again. To non-interactively renew *all* of your certificates, run
671
   "certbot renew"
672
 - If you like Certbot, please consider supporting our work by:
673
674
   Donating to ISRG / Let's Encrypt:   https://letsencrypt.org/donate
675
   Donating to EFF:                    https://eff.org/donate-le
676
677
</pre>
678 41 Nico Schottelius
679
h2. Debugging ceph
680
681
682
<pre>
683
    ceph status
684
    ceph osd status
685
    ceph osd df
686
    ceph osd utilization
687
    ceph osd pool stats
688
    ceph osd tree
689
    ceph pg stat
690
</pre>
691 42 Nico Schottelius
692 53 Nico Schottelius
h3. How to list the version overview
693
694 55 Nico Schottelius
This lists the versions of osds, mgrs and mons:
695
696 53 Nico Schottelius
<pre>
697
ceph versions
698
</pre>
699 55 Nico Schottelius
700
Listing the "features" of clients, osds, mgrs and mons can be done using @ceph features@:
701
702
<pre>
703
[15:32:20] red1.place5:~# ceph features
704
{
705
    "mon": [
706
        {
707
            "features": "0x3ffddff8ffecffff",
708
            "release": "luminous",
709
            "num": 5
710
        }
711
    ],
712
    "osd": [
713
        {
714
            "features": "0x3ffddff8ffecffff",
715
            "release": "luminous",
716
            "num": 44
717
        }
718
    ],
719
    "client": [
720
        {
721
            "features": "0x3ffddff8eea4fffb",
722
            "release": "luminous",
723
            "num": 4
724
        },
725
        {
726
            "features": "0x3ffddff8ffacffff",
727
            "release": "luminous",
728
            "num": 18
729
        },
730
        {
731
            "features": "0x3ffddff8ffecffff",
732
            "release": "luminous",
733
            "num": 31
734
        }
735
    ],
736
    "mgr": [
737
        {
738
            "features": "0x3ffddff8ffecffff",
739
            "release": "luminous",
740
            "num": 4
741
        }
742
    ]
743
}
744
745
</pre>
746
 
747 53 Nico Schottelius
748
h3. How to list the version of every OSD and every monitor
749
750
To list the version of each ceph OSD:
751
752
<pre>
753
ceph tell osd.* version
754
</pre>
755
756
To list the version of each ceph mon:
757
2
758
<pre>
759
ceph tell mon.* version
760
</pre>
761
762
The mgr do not seem to support this command as of 14.2.21.
763
764 49 Nico Schottelius
h2. Performance Tuning
765
766
* Ensure that the basic options for reducing rebalancing workload are set:
767
768
<pre>
769
osd max backfills = 1
770
osd recovery max active = 1
771
osd recovery op priority = 2
772
</pre>
773
774
* Ensure that "osd_op_queue_cut_off":https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_op_queue_cut_off is set to **high**
775
** Requires OSD restart on change
776
777 50 Nico Schottelius
<pre>
778
ceph config set global osd_op_queue_cut_off high
779
</pre>
780
781 51 Nico Schottelius
<pre>
782
be sure to check your osd recovery sleep settings, there are several
783
depending on your underlying drives:
784
785
    "osd_recovery_sleep": "0.000000",
786
    "osd_recovery_sleep_hdd": "0.050000",
787
    "osd_recovery_sleep_hybrid": "0.050000",
788
    "osd_recovery_sleep_ssd": "0.050000",
789
790
Adjusting these will upwards will dramatically reduce IO, and take effect
791
immediately at the cost of slowing rebalance/recovery.
792
</pre>
793
794 52 Nico Schottelius
Reference settings from Frank Schilder:
795
796
<pre>
797
  osd       class:hdd      advanced osd_recovery_sleep                0.050000
798
  osd       class:rbd_data advanced osd_recovery_sleep                0.025000
799
  osd       class:rbd_meta advanced osd_recovery_sleep                0.002500
800
  osd       class:ssd      advanced osd_recovery_sleep                0.002500
801
  osd                      advanced osd_recovery_sleep                0.050000
802
803
  osd       class:hdd      advanced osd_max_backfills                 3
804
  osd       class:rbd_data advanced osd_max_backfills                 6
805
  osd       class:rbd_meta advanced osd_max_backfills                 12
806
  osd       class:ssd      advanced osd_max_backfills                 12
807
  osd                      advanced osd_max_backfills                 3
808
809
  osd       class:hdd      advanced osd_recovery_max_active           8
810
  osd       class:rbd_data advanced osd_recovery_max_active           16
811
  osd       class:rbd_meta advanced osd_recovery_max_active           32
812
  osd       class:ssd      advanced osd_recovery_max_active           32
813
  osd                      advanced osd_recovery_max_active           8
814
</pre>
815
816
(have not yet been tested in our clusters)
817 51 Nico Schottelius
818 42 Nico Schottelius
h2. Ceph theory
819
820
h3. How much data per Server?
821
822
Q: How much data should we add into one server?
823
A: Not more than it can handle.
824
825
How much data can a server handle? For this let's have a look at 2 scenarios:
826
827
* How long does it take to compensate the loss of the server?
828
829
* Assuming a server has X TiB storage in Y disks attached and a network speed of Z GiB/s.
830
* And our estimated rebuild goal is to compensate the loss of a server within U hours.
831
832
833
h4. Approach 1
834
835
Then
836
837
Let's take an example: 
838
839
* A server with @10 disks * 10 TiB@ = 100 TiB = 100 000 GiB data. It is network connected with 10 Gbit = 1.25 GiB/s.
840
* 100000/1.25 = 80000s = 22.22h
841
842
However, our logic assumes that we actually rebuild from the failed server, which... is failed. 
843
844
h4. Approach 2: calculating with left servers
845
846
However we can apply our logic also to distribute
847
the rebuild over several servers that now pull in data from each other for rebuilding.
848
We need to *read* the data (100TiB) from other servers and distribute it to new OSDs. Assuming each server has a 10 Gbit/s
849
network connection.
850
851
Now the servers might need to *read* (get data from other osds) and *write) (send data to other osds). Luckily, networking is 10 Gbit/s duplex - i.e. in both directions.
852
853
However how fast can we actually read data from the disks? 
854
855
* SSDs are in the range of hundreds of MB/s (best case, not necessarily true for random reads) - let's assume
856
* HDDs are in the range of tenths of MB/s (depending on the work load, but 30-40 MB/s random reads seems realistic)
857
858
 
859
860
861
Further assumptions:
862
863
* Assuming further that each disk should be dedicated at least one CPU core.
864 43 Nico Schottelius
865
h3. Disk/SSD speeds
866
867 44 Nico Schottelius
* Tuning for #8473 showed that a 10TB HDD can write up to 180-200MB/s when backfilling (at about 70% cpu usage and 20% disk usage), max backfills = 8
868 43 Nico Schottelius
* Debugging SSD usage in #8461 showed SSDs can read about 470-520MB/s sequential
869
* Debugging SSD usage in #8461 showed SSDs can write about 170-280MB/s sequential
870
* Debugging SSD usage in #8461 showed SSDs can write about 4MB/s RANDOM (need to verify this even though 3 runs showed these numbers)
871 47 Dominique Roux
872 48 Dominique Roux
h3. Ceph theoretical fundament
873 47 Dominique Roux
874
If you are very much into the theoretical fundament of Ceph check out their "paper":https://www3.nd.edu/~dthain/courses/cse40771/spring2007/papers/ceph.pdf