Project

General

Profile

The ungleich ceph handbook » History » Version 57

Nico Schottelius, 05/14/2022 09:12 PM

1 1 Nico Schottelius
h1. The ungleich ceph handbook
2
3 3 Nico Schottelius
{{toc}}
4
5 1 Nico Schottelius
h2. Status
6
7 7 Nico Schottelius
This document is **IN PRODUCTION**.
8 1 Nico Schottelius
9
h2. Introduction
10
11
This article describes the ungleich storage architecture that is based on ceph. It describes our architecture as well maintenance commands. Required for 
12
13 45 Nico Schottelius
h2. Processes
14
15
h3. Usage monitoring
16
17
* Usage should be kept somewhere in 70-75% area
18
* If usage reaches 72.5%, we start reducing usage by adding disks
19
* We stop when usage is below 70%
20
21
h3. Phasing in new disks
22
23
* 24h performance test prior to using it
24
25
h3. Phasing in new servers
26
27
* 24h performance test with 1 ssd or 1 hdd (whatever is applicable)
28
29
30 1 Nico Schottelius
h2. Communication guide
31
32
Usually when disks fails no customer communication is necessary, as it is automatically compensated/rebalanced by ceph. However in case multiple disk failures happen at the same time, I/O speed might be reduced and thus customer experience impacted.
33
34
For this reason communicate whenever I/O recovery settings are temporarily tuned.
35
36 20 Nico Schottelius
h2. Analysing 
37
38 21 Nico Schottelius
h3. ceph osd df tree
39 20 Nico Schottelius
40
Using @ceph osd df tree@ you can see not only the disk usage per OSD, but also the number of PGs on an OSD. This is especially useful to see how the OSDs are balanced.
41
42 22 Nico Schottelius
h3. Find out the device of an OSD
43
44
Use @mount | grep /var/lib/ceph/osd/ceph-OSDID@ on the server on which the OSD is located:
45
46
<pre>
47
48
[16:01:23] server2.place6:~# mount | grep /var/lib/ceph/osd/ceph-31
49
/dev/sdk1 on /var/lib/ceph/osd/ceph-31 type xfs (rw,relatime,attr2,inode64,noquota)
50
</pre>
51
52 57 Nico Schottelius
h3. Show config
53
54
<pre>
55
ceph config dump
56
</pre>
57
58 2 Nico Schottelius
h2. Adding a new disk/ssd to the ceph cluster
59 1 Nico Schottelius
60 25 Jin-Guk Kwon
write on the disks, which order / date we bought it with a permanent marker.
61
62 46 Nico Schottelius
h3. Checking the shadow trees
63
64
To be able to spot differences / weights of hosts, it can be very helpful to look at the crush shadow tree
65
using @ceph osd crush tree --show-shadow@:
66
67
<pre>
68
-16   hdd-big 653.03418           root default~hdd-big        
69
-34   hdd-big         0         0     host server14~hdd-big   
70
-38   hdd-big         0         0     host server15~hdd-big   
71
-42   hdd-big  81.86153  78.28352     host server17~hdd-big   
72
 36   hdd-big   9.09560   9.09560         osd.36              
73
 59   hdd-big   9.09499   9.09499         osd.59              
74
 60   hdd-big   9.09499   9.09499         osd.60              
75
 68   hdd-big   9.09599   8.93999         osd.68              
76
 69   hdd-big   9.09599   7.65999         osd.69              
77
 70   hdd-big   9.09599   8.35899         osd.70              
78
 71   hdd-big   9.09599   8.56000         osd.71              
79
 72   hdd-big   9.09599   8.93700         osd.72              
80
 73   hdd-big   9.09599   8.54199         osd.73              
81
-46   hdd-big  90.94986  90.94986     host server18~hdd-big   
82
...
83
</pre>
84
85
86
Here we can see that the weight of server17 for the class hdd-big is about 81, the one of server18 about 90.
87
SSDs and other classes have their own shadow trees, too.
88
89
90 2 Nico Schottelius
h3. For Dell servers
91
92
First find the disk and then add it to the operating system
93
94
<pre>
95
megacli -PDList -aALL  | grep -B16 -i unconfigur
96
97
# Sample output:
98
[19:46:50] server7.place6:~#  megacli -PDList -aALL  | grep -B16 -i unconfigur
99
Enclosure Device ID: N/A
100
Slot Number: 0
101
Enclosure position: N/A
102
Device Id: 0
103
WWN: 0000000000000000
104
Sequence Number: 1
105
Media Error Count: 0
106
Other Error Count: 0
107
Predictive Failure Count: 0
108
Last Predictive Failure Event Seq Number: 0
109
PD Type: SATA
110
111
Raw Size: 894.252 GB [0x6fc81ab0 Sectors]
112
Non Coerced Size: 893.752 GB [0x6fb81ab0 Sectors]
113
Coerced Size: 893.75 GB [0x6fb80000 Sectors]
114
Sector Size:  0
115
Firmware state: Unconfigured(good), Spun Up
116
</pre>
117
118
Then add the disk to the OS:
119
120
<pre>
121 26 ll nu
megacli -CfgLdAdd -r0 [Enclosure Device ID:slot] -aX (X : host is 0. md-array is 1)
122 2 Nico Schottelius
123
# Sample call, if enclosure and slot are KNOWN (aka not N/A)
124
megacli -CfgLdAdd -r0 [32:0] -a0
125
126
# Sample call, if enclosure is N/A
127 1 Nico Schottelius
megacli -CfgLdAdd -r0 [:0] -a0
128 25 Jin-Guk Kwon
</pre>
129
130
Then check disk
131
132
<pre>
133
fdisk -l
134
[11:26:23] server2.place6:~# fdisk -l
135
......
136
Disk /dev/sdh: 7.3 TiB, 8000987201536 bytes, 15626928128 sectors
137
Units: sectors of 1 * 512 = 512 bytes
138
Sector size (logical/physical): 512 bytes / 512 bytes
139
I/O size (minimum/optimal): 512 bytes / 512 bytes
140
[11:27:24] server2.place6:~#
141
</pre>
142
143
Then create gpt
144
145
<pre>
146
/opt/ungleich-tools/disk-create-fresh-gpt /dev/XXX
147
[11:31:10] server2.place6:~# /opt/ungleich-tools/disk-create-fresh-gpt /dev/sdh
148
......
149
Created a new DOS disklabel with disk identifier 0x9c4a0355.
150
Command (m for help): Created a new GPT disklabel (GUID: 374E31AD-7B96-4837-B5ED-7B22C452899E).
151
......
152
</pre>
153
154
Then create osd for ssd/hdd-big
155
156
<pre>
157
/opt/ungleich-tools/ceph-osd-create-start /dev/XXX XXX(sdd or hdd-big)
158
[11:33:58] server2.place6:~# /opt/ungleich-tools/ceph-osd-create-start /dev/sdh hdd-big
159
+ set -e
160
+ [ 2 -lt 2 ]
161
......
162
+ /opt/ungleich-tools/monit-ceph-create-start osd.14
163
osd.14
164
[ ok ] Restarting daemon monitor: monit.
165
[11:36:14] server2.place6:~#
166
</pre>
167
168
Then check rebalancing(if you want to add another disk, you should do after rebalancing)
169
170
<pre>
171
ceph -s
172
[12:37:57] server2.place6:~# ceph -s
173
  cluster:
174
    id:     1ccd84f6-e362-4c50-9ffe-59436745e445
175
    health: HEALTH_WARN
176
            2248811/49628409 objects misplaced (4.531%)
177
......
178
  io:
179
    client:   170KiB/s rd, 35.0MiB/s wr, 463op/s rd, 728op/s wr
180
    recovery: 27.1MiB/s, 6objects/s
181
[12:49:41] server2.place6:~#
182 2 Nico Schottelius
</pre>
183
184 1 Nico Schottelius
h2. Moving a disk/ssd to another server
185 4 Nico Schottelius
186
(needs to be described better)
187
188
Generally speaking:
189
190 27 ll nu
* //needs to be tested: disable recovery so data wont start move while you have the osd down
191 9 Nico Schottelius
* /opt/ungleich-tools/ceph-osd-stop-disable does the following:
192
** Stop the osd, remove monit on the server you want to take it out
193
** umount the disk
194 1 Nico Schottelius
* Take disk out
195
* Discard preserved cache on the server you took it out 
196 54 Nico Schottelius
** using megacli:  @megacli -DiscardPreservedCache -Lall -aAll@
197 1 Nico Schottelius
* Insert into new server
198 9 Nico Schottelius
* Clear foreign configuration
199 54 Nico Schottelius
** using megacli: @megacli -CfgForeign -Clear -aAll@
200 9 Nico Schottelius
* Disk will now appear in the OS, ceph/udev will automatically start the OSD (!)
201
** No creating of the osd required!
202
* Verify that the disk exists and that the osd is started
203
** using *ps aux*
204
** using *ceph osd tree*
205 10 Nico Schottelius
* */opt/ungleich-tools/monit-ceph-create-start osd.XX* # where osd.XX is the osd + number
206 9 Nico Schottelius
** Creates the monit configuration file so that monit watches the OSD
207
** Reload monit
208 11 Nico Schottelius
* Verify monit using *monit status*
209 1 Nico Schottelius
210 56 Nico Schottelius
h2. OSD related processes
211 1 Nico Schottelius
212 56 Nico Schottelius
h3. Removing a disk/ssd
213
214 1 Nico Schottelius
To permanently remove a failed disk from a cluster, use ***ceph-osd-stop-remove-permanently*** from ungleich-tools repo. Warning: if the disk is still active, the OSD will be shutdown AND removed from the cluster -> all data of that disk will need to be rebalanced.
215
216 56 Nico Schottelius
h3. Handling DOWN osds with filesystem errors
217 1 Nico Schottelius
218
If an email arrives with the subject "monit alert -- Does not exist osd.XX-whoami", the filesystem of an OSD cannot be read anymore. It is very highly likely that the disk / ssd is broken. Steps that need to be done:
219
220
* Login to any ceph monitor (cephX.placeY.ungleich.ch)
221
* Check **ceph -s**, find host using **ceph osd tree**
222
* Login to the affected host
223
* Run the following commands:
224
** ls /var/lib/ceph/osd/ceph-XX
225
** dmesg
226 24 Jin-Guk Kwon
<pre>
227
ex) After checking message of dmesg, you can do next step
228
[204696.406756] XFS (sdl1): metadata I/O error: block 0x19100 ("xlog_iodone") error 5 numblks 64
229
[204696.408094] XFS (sdl1): xfs_do_force_shutdown(0x2) called from line 1233 of file /build/linux-BsFdsw/linux-4.9.65/fs/xfs/xfs_log.c.  Return address = 0xffffffffc08eb612
230
[204696.410702] XFS (sdl1): Log I/O Error Detected.  Shutting down filesystem
231
[204696.411977] XFS (sdl1): Please umount the filesystem and rectify the problem(
232
</pre>
233
234 1 Nico Schottelius
* Create a new ticket in the datacenter light project
235
** Subject: "Replace broken OSD.XX on serverX.placeY.ungleich.ch"
236
** Add (partial) output of above commands
237
** Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently XX, where XX is the osd id, to remove the disk from the cluster
238
** Remove the physical disk from the host, checkout if there is warranty on it and if yes
239
*** Create a short letter to the vendor, including technical details a from above
240
*** Record when you sent it in
241
*** Put ticket into status waiting
242
** If there is no warranty, dispose it
243
244 56 Nico Schottelius
h3. [[Create new pool and place new osd]]
245
246
h3. Configuring auto repair on pgs
247
248
<pre>
249
ceph config set osd osd_scrub_auto_repair true
250
</pre>
251
252
Verify using:
253
254
<pre>
255
ceph config dump
256
</pre>
257 39 Jin-Guk Kwon
258 1 Nico Schottelius
h2. Change ceph speed for i/o recovery
259
260
By default we want to keep I/O recovery traffic low to not impact customer experience. However when multiple disks fail at the same point, we might want to prioritise recover for data safety over performance.
261
262
The default configuration on our servers contains:
263
264
<pre>
265
[osd]
266
osd max backfills = 1
267
osd recovery max active = 1
268
osd recovery op priority = 2
269
</pre>
270
271
The important settings are *osd max backfills* and *osd recovery max active*, the priority is always kept low so that regular I/O has priority.
272
273
To adjust the number of backfills *per osd* and to change the *number of threads* used for recovery, we can use on any node with the admin keyring:
274
275
<pre>
276
ceph tell osd.* injectargs '--osd-max-backfills Y'
277
ceph tell osd.* injectargs '--osd-recovery-max-active X'
278
</pre>
279
280
where Y and X are the values that we want to use. Experience shows that Y=5 and X=5 doubles to triples the recovery performance, whereas X=10 and Y=10 increases recovery performance 5 times.
281
282
h2. Debug scrub errors / inconsistent pg message
283 6 Nico Schottelius
284 1 Nico Schottelius
From time to time disks don't save what they are told to save. Ceph scrubbing detects these errors and switches to HEALTH_ERR. Use *ceph health detail* to find out which placement groups (*pgs*) are affected. Usually a ***ceph pg repair <number>*** fixes the problem.
285
286
If this does not help, consult https://ceph.com/geen-categorie/ceph-manually-repair-object/.
287 12 Nico Schottelius
288
h2. Move servers into the osd tree
289
290
New servers have their buckets placed outside the **default root** and thus need to be moved inside.
291
Output might look as follows:
292
293
<pre>
294
[11:19:27] server5.place6:~# ceph osd tree
295
ID  CLASS   WEIGHT    TYPE NAME        STATUS REWEIGHT PRI-AFF 
296
 -3           0.87270 host server5                             
297
 41     ssd   0.87270     osd.41           up  1.00000 1.00000 
298
 -1         251.85580 root default                             
299
 -7          81.56271     host server2                         
300
  0 hdd-big   9.09511         osd.0        up  1.00000 1.00000 
301
  5 hdd-big   9.09511         osd.5        up  1.00000 1.00000 
302
...
303
</pre>
304
305
306
Use **ceph osd crush move serverX root=default** (where serverX is the new server),
307
which will move the bucket in the right place:
308
309
<pre>
310
[11:21:17] server5.place6:~# ceph osd crush move server5 root=default
311
moved item id -3 name 'server5' to location {root=default} in crush map
312
[11:32:12] server5.place6:~# ceph osd tree
313
ID  CLASS   WEIGHT    TYPE NAME        STATUS REWEIGHT PRI-AFF 
314
 -1         252.72850 root default                             
315
...
316
 -3           0.87270     host server5                         
317
 41     ssd   0.87270         osd.41       up  1.00000 1.00000 
318
319
320
</pre>
321 13 Nico Schottelius
322
h2. How to fix existing osds with wrong partition layout
323
324
In the first version of DCL we used filestore/3 partition based layout.
325
In the second version of DCL, including OSD autodection, we use bluestore/2 partition based layout.
326
327
To convert, we delete the old OSD, clean the partitions and create a new osd:
328
329 14 Nico Schottelius
h3. Inactive OSD
330 1 Nico Schottelius
331 14 Nico Schottelius
If the OSD is *not active*, we can do the following:
332
333 13 Nico Schottelius
* Find the OSD number: mount the partition and find the whoami file
334
335
<pre>
336
root@server2:/opt/ungleich-tools# mount /dev/sda2 /mnt/
337
root@server2:/opt/ungleich-tools# cat /mnt/whoami 
338
0
339
root@server2:/opt/ungleich-tools# umount  /mnt/
340
341
</pre>
342
343
* Verify in the *ceph osd tree* that the OSD is on that server
344
* Deleting the OSD
345
** ceph osd crush remove $osd_name
346 1 Nico Schottelius
** ceph osd rm $osd_name
347 14 Nico Schottelius
348
Then continue below as described in "Recreating the OSD".
349
350
h3. Remove Active OSD
351
352
* Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently OSDID to stop and remove the OSD
353
* Then continue below as described in "Recreating the OSD".
354
355
356
h3. Recreating the OSD
357
358 13 Nico Schottelius
* Create an empty partition table
359
** fdisk /dev/sdX
360
** g
361
** w
362
* Create a new OSD
363
** /opt/ungleich-tools/ceph-osd-create-start /dev/sdX CLASS # use hdd, ssd, ... for the CLASS
364 15 Jin-Guk Kwon
365
h2. How to fix unfound pg
366
367
refer to https://redmine.ungleich.ch/issues/6388
368 16 Jin-Guk Kwon
369
* Check health state 
370
** ceph health detail
371
* Check which server has that osd
372
** ceph osd tree
373
* Check which VM is running in server place
374 17 Jin-Guk Kwon
** virsh list  
375 16 Jin-Guk Kwon
* Check pg map
376 17 Jin-Guk Kwon
** ceph osd map [osd pool] [VMID]
377 18 Jin-Guk Kwon
* revert pg
378
** ceph pg [PGID] mark_unfound_lost revert
379 28 Nico Schottelius
380
h2. Enabling per image RBD statistics for prometheus
381
382
383
<pre>
384
[20:26:57] red2.place5:~# ceph config set mgr mgr/prometheus/rbd_stats_pools "one,hdd"
385
[20:27:57] black2.place6:~# ceph config set mgr mgr/prometheus/rbd_stats_pools "hdd,ssd"
386
</pre>
387 29 Nico Schottelius
388
h2. S3 Object Storage
389
390 36 Nico Schottelius
This section is ** UNDER CONTRUCTION ** 
391
392 29 Nico Schottelius
h3. Introduction
393 1 Nico Schottelius
394 30 Nico Schottelius
* See the "Red Hat manual":https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/object_gateway_guide_for_red_hat_enterprise_linux/overview-rgw
395
* The "ceph docs about object storage":https://docs.ceph.com/docs/mimic/radosgw/
396 29 Nico Schottelius
397
h3. Architecture
398
399
* S3 requests are handled by a publicly accessible gateway, which also has access to the ceph cluster.
400 34 Nico Schottelius
* s3 buckets are usually
401 29 Nico Schottelius
402 32 Nico Schottelius
h3. Authentication / Users
403
404
* Ceph *can* make use of LDAP as a backend
405 1 Nico Schottelius
** However it uses the clear text username+password as a token
406 34 Nico Schottelius
** See https://docs.ceph.com/docs/mimic/radosgw/ldap-auth/
407 32 Nico Schottelius
* We do not want users to store their regular account on machines
408
* For this reason we use independent users / tokens, but with the same username as in LDAP
409
410 38 Nico Schottelius
Creating a user:
411
412
<pre>
413
radosgw-admin user create --uid=USERNAME --display-name="Name of user"
414
</pre>
415
416
417
Listing users:
418
419
<pre>
420
radosgw-admin user list
421
</pre>
422
423
424
Deleting users and their storage:
425
426
<pre>
427
radosgw-admin user rm --uid=USERNAME --purge-data
428
</pre>
429
430 1 Nico Schottelius
h3. Setting up S3 object storage on Ceph
431 33 Nico Schottelius
432
* Setup a gateway node with Alpine Linux
433
** Change do edge
434
** Enable testing
435
* Update the firewall to allow access from this node to the ceph monitors
436 35 Nico Schottelius
* Setting up the wildcard DNS certificate
437
438
<pre>
439
apk add ceph-radosgw
440
</pre>
441 37 Nico Schottelius
442
h3. Wildcard DNS certificate from letsencrypt
443
444
Acquiring and renewing this certificate is currently a manual process, as it requires to change DNS settings.
445
446
* run certbot
447
* update DNS with the first token
448
* update DNS with the second token
449
450
Sample session:
451
452
<pre>
453
s3:/etc/ceph# certbot certonly --manual --preferred-challenges=dns --email sre@ungleich.ch --server https://acme-v02.api.letsencrypt.org/directory --agree-tos 
454
-d *.s3.ungleich.ch -d s3.ungleich.ch
455
Saving debug log to /var/log/letsencrypt/letsencrypt.log
456
Plugins selected: Authenticator manual, Installer None
457
Cert is due for renewal, auto-renewing...
458
Renewing an existing certificate
459
Performing the following challenges:
460
dns-01 challenge for s3.ungleich.ch
461
dns-01 challenge for s3.ungleich.ch
462
463
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
464
NOTE: The IP of this machine will be publicly logged as having requested this
465
certificate. If you're running certbot in manual mode on a machine that is not
466
your server, please ensure you're okay with that.
467
468
Are you OK with your IP being logged?
469
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
470
(Y)es/(N)o: y
471
472
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
473
Please deploy a DNS TXT record under the name
474
_acme-challenge.s3.ungleich.ch with the following value:
475
476
KxGLZNiVjFwz1ifNheoR_KQoPVpkvRUV1oT2pOvJlU0
477
478
Before continuing, verify the record is deployed.
479
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
480
Press Enter to Continue
481
482
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
483
Please deploy a DNS TXT record under the name
484
_acme-challenge.s3.ungleich.ch with the following value:
485
486
bkrhtxWZUipCAL5cBfvrjDuftqsZdQ2JjisiKmXBbaI
487
488
Before continuing, verify the record is deployed.
489
(This must be set up in addition to the previous challenges; do not remove,
490
replace, or undo the previous challenge tasks yet. Note that you might be
491
asked to create multiple distinct TXT records with the same name. This is
492
permitted by DNS standards.)
493
494
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
495
Press Enter to Continue
496
Waiting for verification...
497
Cleaning up challenges
498
499
IMPORTANT NOTES:
500
 - Congratulations! Your certificate and chain have been saved at:
501
   /etc/letsencrypt/live/s3.ungleich.ch/fullchain.pem
502
   Your key file has been saved at:
503
   /etc/letsencrypt/live/s3.ungleich.ch/privkey.pem
504
   Your cert will expire on 2020-12-09. To obtain a new or tweaked
505
   version of this certificate in the future, simply run certbot
506
   again. To non-interactively renew *all* of your certificates, run
507
   "certbot renew"
508
 - If you like Certbot, please consider supporting our work by:
509
510
   Donating to ISRG / Let's Encrypt:   https://letsencrypt.org/donate
511
   Donating to EFF:                    https://eff.org/donate-le
512
513
</pre>
514 41 Nico Schottelius
515
h2. Debugging ceph
516
517
518
<pre>
519
    ceph status
520
    ceph osd status
521
    ceph osd df
522
    ceph osd utilization
523
    ceph osd pool stats
524
    ceph osd tree
525
    ceph pg stat
526
</pre>
527 42 Nico Schottelius
528 53 Nico Schottelius
h3. How to list the version overview
529
530 55 Nico Schottelius
This lists the versions of osds, mgrs and mons:
531
532 53 Nico Schottelius
<pre>
533
ceph versions
534
</pre>
535 55 Nico Schottelius
536
Listing the "features" of clients, osds, mgrs and mons can be done using @ceph features@:
537
538
<pre>
539
[15:32:20] red1.place5:~# ceph features
540
{
541
    "mon": [
542
        {
543
            "features": "0x3ffddff8ffecffff",
544
            "release": "luminous",
545
            "num": 5
546
        }
547
    ],
548
    "osd": [
549
        {
550
            "features": "0x3ffddff8ffecffff",
551
            "release": "luminous",
552
            "num": 44
553
        }
554
    ],
555
    "client": [
556
        {
557
            "features": "0x3ffddff8eea4fffb",
558
            "release": "luminous",
559
            "num": 4
560
        },
561
        {
562
            "features": "0x3ffddff8ffacffff",
563
            "release": "luminous",
564
            "num": 18
565
        },
566
        {
567
            "features": "0x3ffddff8ffecffff",
568
            "release": "luminous",
569
            "num": 31
570
        }
571
    ],
572
    "mgr": [
573
        {
574
            "features": "0x3ffddff8ffecffff",
575
            "release": "luminous",
576
            "num": 4
577
        }
578
    ]
579
}
580
581
</pre>
582
 
583 53 Nico Schottelius
584
h3. How to list the version of every OSD and every monitor
585
586
To list the version of each ceph OSD:
587
588
<pre>
589
ceph tell osd.* version
590
</pre>
591
592
To list the version of each ceph mon:
593
2
594
<pre>
595
ceph tell mon.* version
596
</pre>
597
598
The mgr do not seem to support this command as of 14.2.21.
599
600 49 Nico Schottelius
h2. Performance Tuning
601
602
* Ensure that the basic options for reducing rebalancing workload are set:
603
604
<pre>
605
osd max backfills = 1
606
osd recovery max active = 1
607
osd recovery op priority = 2
608
</pre>
609
610
* Ensure that "osd_op_queue_cut_off":https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_op_queue_cut_off is set to **high**
611
** Requires OSD restart on change
612
613 50 Nico Schottelius
<pre>
614
ceph config set global osd_op_queue_cut_off high
615
</pre>
616
617 51 Nico Schottelius
<pre>
618
be sure to check your osd recovery sleep settings, there are several
619
depending on your underlying drives:
620
621
    "osd_recovery_sleep": "0.000000",
622
    "osd_recovery_sleep_hdd": "0.050000",
623
    "osd_recovery_sleep_hybrid": "0.050000",
624
    "osd_recovery_sleep_ssd": "0.050000",
625
626
Adjusting these will upwards will dramatically reduce IO, and take effect
627
immediately at the cost of slowing rebalance/recovery.
628
</pre>
629
630 52 Nico Schottelius
Reference settings from Frank Schilder:
631
632
<pre>
633
  osd       class:hdd      advanced osd_recovery_sleep                0.050000
634
  osd       class:rbd_data advanced osd_recovery_sleep                0.025000
635
  osd       class:rbd_meta advanced osd_recovery_sleep                0.002500
636
  osd       class:ssd      advanced osd_recovery_sleep                0.002500
637
  osd                      advanced osd_recovery_sleep                0.050000
638
639
  osd       class:hdd      advanced osd_max_backfills                 3
640
  osd       class:rbd_data advanced osd_max_backfills                 6
641
  osd       class:rbd_meta advanced osd_max_backfills                 12
642
  osd       class:ssd      advanced osd_max_backfills                 12
643
  osd                      advanced osd_max_backfills                 3
644
645
  osd       class:hdd      advanced osd_recovery_max_active           8
646
  osd       class:rbd_data advanced osd_recovery_max_active           16
647
  osd       class:rbd_meta advanced osd_recovery_max_active           32
648
  osd       class:ssd      advanced osd_recovery_max_active           32
649
  osd                      advanced osd_recovery_max_active           8
650
</pre>
651
652
(have not yet been tested in our clusters)
653 51 Nico Schottelius
654 42 Nico Schottelius
h2. Ceph theory
655
656
h3. How much data per Server?
657
658
Q: How much data should we add into one server?
659
A: Not more than it can handle.
660
661
How much data can a server handle? For this let's have a look at 2 scenarios:
662
663
* How long does it take to compensate the loss of the server?
664
665
* Assuming a server has X TiB storage in Y disks attached and a network speed of Z GiB/s.
666
* And our estimated rebuild goal is to compensate the loss of a server within U hours.
667
668
669
h4. Approach 1
670
671
Then
672
673
Let's take an example: 
674
675
* A server with @10 disks * 10 TiB@ = 100 TiB = 100 000 GiB data. It is network connected with 10 Gbit = 1.25 GiB/s.
676
* 100000/1.25 = 80000s = 22.22h
677
678
However, our logic assumes that we actually rebuild from the failed server, which... is failed. 
679
680
h4. Approach 2: calculating with left servers
681
682
However we can apply our logic also to distribute
683
the rebuild over several servers that now pull in data from each other for rebuilding.
684
We need to *read* the data (100TiB) from other servers and distribute it to new OSDs. Assuming each server has a 10 Gbit/s
685
network connection.
686
687
Now the servers might need to *read* (get data from other osds) and *write) (send data to other osds). Luckily, networking is 10 Gbit/s duplex - i.e. in both directions.
688
689
However how fast can we actually read data from the disks? 
690
691
* SSDs are in the range of hundreds of MB/s (best case, not necessarily true for random reads) - let's assume
692
* HDDs are in the range of tenths of MB/s (depending on the work load, but 30-40 MB/s random reads seems realistic)
693
694
 
695
696
697
Further assumptions:
698
699
* Assuming further that each disk should be dedicated at least one CPU core.
700 43 Nico Schottelius
701
h3. Disk/SSD speeds
702
703 44 Nico Schottelius
* Tuning for #8473 showed that a 10TB HDD can write up to 180-200MB/s when backfilling (at about 70% cpu usage and 20% disk usage), max backfills = 8
704 43 Nico Schottelius
* Debugging SSD usage in #8461 showed SSDs can read about 470-520MB/s sequential
705
* Debugging SSD usage in #8461 showed SSDs can write about 170-280MB/s sequential
706
* Debugging SSD usage in #8461 showed SSDs can write about 4MB/s RANDOM (need to verify this even though 3 runs showed these numbers)
707 47 Dominique Roux
708 48 Dominique Roux
h3. Ceph theoretical fundament
709 47 Dominique Roux
710
If you are very much into the theoretical fundament of Ceph check out their "paper":https://www3.nd.edu/~dthain/courses/cse40771/spring2007/papers/ceph.pdf