Project

General

Profile

The ungleich ceph handbook » History » Version 56

Nico Schottelius, 12/03/2021 02:38 PM

1 1 Nico Schottelius
h1. The ungleich ceph handbook
2
3 3 Nico Schottelius
{{toc}}
4
5 1 Nico Schottelius
h2. Status
6
7 7 Nico Schottelius
This document is **IN PRODUCTION**.
8 1 Nico Schottelius
9
h2. Introduction
10
11
This article describes the ungleich storage architecture that is based on ceph. It describes our architecture as well maintenance commands. Required for 
12
13 45 Nico Schottelius
h2. Processes
14
15
h3. Usage monitoring
16
17
* Usage should be kept somewhere in 70-75% area
18
* If usage reaches 72.5%, we start reducing usage by adding disks
19
* We stop when usage is below 70%
20
21
h3. Phasing in new disks
22
23
* 24h performance test prior to using it
24
25
h3. Phasing in new servers
26
27
* 24h performance test with 1 ssd or 1 hdd (whatever is applicable)
28
29
30 1 Nico Schottelius
h2. Communication guide
31
32
Usually when disks fails no customer communication is necessary, as it is automatically compensated/rebalanced by ceph. However in case multiple disk failures happen at the same time, I/O speed might be reduced and thus customer experience impacted.
33
34
For this reason communicate whenever I/O recovery settings are temporarily tuned.
35
36 20 Nico Schottelius
h2. Analysing 
37
38 21 Nico Schottelius
h3. ceph osd df tree
39 20 Nico Schottelius
40
Using @ceph osd df tree@ you can see not only the disk usage per OSD, but also the number of PGs on an OSD. This is especially useful to see how the OSDs are balanced.
41
42 22 Nico Schottelius
h3. Find out the device of an OSD
43
44
Use @mount | grep /var/lib/ceph/osd/ceph-OSDID@ on the server on which the OSD is located:
45
46
<pre>
47
48
[16:01:23] server2.place6:~# mount | grep /var/lib/ceph/osd/ceph-31
49
/dev/sdk1 on /var/lib/ceph/osd/ceph-31 type xfs (rw,relatime,attr2,inode64,noquota)
50
</pre>
51
52 2 Nico Schottelius
h2. Adding a new disk/ssd to the ceph cluster
53 1 Nico Schottelius
54 25 Jin-Guk Kwon
write on the disks, which order / date we bought it with a permanent marker.
55
56 46 Nico Schottelius
h3. Checking the shadow trees
57
58
To be able to spot differences / weights of hosts, it can be very helpful to look at the crush shadow tree
59
using @ceph osd crush tree --show-shadow@:
60
61
<pre>
62
-16   hdd-big 653.03418           root default~hdd-big        
63
-34   hdd-big         0         0     host server14~hdd-big   
64
-38   hdd-big         0         0     host server15~hdd-big   
65
-42   hdd-big  81.86153  78.28352     host server17~hdd-big   
66
 36   hdd-big   9.09560   9.09560         osd.36              
67
 59   hdd-big   9.09499   9.09499         osd.59              
68
 60   hdd-big   9.09499   9.09499         osd.60              
69
 68   hdd-big   9.09599   8.93999         osd.68              
70
 69   hdd-big   9.09599   7.65999         osd.69              
71
 70   hdd-big   9.09599   8.35899         osd.70              
72
 71   hdd-big   9.09599   8.56000         osd.71              
73
 72   hdd-big   9.09599   8.93700         osd.72              
74
 73   hdd-big   9.09599   8.54199         osd.73              
75
-46   hdd-big  90.94986  90.94986     host server18~hdd-big   
76
...
77
</pre>
78
79
80
Here we can see that the weight of server17 for the class hdd-big is about 81, the one of server18 about 90.
81
SSDs and other classes have their own shadow trees, too.
82
83
84 2 Nico Schottelius
h3. For Dell servers
85
86
First find the disk and then add it to the operating system
87
88
<pre>
89
megacli -PDList -aALL  | grep -B16 -i unconfigur
90
91
# Sample output:
92
[19:46:50] server7.place6:~#  megacli -PDList -aALL  | grep -B16 -i unconfigur
93
Enclosure Device ID: N/A
94
Slot Number: 0
95
Enclosure position: N/A
96
Device Id: 0
97
WWN: 0000000000000000
98
Sequence Number: 1
99
Media Error Count: 0
100
Other Error Count: 0
101
Predictive Failure Count: 0
102
Last Predictive Failure Event Seq Number: 0
103
PD Type: SATA
104
105
Raw Size: 894.252 GB [0x6fc81ab0 Sectors]
106
Non Coerced Size: 893.752 GB [0x6fb81ab0 Sectors]
107
Coerced Size: 893.75 GB [0x6fb80000 Sectors]
108
Sector Size:  0
109
Firmware state: Unconfigured(good), Spun Up
110
</pre>
111
112
Then add the disk to the OS:
113
114
<pre>
115 26 ll nu
megacli -CfgLdAdd -r0 [Enclosure Device ID:slot] -aX (X : host is 0. md-array is 1)
116 2 Nico Schottelius
117
# Sample call, if enclosure and slot are KNOWN (aka not N/A)
118
megacli -CfgLdAdd -r0 [32:0] -a0
119
120
# Sample call, if enclosure is N/A
121 1 Nico Schottelius
megacli -CfgLdAdd -r0 [:0] -a0
122 25 Jin-Guk Kwon
</pre>
123
124
Then check disk
125
126
<pre>
127
fdisk -l
128
[11:26:23] server2.place6:~# fdisk -l
129
......
130
Disk /dev/sdh: 7.3 TiB, 8000987201536 bytes, 15626928128 sectors
131
Units: sectors of 1 * 512 = 512 bytes
132
Sector size (logical/physical): 512 bytes / 512 bytes
133
I/O size (minimum/optimal): 512 bytes / 512 bytes
134
[11:27:24] server2.place6:~#
135
</pre>
136
137
Then create gpt
138
139
<pre>
140
/opt/ungleich-tools/disk-create-fresh-gpt /dev/XXX
141
[11:31:10] server2.place6:~# /opt/ungleich-tools/disk-create-fresh-gpt /dev/sdh
142
......
143
Created a new DOS disklabel with disk identifier 0x9c4a0355.
144
Command (m for help): Created a new GPT disklabel (GUID: 374E31AD-7B96-4837-B5ED-7B22C452899E).
145
......
146
</pre>
147
148
Then create osd for ssd/hdd-big
149
150
<pre>
151
/opt/ungleich-tools/ceph-osd-create-start /dev/XXX XXX(sdd or hdd-big)
152
[11:33:58] server2.place6:~# /opt/ungleich-tools/ceph-osd-create-start /dev/sdh hdd-big
153
+ set -e
154
+ [ 2 -lt 2 ]
155
......
156
+ /opt/ungleich-tools/monit-ceph-create-start osd.14
157
osd.14
158
[ ok ] Restarting daemon monitor: monit.
159
[11:36:14] server2.place6:~#
160
</pre>
161
162
Then check rebalancing(if you want to add another disk, you should do after rebalancing)
163
164
<pre>
165
ceph -s
166
[12:37:57] server2.place6:~# ceph -s
167
  cluster:
168
    id:     1ccd84f6-e362-4c50-9ffe-59436745e445
169
    health: HEALTH_WARN
170
            2248811/49628409 objects misplaced (4.531%)
171
......
172
  io:
173
    client:   170KiB/s rd, 35.0MiB/s wr, 463op/s rd, 728op/s wr
174
    recovery: 27.1MiB/s, 6objects/s
175
[12:49:41] server2.place6:~#
176 2 Nico Schottelius
</pre>
177
178 1 Nico Schottelius
h2. Moving a disk/ssd to another server
179 4 Nico Schottelius
180
(needs to be described better)
181
182
Generally speaking:
183
184 27 ll nu
* //needs to be tested: disable recovery so data wont start move while you have the osd down
185 9 Nico Schottelius
* /opt/ungleich-tools/ceph-osd-stop-disable does the following:
186
** Stop the osd, remove monit on the server you want to take it out
187
** umount the disk
188 1 Nico Schottelius
* Take disk out
189
* Discard preserved cache on the server you took it out 
190 54 Nico Schottelius
** using megacli:  @megacli -DiscardPreservedCache -Lall -aAll@
191 1 Nico Schottelius
* Insert into new server
192 9 Nico Schottelius
* Clear foreign configuration
193 54 Nico Schottelius
** using megacli: @megacli -CfgForeign -Clear -aAll@
194 9 Nico Schottelius
* Disk will now appear in the OS, ceph/udev will automatically start the OSD (!)
195
** No creating of the osd required!
196
* Verify that the disk exists and that the osd is started
197
** using *ps aux*
198
** using *ceph osd tree*
199 10 Nico Schottelius
* */opt/ungleich-tools/monit-ceph-create-start osd.XX* # where osd.XX is the osd + number
200 9 Nico Schottelius
** Creates the monit configuration file so that monit watches the OSD
201
** Reload monit
202 11 Nico Schottelius
* Verify monit using *monit status*
203 1 Nico Schottelius
204 56 Nico Schottelius
h2. OSD related processes
205 1 Nico Schottelius
206 56 Nico Schottelius
h3. Removing a disk/ssd
207
208 1 Nico Schottelius
To permanently remove a failed disk from a cluster, use ***ceph-osd-stop-remove-permanently*** from ungleich-tools repo. Warning: if the disk is still active, the OSD will be shutdown AND removed from the cluster -> all data of that disk will need to be rebalanced.
209
210 56 Nico Schottelius
h3. Handling DOWN osds with filesystem errors
211 1 Nico Schottelius
212
If an email arrives with the subject "monit alert -- Does not exist osd.XX-whoami", the filesystem of an OSD cannot be read anymore. It is very highly likely that the disk / ssd is broken. Steps that need to be done:
213
214
* Login to any ceph monitor (cephX.placeY.ungleich.ch)
215
* Check **ceph -s**, find host using **ceph osd tree**
216
* Login to the affected host
217
* Run the following commands:
218
** ls /var/lib/ceph/osd/ceph-XX
219
** dmesg
220 24 Jin-Guk Kwon
<pre>
221
ex) After checking message of dmesg, you can do next step
222
[204696.406756] XFS (sdl1): metadata I/O error: block 0x19100 ("xlog_iodone") error 5 numblks 64
223
[204696.408094] XFS (sdl1): xfs_do_force_shutdown(0x2) called from line 1233 of file /build/linux-BsFdsw/linux-4.9.65/fs/xfs/xfs_log.c.  Return address = 0xffffffffc08eb612
224
[204696.410702] XFS (sdl1): Log I/O Error Detected.  Shutting down filesystem
225
[204696.411977] XFS (sdl1): Please umount the filesystem and rectify the problem(
226
</pre>
227
228 1 Nico Schottelius
* Create a new ticket in the datacenter light project
229
** Subject: "Replace broken OSD.XX on serverX.placeY.ungleich.ch"
230
** Add (partial) output of above commands
231
** Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently XX, where XX is the osd id, to remove the disk from the cluster
232
** Remove the physical disk from the host, checkout if there is warranty on it and if yes
233
*** Create a short letter to the vendor, including technical details a from above
234
*** Record when you sent it in
235
*** Put ticket into status waiting
236
** If there is no warranty, dispose it
237
238 56 Nico Schottelius
h3. [[Create new pool and place new osd]]
239
240
h3. Configuring auto repair on pgs
241
242
<pre>
243
ceph config set osd osd_scrub_auto_repair true
244
</pre>
245
246
Verify using:
247
248
<pre>
249
ceph config dump
250
</pre>
251 39 Jin-Guk Kwon
252 1 Nico Schottelius
h2. Change ceph speed for i/o recovery
253
254
By default we want to keep I/O recovery traffic low to not impact customer experience. However when multiple disks fail at the same point, we might want to prioritise recover for data safety over performance.
255
256
The default configuration on our servers contains:
257
258
<pre>
259
[osd]
260
osd max backfills = 1
261
osd recovery max active = 1
262
osd recovery op priority = 2
263
</pre>
264
265
The important settings are *osd max backfills* and *osd recovery max active*, the priority is always kept low so that regular I/O has priority.
266
267
To adjust the number of backfills *per osd* and to change the *number of threads* used for recovery, we can use on any node with the admin keyring:
268
269
<pre>
270
ceph tell osd.* injectargs '--osd-max-backfills Y'
271
ceph tell osd.* injectargs '--osd-recovery-max-active X'
272
</pre>
273
274
where Y and X are the values that we want to use. Experience shows that Y=5 and X=5 doubles to triples the recovery performance, whereas X=10 and Y=10 increases recovery performance 5 times.
275
276
h2. Debug scrub errors / inconsistent pg message
277 6 Nico Schottelius
278 1 Nico Schottelius
From time to time disks don't save what they are told to save. Ceph scrubbing detects these errors and switches to HEALTH_ERR. Use *ceph health detail* to find out which placement groups (*pgs*) are affected. Usually a ***ceph pg repair <number>*** fixes the problem.
279
280
If this does not help, consult https://ceph.com/geen-categorie/ceph-manually-repair-object/.
281 12 Nico Schottelius
282
h2. Move servers into the osd tree
283
284
New servers have their buckets placed outside the **default root** and thus need to be moved inside.
285
Output might look as follows:
286
287
<pre>
288
[11:19:27] server5.place6:~# ceph osd tree
289
ID  CLASS   WEIGHT    TYPE NAME        STATUS REWEIGHT PRI-AFF 
290
 -3           0.87270 host server5                             
291
 41     ssd   0.87270     osd.41           up  1.00000 1.00000 
292
 -1         251.85580 root default                             
293
 -7          81.56271     host server2                         
294
  0 hdd-big   9.09511         osd.0        up  1.00000 1.00000 
295
  5 hdd-big   9.09511         osd.5        up  1.00000 1.00000 
296
...
297
</pre>
298
299
300
Use **ceph osd crush move serverX root=default** (where serverX is the new server),
301
which will move the bucket in the right place:
302
303
<pre>
304
[11:21:17] server5.place6:~# ceph osd crush move server5 root=default
305
moved item id -3 name 'server5' to location {root=default} in crush map
306
[11:32:12] server5.place6:~# ceph osd tree
307
ID  CLASS   WEIGHT    TYPE NAME        STATUS REWEIGHT PRI-AFF 
308
 -1         252.72850 root default                             
309
...
310
 -3           0.87270     host server5                         
311
 41     ssd   0.87270         osd.41       up  1.00000 1.00000 
312
313
314
</pre>
315 13 Nico Schottelius
316
h2. How to fix existing osds with wrong partition layout
317
318
In the first version of DCL we used filestore/3 partition based layout.
319
In the second version of DCL, including OSD autodection, we use bluestore/2 partition based layout.
320
321
To convert, we delete the old OSD, clean the partitions and create a new osd:
322
323 14 Nico Schottelius
h3. Inactive OSD
324 1 Nico Schottelius
325 14 Nico Schottelius
If the OSD is *not active*, we can do the following:
326
327 13 Nico Schottelius
* Find the OSD number: mount the partition and find the whoami file
328
329
<pre>
330
root@server2:/opt/ungleich-tools# mount /dev/sda2 /mnt/
331
root@server2:/opt/ungleich-tools# cat /mnt/whoami 
332
0
333
root@server2:/opt/ungleich-tools# umount  /mnt/
334
335
</pre>
336
337
* Verify in the *ceph osd tree* that the OSD is on that server
338
* Deleting the OSD
339
** ceph osd crush remove $osd_name
340 1 Nico Schottelius
** ceph osd rm $osd_name
341 14 Nico Schottelius
342
Then continue below as described in "Recreating the OSD".
343
344
h3. Remove Active OSD
345
346
* Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently OSDID to stop and remove the OSD
347
* Then continue below as described in "Recreating the OSD".
348
349
350
h3. Recreating the OSD
351
352 13 Nico Schottelius
* Create an empty partition table
353
** fdisk /dev/sdX
354
** g
355
** w
356
* Create a new OSD
357
** /opt/ungleich-tools/ceph-osd-create-start /dev/sdX CLASS # use hdd, ssd, ... for the CLASS
358 15 Jin-Guk Kwon
359
h2. How to fix unfound pg
360
361
refer to https://redmine.ungleich.ch/issues/6388
362 16 Jin-Guk Kwon
363
* Check health state 
364
** ceph health detail
365
* Check which server has that osd
366
** ceph osd tree
367
* Check which VM is running in server place
368 17 Jin-Guk Kwon
** virsh list  
369 16 Jin-Guk Kwon
* Check pg map
370 17 Jin-Guk Kwon
** ceph osd map [osd pool] [VMID]
371 18 Jin-Guk Kwon
* revert pg
372
** ceph pg [PGID] mark_unfound_lost revert
373 28 Nico Schottelius
374
h2. Enabling per image RBD statistics for prometheus
375
376
377
<pre>
378
[20:26:57] red2.place5:~# ceph config set mgr mgr/prometheus/rbd_stats_pools "one,hdd"
379
[20:27:57] black2.place6:~# ceph config set mgr mgr/prometheus/rbd_stats_pools "hdd,ssd"
380
</pre>
381 29 Nico Schottelius
382
h2. S3 Object Storage
383
384 36 Nico Schottelius
This section is ** UNDER CONTRUCTION ** 
385
386 29 Nico Schottelius
h3. Introduction
387 1 Nico Schottelius
388 30 Nico Schottelius
* See the "Red Hat manual":https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/object_gateway_guide_for_red_hat_enterprise_linux/overview-rgw
389
* The "ceph docs about object storage":https://docs.ceph.com/docs/mimic/radosgw/
390 29 Nico Schottelius
391
h3. Architecture
392
393
* S3 requests are handled by a publicly accessible gateway, which also has access to the ceph cluster.
394 34 Nico Schottelius
* s3 buckets are usually
395 29 Nico Schottelius
396 32 Nico Schottelius
h3. Authentication / Users
397
398
* Ceph *can* make use of LDAP as a backend
399 1 Nico Schottelius
** However it uses the clear text username+password as a token
400 34 Nico Schottelius
** See https://docs.ceph.com/docs/mimic/radosgw/ldap-auth/
401 32 Nico Schottelius
* We do not want users to store their regular account on machines
402
* For this reason we use independent users / tokens, but with the same username as in LDAP
403
404 38 Nico Schottelius
Creating a user:
405
406
<pre>
407
radosgw-admin user create --uid=USERNAME --display-name="Name of user"
408
</pre>
409
410
411
Listing users:
412
413
<pre>
414
radosgw-admin user list
415
</pre>
416
417
418
Deleting users and their storage:
419
420
<pre>
421
radosgw-admin user rm --uid=USERNAME --purge-data
422
</pre>
423
424 1 Nico Schottelius
h3. Setting up S3 object storage on Ceph
425 33 Nico Schottelius
426
* Setup a gateway node with Alpine Linux
427
** Change do edge
428
** Enable testing
429
* Update the firewall to allow access from this node to the ceph monitors
430 35 Nico Schottelius
* Setting up the wildcard DNS certificate
431
432
<pre>
433
apk add ceph-radosgw
434
</pre>
435 37 Nico Schottelius
436
h3. Wildcard DNS certificate from letsencrypt
437
438
Acquiring and renewing this certificate is currently a manual process, as it requires to change DNS settings.
439
440
* run certbot
441
* update DNS with the first token
442
* update DNS with the second token
443
444
Sample session:
445
446
<pre>
447
s3:/etc/ceph# certbot certonly --manual --preferred-challenges=dns --email sre@ungleich.ch --server https://acme-v02.api.letsencrypt.org/directory --agree-tos 
448
-d *.s3.ungleich.ch -d s3.ungleich.ch
449
Saving debug log to /var/log/letsencrypt/letsencrypt.log
450
Plugins selected: Authenticator manual, Installer None
451
Cert is due for renewal, auto-renewing...
452
Renewing an existing certificate
453
Performing the following challenges:
454
dns-01 challenge for s3.ungleich.ch
455
dns-01 challenge for s3.ungleich.ch
456
457
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
458
NOTE: The IP of this machine will be publicly logged as having requested this
459
certificate. If you're running certbot in manual mode on a machine that is not
460
your server, please ensure you're okay with that.
461
462
Are you OK with your IP being logged?
463
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
464
(Y)es/(N)o: y
465
466
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
467
Please deploy a DNS TXT record under the name
468
_acme-challenge.s3.ungleich.ch with the following value:
469
470
KxGLZNiVjFwz1ifNheoR_KQoPVpkvRUV1oT2pOvJlU0
471
472
Before continuing, verify the record is deployed.
473
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
474
Press Enter to Continue
475
476
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
477
Please deploy a DNS TXT record under the name
478
_acme-challenge.s3.ungleich.ch with the following value:
479
480
bkrhtxWZUipCAL5cBfvrjDuftqsZdQ2JjisiKmXBbaI
481
482
Before continuing, verify the record is deployed.
483
(This must be set up in addition to the previous challenges; do not remove,
484
replace, or undo the previous challenge tasks yet. Note that you might be
485
asked to create multiple distinct TXT records with the same name. This is
486
permitted by DNS standards.)
487
488
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
489
Press Enter to Continue
490
Waiting for verification...
491
Cleaning up challenges
492
493
IMPORTANT NOTES:
494
 - Congratulations! Your certificate and chain have been saved at:
495
   /etc/letsencrypt/live/s3.ungleich.ch/fullchain.pem
496
   Your key file has been saved at:
497
   /etc/letsencrypt/live/s3.ungleich.ch/privkey.pem
498
   Your cert will expire on 2020-12-09. To obtain a new or tweaked
499
   version of this certificate in the future, simply run certbot
500
   again. To non-interactively renew *all* of your certificates, run
501
   "certbot renew"
502
 - If you like Certbot, please consider supporting our work by:
503
504
   Donating to ISRG / Let's Encrypt:   https://letsencrypt.org/donate
505
   Donating to EFF:                    https://eff.org/donate-le
506
507
</pre>
508 41 Nico Schottelius
509
h2. Debugging ceph
510
511
512
<pre>
513
    ceph status
514
    ceph osd status
515
    ceph osd df
516
    ceph osd utilization
517
    ceph osd pool stats
518
    ceph osd tree
519
    ceph pg stat
520
</pre>
521 42 Nico Schottelius
522 53 Nico Schottelius
h3. How to list the version overview
523
524 55 Nico Schottelius
This lists the versions of osds, mgrs and mons:
525
526 53 Nico Schottelius
<pre>
527
ceph versions
528
</pre>
529 55 Nico Schottelius
530
Listing the "features" of clients, osds, mgrs and mons can be done using @ceph features@:
531
532
<pre>
533
[15:32:20] red1.place5:~# ceph features
534
{
535
    "mon": [
536
        {
537
            "features": "0x3ffddff8ffecffff",
538
            "release": "luminous",
539
            "num": 5
540
        }
541
    ],
542
    "osd": [
543
        {
544
            "features": "0x3ffddff8ffecffff",
545
            "release": "luminous",
546
            "num": 44
547
        }
548
    ],
549
    "client": [
550
        {
551
            "features": "0x3ffddff8eea4fffb",
552
            "release": "luminous",
553
            "num": 4
554
        },
555
        {
556
            "features": "0x3ffddff8ffacffff",
557
            "release": "luminous",
558
            "num": 18
559
        },
560
        {
561
            "features": "0x3ffddff8ffecffff",
562
            "release": "luminous",
563
            "num": 31
564
        }
565
    ],
566
    "mgr": [
567
        {
568
            "features": "0x3ffddff8ffecffff",
569
            "release": "luminous",
570
            "num": 4
571
        }
572
    ]
573
}
574
575
</pre>
576
 
577 53 Nico Schottelius
578
h3. How to list the version of every OSD and every monitor
579
580
To list the version of each ceph OSD:
581
582
<pre>
583
ceph tell osd.* version
584
</pre>
585
586
To list the version of each ceph mon:
587
2
588
<pre>
589
ceph tell mon.* version
590
</pre>
591
592
The mgr do not seem to support this command as of 14.2.21.
593
594 49 Nico Schottelius
h2. Performance Tuning
595
596
* Ensure that the basic options for reducing rebalancing workload are set:
597
598
<pre>
599
osd max backfills = 1
600
osd recovery max active = 1
601
osd recovery op priority = 2
602
</pre>
603
604
* Ensure that "osd_op_queue_cut_off":https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_op_queue_cut_off is set to **high**
605
** Requires OSD restart on change
606
607 50 Nico Schottelius
<pre>
608
ceph config set global osd_op_queue_cut_off high
609
</pre>
610
611 51 Nico Schottelius
<pre>
612
be sure to check your osd recovery sleep settings, there are several
613
depending on your underlying drives:
614
615
    "osd_recovery_sleep": "0.000000",
616
    "osd_recovery_sleep_hdd": "0.050000",
617
    "osd_recovery_sleep_hybrid": "0.050000",
618
    "osd_recovery_sleep_ssd": "0.050000",
619
620
Adjusting these will upwards will dramatically reduce IO, and take effect
621
immediately at the cost of slowing rebalance/recovery.
622
</pre>
623
624 52 Nico Schottelius
Reference settings from Frank Schilder:
625
626
<pre>
627
  osd       class:hdd      advanced osd_recovery_sleep                0.050000
628
  osd       class:rbd_data advanced osd_recovery_sleep                0.025000
629
  osd       class:rbd_meta advanced osd_recovery_sleep                0.002500
630
  osd       class:ssd      advanced osd_recovery_sleep                0.002500
631
  osd                      advanced osd_recovery_sleep                0.050000
632
633
  osd       class:hdd      advanced osd_max_backfills                 3
634
  osd       class:rbd_data advanced osd_max_backfills                 6
635
  osd       class:rbd_meta advanced osd_max_backfills                 12
636
  osd       class:ssd      advanced osd_max_backfills                 12
637
  osd                      advanced osd_max_backfills                 3
638
639
  osd       class:hdd      advanced osd_recovery_max_active           8
640
  osd       class:rbd_data advanced osd_recovery_max_active           16
641
  osd       class:rbd_meta advanced osd_recovery_max_active           32
642
  osd       class:ssd      advanced osd_recovery_max_active           32
643
  osd                      advanced osd_recovery_max_active           8
644
</pre>
645
646
(have not yet been tested in our clusters)
647 51 Nico Schottelius
648 42 Nico Schottelius
h2. Ceph theory
649
650
h3. How much data per Server?
651
652
Q: How much data should we add into one server?
653
A: Not more than it can handle.
654
655
How much data can a server handle? For this let's have a look at 2 scenarios:
656
657
* How long does it take to compensate the loss of the server?
658
659
* Assuming a server has X TiB storage in Y disks attached and a network speed of Z GiB/s.
660
* And our estimated rebuild goal is to compensate the loss of a server within U hours.
661
662
663
h4. Approach 1
664
665
Then
666
667
Let's take an example: 
668
669
* A server with @10 disks * 10 TiB@ = 100 TiB = 100 000 GiB data. It is network connected with 10 Gbit = 1.25 GiB/s.
670
* 100000/1.25 = 80000s = 22.22h
671
672
However, our logic assumes that we actually rebuild from the failed server, which... is failed. 
673
674
h4. Approach 2: calculating with left servers
675
676
However we can apply our logic also to distribute
677
the rebuild over several servers that now pull in data from each other for rebuilding.
678
We need to *read* the data (100TiB) from other servers and distribute it to new OSDs. Assuming each server has a 10 Gbit/s
679
network connection.
680
681
Now the servers might need to *read* (get data from other osds) and *write) (send data to other osds). Luckily, networking is 10 Gbit/s duplex - i.e. in both directions.
682
683
However how fast can we actually read data from the disks? 
684
685
* SSDs are in the range of hundreds of MB/s (best case, not necessarily true for random reads) - let's assume
686
* HDDs are in the range of tenths of MB/s (depending on the work load, but 30-40 MB/s random reads seems realistic)
687
688
 
689
690
691
Further assumptions:
692
693
* Assuming further that each disk should be dedicated at least one CPU core.
694 43 Nico Schottelius
695
h3. Disk/SSD speeds
696
697 44 Nico Schottelius
* Tuning for #8473 showed that a 10TB HDD can write up to 180-200MB/s when backfilling (at about 70% cpu usage and 20% disk usage), max backfills = 8
698 43 Nico Schottelius
* Debugging SSD usage in #8461 showed SSDs can read about 470-520MB/s sequential
699
* Debugging SSD usage in #8461 showed SSDs can write about 170-280MB/s sequential
700
* Debugging SSD usage in #8461 showed SSDs can write about 4MB/s RANDOM (need to verify this even though 3 runs showed these numbers)
701 47 Dominique Roux
702 48 Dominique Roux
h3. Ceph theoretical fundament
703 47 Dominique Roux
704
If you are very much into the theoretical fundament of Ceph check out their "paper":https://www3.nd.edu/~dthain/courses/cse40771/spring2007/papers/ceph.pdf