Project

General

Profile

The ungleich ceph handbook » History » Version 58

Nico Schottelius, 05/14/2022 09:15 PM

1 1 Nico Schottelius
h1. The ungleich ceph handbook
2
3 3 Nico Schottelius
{{toc}}
4
5 1 Nico Schottelius
h2. Status
6
7 7 Nico Schottelius
This document is **IN PRODUCTION**.
8 1 Nico Schottelius
9
h2. Introduction
10
11
This article describes the ungleich storage architecture that is based on ceph. It describes our architecture as well maintenance commands. Required for 
12
13 45 Nico Schottelius
h2. Processes
14
15
h3. Usage monitoring
16
17
* Usage should be kept somewhere in 70-75% area
18
* If usage reaches 72.5%, we start reducing usage by adding disks
19
* We stop when usage is below 70%
20
21
h3. Phasing in new disks
22
23
* 24h performance test prior to using it
24
25
h3. Phasing in new servers
26
27
* 24h performance test with 1 ssd or 1 hdd (whatever is applicable)
28
29
30 1 Nico Schottelius
h2. Communication guide
31
32
Usually when disks fails no customer communication is necessary, as it is automatically compensated/rebalanced by ceph. However in case multiple disk failures happen at the same time, I/O speed might be reduced and thus customer experience impacted.
33
34
For this reason communicate whenever I/O recovery settings are temporarily tuned.
35
36 20 Nico Schottelius
h2. Analysing 
37
38 21 Nico Schottelius
h3. ceph osd df tree
39 20 Nico Schottelius
40
Using @ceph osd df tree@ you can see not only the disk usage per OSD, but also the number of PGs on an OSD. This is especially useful to see how the OSDs are balanced.
41
42 22 Nico Schottelius
h3. Find out the device of an OSD
43
44
Use @mount | grep /var/lib/ceph/osd/ceph-OSDID@ on the server on which the OSD is located:
45
46
<pre>
47
48
[16:01:23] server2.place6:~# mount | grep /var/lib/ceph/osd/ceph-31
49
/dev/sdk1 on /var/lib/ceph/osd/ceph-31 type xfs (rw,relatime,attr2,inode64,noquota)
50
</pre>
51
52 57 Nico Schottelius
h3. Show config
53
54
<pre>
55
ceph config dump
56
</pre>
57
58 58 Nico Schottelius
h3. Show backfill and recovery config
59
60
<pre>
61
ceph-conf --show-config | egrep "osd_recovery_max_active|osd_recovery_op_priority|osd_max_backfills"
62
</pre>
63
64 2 Nico Schottelius
h2. Adding a new disk/ssd to the ceph cluster
65 1 Nico Schottelius
66 25 Jin-Guk Kwon
write on the disks, which order / date we bought it with a permanent marker.
67
68 46 Nico Schottelius
h3. Checking the shadow trees
69
70
To be able to spot differences / weights of hosts, it can be very helpful to look at the crush shadow tree
71
using @ceph osd crush tree --show-shadow@:
72
73
<pre>
74
-16   hdd-big 653.03418           root default~hdd-big        
75
-34   hdd-big         0         0     host server14~hdd-big   
76
-38   hdd-big         0         0     host server15~hdd-big   
77
-42   hdd-big  81.86153  78.28352     host server17~hdd-big   
78
 36   hdd-big   9.09560   9.09560         osd.36              
79
 59   hdd-big   9.09499   9.09499         osd.59              
80
 60   hdd-big   9.09499   9.09499         osd.60              
81
 68   hdd-big   9.09599   8.93999         osd.68              
82
 69   hdd-big   9.09599   7.65999         osd.69              
83
 70   hdd-big   9.09599   8.35899         osd.70              
84
 71   hdd-big   9.09599   8.56000         osd.71              
85
 72   hdd-big   9.09599   8.93700         osd.72              
86
 73   hdd-big   9.09599   8.54199         osd.73              
87
-46   hdd-big  90.94986  90.94986     host server18~hdd-big   
88
...
89
</pre>
90
91
92
Here we can see that the weight of server17 for the class hdd-big is about 81, the one of server18 about 90.
93
SSDs and other classes have their own shadow trees, too.
94
95
96 2 Nico Schottelius
h3. For Dell servers
97
98
First find the disk and then add it to the operating system
99
100
<pre>
101
megacli -PDList -aALL  | grep -B16 -i unconfigur
102
103
# Sample output:
104
[19:46:50] server7.place6:~#  megacli -PDList -aALL  | grep -B16 -i unconfigur
105
Enclosure Device ID: N/A
106
Slot Number: 0
107
Enclosure position: N/A
108
Device Id: 0
109
WWN: 0000000000000000
110
Sequence Number: 1
111
Media Error Count: 0
112
Other Error Count: 0
113
Predictive Failure Count: 0
114
Last Predictive Failure Event Seq Number: 0
115
PD Type: SATA
116
117
Raw Size: 894.252 GB [0x6fc81ab0 Sectors]
118
Non Coerced Size: 893.752 GB [0x6fb81ab0 Sectors]
119
Coerced Size: 893.75 GB [0x6fb80000 Sectors]
120
Sector Size:  0
121
Firmware state: Unconfigured(good), Spun Up
122
</pre>
123
124
Then add the disk to the OS:
125
126
<pre>
127 26 ll nu
megacli -CfgLdAdd -r0 [Enclosure Device ID:slot] -aX (X : host is 0. md-array is 1)
128 2 Nico Schottelius
129
# Sample call, if enclosure and slot are KNOWN (aka not N/A)
130
megacli -CfgLdAdd -r0 [32:0] -a0
131
132
# Sample call, if enclosure is N/A
133 1 Nico Schottelius
megacli -CfgLdAdd -r0 [:0] -a0
134 25 Jin-Guk Kwon
</pre>
135
136
Then check disk
137
138
<pre>
139
fdisk -l
140
[11:26:23] server2.place6:~# fdisk -l
141
......
142
Disk /dev/sdh: 7.3 TiB, 8000987201536 bytes, 15626928128 sectors
143
Units: sectors of 1 * 512 = 512 bytes
144
Sector size (logical/physical): 512 bytes / 512 bytes
145
I/O size (minimum/optimal): 512 bytes / 512 bytes
146
[11:27:24] server2.place6:~#
147
</pre>
148
149
Then create gpt
150
151
<pre>
152
/opt/ungleich-tools/disk-create-fresh-gpt /dev/XXX
153
[11:31:10] server2.place6:~# /opt/ungleich-tools/disk-create-fresh-gpt /dev/sdh
154
......
155
Created a new DOS disklabel with disk identifier 0x9c4a0355.
156
Command (m for help): Created a new GPT disklabel (GUID: 374E31AD-7B96-4837-B5ED-7B22C452899E).
157
......
158
</pre>
159
160
Then create osd for ssd/hdd-big
161
162
<pre>
163
/opt/ungleich-tools/ceph-osd-create-start /dev/XXX XXX(sdd or hdd-big)
164
[11:33:58] server2.place6:~# /opt/ungleich-tools/ceph-osd-create-start /dev/sdh hdd-big
165
+ set -e
166
+ [ 2 -lt 2 ]
167
......
168
+ /opt/ungleich-tools/monit-ceph-create-start osd.14
169
osd.14
170
[ ok ] Restarting daemon monitor: monit.
171
[11:36:14] server2.place6:~#
172
</pre>
173
174
Then check rebalancing(if you want to add another disk, you should do after rebalancing)
175
176
<pre>
177
ceph -s
178
[12:37:57] server2.place6:~# ceph -s
179
  cluster:
180
    id:     1ccd84f6-e362-4c50-9ffe-59436745e445
181
    health: HEALTH_WARN
182
            2248811/49628409 objects misplaced (4.531%)
183
......
184
  io:
185
    client:   170KiB/s rd, 35.0MiB/s wr, 463op/s rd, 728op/s wr
186
    recovery: 27.1MiB/s, 6objects/s
187
[12:49:41] server2.place6:~#
188 2 Nico Schottelius
</pre>
189
190 1 Nico Schottelius
h2. Moving a disk/ssd to another server
191 4 Nico Schottelius
192
(needs to be described better)
193
194
Generally speaking:
195
196 27 ll nu
* //needs to be tested: disable recovery so data wont start move while you have the osd down
197 9 Nico Schottelius
* /opt/ungleich-tools/ceph-osd-stop-disable does the following:
198
** Stop the osd, remove monit on the server you want to take it out
199
** umount the disk
200 1 Nico Schottelius
* Take disk out
201
* Discard preserved cache on the server you took it out 
202 54 Nico Schottelius
** using megacli:  @megacli -DiscardPreservedCache -Lall -aAll@
203 1 Nico Schottelius
* Insert into new server
204 9 Nico Schottelius
* Clear foreign configuration
205 54 Nico Schottelius
** using megacli: @megacli -CfgForeign -Clear -aAll@
206 9 Nico Schottelius
* Disk will now appear in the OS, ceph/udev will automatically start the OSD (!)
207
** No creating of the osd required!
208
* Verify that the disk exists and that the osd is started
209
** using *ps aux*
210
** using *ceph osd tree*
211 10 Nico Schottelius
* */opt/ungleich-tools/monit-ceph-create-start osd.XX* # where osd.XX is the osd + number
212 9 Nico Schottelius
** Creates the monit configuration file so that monit watches the OSD
213
** Reload monit
214 11 Nico Schottelius
* Verify monit using *monit status*
215 1 Nico Schottelius
216 56 Nico Schottelius
h2. OSD related processes
217 1 Nico Schottelius
218 56 Nico Schottelius
h3. Removing a disk/ssd
219
220 1 Nico Schottelius
To permanently remove a failed disk from a cluster, use ***ceph-osd-stop-remove-permanently*** from ungleich-tools repo. Warning: if the disk is still active, the OSD will be shutdown AND removed from the cluster -> all data of that disk will need to be rebalanced.
221
222 56 Nico Schottelius
h3. Handling DOWN osds with filesystem errors
223 1 Nico Schottelius
224
If an email arrives with the subject "monit alert -- Does not exist osd.XX-whoami", the filesystem of an OSD cannot be read anymore. It is very highly likely that the disk / ssd is broken. Steps that need to be done:
225
226
* Login to any ceph monitor (cephX.placeY.ungleich.ch)
227
* Check **ceph -s**, find host using **ceph osd tree**
228
* Login to the affected host
229
* Run the following commands:
230
** ls /var/lib/ceph/osd/ceph-XX
231
** dmesg
232 24 Jin-Guk Kwon
<pre>
233
ex) After checking message of dmesg, you can do next step
234
[204696.406756] XFS (sdl1): metadata I/O error: block 0x19100 ("xlog_iodone") error 5 numblks 64
235
[204696.408094] XFS (sdl1): xfs_do_force_shutdown(0x2) called from line 1233 of file /build/linux-BsFdsw/linux-4.9.65/fs/xfs/xfs_log.c.  Return address = 0xffffffffc08eb612
236
[204696.410702] XFS (sdl1): Log I/O Error Detected.  Shutting down filesystem
237
[204696.411977] XFS (sdl1): Please umount the filesystem and rectify the problem(
238
</pre>
239
240 1 Nico Schottelius
* Create a new ticket in the datacenter light project
241
** Subject: "Replace broken OSD.XX on serverX.placeY.ungleich.ch"
242
** Add (partial) output of above commands
243
** Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently XX, where XX is the osd id, to remove the disk from the cluster
244
** Remove the physical disk from the host, checkout if there is warranty on it and if yes
245
*** Create a short letter to the vendor, including technical details a from above
246
*** Record when you sent it in
247
*** Put ticket into status waiting
248
** If there is no warranty, dispose it
249
250 56 Nico Schottelius
h3. [[Create new pool and place new osd]]
251
252
h3. Configuring auto repair on pgs
253
254
<pre>
255
ceph config set osd osd_scrub_auto_repair true
256
</pre>
257
258
Verify using:
259
260
<pre>
261
ceph config dump
262
</pre>
263 39 Jin-Guk Kwon
264 1 Nico Schottelius
h2. Change ceph speed for i/o recovery
265
266
By default we want to keep I/O recovery traffic low to not impact customer experience. However when multiple disks fail at the same point, we might want to prioritise recover for data safety over performance.
267
268
The default configuration on our servers contains:
269
270
<pre>
271
[osd]
272
osd max backfills = 1
273
osd recovery max active = 1
274
osd recovery op priority = 2
275
</pre>
276
277
The important settings are *osd max backfills* and *osd recovery max active*, the priority is always kept low so that regular I/O has priority.
278
279
To adjust the number of backfills *per osd* and to change the *number of threads* used for recovery, we can use on any node with the admin keyring:
280
281
<pre>
282
ceph tell osd.* injectargs '--osd-max-backfills Y'
283
ceph tell osd.* injectargs '--osd-recovery-max-active X'
284
</pre>
285
286
where Y and X are the values that we want to use. Experience shows that Y=5 and X=5 doubles to triples the recovery performance, whereas X=10 and Y=10 increases recovery performance 5 times.
287
288
h2. Debug scrub errors / inconsistent pg message
289 6 Nico Schottelius
290 1 Nico Schottelius
From time to time disks don't save what they are told to save. Ceph scrubbing detects these errors and switches to HEALTH_ERR. Use *ceph health detail* to find out which placement groups (*pgs*) are affected. Usually a ***ceph pg repair <number>*** fixes the problem.
291
292
If this does not help, consult https://ceph.com/geen-categorie/ceph-manually-repair-object/.
293 12 Nico Schottelius
294
h2. Move servers into the osd tree
295
296
New servers have their buckets placed outside the **default root** and thus need to be moved inside.
297
Output might look as follows:
298
299
<pre>
300
[11:19:27] server5.place6:~# ceph osd tree
301
ID  CLASS   WEIGHT    TYPE NAME        STATUS REWEIGHT PRI-AFF 
302
 -3           0.87270 host server5                             
303
 41     ssd   0.87270     osd.41           up  1.00000 1.00000 
304
 -1         251.85580 root default                             
305
 -7          81.56271     host server2                         
306
  0 hdd-big   9.09511         osd.0        up  1.00000 1.00000 
307
  5 hdd-big   9.09511         osd.5        up  1.00000 1.00000 
308
...
309
</pre>
310
311
312
Use **ceph osd crush move serverX root=default** (where serverX is the new server),
313
which will move the bucket in the right place:
314
315
<pre>
316
[11:21:17] server5.place6:~# ceph osd crush move server5 root=default
317
moved item id -3 name 'server5' to location {root=default} in crush map
318
[11:32:12] server5.place6:~# ceph osd tree
319
ID  CLASS   WEIGHT    TYPE NAME        STATUS REWEIGHT PRI-AFF 
320
 -1         252.72850 root default                             
321
...
322
 -3           0.87270     host server5                         
323
 41     ssd   0.87270         osd.41       up  1.00000 1.00000 
324
325
326
</pre>
327 13 Nico Schottelius
328
h2. How to fix existing osds with wrong partition layout
329
330
In the first version of DCL we used filestore/3 partition based layout.
331
In the second version of DCL, including OSD autodection, we use bluestore/2 partition based layout.
332
333
To convert, we delete the old OSD, clean the partitions and create a new osd:
334
335 14 Nico Schottelius
h3. Inactive OSD
336 1 Nico Schottelius
337 14 Nico Schottelius
If the OSD is *not active*, we can do the following:
338
339 13 Nico Schottelius
* Find the OSD number: mount the partition and find the whoami file
340
341
<pre>
342
root@server2:/opt/ungleich-tools# mount /dev/sda2 /mnt/
343
root@server2:/opt/ungleich-tools# cat /mnt/whoami 
344
0
345
root@server2:/opt/ungleich-tools# umount  /mnt/
346
347
</pre>
348
349
* Verify in the *ceph osd tree* that the OSD is on that server
350
* Deleting the OSD
351
** ceph osd crush remove $osd_name
352 1 Nico Schottelius
** ceph osd rm $osd_name
353 14 Nico Schottelius
354
Then continue below as described in "Recreating the OSD".
355
356
h3. Remove Active OSD
357
358
* Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently OSDID to stop and remove the OSD
359
* Then continue below as described in "Recreating the OSD".
360
361
362
h3. Recreating the OSD
363
364 13 Nico Schottelius
* Create an empty partition table
365
** fdisk /dev/sdX
366
** g
367
** w
368
* Create a new OSD
369
** /opt/ungleich-tools/ceph-osd-create-start /dev/sdX CLASS # use hdd, ssd, ... for the CLASS
370 15 Jin-Guk Kwon
371
h2. How to fix unfound pg
372
373
refer to https://redmine.ungleich.ch/issues/6388
374 16 Jin-Guk Kwon
375
* Check health state 
376
** ceph health detail
377
* Check which server has that osd
378
** ceph osd tree
379
* Check which VM is running in server place
380 17 Jin-Guk Kwon
** virsh list  
381 16 Jin-Guk Kwon
* Check pg map
382 17 Jin-Guk Kwon
** ceph osd map [osd pool] [VMID]
383 18 Jin-Guk Kwon
* revert pg
384
** ceph pg [PGID] mark_unfound_lost revert
385 28 Nico Schottelius
386
h2. Enabling per image RBD statistics for prometheus
387
388
389
<pre>
390
[20:26:57] red2.place5:~# ceph config set mgr mgr/prometheus/rbd_stats_pools "one,hdd"
391
[20:27:57] black2.place6:~# ceph config set mgr mgr/prometheus/rbd_stats_pools "hdd,ssd"
392
</pre>
393 29 Nico Schottelius
394
h2. S3 Object Storage
395
396 36 Nico Schottelius
This section is ** UNDER CONTRUCTION ** 
397
398 29 Nico Schottelius
h3. Introduction
399 1 Nico Schottelius
400 30 Nico Schottelius
* See the "Red Hat manual":https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/object_gateway_guide_for_red_hat_enterprise_linux/overview-rgw
401
* The "ceph docs about object storage":https://docs.ceph.com/docs/mimic/radosgw/
402 29 Nico Schottelius
403
h3. Architecture
404
405
* S3 requests are handled by a publicly accessible gateway, which also has access to the ceph cluster.
406 34 Nico Schottelius
* s3 buckets are usually
407 29 Nico Schottelius
408 32 Nico Schottelius
h3. Authentication / Users
409
410
* Ceph *can* make use of LDAP as a backend
411 1 Nico Schottelius
** However it uses the clear text username+password as a token
412 34 Nico Schottelius
** See https://docs.ceph.com/docs/mimic/radosgw/ldap-auth/
413 32 Nico Schottelius
* We do not want users to store their regular account on machines
414
* For this reason we use independent users / tokens, but with the same username as in LDAP
415
416 38 Nico Schottelius
Creating a user:
417
418
<pre>
419
radosgw-admin user create --uid=USERNAME --display-name="Name of user"
420
</pre>
421
422
423
Listing users:
424
425
<pre>
426
radosgw-admin user list
427
</pre>
428
429
430
Deleting users and their storage:
431
432
<pre>
433
radosgw-admin user rm --uid=USERNAME --purge-data
434
</pre>
435
436 1 Nico Schottelius
h3. Setting up S3 object storage on Ceph
437 33 Nico Schottelius
438
* Setup a gateway node with Alpine Linux
439
** Change do edge
440
** Enable testing
441
* Update the firewall to allow access from this node to the ceph monitors
442 35 Nico Schottelius
* Setting up the wildcard DNS certificate
443
444
<pre>
445
apk add ceph-radosgw
446
</pre>
447 37 Nico Schottelius
448
h3. Wildcard DNS certificate from letsencrypt
449
450
Acquiring and renewing this certificate is currently a manual process, as it requires to change DNS settings.
451
452
* run certbot
453
* update DNS with the first token
454
* update DNS with the second token
455
456
Sample session:
457
458
<pre>
459
s3:/etc/ceph# certbot certonly --manual --preferred-challenges=dns --email sre@ungleich.ch --server https://acme-v02.api.letsencrypt.org/directory --agree-tos 
460
-d *.s3.ungleich.ch -d s3.ungleich.ch
461
Saving debug log to /var/log/letsencrypt/letsencrypt.log
462
Plugins selected: Authenticator manual, Installer None
463
Cert is due for renewal, auto-renewing...
464
Renewing an existing certificate
465
Performing the following challenges:
466
dns-01 challenge for s3.ungleich.ch
467
dns-01 challenge for s3.ungleich.ch
468
469
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
470
NOTE: The IP of this machine will be publicly logged as having requested this
471
certificate. If you're running certbot in manual mode on a machine that is not
472
your server, please ensure you're okay with that.
473
474
Are you OK with your IP being logged?
475
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
476
(Y)es/(N)o: y
477
478
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
479
Please deploy a DNS TXT record under the name
480
_acme-challenge.s3.ungleich.ch with the following value:
481
482
KxGLZNiVjFwz1ifNheoR_KQoPVpkvRUV1oT2pOvJlU0
483
484
Before continuing, verify the record is deployed.
485
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
486
Press Enter to Continue
487
488
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
489
Please deploy a DNS TXT record under the name
490
_acme-challenge.s3.ungleich.ch with the following value:
491
492
bkrhtxWZUipCAL5cBfvrjDuftqsZdQ2JjisiKmXBbaI
493
494
Before continuing, verify the record is deployed.
495
(This must be set up in addition to the previous challenges; do not remove,
496
replace, or undo the previous challenge tasks yet. Note that you might be
497
asked to create multiple distinct TXT records with the same name. This is
498
permitted by DNS standards.)
499
500
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
501
Press Enter to Continue
502
Waiting for verification...
503
Cleaning up challenges
504
505
IMPORTANT NOTES:
506
 - Congratulations! Your certificate and chain have been saved at:
507
   /etc/letsencrypt/live/s3.ungleich.ch/fullchain.pem
508
   Your key file has been saved at:
509
   /etc/letsencrypt/live/s3.ungleich.ch/privkey.pem
510
   Your cert will expire on 2020-12-09. To obtain a new or tweaked
511
   version of this certificate in the future, simply run certbot
512
   again. To non-interactively renew *all* of your certificates, run
513
   "certbot renew"
514
 - If you like Certbot, please consider supporting our work by:
515
516
   Donating to ISRG / Let's Encrypt:   https://letsencrypt.org/donate
517
   Donating to EFF:                    https://eff.org/donate-le
518
519
</pre>
520 41 Nico Schottelius
521
h2. Debugging ceph
522
523
524
<pre>
525
    ceph status
526
    ceph osd status
527
    ceph osd df
528
    ceph osd utilization
529
    ceph osd pool stats
530
    ceph osd tree
531
    ceph pg stat
532
</pre>
533 42 Nico Schottelius
534 53 Nico Schottelius
h3. How to list the version overview
535
536 55 Nico Schottelius
This lists the versions of osds, mgrs and mons:
537
538 53 Nico Schottelius
<pre>
539
ceph versions
540
</pre>
541 55 Nico Schottelius
542
Listing the "features" of clients, osds, mgrs and mons can be done using @ceph features@:
543
544
<pre>
545
[15:32:20] red1.place5:~# ceph features
546
{
547
    "mon": [
548
        {
549
            "features": "0x3ffddff8ffecffff",
550
            "release": "luminous",
551
            "num": 5
552
        }
553
    ],
554
    "osd": [
555
        {
556
            "features": "0x3ffddff8ffecffff",
557
            "release": "luminous",
558
            "num": 44
559
        }
560
    ],
561
    "client": [
562
        {
563
            "features": "0x3ffddff8eea4fffb",
564
            "release": "luminous",
565
            "num": 4
566
        },
567
        {
568
            "features": "0x3ffddff8ffacffff",
569
            "release": "luminous",
570
            "num": 18
571
        },
572
        {
573
            "features": "0x3ffddff8ffecffff",
574
            "release": "luminous",
575
            "num": 31
576
        }
577
    ],
578
    "mgr": [
579
        {
580
            "features": "0x3ffddff8ffecffff",
581
            "release": "luminous",
582
            "num": 4
583
        }
584
    ]
585
}
586
587
</pre>
588
 
589 53 Nico Schottelius
590
h3. How to list the version of every OSD and every monitor
591
592
To list the version of each ceph OSD:
593
594
<pre>
595
ceph tell osd.* version
596
</pre>
597
598
To list the version of each ceph mon:
599
2
600
<pre>
601
ceph tell mon.* version
602
</pre>
603
604
The mgr do not seem to support this command as of 14.2.21.
605
606 49 Nico Schottelius
h2. Performance Tuning
607
608
* Ensure that the basic options for reducing rebalancing workload are set:
609
610
<pre>
611
osd max backfills = 1
612
osd recovery max active = 1
613
osd recovery op priority = 2
614
</pre>
615
616
* Ensure that "osd_op_queue_cut_off":https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_op_queue_cut_off is set to **high**
617
** Requires OSD restart on change
618
619 50 Nico Schottelius
<pre>
620
ceph config set global osd_op_queue_cut_off high
621
</pre>
622
623 51 Nico Schottelius
<pre>
624
be sure to check your osd recovery sleep settings, there are several
625
depending on your underlying drives:
626
627
    "osd_recovery_sleep": "0.000000",
628
    "osd_recovery_sleep_hdd": "0.050000",
629
    "osd_recovery_sleep_hybrid": "0.050000",
630
    "osd_recovery_sleep_ssd": "0.050000",
631
632
Adjusting these will upwards will dramatically reduce IO, and take effect
633
immediately at the cost of slowing rebalance/recovery.
634
</pre>
635
636 52 Nico Schottelius
Reference settings from Frank Schilder:
637
638
<pre>
639
  osd       class:hdd      advanced osd_recovery_sleep                0.050000
640
  osd       class:rbd_data advanced osd_recovery_sleep                0.025000
641
  osd       class:rbd_meta advanced osd_recovery_sleep                0.002500
642
  osd       class:ssd      advanced osd_recovery_sleep                0.002500
643
  osd                      advanced osd_recovery_sleep                0.050000
644
645
  osd       class:hdd      advanced osd_max_backfills                 3
646
  osd       class:rbd_data advanced osd_max_backfills                 6
647
  osd       class:rbd_meta advanced osd_max_backfills                 12
648
  osd       class:ssd      advanced osd_max_backfills                 12
649
  osd                      advanced osd_max_backfills                 3
650
651
  osd       class:hdd      advanced osd_recovery_max_active           8
652
  osd       class:rbd_data advanced osd_recovery_max_active           16
653
  osd       class:rbd_meta advanced osd_recovery_max_active           32
654
  osd       class:ssd      advanced osd_recovery_max_active           32
655
  osd                      advanced osd_recovery_max_active           8
656
</pre>
657
658
(have not yet been tested in our clusters)
659 51 Nico Schottelius
660 42 Nico Schottelius
h2. Ceph theory
661
662
h3. How much data per Server?
663
664
Q: How much data should we add into one server?
665
A: Not more than it can handle.
666
667
How much data can a server handle? For this let's have a look at 2 scenarios:
668
669
* How long does it take to compensate the loss of the server?
670
671
* Assuming a server has X TiB storage in Y disks attached and a network speed of Z GiB/s.
672
* And our estimated rebuild goal is to compensate the loss of a server within U hours.
673
674
675
h4. Approach 1
676
677
Then
678
679
Let's take an example: 
680
681
* A server with @10 disks * 10 TiB@ = 100 TiB = 100 000 GiB data. It is network connected with 10 Gbit = 1.25 GiB/s.
682
* 100000/1.25 = 80000s = 22.22h
683
684
However, our logic assumes that we actually rebuild from the failed server, which... is failed. 
685
686
h4. Approach 2: calculating with left servers
687
688
However we can apply our logic also to distribute
689
the rebuild over several servers that now pull in data from each other for rebuilding.
690
We need to *read* the data (100TiB) from other servers and distribute it to new OSDs. Assuming each server has a 10 Gbit/s
691
network connection.
692
693
Now the servers might need to *read* (get data from other osds) and *write) (send data to other osds). Luckily, networking is 10 Gbit/s duplex - i.e. in both directions.
694
695
However how fast can we actually read data from the disks? 
696
697
* SSDs are in the range of hundreds of MB/s (best case, not necessarily true for random reads) - let's assume
698
* HDDs are in the range of tenths of MB/s (depending on the work load, but 30-40 MB/s random reads seems realistic)
699
700
 
701
702
703
Further assumptions:
704
705
* Assuming further that each disk should be dedicated at least one CPU core.
706 43 Nico Schottelius
707
h3. Disk/SSD speeds
708
709 44 Nico Schottelius
* Tuning for #8473 showed that a 10TB HDD can write up to 180-200MB/s when backfilling (at about 70% cpu usage and 20% disk usage), max backfills = 8
710 43 Nico Schottelius
* Debugging SSD usage in #8461 showed SSDs can read about 470-520MB/s sequential
711
* Debugging SSD usage in #8461 showed SSDs can write about 170-280MB/s sequential
712
* Debugging SSD usage in #8461 showed SSDs can write about 4MB/s RANDOM (need to verify this even though 3 runs showed these numbers)
713 47 Dominique Roux
714 48 Dominique Roux
h3. Ceph theoretical fundament
715 47 Dominique Roux
716
If you are very much into the theoretical fundament of Ceph check out their "paper":https://www3.nd.edu/~dthain/courses/cse40771/spring2007/papers/ceph.pdf