The ungleich ceph handbook » History » Version 72
Nico Schottelius, 01/30/2024 10:16 AM
| 1 | 1 | Nico Schottelius | h1. The ungleich ceph handbook |
|---|---|---|---|
| 2 | |||
| 3 | 3 | Nico Schottelius | {{toc}} |
| 4 | |||
| 5 | 1 | Nico Schottelius | h2. Status |
| 6 | |||
| 7 | 7 | Nico Schottelius | This document is **IN PRODUCTION**. |
| 8 | 1 | Nico Schottelius | |
| 9 | h2. Introduction |
||
| 10 | |||
| 11 | This article describes the ungleich storage architecture that is based on ceph. It describes our architecture as well maintenance commands. Required for |
||
| 12 | |||
| 13 | 45 | Nico Schottelius | h2. Processes |
| 14 | |||
| 15 | h3. Usage monitoring |
||
| 16 | |||
| 17 | * Usage should be kept somewhere in 70-75% area |
||
| 18 | * If usage reaches 72.5%, we start reducing usage by adding disks |
||
| 19 | * We stop when usage is below 70% |
||
| 20 | |||
| 21 | h3. Phasing in new disks |
||
| 22 | |||
| 23 | * 24h performance test prior to using it |
||
| 24 | |||
| 25 | h3. Phasing in new servers |
||
| 26 | |||
| 27 | * 24h performance test with 1 ssd or 1 hdd (whatever is applicable) |
||
| 28 | |||
| 29 | |||
| 30 | 1 | Nico Schottelius | h2. Communication guide |
| 31 | |||
| 32 | Usually when disks fails no customer communication is necessary, as it is automatically compensated/rebalanced by ceph. However in case multiple disk failures happen at the same time, I/O speed might be reduced and thus customer experience impacted. |
||
| 33 | |||
| 34 | For this reason communicate whenever I/O recovery settings are temporarily tuned. |
||
| 35 | |||
| 36 | 20 | Nico Schottelius | h2. Analysing |
| 37 | |||
| 38 | 21 | Nico Schottelius | h3. ceph osd df tree |
| 39 | 20 | Nico Schottelius | |
| 40 | Using @ceph osd df tree@ you can see not only the disk usage per OSD, but also the number of PGs on an OSD. This is especially useful to see how the OSDs are balanced. |
||
| 41 | |||
| 42 | 22 | Nico Schottelius | h3. Find out the device of an OSD |
| 43 | |||
| 44 | Use @mount | grep /var/lib/ceph/osd/ceph-OSDID@ on the server on which the OSD is located: |
||
| 45 | |||
| 46 | <pre> |
||
| 47 | |||
| 48 | [16:01:23] server2.place6:~# mount | grep /var/lib/ceph/osd/ceph-31 |
||
| 49 | /dev/sdk1 on /var/lib/ceph/osd/ceph-31 type xfs (rw,relatime,attr2,inode64,noquota) |
||
| 50 | </pre> |
||
| 51 | |||
| 52 | 57 | Nico Schottelius | h3. Show config |
| 53 | |||
| 54 | <pre> |
||
| 55 | ceph config dump |
||
| 56 | </pre> |
||
| 57 | |||
| 58 | 58 | Nico Schottelius | h3. Show backfill and recovery config |
| 59 | |||
| 60 | <pre> |
||
| 61 | ceph-conf --show-config | egrep "osd_recovery_max_active|osd_recovery_op_priority|osd_max_backfills" |
||
| 62 | </pre> |
||
| 63 | |||
| 64 | 59 | Nico Schottelius | * See also: https://www.suse.com/support/kb/doc/?id=000019693 |
| 65 | |||
| 66 | 63 | Nico Schottelius | h3. Checking and clearing crash reports |
| 67 | |||
| 68 | If the cluster is reporting HEALTH_WARN and a recent crash such as: |
||
| 69 | |||
| 70 | <pre> |
||
| 71 | [rook@rook-ceph-tools-f569797b4-z4542 /]$ ceph -s |
||
| 72 | cluster: |
||
| 73 | id: ... |
||
| 74 | health: HEALTH_WARN |
||
| 75 | 1 daemons have recently crashed |
||
| 76 | </pre> |
||
| 77 | |||
| 78 | One can analyse it using |
||
| 79 | |||
| 80 | * List the crashes: @ceph crash ls@ |
||
| 81 | * Checkout the details: @ceph crash info <id>@ |
||
| 82 | |||
| 83 | To archive the error: |
||
| 84 | |||
| 85 | * To archive a specific report: @ceph crash archive <id>@ |
||
| 86 | * To archive all: @ceph crash archive-all@ |
||
| 87 | |||
| 88 | After archiving, the cluster health should return to HEALTH_OK: |
||
| 89 | |||
| 90 | <pre> |
||
| 91 | [rook@rook-ceph-tools-f569797b4-z4542 /]$ ceph crash ls |
||
| 92 | ID ENTITY NEW |
||
| 93 | 2022-09-19T04:33:19.378500Z_b2e26755-0712-41de-bf2b-b370dbe94ebc mon.c * |
||
| 94 | [rook@rook-ceph-tools-f569797b4-z4542 /]$ ceph crash archive 2022-09-19T04:33:19.378500Z_b2e26755-0712-41de-bf2b-b370dbe94ebc |
||
| 95 | [rook@rook-ceph-tools-f569797b4-z4542 /]$ ceph -s |
||
| 96 | cluster: |
||
| 97 | id: .. |
||
| 98 | health: HEALTH_OK |
||
| 99 | |||
| 100 | </pre> |
||
| 101 | |||
| 102 | 68 | Nico Schottelius | h3. Low monitor space warning |
| 103 | |||
| 104 | If you see |
||
| 105 | |||
| 106 | <pre> |
||
| 107 | [rook@rook-ceph-tools-6bdf996-8g792 /]$ ceph health detail |
||
| 108 | HEALTH_WARN mon q is low on available space |
||
| 109 | [WRN] MON_DISK_LOW: mon q is low on available space |
||
| 110 | mon.q has 29% avail |
||
| 111 | |||
| 112 | </pre> |
||
| 113 | |||
| 114 | there are two options to fix it: |
||
| 115 | |||
| 116 | * a) free up space |
||
| 117 | * b) raise the limit as specified in @mon_data_avail_warn@ |
||
| 118 | |||
| 119 | 2 | Nico Schottelius | h2. Adding a new disk/ssd to the ceph cluster |
| 120 | 1 | Nico Schottelius | |
| 121 | 25 | Jin-Guk Kwon | write on the disks, which order / date we bought it with a permanent marker. |
| 122 | |||
| 123 | 46 | Nico Schottelius | h3. Checking the shadow trees |
| 124 | |||
| 125 | To be able to spot differences / weights of hosts, it can be very helpful to look at the crush shadow tree |
||
| 126 | using @ceph osd crush tree --show-shadow@: |
||
| 127 | |||
| 128 | <pre> |
||
| 129 | -16 hdd-big 653.03418 root default~hdd-big |
||
| 130 | -34 hdd-big 0 0 host server14~hdd-big |
||
| 131 | -38 hdd-big 0 0 host server15~hdd-big |
||
| 132 | -42 hdd-big 81.86153 78.28352 host server17~hdd-big |
||
| 133 | 36 hdd-big 9.09560 9.09560 osd.36 |
||
| 134 | 59 hdd-big 9.09499 9.09499 osd.59 |
||
| 135 | 60 hdd-big 9.09499 9.09499 osd.60 |
||
| 136 | 68 hdd-big 9.09599 8.93999 osd.68 |
||
| 137 | 69 hdd-big 9.09599 7.65999 osd.69 |
||
| 138 | 70 hdd-big 9.09599 8.35899 osd.70 |
||
| 139 | 71 hdd-big 9.09599 8.56000 osd.71 |
||
| 140 | 72 hdd-big 9.09599 8.93700 osd.72 |
||
| 141 | 73 hdd-big 9.09599 8.54199 osd.73 |
||
| 142 | -46 hdd-big 90.94986 90.94986 host server18~hdd-big |
||
| 143 | ... |
||
| 144 | </pre> |
||
| 145 | |||
| 146 | |||
| 147 | Here we can see that the weight of server17 for the class hdd-big is about 81, the one of server18 about 90. |
||
| 148 | SSDs and other classes have their own shadow trees, too. |
||
| 149 | |||
| 150 | 2 | Nico Schottelius | h3. For Dell servers |
| 151 | |||
| 152 | First find the disk and then add it to the operating system |
||
| 153 | |||
| 154 | <pre> |
||
| 155 | megacli -PDList -aALL | grep -B16 -i unconfigur |
||
| 156 | |||
| 157 | # Sample output: |
||
| 158 | [19:46:50] server7.place6:~# megacli -PDList -aALL | grep -B16 -i unconfigur |
||
| 159 | Enclosure Device ID: N/A |
||
| 160 | Slot Number: 0 |
||
| 161 | Enclosure position: N/A |
||
| 162 | Device Id: 0 |
||
| 163 | WWN: 0000000000000000 |
||
| 164 | Sequence Number: 1 |
||
| 165 | Media Error Count: 0 |
||
| 166 | Other Error Count: 0 |
||
| 167 | Predictive Failure Count: 0 |
||
| 168 | Last Predictive Failure Event Seq Number: 0 |
||
| 169 | PD Type: SATA |
||
| 170 | |||
| 171 | Raw Size: 894.252 GB [0x6fc81ab0 Sectors] |
||
| 172 | Non Coerced Size: 893.752 GB [0x6fb81ab0 Sectors] |
||
| 173 | Coerced Size: 893.75 GB [0x6fb80000 Sectors] |
||
| 174 | Sector Size: 0 |
||
| 175 | Firmware state: Unconfigured(good), Spun Up |
||
| 176 | </pre> |
||
| 177 | |||
| 178 | Then add the disk to the OS: |
||
| 179 | |||
| 180 | <pre> |
||
| 181 | 26 | ll nu | megacli -CfgLdAdd -r0 [Enclosure Device ID:slot] -aX (X : host is 0. md-array is 1) |
| 182 | 2 | Nico Schottelius | |
| 183 | # Sample call, if enclosure and slot are KNOWN (aka not N/A) |
||
| 184 | megacli -CfgLdAdd -r0 [32:0] -a0 |
||
| 185 | |||
| 186 | # Sample call, if enclosure is N/A |
||
| 187 | 1 | Nico Schottelius | megacli -CfgLdAdd -r0 [:0] -a0 |
| 188 | 25 | Jin-Guk Kwon | </pre> |
| 189 | |||
| 190 | Then check disk |
||
| 191 | |||
| 192 | <pre> |
||
| 193 | fdisk -l |
||
| 194 | [11:26:23] server2.place6:~# fdisk -l |
||
| 195 | ...... |
||
| 196 | Disk /dev/sdh: 7.3 TiB, 8000987201536 bytes, 15626928128 sectors |
||
| 197 | Units: sectors of 1 * 512 = 512 bytes |
||
| 198 | Sector size (logical/physical): 512 bytes / 512 bytes |
||
| 199 | I/O size (minimum/optimal): 512 bytes / 512 bytes |
||
| 200 | [11:27:24] server2.place6:~# |
||
| 201 | </pre> |
||
| 202 | |||
| 203 | Then create gpt |
||
| 204 | |||
| 205 | <pre> |
||
| 206 | /opt/ungleich-tools/disk-create-fresh-gpt /dev/XXX |
||
| 207 | [11:31:10] server2.place6:~# /opt/ungleich-tools/disk-create-fresh-gpt /dev/sdh |
||
| 208 | ...... |
||
| 209 | Created a new DOS disklabel with disk identifier 0x9c4a0355. |
||
| 210 | Command (m for help): Created a new GPT disklabel (GUID: 374E31AD-7B96-4837-B5ED-7B22C452899E). |
||
| 211 | ...... |
||
| 212 | </pre> |
||
| 213 | |||
| 214 | Then create osd for ssd/hdd-big |
||
| 215 | |||
| 216 | <pre> |
||
| 217 | /opt/ungleich-tools/ceph-osd-create-start /dev/XXX XXX(sdd or hdd-big) |
||
| 218 | [11:33:58] server2.place6:~# /opt/ungleich-tools/ceph-osd-create-start /dev/sdh hdd-big |
||
| 219 | + set -e |
||
| 220 | + [ 2 -lt 2 ] |
||
| 221 | ...... |
||
| 222 | + /opt/ungleich-tools/monit-ceph-create-start osd.14 |
||
| 223 | osd.14 |
||
| 224 | [ ok ] Restarting daemon monitor: monit. |
||
| 225 | [11:36:14] server2.place6:~# |
||
| 226 | </pre> |
||
| 227 | |||
| 228 | Then check rebalancing(if you want to add another disk, you should do after rebalancing) |
||
| 229 | |||
| 230 | <pre> |
||
| 231 | ceph -s |
||
| 232 | [12:37:57] server2.place6:~# ceph -s |
||
| 233 | cluster: |
||
| 234 | id: 1ccd84f6-e362-4c50-9ffe-59436745e445 |
||
| 235 | health: HEALTH_WARN |
||
| 236 | 2248811/49628409 objects misplaced (4.531%) |
||
| 237 | ...... |
||
| 238 | io: |
||
| 239 | client: 170KiB/s rd, 35.0MiB/s wr, 463op/s rd, 728op/s wr |
||
| 240 | recovery: 27.1MiB/s, 6objects/s |
||
| 241 | 1 | Nico Schottelius | [12:49:41] server2.place6:~# |
| 242 | 64 | Nico Schottelius | </pre> |
| 243 | |||
| 244 | 66 | Nico Schottelius | h3. For HP servers (hpacucli) |
| 245 | 64 | Nico Schottelius | |
| 246 | * Ensure the module "sg" has been loaded |
||
| 247 | |||
| 248 | Use the following to verify that the controller is detected: |
||
| 249 | |||
| 250 | <pre> |
||
| 251 | # hpacucli controller all show |
||
| 252 | |||
| 253 | Smart Array P420i in Slot 0 (Embedded) (sn: 001438033ECEF60) |
||
| 254 | 2 | Nico Schottelius | </pre> |
| 255 | |||
| 256 | 65 | Nico Schottelius | |
| 257 | h4. Show all disks from controller on slot 0 |
||
| 258 | |||
| 259 | <pre> |
||
| 260 | hpacucli controller slot=0 physicaldrive all show |
||
| 261 | </pre> |
||
| 262 | |||
| 263 | Example |
||
| 264 | |||
| 265 | <pre> |
||
| 266 | # hpacucli controller slot=0 physicaldrive all show |
||
| 267 | |||
| 268 | Smart Array P420i in Slot 0 (Embedded) |
||
| 269 | |||
| 270 | array A |
||
| 271 | |||
| 272 | physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SATA, 12000.1 GB, OK) |
||
| 273 | |||
| 274 | array B |
||
| 275 | |||
| 276 | physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SATA, 12000.1 GB, OK) |
||
| 277 | |||
| 278 | unassigned |
||
| 279 | |||
| 280 | physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SATA, 10000.8 GB, OK) |
||
| 281 | |||
| 282 | root@ungleich-hardware-server97:/# |
||
| 283 | |||
| 284 | </pre> |
||
| 285 | |||
| 286 | In this example the last disk has not been assigned yet. |
||
| 287 | |||
| 288 | h4. Create RAID 0 for ceph |
||
| 289 | |||
| 290 | For ceph we want a raid 0 over 1 disk to expose the disk to the OS. |
||
| 291 | |||
| 292 | This can be done using the following command: |
||
| 293 | |||
| 294 | <pre> |
||
| 295 | hpacucli controller slot=0 create type=ld drives=$DRIVEID raid=0 |
||
| 296 | </pre> |
||
| 297 | |||
| 298 | For example: |
||
| 299 | |||
| 300 | <pre> |
||
| 301 | hpacucli controller slot=0 create type=ld drives=1I:1:3 raid=0 |
||
| 302 | </pre> |
||
| 303 | |||
| 304 | h4. Show the controller configuration |
||
| 305 | |||
| 306 | <pre> |
||
| 307 | hpacucli controller slot=0 show config |
||
| 308 | </pre> |
||
| 309 | |||
| 310 | For example: |
||
| 311 | |||
| 312 | <pre> |
||
| 313 | # hpacucli controller slot=0 show config |
||
| 314 | |||
| 315 | Smart Array P420i in Slot 0 (Embedded) (sn: 001438033ECEF60) |
||
| 316 | |||
| 317 | array A (SATA, Unused Space: 0 MB) |
||
| 318 | |||
| 319 | |||
| 320 | logicaldrive 1 (10.9 TB, RAID 0, OK) |
||
| 321 | |||
| 322 | physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SATA, 12000.1 GB, OK) |
||
| 323 | |||
| 324 | array B (SATA, Unused Space: 0 MB) |
||
| 325 | |||
| 326 | |||
| 327 | logicaldrive 2 (10.9 TB, RAID 0, OK) |
||
| 328 | |||
| 329 | physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SATA, 12000.1 GB, OK) |
||
| 330 | |||
| 331 | array C (SATA, Unused Space: 0 MB) |
||
| 332 | |||
| 333 | |||
| 334 | logicaldrive 3 (9.1 TB, RAID 0, OK) |
||
| 335 | |||
| 336 | physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SATA, 10000.8 GB, OK) |
||
| 337 | |||
| 338 | Expander 380 (WWID: 50014380324EBFE0, Port: 1I, Box: 1) |
||
| 339 | |||
| 340 | Enclosure SEP (Vendor ID HP, Model Gen8 ServBP 12+2) 378 (WWID: 50014380324EBFF9, Port: 1I, Box: 1) |
||
| 341 | 1 | Nico Schottelius | |
| 342 | SEP (Vendor ID PMCSIERA, Model SRCv8x6G) 379 (WWID: 5001438033ECEF6F) |
||
| 343 | 71 | Nico Schottelius | </pre> |
| 344 | |||
| 345 | h3. Removing signatures preventing disk being used by ceph |
||
| 346 | |||
| 347 | If you see |
||
| 348 | |||
| 349 | <pre> |
||
| 350 | cephosd: skipping device "sdX" because it contains a filesystem "ddf_raid_member" |
||
| 351 | </pre> |
||
| 352 | |||
| 353 | you can clean it with wipefs: |
||
| 354 | |||
| 355 | <pre> |
||
| 356 | [20:47] server98.place10:~# wipefs /dev/sde |
||
| 357 | DEVICE OFFSET TYPE UUID LABEL |
||
| 358 | sde 0xae9fffffe00 ddf_raid_member Dell \x10 |
||
| 359 | [20:48] server98.place10:~# wipefs -a /dev/sde |
||
| 360 | /dev/sde: 4 bytes were erased at offset 0xae9fffffe00 (ddf_raid_member): de 11 de 11 |
||
| 361 | [20:48] server98.place10:~# |
||
| 362 | |||
| 363 | 65 | Nico Schottelius | </pre> |
| 364 | |||
| 365 | 1 | Nico Schottelius | h2. Moving a disk/ssd to another server |
| 366 | 4 | Nico Schottelius | |
| 367 | (needs to be described better) |
||
| 368 | |||
| 369 | Generally speaking: |
||
| 370 | |||
| 371 | 27 | ll nu | * //needs to be tested: disable recovery so data wont start move while you have the osd down |
| 372 | 9 | Nico Schottelius | * /opt/ungleich-tools/ceph-osd-stop-disable does the following: |
| 373 | ** Stop the osd, remove monit on the server you want to take it out |
||
| 374 | ** umount the disk |
||
| 375 | 1 | Nico Schottelius | * Take disk out |
| 376 | * Discard preserved cache on the server you took it out |
||
| 377 | 54 | Nico Schottelius | ** using megacli: @megacli -DiscardPreservedCache -Lall -aAll@ |
| 378 | 1 | Nico Schottelius | * Insert into new server |
| 379 | 9 | Nico Schottelius | * Clear foreign configuration |
| 380 | 54 | Nico Schottelius | ** using megacli: @megacli -CfgForeign -Clear -aAll@ |
| 381 | 9 | Nico Schottelius | * Disk will now appear in the OS, ceph/udev will automatically start the OSD (!) |
| 382 | ** No creating of the osd required! |
||
| 383 | * Verify that the disk exists and that the osd is started |
||
| 384 | ** using *ps aux* |
||
| 385 | ** using *ceph osd tree* |
||
| 386 | 10 | Nico Schottelius | * */opt/ungleich-tools/monit-ceph-create-start osd.XX* # where osd.XX is the osd + number |
| 387 | 9 | Nico Schottelius | ** Creates the monit configuration file so that monit watches the OSD |
| 388 | ** Reload monit |
||
| 389 | 11 | Nico Schottelius | * Verify monit using *monit status* |
| 390 | 1 | Nico Schottelius | |
| 391 | 72 | Nico Schottelius | <pre> |
| 392 | megacli -DiscardPreservedCache -Lall -aAll |
||
| 393 | megacli -CfgForeign -Clear -aAll |
||
| 394 | </pre> |
||
| 395 | |||
| 396 | 56 | Nico Schottelius | h2. OSD related processes |
| 397 | 1 | Nico Schottelius | |
| 398 | 56 | Nico Schottelius | h3. Removing a disk/ssd |
| 399 | |||
| 400 | 1 | Nico Schottelius | To permanently remove a failed disk from a cluster, use ***ceph-osd-stop-remove-permanently*** from ungleich-tools repo. Warning: if the disk is still active, the OSD will be shutdown AND removed from the cluster -> all data of that disk will need to be rebalanced. |
| 401 | |||
| 402 | 56 | Nico Schottelius | h3. Handling DOWN osds with filesystem errors |
| 403 | 1 | Nico Schottelius | |
| 404 | If an email arrives with the subject "monit alert -- Does not exist osd.XX-whoami", the filesystem of an OSD cannot be read anymore. It is very highly likely that the disk / ssd is broken. Steps that need to be done: |
||
| 405 | |||
| 406 | * Login to any ceph monitor (cephX.placeY.ungleich.ch) |
||
| 407 | * Check **ceph -s**, find host using **ceph osd tree** |
||
| 408 | * Login to the affected host |
||
| 409 | * Run the following commands: |
||
| 410 | ** ls /var/lib/ceph/osd/ceph-XX |
||
| 411 | ** dmesg |
||
| 412 | 24 | Jin-Guk Kwon | <pre> |
| 413 | ex) After checking message of dmesg, you can do next step |
||
| 414 | [204696.406756] XFS (sdl1): metadata I/O error: block 0x19100 ("xlog_iodone") error 5 numblks 64 |
||
| 415 | [204696.408094] XFS (sdl1): xfs_do_force_shutdown(0x2) called from line 1233 of file /build/linux-BsFdsw/linux-4.9.65/fs/xfs/xfs_log.c. Return address = 0xffffffffc08eb612 |
||
| 416 | [204696.410702] XFS (sdl1): Log I/O Error Detected. Shutting down filesystem |
||
| 417 | [204696.411977] XFS (sdl1): Please umount the filesystem and rectify the problem( |
||
| 418 | </pre> |
||
| 419 | |||
| 420 | 1 | Nico Schottelius | * Create a new ticket in the datacenter light project |
| 421 | ** Subject: "Replace broken OSD.XX on serverX.placeY.ungleich.ch" |
||
| 422 | ** Add (partial) output of above commands |
||
| 423 | ** Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently XX, where XX is the osd id, to remove the disk from the cluster |
||
| 424 | ** Remove the physical disk from the host, checkout if there is warranty on it and if yes |
||
| 425 | *** Create a short letter to the vendor, including technical details a from above |
||
| 426 | *** Record when you sent it in |
||
| 427 | *** Put ticket into status waiting |
||
| 428 | ** If there is no warranty, dispose it |
||
| 429 | |||
| 430 | 56 | Nico Schottelius | h3. [[Create new pool and place new osd]] |
| 431 | |||
| 432 | h3. Configuring auto repair on pgs |
||
| 433 | |||
| 434 | <pre> |
||
| 435 | ceph config set osd osd_scrub_auto_repair true |
||
| 436 | </pre> |
||
| 437 | |||
| 438 | Verify using: |
||
| 439 | |||
| 440 | <pre> |
||
| 441 | ceph config dump |
||
| 442 | </pre> |
||
| 443 | 39 | Jin-Guk Kwon | |
| 444 | 67 | Nico Schottelius | h3. Change the device class of an OSD |
| 445 | |||
| 446 | <pre> |
||
| 447 | OSD=XX |
||
| 448 | NEWCLASS=ZZ |
||
| 449 | |||
| 450 | # Set new device class to "ssd" |
||
| 451 | ceph osd crush rm-device-class osd.$OSD |
||
| 452 | ceph osd crush set-device-class $NEWCLASS osd.$OSD |
||
| 453 | </pre> |
||
| 454 | |||
| 455 | * Found on https://arpnetworks.com/blog/2019/06/28/how-to-update-the-device-class-on-a-ceph-osd.html |
||
| 456 | |||
| 457 | 70 | Nico Schottelius | h2. Managing ceph Daemon crashes |
| 458 | |||
| 459 | If there is a warning about crashed daemons, they can be displayed and deleted as follows: |
||
| 460 | |||
| 461 | * @ceph crash ls@ |
||
| 462 | * @ceph crash info <id>@ |
||
| 463 | * @ceph crash archive <id>@ |
||
| 464 | * @ceph crash archive-all@ |
||
| 465 | |||
| 466 | Summary originally found on https://forum.proxmox.com/threads/health_warn-1-daemons-have-recently-crashed.63105/ |
||
| 467 | |||
| 468 | 1 | Nico Schottelius | h2. Change ceph speed for i/o recovery |
| 469 | |||
| 470 | By default we want to keep I/O recovery traffic low to not impact customer experience. However when multiple disks fail at the same point, we might want to prioritise recover for data safety over performance. |
||
| 471 | |||
| 472 | The default configuration on our servers contains: |
||
| 473 | |||
| 474 | <pre> |
||
| 475 | [osd] |
||
| 476 | osd max backfills = 1 |
||
| 477 | osd recovery max active = 1 |
||
| 478 | osd recovery op priority = 2 |
||
| 479 | </pre> |
||
| 480 | |||
| 481 | The important settings are *osd max backfills* and *osd recovery max active*, the priority is always kept low so that regular I/O has priority. |
||
| 482 | |||
| 483 | To adjust the number of backfills *per osd* and to change the *number of threads* used for recovery, we can use on any node with the admin keyring: |
||
| 484 | |||
| 485 | <pre> |
||
| 486 | ceph tell osd.* injectargs '--osd-max-backfills Y' |
||
| 487 | ceph tell osd.* injectargs '--osd-recovery-max-active X' |
||
| 488 | </pre> |
||
| 489 | |||
| 490 | where Y and X are the values that we want to use. Experience shows that Y=5 and X=5 doubles to triples the recovery performance, whereas X=10 and Y=10 increases recovery performance 5 times. |
||
| 491 | |||
| 492 | 69 | Nico Schottelius | This can also be combined in one command: |
| 493 | |||
| 494 | <pre> |
||
| 495 | ceph tell osd.* injectargs '--osd-max-backfills Y' '--osd-recovery-max-active X' |
||
| 496 | |||
| 497 | # f.i.: reset to 1 |
||
| 498 | ceph tell osd.* injectargs '--osd-max-backfills 1' '--osd-recovery-max-active 1' |
||
| 499 | |||
| 500 | # f.i.: set to 4 |
||
| 501 | ceph tell osd.* injectargs '--osd-max-backfills 4' '--osd-recovery-max-active 4' |
||
| 502 | |||
| 503 | </pre> |
||
| 504 | |||
| 505 | 1 | Nico Schottelius | h2. Debug scrub errors / inconsistent pg message |
| 506 | 6 | Nico Schottelius | |
| 507 | 1 | Nico Schottelius | From time to time disks don't save what they are told to save. Ceph scrubbing detects these errors and switches to HEALTH_ERR. Use *ceph health detail* to find out which placement groups (*pgs*) are affected. Usually a ***ceph pg repair <number>*** fixes the problem. |
| 508 | |||
| 509 | If this does not help, consult https://ceph.com/geen-categorie/ceph-manually-repair-object/. |
||
| 510 | 12 | Nico Schottelius | |
| 511 | h2. Move servers into the osd tree |
||
| 512 | |||
| 513 | New servers have their buckets placed outside the **default root** and thus need to be moved inside. |
||
| 514 | Output might look as follows: |
||
| 515 | |||
| 516 | <pre> |
||
| 517 | [11:19:27] server5.place6:~# ceph osd tree |
||
| 518 | ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF |
||
| 519 | -3 0.87270 host server5 |
||
| 520 | 41 ssd 0.87270 osd.41 up 1.00000 1.00000 |
||
| 521 | -1 251.85580 root default |
||
| 522 | -7 81.56271 host server2 |
||
| 523 | 0 hdd-big 9.09511 osd.0 up 1.00000 1.00000 |
||
| 524 | 5 hdd-big 9.09511 osd.5 up 1.00000 1.00000 |
||
| 525 | ... |
||
| 526 | </pre> |
||
| 527 | |||
| 528 | |||
| 529 | Use **ceph osd crush move serverX root=default** (where serverX is the new server), |
||
| 530 | which will move the bucket in the right place: |
||
| 531 | |||
| 532 | <pre> |
||
| 533 | [11:21:17] server5.place6:~# ceph osd crush move server5 root=default |
||
| 534 | moved item id -3 name 'server5' to location {root=default} in crush map |
||
| 535 | [11:32:12] server5.place6:~# ceph osd tree |
||
| 536 | ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF |
||
| 537 | -1 252.72850 root default |
||
| 538 | ... |
||
| 539 | -3 0.87270 host server5 |
||
| 540 | 41 ssd 0.87270 osd.41 up 1.00000 1.00000 |
||
| 541 | |||
| 542 | |||
| 543 | </pre> |
||
| 544 | 13 | Nico Schottelius | |
| 545 | h2. How to fix existing osds with wrong partition layout |
||
| 546 | |||
| 547 | In the first version of DCL we used filestore/3 partition based layout. |
||
| 548 | In the second version of DCL, including OSD autodection, we use bluestore/2 partition based layout. |
||
| 549 | |||
| 550 | To convert, we delete the old OSD, clean the partitions and create a new osd: |
||
| 551 | |||
| 552 | 14 | Nico Schottelius | h3. Inactive OSD |
| 553 | 1 | Nico Schottelius | |
| 554 | 14 | Nico Schottelius | If the OSD is *not active*, we can do the following: |
| 555 | |||
| 556 | 13 | Nico Schottelius | * Find the OSD number: mount the partition and find the whoami file |
| 557 | |||
| 558 | <pre> |
||
| 559 | root@server2:/opt/ungleich-tools# mount /dev/sda2 /mnt/ |
||
| 560 | root@server2:/opt/ungleich-tools# cat /mnt/whoami |
||
| 561 | 0 |
||
| 562 | root@server2:/opt/ungleich-tools# umount /mnt/ |
||
| 563 | |||
| 564 | </pre> |
||
| 565 | |||
| 566 | * Verify in the *ceph osd tree* that the OSD is on that server |
||
| 567 | * Deleting the OSD |
||
| 568 | ** ceph osd crush remove $osd_name |
||
| 569 | 1 | Nico Schottelius | ** ceph osd rm $osd_name |
| 570 | 14 | Nico Schottelius | |
| 571 | Then continue below as described in "Recreating the OSD". |
||
| 572 | |||
| 573 | h3. Remove Active OSD |
||
| 574 | |||
| 575 | * Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently OSDID to stop and remove the OSD |
||
| 576 | * Then continue below as described in "Recreating the OSD". |
||
| 577 | |||
| 578 | |||
| 579 | h3. Recreating the OSD |
||
| 580 | |||
| 581 | 13 | Nico Schottelius | * Create an empty partition table |
| 582 | ** fdisk /dev/sdX |
||
| 583 | ** g |
||
| 584 | ** w |
||
| 585 | * Create a new OSD |
||
| 586 | ** /opt/ungleich-tools/ceph-osd-create-start /dev/sdX CLASS # use hdd, ssd, ... for the CLASS |
||
| 587 | 15 | Jin-Guk Kwon | |
| 588 | h2. How to fix unfound pg |
||
| 589 | |||
| 590 | refer to https://redmine.ungleich.ch/issues/6388 |
||
| 591 | 16 | Jin-Guk Kwon | |
| 592 | * Check health state |
||
| 593 | ** ceph health detail |
||
| 594 | * Check which server has that osd |
||
| 595 | ** ceph osd tree |
||
| 596 | * Check which VM is running in server place |
||
| 597 | 17 | Jin-Guk Kwon | ** virsh list |
| 598 | 16 | Jin-Guk Kwon | * Check pg map |
| 599 | 17 | Jin-Guk Kwon | ** ceph osd map [osd pool] [VMID] |
| 600 | 18 | Jin-Guk Kwon | * revert pg |
| 601 | ** ceph pg [PGID] mark_unfound_lost revert |
||
| 602 | 28 | Nico Schottelius | |
| 603 | 60 | Nico Schottelius | h2. Phasing out OSDs |
| 604 | |||
| 605 | 61 | Nico Schottelius | * Either directly via /opt/ungleich-tools/ceph/ceph-osd-stop-remove-permanently |
| 606 | 62 | Nico Schottelius | * Or first draining it using @ceph osd crush reweight osd.XX 0@ |
| 607 | 60 | Nico Schottelius | ** Wait until rebalance done |
| 608 | 61 | Nico Schottelius | ** Then remove |
| 609 | 60 | Nico Schottelius | |
| 610 | 28 | Nico Schottelius | h2. Enabling per image RBD statistics for prometheus |
| 611 | |||
| 612 | |||
| 613 | <pre> |
||
| 614 | [20:26:57] red2.place5:~# ceph config set mgr mgr/prometheus/rbd_stats_pools "one,hdd" |
||
| 615 | [20:27:57] black2.place6:~# ceph config set mgr mgr/prometheus/rbd_stats_pools "hdd,ssd" |
||
| 616 | </pre> |
||
| 617 | 29 | Nico Schottelius | |
| 618 | h2. S3 Object Storage |
||
| 619 | |||
| 620 | 36 | Nico Schottelius | This section is ** UNDER CONTRUCTION ** |
| 621 | |||
| 622 | 29 | Nico Schottelius | h3. Introduction |
| 623 | 1 | Nico Schottelius | |
| 624 | 30 | Nico Schottelius | * See the "Red Hat manual":https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/object_gateway_guide_for_red_hat_enterprise_linux/overview-rgw |
| 625 | * The "ceph docs about object storage":https://docs.ceph.com/docs/mimic/radosgw/ |
||
| 626 | 29 | Nico Schottelius | |
| 627 | h3. Architecture |
||
| 628 | |||
| 629 | * S3 requests are handled by a publicly accessible gateway, which also has access to the ceph cluster. |
||
| 630 | 34 | Nico Schottelius | * s3 buckets are usually |
| 631 | 29 | Nico Schottelius | |
| 632 | 32 | Nico Schottelius | h3. Authentication / Users |
| 633 | |||
| 634 | * Ceph *can* make use of LDAP as a backend |
||
| 635 | 1 | Nico Schottelius | ** However it uses the clear text username+password as a token |
| 636 | 34 | Nico Schottelius | ** See https://docs.ceph.com/docs/mimic/radosgw/ldap-auth/ |
| 637 | 32 | Nico Schottelius | * We do not want users to store their regular account on machines |
| 638 | * For this reason we use independent users / tokens, but with the same username as in LDAP |
||
| 639 | |||
| 640 | 38 | Nico Schottelius | Creating a user: |
| 641 | |||
| 642 | <pre> |
||
| 643 | radosgw-admin user create --uid=USERNAME --display-name="Name of user" |
||
| 644 | </pre> |
||
| 645 | |||
| 646 | |||
| 647 | Listing users: |
||
| 648 | |||
| 649 | <pre> |
||
| 650 | radosgw-admin user list |
||
| 651 | </pre> |
||
| 652 | |||
| 653 | |||
| 654 | Deleting users and their storage: |
||
| 655 | |||
| 656 | <pre> |
||
| 657 | radosgw-admin user rm --uid=USERNAME --purge-data |
||
| 658 | </pre> |
||
| 659 | |||
| 660 | 1 | Nico Schottelius | h3. Setting up S3 object storage on Ceph |
| 661 | 33 | Nico Schottelius | |
| 662 | * Setup a gateway node with Alpine Linux |
||
| 663 | ** Change do edge |
||
| 664 | ** Enable testing |
||
| 665 | * Update the firewall to allow access from this node to the ceph monitors |
||
| 666 | 35 | Nico Schottelius | * Setting up the wildcard DNS certificate |
| 667 | |||
| 668 | <pre> |
||
| 669 | apk add ceph-radosgw |
||
| 670 | </pre> |
||
| 671 | 37 | Nico Schottelius | |
| 672 | h3. Wildcard DNS certificate from letsencrypt |
||
| 673 | |||
| 674 | Acquiring and renewing this certificate is currently a manual process, as it requires to change DNS settings. |
||
| 675 | |||
| 676 | * run certbot |
||
| 677 | * update DNS with the first token |
||
| 678 | * update DNS with the second token |
||
| 679 | |||
| 680 | Sample session: |
||
| 681 | |||
| 682 | <pre> |
||
| 683 | s3:/etc/ceph# certbot certonly --manual --preferred-challenges=dns --email sre@ungleich.ch --server https://acme-v02.api.letsencrypt.org/directory --agree-tos |
||
| 684 | -d *.s3.ungleich.ch -d s3.ungleich.ch |
||
| 685 | Saving debug log to /var/log/letsencrypt/letsencrypt.log |
||
| 686 | Plugins selected: Authenticator manual, Installer None |
||
| 687 | Cert is due for renewal, auto-renewing... |
||
| 688 | Renewing an existing certificate |
||
| 689 | Performing the following challenges: |
||
| 690 | dns-01 challenge for s3.ungleich.ch |
||
| 691 | dns-01 challenge for s3.ungleich.ch |
||
| 692 | |||
| 693 | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
||
| 694 | NOTE: The IP of this machine will be publicly logged as having requested this |
||
| 695 | certificate. If you're running certbot in manual mode on a machine that is not |
||
| 696 | your server, please ensure you're okay with that. |
||
| 697 | |||
| 698 | Are you OK with your IP being logged? |
||
| 699 | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
||
| 700 | (Y)es/(N)o: y |
||
| 701 | |||
| 702 | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
||
| 703 | Please deploy a DNS TXT record under the name |
||
| 704 | _acme-challenge.s3.ungleich.ch with the following value: |
||
| 705 | |||
| 706 | KxGLZNiVjFwz1ifNheoR_KQoPVpkvRUV1oT2pOvJlU0 |
||
| 707 | |||
| 708 | Before continuing, verify the record is deployed. |
||
| 709 | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
||
| 710 | Press Enter to Continue |
||
| 711 | |||
| 712 | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
||
| 713 | Please deploy a DNS TXT record under the name |
||
| 714 | _acme-challenge.s3.ungleich.ch with the following value: |
||
| 715 | |||
| 716 | bkrhtxWZUipCAL5cBfvrjDuftqsZdQ2JjisiKmXBbaI |
||
| 717 | |||
| 718 | Before continuing, verify the record is deployed. |
||
| 719 | (This must be set up in addition to the previous challenges; do not remove, |
||
| 720 | replace, or undo the previous challenge tasks yet. Note that you might be |
||
| 721 | asked to create multiple distinct TXT records with the same name. This is |
||
| 722 | permitted by DNS standards.) |
||
| 723 | |||
| 724 | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
||
| 725 | Press Enter to Continue |
||
| 726 | Waiting for verification... |
||
| 727 | Cleaning up challenges |
||
| 728 | |||
| 729 | IMPORTANT NOTES: |
||
| 730 | - Congratulations! Your certificate and chain have been saved at: |
||
| 731 | /etc/letsencrypt/live/s3.ungleich.ch/fullchain.pem |
||
| 732 | Your key file has been saved at: |
||
| 733 | /etc/letsencrypt/live/s3.ungleich.ch/privkey.pem |
||
| 734 | Your cert will expire on 2020-12-09. To obtain a new or tweaked |
||
| 735 | version of this certificate in the future, simply run certbot |
||
| 736 | again. To non-interactively renew *all* of your certificates, run |
||
| 737 | "certbot renew" |
||
| 738 | - If you like Certbot, please consider supporting our work by: |
||
| 739 | |||
| 740 | Donating to ISRG / Let's Encrypt: https://letsencrypt.org/donate |
||
| 741 | Donating to EFF: https://eff.org/donate-le |
||
| 742 | |||
| 743 | </pre> |
||
| 744 | 41 | Nico Schottelius | |
| 745 | h2. Debugging ceph |
||
| 746 | |||
| 747 | |||
| 748 | <pre> |
||
| 749 | ceph status |
||
| 750 | ceph osd status |
||
| 751 | ceph osd df |
||
| 752 | ceph osd utilization |
||
| 753 | ceph osd pool stats |
||
| 754 | ceph osd tree |
||
| 755 | ceph pg stat |
||
| 756 | </pre> |
||
| 757 | 42 | Nico Schottelius | |
| 758 | 53 | Nico Schottelius | h3. How to list the version overview |
| 759 | |||
| 760 | 55 | Nico Schottelius | This lists the versions of osds, mgrs and mons: |
| 761 | |||
| 762 | 53 | Nico Schottelius | <pre> |
| 763 | ceph versions |
||
| 764 | </pre> |
||
| 765 | 55 | Nico Schottelius | |
| 766 | Listing the "features" of clients, osds, mgrs and mons can be done using @ceph features@: |
||
| 767 | |||
| 768 | <pre> |
||
| 769 | [15:32:20] red1.place5:~# ceph features |
||
| 770 | { |
||
| 771 | "mon": [ |
||
| 772 | { |
||
| 773 | "features": "0x3ffddff8ffecffff", |
||
| 774 | "release": "luminous", |
||
| 775 | "num": 5 |
||
| 776 | } |
||
| 777 | ], |
||
| 778 | "osd": [ |
||
| 779 | { |
||
| 780 | "features": "0x3ffddff8ffecffff", |
||
| 781 | "release": "luminous", |
||
| 782 | "num": 44 |
||
| 783 | } |
||
| 784 | ], |
||
| 785 | "client": [ |
||
| 786 | { |
||
| 787 | "features": "0x3ffddff8eea4fffb", |
||
| 788 | "release": "luminous", |
||
| 789 | "num": 4 |
||
| 790 | }, |
||
| 791 | { |
||
| 792 | "features": "0x3ffddff8ffacffff", |
||
| 793 | "release": "luminous", |
||
| 794 | "num": 18 |
||
| 795 | }, |
||
| 796 | { |
||
| 797 | "features": "0x3ffddff8ffecffff", |
||
| 798 | "release": "luminous", |
||
| 799 | "num": 31 |
||
| 800 | } |
||
| 801 | ], |
||
| 802 | "mgr": [ |
||
| 803 | { |
||
| 804 | "features": "0x3ffddff8ffecffff", |
||
| 805 | "release": "luminous", |
||
| 806 | "num": 4 |
||
| 807 | } |
||
| 808 | ] |
||
| 809 | } |
||
| 810 | |||
| 811 | </pre> |
||
| 812 | |||
| 813 | 53 | Nico Schottelius | |
| 814 | h3. How to list the version of every OSD and every monitor |
||
| 815 | |||
| 816 | To list the version of each ceph OSD: |
||
| 817 | |||
| 818 | <pre> |
||
| 819 | ceph tell osd.* version |
||
| 820 | </pre> |
||
| 821 | |||
| 822 | To list the version of each ceph mon: |
||
| 823 | 2 |
||
| 824 | <pre> |
||
| 825 | ceph tell mon.* version |
||
| 826 | </pre> |
||
| 827 | |||
| 828 | The mgr do not seem to support this command as of 14.2.21. |
||
| 829 | |||
| 830 | 49 | Nico Schottelius | h2. Performance Tuning |
| 831 | |||
| 832 | * Ensure that the basic options for reducing rebalancing workload are set: |
||
| 833 | |||
| 834 | <pre> |
||
| 835 | osd max backfills = 1 |
||
| 836 | osd recovery max active = 1 |
||
| 837 | osd recovery op priority = 2 |
||
| 838 | </pre> |
||
| 839 | |||
| 840 | * Ensure that "osd_op_queue_cut_off":https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_op_queue_cut_off is set to **high** |
||
| 841 | ** Requires OSD restart on change |
||
| 842 | |||
| 843 | 50 | Nico Schottelius | <pre> |
| 844 | ceph config set global osd_op_queue_cut_off high |
||
| 845 | </pre> |
||
| 846 | |||
| 847 | 51 | Nico Schottelius | <pre> |
| 848 | be sure to check your osd recovery sleep settings, there are several |
||
| 849 | depending on your underlying drives: |
||
| 850 | |||
| 851 | "osd_recovery_sleep": "0.000000", |
||
| 852 | "osd_recovery_sleep_hdd": "0.050000", |
||
| 853 | "osd_recovery_sleep_hybrid": "0.050000", |
||
| 854 | "osd_recovery_sleep_ssd": "0.050000", |
||
| 855 | |||
| 856 | Adjusting these will upwards will dramatically reduce IO, and take effect |
||
| 857 | immediately at the cost of slowing rebalance/recovery. |
||
| 858 | </pre> |
||
| 859 | |||
| 860 | 52 | Nico Schottelius | Reference settings from Frank Schilder: |
| 861 | |||
| 862 | <pre> |
||
| 863 | osd class:hdd advanced osd_recovery_sleep 0.050000 |
||
| 864 | osd class:rbd_data advanced osd_recovery_sleep 0.025000 |
||
| 865 | osd class:rbd_meta advanced osd_recovery_sleep 0.002500 |
||
| 866 | osd class:ssd advanced osd_recovery_sleep 0.002500 |
||
| 867 | osd advanced osd_recovery_sleep 0.050000 |
||
| 868 | |||
| 869 | osd class:hdd advanced osd_max_backfills 3 |
||
| 870 | osd class:rbd_data advanced osd_max_backfills 6 |
||
| 871 | osd class:rbd_meta advanced osd_max_backfills 12 |
||
| 872 | osd class:ssd advanced osd_max_backfills 12 |
||
| 873 | osd advanced osd_max_backfills 3 |
||
| 874 | |||
| 875 | osd class:hdd advanced osd_recovery_max_active 8 |
||
| 876 | osd class:rbd_data advanced osd_recovery_max_active 16 |
||
| 877 | osd class:rbd_meta advanced osd_recovery_max_active 32 |
||
| 878 | osd class:ssd advanced osd_recovery_max_active 32 |
||
| 879 | osd advanced osd_recovery_max_active 8 |
||
| 880 | </pre> |
||
| 881 | |||
| 882 | (have not yet been tested in our clusters) |
||
| 883 | 51 | Nico Schottelius | |
| 884 | 42 | Nico Schottelius | h2. Ceph theory |
| 885 | |||
| 886 | h3. How much data per Server? |
||
| 887 | |||
| 888 | Q: How much data should we add into one server? |
||
| 889 | A: Not more than it can handle. |
||
| 890 | |||
| 891 | How much data can a server handle? For this let's have a look at 2 scenarios: |
||
| 892 | |||
| 893 | * How long does it take to compensate the loss of the server? |
||
| 894 | |||
| 895 | * Assuming a server has X TiB storage in Y disks attached and a network speed of Z GiB/s. |
||
| 896 | * And our estimated rebuild goal is to compensate the loss of a server within U hours. |
||
| 897 | |||
| 898 | |||
| 899 | h4. Approach 1 |
||
| 900 | |||
| 901 | Then |
||
| 902 | |||
| 903 | Let's take an example: |
||
| 904 | |||
| 905 | * A server with @10 disks * 10 TiB@ = 100 TiB = 100 000 GiB data. It is network connected with 10 Gbit = 1.25 GiB/s. |
||
| 906 | * 100000/1.25 = 80000s = 22.22h |
||
| 907 | |||
| 908 | However, our logic assumes that we actually rebuild from the failed server, which... is failed. |
||
| 909 | |||
| 910 | h4. Approach 2: calculating with left servers |
||
| 911 | |||
| 912 | However we can apply our logic also to distribute |
||
| 913 | the rebuild over several servers that now pull in data from each other for rebuilding. |
||
| 914 | We need to *read* the data (100TiB) from other servers and distribute it to new OSDs. Assuming each server has a 10 Gbit/s |
||
| 915 | network connection. |
||
| 916 | |||
| 917 | Now the servers might need to *read* (get data from other osds) and *write) (send data to other osds). Luckily, networking is 10 Gbit/s duplex - i.e. in both directions. |
||
| 918 | |||
| 919 | However how fast can we actually read data from the disks? |
||
| 920 | |||
| 921 | * SSDs are in the range of hundreds of MB/s (best case, not necessarily true for random reads) - let's assume |
||
| 922 | * HDDs are in the range of tenths of MB/s (depending on the work load, but 30-40 MB/s random reads seems realistic) |
||
| 923 | |||
| 924 | |||
| 925 | |||
| 926 | |||
| 927 | Further assumptions: |
||
| 928 | |||
| 929 | * Assuming further that each disk should be dedicated at least one CPU core. |
||
| 930 | 43 | Nico Schottelius | |
| 931 | h3. Disk/SSD speeds |
||
| 932 | |||
| 933 | 44 | Nico Schottelius | * Tuning for #8473 showed that a 10TB HDD can write up to 180-200MB/s when backfilling (at about 70% cpu usage and 20% disk usage), max backfills = 8 |
| 934 | 43 | Nico Schottelius | * Debugging SSD usage in #8461 showed SSDs can read about 470-520MB/s sequential |
| 935 | * Debugging SSD usage in #8461 showed SSDs can write about 170-280MB/s sequential |
||
| 936 | * Debugging SSD usage in #8461 showed SSDs can write about 4MB/s RANDOM (need to verify this even though 3 runs showed these numbers) |
||
| 937 | 47 | Dominique Roux | |
| 938 | 48 | Dominique Roux | h3. Ceph theoretical fundament |
| 939 | 47 | Dominique Roux | |
| 940 | If you are very much into the theoretical fundament of Ceph check out their "paper":https://www3.nd.edu/~dthain/courses/cse40771/spring2007/papers/ceph.pdf |