The ungleich ceph handbook » History » Version 31
Nico Schottelius, 09/10/2020 02:33 PM
| 1 | 1 | Nico Schottelius | h1. The ungleich ceph handbook |
|---|---|---|---|
| 2 | |||
| 3 | 3 | Nico Schottelius | {{toc}} |
| 4 | |||
| 5 | 1 | Nico Schottelius | h2. Status |
| 6 | |||
| 7 | 7 | Nico Schottelius | This document is **IN PRODUCTION**. |
| 8 | 1 | Nico Schottelius | |
| 9 | h2. Introduction |
||
| 10 | |||
| 11 | This article describes the ungleich storage architecture that is based on ceph. It describes our architecture as well maintenance commands. Required for |
||
| 12 | |||
| 13 | h2. Communication guide |
||
| 14 | |||
| 15 | Usually when disks fails no customer communication is necessary, as it is automatically compensated/rebalanced by ceph. However in case multiple disk failures happen at the same time, I/O speed might be reduced and thus customer experience impacted. |
||
| 16 | |||
| 17 | For this reason communicate whenever I/O recovery settings are temporarily tuned. |
||
| 18 | |||
| 19 | 20 | Nico Schottelius | h2. Analysing |
| 20 | |||
| 21 | 21 | Nico Schottelius | h3. ceph osd df tree |
| 22 | 20 | Nico Schottelius | |
| 23 | Using @ceph osd df tree@ you can see not only the disk usage per OSD, but also the number of PGs on an OSD. This is especially useful to see how the OSDs are balanced. |
||
| 24 | |||
| 25 | 22 | Nico Schottelius | h3. Find out the device of an OSD |
| 26 | |||
| 27 | Use @mount | grep /var/lib/ceph/osd/ceph-OSDID@ on the server on which the OSD is located: |
||
| 28 | |||
| 29 | <pre> |
||
| 30 | |||
| 31 | [16:01:23] server2.place6:~# mount | grep /var/lib/ceph/osd/ceph-31 |
||
| 32 | /dev/sdk1 on /var/lib/ceph/osd/ceph-31 type xfs (rw,relatime,attr2,inode64,noquota) |
||
| 33 | </pre> |
||
| 34 | |||
| 35 | 2 | Nico Schottelius | h2. Adding a new disk/ssd to the ceph cluster |
| 36 | 1 | Nico Schottelius | |
| 37 | 25 | Jin-Guk Kwon | write on the disks, which order / date we bought it with a permanent marker. |
| 38 | |||
| 39 | 2 | Nico Schottelius | h3. For Dell servers |
| 40 | |||
| 41 | First find the disk and then add it to the operating system |
||
| 42 | |||
| 43 | <pre> |
||
| 44 | megacli -PDList -aALL | grep -B16 -i unconfigur |
||
| 45 | |||
| 46 | # Sample output: |
||
| 47 | [19:46:50] server7.place6:~# megacli -PDList -aALL | grep -B16 -i unconfigur |
||
| 48 | Enclosure Device ID: N/A |
||
| 49 | Slot Number: 0 |
||
| 50 | Enclosure position: N/A |
||
| 51 | Device Id: 0 |
||
| 52 | WWN: 0000000000000000 |
||
| 53 | Sequence Number: 1 |
||
| 54 | Media Error Count: 0 |
||
| 55 | Other Error Count: 0 |
||
| 56 | Predictive Failure Count: 0 |
||
| 57 | Last Predictive Failure Event Seq Number: 0 |
||
| 58 | PD Type: SATA |
||
| 59 | |||
| 60 | Raw Size: 894.252 GB [0x6fc81ab0 Sectors] |
||
| 61 | Non Coerced Size: 893.752 GB [0x6fb81ab0 Sectors] |
||
| 62 | Coerced Size: 893.75 GB [0x6fb80000 Sectors] |
||
| 63 | Sector Size: 0 |
||
| 64 | Firmware state: Unconfigured(good), Spun Up |
||
| 65 | </pre> |
||
| 66 | |||
| 67 | Then add the disk to the OS: |
||
| 68 | |||
| 69 | <pre> |
||
| 70 | 26 | ll nu | megacli -CfgLdAdd -r0 [Enclosure Device ID:slot] -aX (X : host is 0. md-array is 1) |
| 71 | 2 | Nico Schottelius | |
| 72 | # Sample call, if enclosure and slot are KNOWN (aka not N/A) |
||
| 73 | megacli -CfgLdAdd -r0 [32:0] -a0 |
||
| 74 | |||
| 75 | # Sample call, if enclosure is N/A |
||
| 76 | 1 | Nico Schottelius | megacli -CfgLdAdd -r0 [:0] -a0 |
| 77 | 25 | Jin-Guk Kwon | </pre> |
| 78 | |||
| 79 | Then check disk |
||
| 80 | |||
| 81 | <pre> |
||
| 82 | fdisk -l |
||
| 83 | [11:26:23] server2.place6:~# fdisk -l |
||
| 84 | ...... |
||
| 85 | Disk /dev/sdh: 7.3 TiB, 8000987201536 bytes, 15626928128 sectors |
||
| 86 | Units: sectors of 1 * 512 = 512 bytes |
||
| 87 | Sector size (logical/physical): 512 bytes / 512 bytes |
||
| 88 | I/O size (minimum/optimal): 512 bytes / 512 bytes |
||
| 89 | [11:27:24] server2.place6:~# |
||
| 90 | </pre> |
||
| 91 | |||
| 92 | Then create gpt |
||
| 93 | |||
| 94 | <pre> |
||
| 95 | /opt/ungleich-tools/disk-create-fresh-gpt /dev/XXX |
||
| 96 | [11:31:10] server2.place6:~# /opt/ungleich-tools/disk-create-fresh-gpt /dev/sdh |
||
| 97 | ...... |
||
| 98 | Created a new DOS disklabel with disk identifier 0x9c4a0355. |
||
| 99 | Command (m for help): Created a new GPT disklabel (GUID: 374E31AD-7B96-4837-B5ED-7B22C452899E). |
||
| 100 | ...... |
||
| 101 | </pre> |
||
| 102 | |||
| 103 | Then create osd for ssd/hdd-big |
||
| 104 | |||
| 105 | <pre> |
||
| 106 | /opt/ungleich-tools/ceph-osd-create-start /dev/XXX XXX(sdd or hdd-big) |
||
| 107 | [11:33:58] server2.place6:~# /opt/ungleich-tools/ceph-osd-create-start /dev/sdh hdd-big |
||
| 108 | + set -e |
||
| 109 | + [ 2 -lt 2 ] |
||
| 110 | ...... |
||
| 111 | + /opt/ungleich-tools/monit-ceph-create-start osd.14 |
||
| 112 | osd.14 |
||
| 113 | [ ok ] Restarting daemon monitor: monit. |
||
| 114 | [11:36:14] server2.place6:~# |
||
| 115 | </pre> |
||
| 116 | |||
| 117 | Then check rebalancing(if you want to add another disk, you should do after rebalancing) |
||
| 118 | |||
| 119 | <pre> |
||
| 120 | ceph -s |
||
| 121 | [12:37:57] server2.place6:~# ceph -s |
||
| 122 | cluster: |
||
| 123 | id: 1ccd84f6-e362-4c50-9ffe-59436745e445 |
||
| 124 | health: HEALTH_WARN |
||
| 125 | 2248811/49628409 objects misplaced (4.531%) |
||
| 126 | ...... |
||
| 127 | io: |
||
| 128 | client: 170KiB/s rd, 35.0MiB/s wr, 463op/s rd, 728op/s wr |
||
| 129 | recovery: 27.1MiB/s, 6objects/s |
||
| 130 | [12:49:41] server2.place6:~# |
||
| 131 | 2 | Nico Schottelius | </pre> |
| 132 | |||
| 133 | 1 | Nico Schottelius | h2. Moving a disk/ssd to another server |
| 134 | 4 | Nico Schottelius | |
| 135 | (needs to be described better) |
||
| 136 | |||
| 137 | Generally speaking: |
||
| 138 | |||
| 139 | 27 | ll nu | * //needs to be tested: disable recovery so data wont start move while you have the osd down |
| 140 | 9 | Nico Schottelius | * /opt/ungleich-tools/ceph-osd-stop-disable does the following: |
| 141 | ** Stop the osd, remove monit on the server you want to take it out |
||
| 142 | ** umount the disk |
||
| 143 | 1 | Nico Schottelius | * Take disk out |
| 144 | * Discard preserved cache on the server you took it out |
||
| 145 | 23 | Nico Schottelius | ** using megacli: @megacli -DiscardPreservedCache -Lall -a0@ |
| 146 | 1 | Nico Schottelius | * Insert into new server |
| 147 | 9 | Nico Schottelius | * Clear foreign configuration |
| 148 | 23 | Nico Schottelius | ** using megacli: @megacli -CfgForeign -Clear -a0@ |
| 149 | 9 | Nico Schottelius | * Disk will now appear in the OS, ceph/udev will automatically start the OSD (!) |
| 150 | ** No creating of the osd required! |
||
| 151 | * Verify that the disk exists and that the osd is started |
||
| 152 | ** using *ps aux* |
||
| 153 | ** using *ceph osd tree* |
||
| 154 | 10 | Nico Schottelius | * */opt/ungleich-tools/monit-ceph-create-start osd.XX* # where osd.XX is the osd + number |
| 155 | 9 | Nico Schottelius | ** Creates the monit configuration file so that monit watches the OSD |
| 156 | ** Reload monit |
||
| 157 | 11 | Nico Schottelius | * Verify monit using *monit status* |
| 158 | 1 | Nico Schottelius | |
| 159 | h2. Removing a disk/ssd |
||
| 160 | 5 | Nico Schottelius | |
| 161 | To permanently remove a failed disk from a cluster, use ***ceph-osd-stop-remove-permanently*** from ungleich-tools repo. Warning: if the disk is still active, the OSD will be shutdown AND removed from the cluster -> all data of that disk will need to be rebalanced. |
||
| 162 | 1 | Nico Schottelius | |
| 163 | h2. Handling DOWN osds with filesystem errors |
||
| 164 | |||
| 165 | If an email arrives with the subject "monit alert -- Does not exist osd.XX-whoami", the filesystem of an OSD cannot be read anymore. It is very highly likely that the disk / ssd is broken. Steps that need to be done: |
||
| 166 | |||
| 167 | * Login to any ceph monitor (cephX.placeY.ungleich.ch) |
||
| 168 | * Check **ceph -s**, find host using **ceph osd tree** |
||
| 169 | * Login to the affected host |
||
| 170 | * Run the following commands: |
||
| 171 | ** ls /var/lib/ceph/osd/ceph-XX |
||
| 172 | ** dmesg |
||
| 173 | 24 | Jin-Guk Kwon | <pre> |
| 174 | ex) After checking message of dmesg, you can do next step |
||
| 175 | [204696.406756] XFS (sdl1): metadata I/O error: block 0x19100 ("xlog_iodone") error 5 numblks 64 |
||
| 176 | [204696.408094] XFS (sdl1): xfs_do_force_shutdown(0x2) called from line 1233 of file /build/linux-BsFdsw/linux-4.9.65/fs/xfs/xfs_log.c. Return address = 0xffffffffc08eb612 |
||
| 177 | [204696.410702] XFS (sdl1): Log I/O Error Detected. Shutting down filesystem |
||
| 178 | [204696.411977] XFS (sdl1): Please umount the filesystem and rectify the problem( |
||
| 179 | </pre> |
||
| 180 | |||
| 181 | 1 | Nico Schottelius | * Create a new ticket in the datacenter light project |
| 182 | ** Subject: "Replace broken OSD.XX on serverX.placeY.ungleich.ch" |
||
| 183 | ** Add (partial) output of above commands |
||
| 184 | ** Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently XX, where XX is the osd id, to remove the disk from the cluster |
||
| 185 | ** Remove the physical disk from the host, checkout if there is warranty on it and if yes |
||
| 186 | *** Create a short letter to the vendor, including technical details a from above |
||
| 187 | *** Record when you sent it in |
||
| 188 | *** Put ticket into status waiting |
||
| 189 | ** If there is no warranty, dispose it |
||
| 190 | |||
| 191 | h2. Change ceph speed for i/o recovery |
||
| 192 | |||
| 193 | By default we want to keep I/O recovery traffic low to not impact customer experience. However when multiple disks fail at the same point, we might want to prioritise recover for data safety over performance. |
||
| 194 | |||
| 195 | The default configuration on our servers contains: |
||
| 196 | |||
| 197 | <pre> |
||
| 198 | [osd] |
||
| 199 | osd max backfills = 1 |
||
| 200 | osd recovery max active = 1 |
||
| 201 | osd recovery op priority = 2 |
||
| 202 | </pre> |
||
| 203 | |||
| 204 | The important settings are *osd max backfills* and *osd recovery max active*, the priority is always kept low so that regular I/O has priority. |
||
| 205 | |||
| 206 | To adjust the number of backfills *per osd* and to change the *number of threads* used for recovery, we can use on any node with the admin keyring: |
||
| 207 | |||
| 208 | <pre> |
||
| 209 | ceph tell osd.* injectargs '--osd-max-backfills Y' |
||
| 210 | ceph tell osd.* injectargs '--osd-recovery-max-active X' |
||
| 211 | </pre> |
||
| 212 | |||
| 213 | where Y and X are the values that we want to use. Experience shows that Y=5 and X=5 doubles to triples the recovery performance, whereas X=10 and Y=10 increases recovery performance 5 times. |
||
| 214 | |||
| 215 | h2. Debug scrub errors / inconsistent pg message |
||
| 216 | 6 | Nico Schottelius | |
| 217 | 1 | Nico Schottelius | From time to time disks don't save what they are told to save. Ceph scrubbing detects these errors and switches to HEALTH_ERR. Use *ceph health detail* to find out which placement groups (*pgs*) are affected. Usually a ***ceph pg repair <number>*** fixes the problem. |
| 218 | |||
| 219 | If this does not help, consult https://ceph.com/geen-categorie/ceph-manually-repair-object/. |
||
| 220 | 12 | Nico Schottelius | |
| 221 | h2. Move servers into the osd tree |
||
| 222 | |||
| 223 | New servers have their buckets placed outside the **default root** and thus need to be moved inside. |
||
| 224 | Output might look as follows: |
||
| 225 | |||
| 226 | <pre> |
||
| 227 | [11:19:27] server5.place6:~# ceph osd tree |
||
| 228 | ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF |
||
| 229 | -3 0.87270 host server5 |
||
| 230 | 41 ssd 0.87270 osd.41 up 1.00000 1.00000 |
||
| 231 | -1 251.85580 root default |
||
| 232 | -7 81.56271 host server2 |
||
| 233 | 0 hdd-big 9.09511 osd.0 up 1.00000 1.00000 |
||
| 234 | 5 hdd-big 9.09511 osd.5 up 1.00000 1.00000 |
||
| 235 | ... |
||
| 236 | </pre> |
||
| 237 | |||
| 238 | |||
| 239 | Use **ceph osd crush move serverX root=default** (where serverX is the new server), |
||
| 240 | which will move the bucket in the right place: |
||
| 241 | |||
| 242 | <pre> |
||
| 243 | [11:21:17] server5.place6:~# ceph osd crush move server5 root=default |
||
| 244 | moved item id -3 name 'server5' to location {root=default} in crush map |
||
| 245 | [11:32:12] server5.place6:~# ceph osd tree |
||
| 246 | ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF |
||
| 247 | -1 252.72850 root default |
||
| 248 | ... |
||
| 249 | -3 0.87270 host server5 |
||
| 250 | 41 ssd 0.87270 osd.41 up 1.00000 1.00000 |
||
| 251 | |||
| 252 | |||
| 253 | </pre> |
||
| 254 | 13 | Nico Schottelius | |
| 255 | h2. How to fix existing osds with wrong partition layout |
||
| 256 | |||
| 257 | In the first version of DCL we used filestore/3 partition based layout. |
||
| 258 | In the second version of DCL, including OSD autodection, we use bluestore/2 partition based layout. |
||
| 259 | |||
| 260 | To convert, we delete the old OSD, clean the partitions and create a new osd: |
||
| 261 | |||
| 262 | 14 | Nico Schottelius | h3. Inactive OSD |
| 263 | 1 | Nico Schottelius | |
| 264 | 14 | Nico Schottelius | If the OSD is *not active*, we can do the following: |
| 265 | |||
| 266 | 13 | Nico Schottelius | * Find the OSD number: mount the partition and find the whoami file |
| 267 | |||
| 268 | <pre> |
||
| 269 | root@server2:/opt/ungleich-tools# mount /dev/sda2 /mnt/ |
||
| 270 | root@server2:/opt/ungleich-tools# cat /mnt/whoami |
||
| 271 | 0 |
||
| 272 | root@server2:/opt/ungleich-tools# umount /mnt/ |
||
| 273 | |||
| 274 | </pre> |
||
| 275 | |||
| 276 | * Verify in the *ceph osd tree* that the OSD is on that server |
||
| 277 | * Deleting the OSD |
||
| 278 | ** ceph osd crush remove $osd_name |
||
| 279 | 1 | Nico Schottelius | ** ceph osd rm $osd_name |
| 280 | 14 | Nico Schottelius | |
| 281 | Then continue below as described in "Recreating the OSD". |
||
| 282 | |||
| 283 | h3. Remove Active OSD |
||
| 284 | |||
| 285 | * Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently OSDID to stop and remove the OSD |
||
| 286 | * Then continue below as described in "Recreating the OSD". |
||
| 287 | |||
| 288 | |||
| 289 | h3. Recreating the OSD |
||
| 290 | |||
| 291 | 13 | Nico Schottelius | * Create an empty partition table |
| 292 | ** fdisk /dev/sdX |
||
| 293 | ** g |
||
| 294 | ** w |
||
| 295 | * Create a new OSD |
||
| 296 | ** /opt/ungleich-tools/ceph-osd-create-start /dev/sdX CLASS # use hdd, ssd, ... for the CLASS |
||
| 297 | 15 | Jin-Guk Kwon | |
| 298 | h2. How to fix unfound pg |
||
| 299 | |||
| 300 | refer to https://redmine.ungleich.ch/issues/6388 |
||
| 301 | 16 | Jin-Guk Kwon | |
| 302 | * Check health state |
||
| 303 | ** ceph health detail |
||
| 304 | * Check which server has that osd |
||
| 305 | ** ceph osd tree |
||
| 306 | * Check which VM is running in server place |
||
| 307 | 17 | Jin-Guk Kwon | ** virsh list |
| 308 | 16 | Jin-Guk Kwon | * Check pg map |
| 309 | 17 | Jin-Guk Kwon | ** ceph osd map [osd pool] [VMID] |
| 310 | 18 | Jin-Guk Kwon | * revert pg |
| 311 | ** ceph pg [PGID] mark_unfound_lost revert |
||
| 312 | 28 | Nico Schottelius | |
| 313 | h2. Enabling per image RBD statistics for prometheus |
||
| 314 | |||
| 315 | |||
| 316 | <pre> |
||
| 317 | [20:26:57] red2.place5:~# ceph config set mgr mgr/prometheus/rbd_stats_pools "one,hdd" |
||
| 318 | [20:27:57] black2.place6:~# ceph config set mgr mgr/prometheus/rbd_stats_pools "hdd,ssd" |
||
| 319 | </pre> |
||
| 320 | 29 | Nico Schottelius | |
| 321 | h2. S3 Object Storage |
||
| 322 | |||
| 323 | h3. Introduction |
||
| 324 | 1 | Nico Schottelius | |
| 325 | 30 | Nico Schottelius | * See the "Red Hat manual":https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/object_gateway_guide_for_red_hat_enterprise_linux/overview-rgw |
| 326 | * The "ceph docs about object storage":https://docs.ceph.com/docs/mimic/radosgw/ |
||
| 327 | 29 | Nico Schottelius | |
| 328 | h3. Architecture |
||
| 329 | |||
| 330 | * S3 requests are handled by a publicly accessible gateway, which also has access to the ceph cluster. |
||
| 331 | * s3 buckets are usually |
||
| 332 | |||
| 333 | h3. Setting up S3 object storage on Ceph |