Version 49 - History - The ungleich ceph handbook - Open Infrastructure - ungleich redmine

The ungleich ceph handbook » History » Version 49

Nico Schottelius, 08/12/2021 07:16 PM

-Nico Schottelius
+h1. The ungleich ceph handbook
-Nico Schottelius
+{{toc}}
-Nico Schottelius
+h2. Status
-Nico Schottelius
+This document is **IN PRODUCTION**.
 Nico Schottelius
 h2. Introduction
 This article describes the ungleich storage architecture that is based on ceph. It describes our architecture as well maintenance commands. Required for
-Nico Schottelius
+h2. Processes
 h3. Usage monitoring
 * Usage should be kept somewhere in 70-75% area
 * If usage reaches 72.5%, we start reducing usage by adding disks
 * We stop when usage is below 70%
 h3. Phasing in new disks
 * 24h performance test prior to using it
 h3. Phasing in new servers
 * 24h performance test with 1 ssd or 1 hdd (whatever is applicable)
-Nico Schottelius
+h2. Communication guide
 Usually when disks fails no customer communication is necessary, as it is automatically compensated/rebalanced by ceph. However in case multiple disk failures happen at the same time, I/O speed might be reduced and thus customer experience impacted.
 For this reason communicate whenever I/O recovery settings are temporarily tuned.
-Nico Schottelius
+h2. Analysing
-Nico Schottelius
+h3. ceph osd df tree
 Nico Schottelius
 Using @ceph osd df tree@ you can see not only the disk usage per OSD, but also the number of PGs on an OSD. This is especially useful to see how the OSDs are balanced.
-Nico Schottelius
+h3. Find out the device of an OSD
 Use @mount | grep /var/lib/ceph/osd/ceph-OSDID@ on the server on which the OSD is located:
 <pre>
 [16:01:23] server2.place6:~# mount | grep /var/lib/ceph/osd/ceph-31
 /dev/sdk1 on /var/lib/ceph/osd/ceph-31 type xfs (rw,relatime,attr2,inode64,noquota)
 </pre>
-Nico Schottelius
+h2. Adding a new disk/ssd to the ceph cluster
 Nico Schottelius
-Jin-Guk Kwon
+write on the disks, which order / date we bought it with a permanent marker.
-Nico Schottelius
+h3. Checking the shadow trees
 To be able to spot differences / weights of hosts, it can be very helpful to look at the crush shadow tree
 using @ceph osd crush tree --show-shadow@:
 <pre>
 -16   hdd-big 653.03418           root default~hdd-big
 -34   hdd-big         0         0     host server14~hdd-big
 -38   hdd-big         0         0     host server15~hdd-big
 -42   hdd-big  81.86153  78.28352     host server17~hdd-big
 hdd-big   9.09560   9.09560         osd.36
 hdd-big   9.09499   9.09499         osd.59
 hdd-big   9.09499   9.09499         osd.60
 hdd-big   9.09599   8.93999         osd.68
 hdd-big   9.09599   7.65999         osd.69
 hdd-big   9.09599   8.35899         osd.70
 hdd-big   9.09599   8.56000         osd.71
 hdd-big   9.09599   8.93700         osd.72
 hdd-big   9.09599   8.54199         osd.73
 -46   hdd-big  90.94986  90.94986     host server18~hdd-big
 ...
 </pre>
 Here we can see that the weight of server17 for the class hdd-big is about 81, the one of server18 about 90.
 SSDs and other classes have their own shadow trees, too.
-Nico Schottelius
+h3. For Dell servers
 First find the disk and then add it to the operating system
 <pre>
 megacli -PDList -aALL  | grep -B16 -i unconfigur
 # Sample output:
 [19:46:50] server7.place6:~#  megacli -PDList -aALL  | grep -B16 -i unconfigur
 Enclosure Device ID: N/A
 Slot Number: 0
 Enclosure position: N/A
 Device Id: 0
 WWN: 0000000000000000
 Sequence Number: 1
 Media Error Count: 0
 Other Error Count: 0
 Predictive Failure Count: 0
 Last Predictive Failure Event Seq Number: 0
 PD Type: SATA
 Raw Size: 894.252 GB [0x6fc81ab0 Sectors]
 Non Coerced Size: 893.752 GB [0x6fb81ab0 Sectors]
 Coerced Size: 893.75 GB [0x6fb80000 Sectors]
 Sector Size:  0
 Firmware state: Unconfigured(good), Spun Up
 </pre>
 Then add the disk to the OS:
 <pre>
-ll nu
+megacli -CfgLdAdd -r0 [Enclosure Device ID:slot] -aX (X : host is 0. md-array is 1)
 Nico Schottelius
 # Sample call, if enclosure and slot are KNOWN (aka not N/A)
 megacli -CfgLdAdd -r0 [32:0] -a0
 # Sample call, if enclosure is N/A
-Nico Schottelius
+megacli -CfgLdAdd -r0 [:0] -a0
-Jin-Guk Kwon
+</pre>
 Then check disk
 <pre>
 fdisk -l
 [11:26:23] server2.place6:~# fdisk -l
 ......
 Disk /dev/sdh: 7.3 TiB, 8000987201536 bytes, 15626928128 sectors
 Units: sectors of 1 * 512 = 512 bytes
 Sector size (logical/physical): 512 bytes / 512 bytes
 I/O size (minimum/optimal): 512 bytes / 512 bytes
 [11:27:24] server2.place6:~#
 </pre>
 Then create gpt
 <pre>
 /opt/ungleich-tools/disk-create-fresh-gpt /dev/XXX
 [11:31:10] server2.place6:~# /opt/ungleich-tools/disk-create-fresh-gpt /dev/sdh
 ......
 Created a new DOS disklabel with disk identifier 0x9c4a0355.
 Command (m for help): Created a new GPT disklabel (GUID: 374E31AD-7B96-4837-B5ED-7B22C452899E).
 ......
 </pre>
 Then create osd for ssd/hdd-big
 <pre>
 /opt/ungleich-tools/ceph-osd-create-start /dev/XXX XXX(sdd or hdd-big)
 [11:33:58] server2.place6:~# /opt/ungleich-tools/ceph-osd-create-start /dev/sdh hdd-big
 + set -e
 + [ 2 -lt 2 ]
 ......
 + /opt/ungleich-tools/monit-ceph-create-start osd.14
 osd.14
 [ ok ] Restarting daemon monitor: monit.
 [11:36:14] server2.place6:~#
 </pre>
 Then check rebalancing(if you want to add another disk, you should do after rebalancing)
 <pre>
 ceph -s
 [12:37:57] server2.place6:~# ceph -s
   cluster:
     id:     1ccd84f6-e362-4c50-9ffe-59436745e445
     health: HEALTH_WARN
             2248811/49628409 objects misplaced (4.531%)
 ......
   io:
     client:   170KiB/s rd, 35.0MiB/s wr, 463op/s rd, 728op/s wr
     recovery: 27.1MiB/s, 6objects/s
 [12:49:41] server2.place6:~#
-Nico Schottelius
+</pre>
-Nico Schottelius
+h2. Moving a disk/ssd to another server
 Nico Schottelius
 (needs to be described better)
 Generally speaking:
-ll nu
+* //needs to be tested: disable recovery so data wont start move while you have the osd down
-Nico Schottelius
+* /opt/ungleich-tools/ceph-osd-stop-disable does the following:
 ** Stop the osd, remove monit on the server you want to take it out
 ** umount the disk
-Nico Schottelius
+* Take disk out
 * Discard preserved cache on the server you took it out
-Nico Schottelius
+** using megacli:  @megacli -DiscardPreservedCache -Lall -a0@
-Nico Schottelius
+* Insert into new server
-Nico Schottelius
+* Clear foreign configuration
-Nico Schottelius
+** using megacli: @megacli -CfgForeign -Clear -a0@
-Nico Schottelius
+* Disk will now appear in the OS, ceph/udev will automatically start the OSD (!)
 ** No creating of the osd required!
 * Verify that the disk exists and that the osd is started
 ** using *ps aux*
 ** using *ceph osd tree*
-Nico Schottelius
+* */opt/ungleich-tools/monit-ceph-create-start osd.XX* # where osd.XX is the osd + number
-Nico Schottelius
+** Creates the monit configuration file so that monit watches the OSD
 ** Reload monit
-Nico Schottelius
+* Verify monit using *monit status*
 Nico Schottelius
 h2. Removing a disk/ssd
 Nico Schottelius
 To permanently remove a failed disk from a cluster, use ***ceph-osd-stop-remove-permanently*** from ungleich-tools repo. Warning: if the disk is still active, the OSD will be shutdown AND removed from the cluster -> all data of that disk will need to be rebalanced.
 Nico Schottelius
 h2. Handling DOWN osds with filesystem errors
 If an email arrives with the subject "monit alert -- Does not exist osd.XX-whoami", the filesystem of an OSD cannot be read anymore. It is very highly likely that the disk / ssd is broken. Steps that need to be done:
 * Login to any ceph monitor (cephX.placeY.ungleich.ch)
 * Check **ceph -s**, find host using **ceph osd tree**
 * Login to the affected host
 * Run the following commands:
 ** ls /var/lib/ceph/osd/ceph-XX
 ** dmesg
-Jin-Guk Kwon
+<pre>
 ex) After checking message of dmesg, you can do next step
 [204696.406756] XFS (sdl1): metadata I/O error: block 0x19100 ("xlog_iodone") error 5 numblks 64
 [204696.408094] XFS (sdl1): xfs_do_force_shutdown(0x2) called from line 1233 of file /build/linux-BsFdsw/linux-4.9.65/fs/xfs/xfs_log.c.  Return address = 0xffffffffc08eb612
 [204696.410702] XFS (sdl1): Log I/O Error Detected.  Shutting down filesystem
 [204696.411977] XFS (sdl1): Please umount the filesystem and rectify the problem(
 </pre>
-Nico Schottelius
+* Create a new ticket in the datacenter light project
 ** Subject: "Replace broken OSD.XX on serverX.placeY.ungleich.ch"
 ** Add (partial) output of above commands
 ** Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently XX, where XX is the osd id, to remove the disk from the cluster
 ** Remove the physical disk from the host, checkout if there is warranty on it and if yes
 *** Create a short letter to the vendor, including technical details a from above
 *** Record when you sent it in
 *** Put ticket into status waiting
 ** If there is no warranty, dispose it
-Jin-Guk Kwon
+h2. [[Create new pool and place new osd]]
 Jin-Guk Kwon
-Nico Schottelius
+h2. Change ceph speed for i/o recovery
 By default we want to keep I/O recovery traffic low to not impact customer experience. However when multiple disks fail at the same point, we might want to prioritise recover for data safety over performance.
 The default configuration on our servers contains:
 <pre>
 [osd]
 osd max backfills = 1
 osd recovery max active = 1
 osd recovery op priority = 2
 </pre>
 The important settings are *osd max backfills* and *osd recovery max active*, the priority is always kept low so that regular I/O has priority.
 To adjust the number of backfills *per osd* and to change the *number of threads* used for recovery, we can use on any node with the admin keyring:
 <pre>
 ceph tell osd.* injectargs '--osd-max-backfills Y'
 ceph tell osd.* injectargs '--osd-recovery-max-active X'
 </pre>
 where Y and X are the values that we want to use. Experience shows that Y=5 and X=5 doubles to triples the recovery performance, whereas X=10 and Y=10 increases recovery performance 5 times.
 h2. Debug scrub errors / inconsistent pg message
 Nico Schottelius
-Nico Schottelius
+From time to time disks don't save what they are told to save. Ceph scrubbing detects these errors and switches to HEALTH_ERR. Use *ceph health detail* to find out which placement groups (*pgs*) are affected. Usually a ***ceph pg repair <number>*** fixes the problem.
 If this does not help, consult https://ceph.com/geen-categorie/ceph-manually-repair-object/.
 Nico Schottelius
 h2. Move servers into the osd tree
 New servers have their buckets placed outside the **default root** and thus need to be moved inside.
 Output might look as follows:
 <pre>
 [11:19:27] server5.place6:~# ceph osd tree
 ID  CLASS   WEIGHT    TYPE NAME        STATUS REWEIGHT PRI-AFF
  -3           0.87270 host server5
 ssd   0.87270     osd.41           up  1.00000 1.00000
  -1         251.85580 root default
  -7          81.56271     host server2
 hdd-big   9.09511         osd.0        up  1.00000 1.00000
 hdd-big   9.09511         osd.5        up  1.00000 1.00000
 ...
 </pre>
 Use **ceph osd crush move serverX root=default** (where serverX is the new server),
 which will move the bucket in the right place:
 <pre>
 [11:21:17] server5.place6:~# ceph osd crush move server5 root=default
 moved item id -3 name 'server5' to location {root=default} in crush map
 [11:32:12] server5.place6:~# ceph osd tree
 ID  CLASS   WEIGHT    TYPE NAME        STATUS REWEIGHT PRI-AFF
  -1         252.72850 root default
 ...
  -3           0.87270     host server5
 ssd   0.87270         osd.41       up  1.00000 1.00000
 </pre>
 Nico Schottelius
 h2. How to fix existing osds with wrong partition layout
 In the first version of DCL we used filestore/3 partition based layout.
 In the second version of DCL, including OSD autodection, we use bluestore/2 partition based layout.
 To convert, we delete the old OSD, clean the partitions and create a new osd:
-Nico Schottelius
+h3. Inactive OSD
 Nico Schottelius
-Nico Schottelius
+If the OSD is *not active*, we can do the following:
-Nico Schottelius
+* Find the OSD number: mount the partition and find the whoami file
 <pre>
 root@server2:/opt/ungleich-tools# mount /dev/sda2 /mnt/
 root@server2:/opt/ungleich-tools# cat /mnt/whoami
 root@server2:/opt/ungleich-tools# umount  /mnt/
 </pre>
 * Verify in the *ceph osd tree* that the OSD is on that server
 * Deleting the OSD
 ** ceph osd crush remove $osd_name
-Nico Schottelius
+** ceph osd rm $osd_name
 Nico Schottelius
 Then continue below as described in "Recreating the OSD".
 h3. Remove Active OSD
 * Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently OSDID to stop and remove the OSD
 * Then continue below as described in "Recreating the OSD".
 h3. Recreating the OSD
-Nico Schottelius
+* Create an empty partition table
 ** fdisk /dev/sdX
 ** g
 ** w
 * Create a new OSD
 ** /opt/ungleich-tools/ceph-osd-create-start /dev/sdX CLASS # use hdd, ssd, ... for the CLASS
 Jin-Guk Kwon
 h2. How to fix unfound pg
 refer to https://redmine.ungleich.ch/issues/6388
 Jin-Guk Kwon
 * Check health state
 ** ceph health detail
 * Check which server has that osd
 ** ceph osd tree
 * Check which VM is running in server place
-Jin-Guk Kwon
+** virsh list
-Jin-Guk Kwon
+* Check pg map
-Jin-Guk Kwon
+** ceph osd map [osd pool] [VMID]
-Jin-Guk Kwon
+* revert pg
 ** ceph pg [PGID] mark_unfound_lost revert
 Nico Schottelius
 h2. Enabling per image RBD statistics for prometheus
 <pre>
 [20:26:57] red2.place5:~# ceph config set mgr mgr/prometheus/rbd_stats_pools "one,hdd"
 [20:27:57] black2.place6:~# ceph config set mgr mgr/prometheus/rbd_stats_pools "hdd,ssd"
 </pre>
 Nico Schottelius
 h2. S3 Object Storage
-Nico Schottelius
+This section is ** UNDER CONTRUCTION **
-Nico Schottelius
+h3. Introduction
 Nico Schottelius
-Nico Schottelius
+* See the "Red Hat manual":https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/object_gateway_guide_for_red_hat_enterprise_linux/overview-rgw
 * The "ceph docs about object storage":https://docs.ceph.com/docs/mimic/radosgw/
 Nico Schottelius
 h3. Architecture
 * S3 requests are handled by a publicly accessible gateway, which also has access to the ceph cluster.
-Nico Schottelius
+* s3 buckets are usually
 Nico Schottelius
-Nico Schottelius
+h3. Authentication / Users
 * Ceph *can* make use of LDAP as a backend
-Nico Schottelius
+** However it uses the clear text username+password as a token
-Nico Schottelius
+** See https://docs.ceph.com/docs/mimic/radosgw/ldap-auth/
-Nico Schottelius
+* We do not want users to store their regular account on machines
 * For this reason we use independent users / tokens, but with the same username as in LDAP
-Nico Schottelius
+Creating a user:
 <pre>
 radosgw-admin user create --uid=USERNAME --display-name="Name of user"
 </pre>
 Listing users:
 <pre>
 radosgw-admin user list
 </pre>
 Deleting users and their storage:
 <pre>
 radosgw-admin user rm --uid=USERNAME --purge-data
 </pre>
-Nico Schottelius
+h3. Setting up S3 object storage on Ceph
 Nico Schottelius
 * Setup a gateway node with Alpine Linux
 ** Change do edge
 ** Enable testing
 * Update the firewall to allow access from this node to the ceph monitors
-Nico Schottelius
+* Setting up the wildcard DNS certificate
 <pre>
 apk add ceph-radosgw
 </pre>
 Nico Schottelius
 h3. Wildcard DNS certificate from letsencrypt
 Acquiring and renewing this certificate is currently a manual process, as it requires to change DNS settings.
 * run certbot
 * update DNS with the first token
 * update DNS with the second token
 Sample session:
 <pre>
 s3:/etc/ceph# certbot certonly --manual --preferred-challenges=dns --email sre@ungleich.ch --server https://acme-v02.api.letsencrypt.org/directory --agree-tos
 -d *.s3.ungleich.ch -d s3.ungleich.ch
 Saving debug log to /var/log/letsencrypt/letsencrypt.log
 Plugins selected: Authenticator manual, Installer None
 Cert is due for renewal, auto-renewing...
 Renewing an existing certificate
 Performing the following challenges:
 dns-01 challenge for s3.ungleich.ch
 dns-01 challenge for s3.ungleich.ch
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 NOTE: The IP of this machine will be publicly logged as having requested this
 certificate. If you're running certbot in manual mode on a machine that is not
 your server, please ensure you're okay with that.
 Are you OK with your IP being logged?
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 (Y)es/(N)o: y
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 Please deploy a DNS TXT record under the name
 _acme-challenge.s3.ungleich.ch with the following value:
 KxGLZNiVjFwz1ifNheoR_KQoPVpkvRUV1oT2pOvJlU0
 Before continuing, verify the record is deployed.
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 Press Enter to Continue
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 Please deploy a DNS TXT record under the name
 _acme-challenge.s3.ungleich.ch with the following value:
 bkrhtxWZUipCAL5cBfvrjDuftqsZdQ2JjisiKmXBbaI
 Before continuing, verify the record is deployed.
 (This must be set up in addition to the previous challenges; do not remove,
 replace, or undo the previous challenge tasks yet. Note that you might be
 asked to create multiple distinct TXT records with the same name. This is
 permitted by DNS standards.)
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 Press Enter to Continue
 Waiting for verification...
 Cleaning up challenges
 IMPORTANT NOTES:
  - Congratulations! Your certificate and chain have been saved at:
    /etc/letsencrypt/live/s3.ungleich.ch/fullchain.pem
    Your key file has been saved at:
    /etc/letsencrypt/live/s3.ungleich.ch/privkey.pem
    Your cert will expire on 2020-12-09. To obtain a new or tweaked
    version of this certificate in the future, simply run certbot
    again. To non-interactively renew *all* of your certificates, run
    "certbot renew"
  - If you like Certbot, please consider supporting our work by:
    Donating to ISRG / Let's Encrypt:   https://letsencrypt.org/donate
    Donating to EFF:                    https://eff.org/donate-le
 </pre>
 Nico Schottelius
 h2. Debugging ceph
 <pre>
     ceph status
     ceph osd status
     ceph osd df
     ceph osd utilization
     ceph osd pool stats
     ceph osd tree
     ceph pg stat
 </pre>
 Nico Schottelius
-Nico Schottelius
+h2. Performance Tuning
 * Ensure that the basic options for reducing rebalancing workload are set:
 <pre>
 osd max backfills = 1
 osd recovery max active = 1
 osd recovery op priority = 2
 </pre>
 * Ensure that "osd_op_queue_cut_off":https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_op_queue_cut_off is set to **high**
 ** Requires OSD restart on change
-Nico Schottelius
+h2. Ceph theory
 h3. How much data per Server?
 Q: How much data should we add into one server?
 A: Not more than it can handle.
 How much data can a server handle? For this let's have a look at 2 scenarios:
 * How long does it take to compensate the loss of the server?
 * Assuming a server has X TiB storage in Y disks attached and a network speed of Z GiB/s.
 * And our estimated rebuild goal is to compensate the loss of a server within U hours.
 h4. Approach 1
 Then
 Let's take an example:
 * A server with @10 disks * 10 TiB@ = 100 TiB = 100 000 GiB data. It is network connected with 10 Gbit = 1.25 GiB/s.
 * 100000/1.25 = 80000s = 22.22h
 However, our logic assumes that we actually rebuild from the failed server, which... is failed.
 h4. Approach 2: calculating with left servers
 However we can apply our logic also to distribute
 the rebuild over several servers that now pull in data from each other for rebuilding.
 We need to *read* the data (100TiB) from other servers and distribute it to new OSDs. Assuming each server has a 10 Gbit/s
 network connection.
 Now the servers might need to *read* (get data from other osds) and *write) (send data to other osds). Luckily, networking is 10 Gbit/s duplex - i.e. in both directions.
 However how fast can we actually read data from the disks?
 * SSDs are in the range of hundreds of MB/s (best case, not necessarily true for random reads) - let's assume
 * HDDs are in the range of tenths of MB/s (depending on the work load, but 30-40 MB/s random reads seems realistic)
 Further assumptions:
 * Assuming further that each disk should be dedicated at least one CPU core.
 Nico Schottelius
 h3. Disk/SSD speeds
-Nico Schottelius
+* Tuning for #8473 showed that a 10TB HDD can write up to 180-200MB/s when backfilling (at about 70% cpu usage and 20% disk usage), max backfills = 8
-Nico Schottelius
+* Debugging SSD usage in #8461 showed SSDs can read about 470-520MB/s sequential
 * Debugging SSD usage in #8461 showed SSDs can write about 170-280MB/s sequential
 * Debugging SSD usage in #8461 showed SSDs can write about 4MB/s RANDOM (need to verify this even though 3 runs showed these numbers)
 Dominique Roux
-Dominique Roux
+h3. Ceph theoretical fundament
 Dominique Roux
 If you are very much into the theoretical fundament of Ceph check out their "paper":https://www3.nd.edu/~dthain/courses/cse40771/spring2007/papers/ceph.pdf

Project

General

Profile

Open Infrastructure

The ungleich ceph handbook » History » Version 49