Project

General

Profile

The ungleich ceph handbook » History » Version 24

Jin-Guk Kwon, 03/11/2019 08:26 AM

1 1 Nico Schottelius
h1. The ungleich ceph handbook
2
3 3 Nico Schottelius
{{toc}}
4
5 1 Nico Schottelius
h2. Status
6
7 7 Nico Schottelius
This document is **IN PRODUCTION**.
8 1 Nico Schottelius
9
h2. Introduction
10
11
This article describes the ungleich storage architecture that is based on ceph. It describes our architecture as well maintenance commands. Required for 
12
13
h2. Communication guide
14
15
Usually when disks fails no customer communication is necessary, as it is automatically compensated/rebalanced by ceph. However in case multiple disk failures happen at the same time, I/O speed might be reduced and thus customer experience impacted.
16
17
For this reason communicate whenever I/O recovery settings are temporarily tuned.
18
19 20 Nico Schottelius
h2. Analysing 
20
21 21 Nico Schottelius
h3. ceph osd df tree
22 20 Nico Schottelius
23
Using @ceph osd df tree@ you can see not only the disk usage per OSD, but also the number of PGs on an OSD. This is especially useful to see how the OSDs are balanced.
24
25 22 Nico Schottelius
h3. Find out the device of an OSD
26
27
Use @mount | grep /var/lib/ceph/osd/ceph-OSDID@ on the server on which the OSD is located:
28
29
<pre>
30
31
[16:01:23] server2.place6:~# mount | grep /var/lib/ceph/osd/ceph-31
32
/dev/sdk1 on /var/lib/ceph/osd/ceph-31 type xfs (rw,relatime,attr2,inode64,noquota)
33
</pre>
34
35
36 2 Nico Schottelius
h2. Adding a new disk/ssd to the ceph cluster
37
38
h3. For Dell servers
39
40
First find the disk and then add it to the operating system
41
42
<pre>
43
megacli -PDList -aALL  | grep -B16 -i unconfigur
44
45
# Sample output:
46
[19:46:50] server7.place6:~#  megacli -PDList -aALL  | grep -B16 -i unconfigur
47
Enclosure Device ID: N/A
48
Slot Number: 0
49
Enclosure position: N/A
50
Device Id: 0
51
WWN: 0000000000000000
52
Sequence Number: 1
53
Media Error Count: 0
54
Other Error Count: 0
55
Predictive Failure Count: 0
56
Last Predictive Failure Event Seq Number: 0
57
PD Type: SATA
58
59
Raw Size: 894.252 GB [0x6fc81ab0 Sectors]
60
Non Coerced Size: 893.752 GB [0x6fb81ab0 Sectors]
61
Coerced Size: 893.75 GB [0x6fb80000 Sectors]
62
Sector Size:  0
63
Firmware state: Unconfigured(good), Spun Up
64
</pre>
65
66
Then add the disk to the OS:
67
68
<pre>
69 19 Jin-Guk Kwon
megacli -CfgLdAdd -r0 [enclosure position:slot] -aX (X : host is 0. marray is 1)
70 2 Nico Schottelius
71
# Sample call, if enclosure and slot are KNOWN (aka not N/A)
72
megacli -CfgLdAdd -r0 [32:0] -a0
73
74
# Sample call, if enclosure is N/A
75
megacli -CfgLdAdd -r0 [:0] -a0
76
</pre>
77
78 1 Nico Schottelius
h2. Moving a disk/ssd to another server
79 4 Nico Schottelius
80
(needs to be described better)
81
82
Generally speaking:
83
84 9 Nico Schottelius
* /opt/ungleich-tools/ceph-osd-stop-disable does the following:
85
** Stop the osd, remove monit on the server you want to take it out
86
** umount the disk
87 1 Nico Schottelius
* Take disk out
88
* Discard preserved cache on the server you took it out 
89 23 Nico Schottelius
** using megacli:  @megacli -DiscardPreservedCache -Lall -a0@
90 1 Nico Schottelius
* Insert into new server
91 9 Nico Schottelius
* Clear foreign configuration
92 23 Nico Schottelius
** using megacli: @megacli -CfgForeign -Clear -a0@
93 9 Nico Schottelius
* Disk will now appear in the OS, ceph/udev will automatically start the OSD (!)
94
** No creating of the osd required!
95
* Verify that the disk exists and that the osd is started
96
** using *ps aux*
97
** using *ceph osd tree*
98 10 Nico Schottelius
* */opt/ungleich-tools/monit-ceph-create-start osd.XX* # where osd.XX is the osd + number
99 9 Nico Schottelius
** Creates the monit configuration file so that monit watches the OSD
100
** Reload monit
101 11 Nico Schottelius
* Verify monit using *monit status*
102 1 Nico Schottelius
103
h2. Removing a disk/ssd
104 5 Nico Schottelius
105
To permanently remove a failed disk from a cluster, use ***ceph-osd-stop-remove-permanently*** from ungleich-tools repo. Warning: if the disk is still active, the OSD will be shutdown AND removed from the cluster -> all data of that disk will need to be rebalanced.
106 1 Nico Schottelius
107
h2. Handling DOWN osds with filesystem errors
108
109
If an email arrives with the subject "monit alert -- Does not exist osd.XX-whoami", the filesystem of an OSD cannot be read anymore. It is very highly likely that the disk / ssd is broken. Steps that need to be done:
110
111
* Login to any ceph monitor (cephX.placeY.ungleich.ch)
112
* Check **ceph -s**, find host using **ceph osd tree**
113
* Login to the affected host
114
* Run the following commands:
115
** ls /var/lib/ceph/osd/ceph-XX
116
** dmesg
117 24 Jin-Guk Kwon
<pre>
118
ex) After checking message of dmesg, you can do next step
119
[204696.406756] XFS (sdl1): metadata I/O error: block 0x19100 ("xlog_iodone") error 5 numblks 64
120
[204696.408094] XFS (sdl1): xfs_do_force_shutdown(0x2) called from line 1233 of file /build/linux-BsFdsw/linux-4.9.65/fs/xfs/xfs_log.c.  Return address = 0xffffffffc08eb612
121
[204696.410702] XFS (sdl1): Log I/O Error Detected.  Shutting down filesystem
122
[204696.411977] XFS (sdl1): Please umount the filesystem and rectify the problem(
123
</pre>
124
125 1 Nico Schottelius
* Create a new ticket in the datacenter light project
126
** Subject: "Replace broken OSD.XX on serverX.placeY.ungleich.ch"
127
** Add (partial) output of above commands
128
** Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently XX, where XX is the osd id, to remove the disk from the cluster
129
** Remove the physical disk from the host, checkout if there is warranty on it and if yes
130
*** Create a short letter to the vendor, including technical details a from above
131
*** Record when you sent it in
132
*** Put ticket into status waiting
133
** If there is no warranty, dispose it
134
135
h2. Change ceph speed for i/o recovery
136
137
By default we want to keep I/O recovery traffic low to not impact customer experience. However when multiple disks fail at the same point, we might want to prioritise recover for data safety over performance.
138
139
The default configuration on our servers contains:
140
141
<pre>
142
[osd]
143
osd max backfills = 1
144
osd recovery max active = 1
145
osd recovery op priority = 2
146
</pre>
147
148
The important settings are *osd max backfills* and *osd recovery max active*, the priority is always kept low so that regular I/O has priority.
149
150
To adjust the number of backfills *per osd* and to change the *number of threads* used for recovery, we can use on any node with the admin keyring:
151
152
<pre>
153
ceph tell osd.* injectargs '--osd-max-backfills Y'
154
ceph tell osd.* injectargs '--osd-recovery-max-active X'
155
</pre>
156
157
where Y and X are the values that we want to use. Experience shows that Y=5 and X=5 doubles to triples the recovery performance, whereas X=10 and Y=10 increases recovery performance 5 times.
158
159
h2. Debug scrub errors / inconsistent pg message
160 6 Nico Schottelius
161 1 Nico Schottelius
From time to time disks don't save what they are told to save. Ceph scrubbing detects these errors and switches to HEALTH_ERR. Use *ceph health detail* to find out which placement groups (*pgs*) are affected. Usually a ***ceph pg repair <number>*** fixes the problem.
162
163
If this does not help, consult https://ceph.com/geen-categorie/ceph-manually-repair-object/.
164 12 Nico Schottelius
165
h2. Move servers into the osd tree
166
167
New servers have their buckets placed outside the **default root** and thus need to be moved inside.
168
Output might look as follows:
169
170
<pre>
171
[11:19:27] server5.place6:~# ceph osd tree
172
ID  CLASS   WEIGHT    TYPE NAME        STATUS REWEIGHT PRI-AFF 
173
 -3           0.87270 host server5                             
174
 41     ssd   0.87270     osd.41           up  1.00000 1.00000 
175
 -1         251.85580 root default                             
176
 -7          81.56271     host server2                         
177
  0 hdd-big   9.09511         osd.0        up  1.00000 1.00000 
178
  5 hdd-big   9.09511         osd.5        up  1.00000 1.00000 
179
...
180
</pre>
181
182
183
Use **ceph osd crush move serverX root=default** (where serverX is the new server),
184
which will move the bucket in the right place:
185
186
<pre>
187
[11:21:17] server5.place6:~# ceph osd crush move server5 root=default
188
moved item id -3 name 'server5' to location {root=default} in crush map
189
[11:32:12] server5.place6:~# ceph osd tree
190
ID  CLASS   WEIGHT    TYPE NAME        STATUS REWEIGHT PRI-AFF 
191
 -1         252.72850 root default                             
192
...
193
 -3           0.87270     host server5                         
194
 41     ssd   0.87270         osd.41       up  1.00000 1.00000 
195
196
197
</pre>
198 13 Nico Schottelius
199
h2. How to fix existing osds with wrong partition layout
200
201
In the first version of DCL we used filestore/3 partition based layout.
202
In the second version of DCL, including OSD autodection, we use bluestore/2 partition based layout.
203
204
To convert, we delete the old OSD, clean the partitions and create a new osd:
205
206 14 Nico Schottelius
h3. Inactive OSD
207 1 Nico Schottelius
208 14 Nico Schottelius
If the OSD is *not active*, we can do the following:
209
210 13 Nico Schottelius
* Find the OSD number: mount the partition and find the whoami file
211
212
<pre>
213
root@server2:/opt/ungleich-tools# mount /dev/sda2 /mnt/
214
root@server2:/opt/ungleich-tools# cat /mnt/whoami 
215
0
216
root@server2:/opt/ungleich-tools# umount  /mnt/
217
218
</pre>
219
220
* Verify in the *ceph osd tree* that the OSD is on that server
221
* Deleting the OSD
222
** ceph osd crush remove $osd_name
223 1 Nico Schottelius
** ceph osd rm $osd_name
224 14 Nico Schottelius
225
Then continue below as described in "Recreating the OSD".
226
227
h3. Remove Active OSD
228
229
* Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently OSDID to stop and remove the OSD
230
* Then continue below as described in "Recreating the OSD".
231
232
233
h3. Recreating the OSD
234
235 13 Nico Schottelius
* Create an empty partition table
236
** fdisk /dev/sdX
237
** g
238
** w
239
* Create a new OSD
240
** /opt/ungleich-tools/ceph-osd-create-start /dev/sdX CLASS # use hdd, ssd, ... for the CLASS
241 15 Jin-Guk Kwon
242
h2. How to fix unfound pg
243
244
refer to https://redmine.ungleich.ch/issues/6388
245 16 Jin-Guk Kwon
246
* Check health state 
247
** ceph health detail
248
* Check which server has that osd
249
** ceph osd tree
250
* Check which VM is running in server place
251 17 Jin-Guk Kwon
** virsh list  
252 16 Jin-Guk Kwon
* Check pg map
253 17 Jin-Guk Kwon
** ceph osd map [osd pool] [VMID]
254 18 Jin-Guk Kwon
* revert pg
255
** ceph pg [PGID] mark_unfound_lost revert