Project

General

Profile

The ungleich ceph handbook » History » Version 13

Nico Schottelius, 12/06/2018 01:23 PM

1 1 Nico Schottelius
h1. The ungleich ceph handbook
2
3 3 Nico Schottelius
{{toc}}
4
5 1 Nico Schottelius
h2. Status
6
7 7 Nico Schottelius
This document is **IN PRODUCTION**.
8 1 Nico Schottelius
9
h2. Introduction
10
11
This article describes the ungleich storage architecture that is based on ceph. It describes our architecture as well maintenance commands. Required for 
12
13
h2. Communication guide
14
15
Usually when disks fails no customer communication is necessary, as it is automatically compensated/rebalanced by ceph. However in case multiple disk failures happen at the same time, I/O speed might be reduced and thus customer experience impacted.
16
17
For this reason communicate whenever I/O recovery settings are temporarily tuned.
18
19 2 Nico Schottelius
h2. Adding a new disk/ssd to the ceph cluster
20
21
h3. For Dell servers
22
23
First find the disk and then add it to the operating system
24
25
<pre>
26
megacli -PDList -aALL  | grep -B16 -i unconfigur
27
28
# Sample output:
29
[19:46:50] server7.place6:~#  megacli -PDList -aALL  | grep -B16 -i unconfigur
30
Enclosure Device ID: N/A
31
Slot Number: 0
32
Enclosure position: N/A
33
Device Id: 0
34
WWN: 0000000000000000
35
Sequence Number: 1
36
Media Error Count: 0
37
Other Error Count: 0
38
Predictive Failure Count: 0
39
Last Predictive Failure Event Seq Number: 0
40
PD Type: SATA
41
42
Raw Size: 894.252 GB [0x6fc81ab0 Sectors]
43
Non Coerced Size: 893.752 GB [0x6fb81ab0 Sectors]
44
Coerced Size: 893.75 GB [0x6fb80000 Sectors]
45
Sector Size:  0
46
Firmware state: Unconfigured(good), Spun Up
47
</pre>
48
49
Then add the disk to the OS:
50
51
<pre>
52
megacli -CfgLdAdd -r0 [enclosure:slot] -aX
53
54
# Sample call, if enclosure and slot are KNOWN (aka not N/A)
55
megacli -CfgLdAdd -r0 [32:0] -a0
56
57
# Sample call, if enclosure is N/A
58
megacli -CfgLdAdd -r0 [:0] -a0
59
</pre>
60
61 1 Nico Schottelius
h2. Moving a disk/ssd to another server
62 4 Nico Schottelius
63
(needs to be described better)
64
65
Generally speaking:
66
67 9 Nico Schottelius
* /opt/ungleich-tools/ceph-osd-stop-disable does the following:
68
** Stop the osd, remove monit on the server you want to take it out
69
** umount the disk
70 1 Nico Schottelius
* Take disk out
71
* Discard preserved cache on the server you took it out 
72 9 Nico Schottelius
** using megacli
73 1 Nico Schottelius
* Insert into new server
74 9 Nico Schottelius
* Clear foreign configuration
75
** using megacli
76
* Disk will now appear in the OS, ceph/udev will automatically start the OSD (!)
77
** No creating of the osd required!
78
* Verify that the disk exists and that the osd is started
79
** using *ps aux*
80
** using *ceph osd tree*
81 10 Nico Schottelius
* */opt/ungleich-tools/monit-ceph-create-start osd.XX* # where osd.XX is the osd + number
82 9 Nico Schottelius
** Creates the monit configuration file so that monit watches the OSD
83
** Reload monit
84 11 Nico Schottelius
* Verify monit using *monit status*
85 1 Nico Schottelius
86
h2. Removing a disk/ssd
87 5 Nico Schottelius
88
To permanently remove a failed disk from a cluster, use ***ceph-osd-stop-remove-permanently*** from ungleich-tools repo. Warning: if the disk is still active, the OSD will be shutdown AND removed from the cluster -> all data of that disk will need to be rebalanced.
89 1 Nico Schottelius
90
h2. Handling DOWN osds with filesystem errors
91
92
If an email arrives with the subject "monit alert -- Does not exist osd.XX-whoami", the filesystem of an OSD cannot be read anymore. It is very highly likely that the disk / ssd is broken. Steps that need to be done:
93
94
* Login to any ceph monitor (cephX.placeY.ungleich.ch)
95
* Check **ceph -s**, find host using **ceph osd tree**
96
* Login to the affected host
97
* Run the following commands:
98
** ls /var/lib/ceph/osd/ceph-XX
99
** dmesg
100
* Create a new ticket in the datacenter light project
101
** Subject: "Replace broken OSD.XX on serverX.placeY.ungleich.ch"
102
** Add (partial) output of above commands
103
** Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently XX, where XX is the osd id, to remove the disk from the cluster
104
** Remove the physical disk from the host, checkout if there is warranty on it and if yes
105
*** Create a short letter to the vendor, including technical details a from above
106
*** Record when you sent it in
107
*** Put ticket into status waiting
108
** If there is no warranty, dispose it
109
110
111
112
h2. Change ceph speed for i/o recovery
113
114
By default we want to keep I/O recovery traffic low to not impact customer experience. However when multiple disks fail at the same point, we might want to prioritise recover for data safety over performance.
115
116
The default configuration on our servers contains:
117
118
<pre>
119
[osd]
120
osd max backfills = 1
121
osd recovery max active = 1
122
osd recovery op priority = 2
123
</pre>
124
125
The important settings are *osd max backfills* and *osd recovery max active*, the priority is always kept low so that regular I/O has priority.
126
127
To adjust the number of backfills *per osd* and to change the *number of threads* used for recovery, we can use on any node with the admin keyring:
128
129
<pre>
130
ceph tell osd.* injectargs '--osd-max-backfills Y'
131
ceph tell osd.* injectargs '--osd-recovery-max-active X'
132
</pre>
133
134
where Y and X are the values that we want to use. Experience shows that Y=5 and X=5 doubles to triples the recovery performance, whereas X=10 and Y=10 increases recovery performance 5 times.
135
136
h2. Debug scrub errors / inconsistent pg message
137 6 Nico Schottelius
138 1 Nico Schottelius
From time to time disks don't save what they are told to save. Ceph scrubbing detects these errors and switches to HEALTH_ERR. Use *ceph health detail* to find out which placement groups (*pgs*) are affected. Usually a ***ceph pg repair <number>*** fixes the problem.
139
140
If this does not help, consult https://ceph.com/geen-categorie/ceph-manually-repair-object/.
141 12 Nico Schottelius
142
h2. Move servers into the osd tree
143
144
New servers have their buckets placed outside the **default root** and thus need to be moved inside.
145
Output might look as follows:
146
147
<pre>
148
[11:19:27] server5.place6:~# ceph osd tree
149
ID  CLASS   WEIGHT    TYPE NAME        STATUS REWEIGHT PRI-AFF 
150
 -3           0.87270 host server5                             
151
 41     ssd   0.87270     osd.41           up  1.00000 1.00000 
152
 -1         251.85580 root default                             
153
 -7          81.56271     host server2                         
154
  0 hdd-big   9.09511         osd.0        up  1.00000 1.00000 
155
  5 hdd-big   9.09511         osd.5        up  1.00000 1.00000 
156
...
157
</pre>
158
159
160
Use **ceph osd crush move serverX root=default** (where serverX is the new server),
161
which will move the bucket in the right place:
162
163
<pre>
164
[11:21:17] server5.place6:~# ceph osd crush move server5 root=default
165
moved item id -3 name 'server5' to location {root=default} in crush map
166
[11:32:12] server5.place6:~# ceph osd tree
167
ID  CLASS   WEIGHT    TYPE NAME        STATUS REWEIGHT PRI-AFF 
168
 -1         252.72850 root default                             
169
...
170
 -3           0.87270     host server5                         
171
 41     ssd   0.87270         osd.41       up  1.00000 1.00000 
172
173
174
</pre>
175 13 Nico Schottelius
176
h2. How to fix existing osds with wrong partition layout
177
178
In the first version of DCL we used filestore/3 partition based layout.
179
In the second version of DCL, including OSD autodection, we use bluestore/2 partition based layout.
180
181
To convert, we delete the old OSD, clean the partitions and create a new osd:
182
183
Assuming the OSD is *not active*, we can do the following:
184
185
* Find the OSD number: mount the partition and find the whoami file
186
187
<pre>
188
root@server2:/opt/ungleich-tools# mount /dev/sda2 /mnt/
189
root@server2:/opt/ungleich-tools# cat /mnt/whoami 
190
0
191
root@server2:/opt/ungleich-tools# umount  /mnt/
192
193
</pre>
194
195
* Verify in the *ceph osd tree* that the OSD is on that server
196
* Deleting the OSD
197
** ceph osd crush remove $osd_name
198
** ceph osd rm $osd_name
199
* Create an empty partition table
200
** fdisk /dev/sdX
201
** g
202
** w
203
* Create a new OSD
204
** /opt/ungleich-tools/ceph-osd-create-start /dev/sdX CLASS # use hdd, ssd, ... for the CLASS