Task #5786: Document our ceph setup && maintenance guide - Open Infrastructure - ungleich redmine

NS Updated by Nico Schottelius almost 8 years ago Actions
Copy link
#1

Description updated (diff)
Status changed from New to In Progress

Will put sample / first versions in this ticket before migrating to the wiki.

NS Updated by Nico Schottelius almost 8 years ago Actions
Copy link
#2

Description updated (diff)

NS Updated by Nico Schottelius almost 8 years ago Actions
Copy link
#3

Description updated (diff)

NS Updated by Nico Schottelius almost 8 years ago Actions
Copy link
#4

Description updated (diff)

NS Updated by Nico Schottelius almost 8 years ago Actions
Copy link
#5

Description updated (diff)

NS Updated by Nico Schottelius almost 8 years ago Actions
Copy link
#6

Description updated (diff)

SH Updated by Samuel Hailu almost 8 years ago Actions
Copy link
#7

Nico Schottelius wrote:

Introduction¶

This article describes the ungleich storage architecture that is based on ceph. It describes our architecture as well maintenance commands. Required for

Communication guide¶

Usually when disks fails no customer communication is necessary, as it is automatically compensated/rebalanced by ceph. However in case multiple disk failures happen at the same time, I/O speed might be reduced and thus customer experience impacted.

For this reason communicate whenever I/O recovery settings are temporarily tuned.

Note please inform or notify in the infrastructure channel before you add or remove a disk
h2. Adding a new disk/ssd

Moving a disk/ssd to another server¶

Removing a disk/ssd¶

Change ceph speed for i/o recovery¶

By default we want to keep I/O recovery traffic low to not impact customer experience. However when multiple disks fail at the same point, we might want to prioritise recover for data safety over performance.

The default configuration on our servers contains:

[...]

The important settings are osd max backfills and osd recovery max active, the priority is always kept low so that regular I/O has priority.

To adjust the number of backfills per osd and to change the number of threads used for recovery, we can use on any node with the admin keyring:

[...]

where Y and X are the values that we want to use. Experience shows that Y=5 and X=5 doubles to triples the recovery performance, whereas X=10 and Y=10 increases recovery performance 5 times.

Debug scrub errors / inconsistent pg message¶

From time to time disks don't save what they are told to save. Ceph scrubbing detects these errors and switches to HEALTH_ERR. Use ceph health detail to find out which placement groups (pgs) are affected. Usually a *ceph pg repair <number> fixes the problem.

If this does not help, consult https://ceph.com/geen-categorie/ceph-manually-repair-object/.

Nico Schottelius wrote:

Introduction¶

This article describes the ungleich storage architecture that is based on ceph. It describes our architecture as well maintenance commands. Required for

Communication guide¶

Usually when disks fails no customer communication is necessary, as it is automatically compensated/rebalanced by ceph. However in case multiple disk failures happen at the same time, I/O speed might be reduced and thus customer experience impacted.

For this reason communicate whenever I/O recovery settings are temporarily tuned.

Adding a new disk/ssd¶

Moving a disk/ssd to another server¶

Removing a disk/ssd¶

Change ceph speed for i/o recovery¶

By default we want to keep I/O recovery traffic low to not impact customer experience. However when multiple disks fail at the same point, we might want to prioritise recover for data safety over performance.

The default configuration on our servers contains:

[...]

The important settings are osd max backfills and osd recovery max active, the priority is always kept low so that regular I/O has priority.

To adjust the number of backfills per osd and to change the number of threads used for recovery, we can use on any node with the admin keyring:

[...]

where Y and X are the values that we want to use. Experience shows that Y=5 and X=5 doubles to triples the recovery performance, whereas X=10 and Y=10 increases recovery performance 5 times.

Debug scrub errors / inconsistent pg message¶

From time to time disks don't save what they are told to save. Ceph scrubbing detects these errors and switches to HEALTH_ERR. Use ceph health detail to find out which placement groups (pgs) are affected. Usually a *ceph pg repair <number> fixes the problem.

If this does not help, consult https://ceph.com/geen-categorie/ceph-manually-repair-object/.

SH Updated by Samuel Hailu almost 8 years ago Actions
Copy link
#8

Nico Schottelius wrote:

Introduction¶

This article describes the ungleich storage architecture that is based on ceph. It describes our architecture as well maintenance commands. Required for

Communication guide¶

Usually when disks fails no customer communication is necessary, as it is automatically compensated/rebalanced by ceph. However in case multiple disk failures happen at the same time, I/O speed might be reduced and thus customer experience impacted.

For this reason communicate whenever I/O recovery settings are temporarily tuned.

Note please inform or notify in the infrastructure channel before you add or remove a disk.

Adding a new disk/ssd¶

Adding a new disk/ssd¶

Moving a disk/ssd to another server¶

Removing a disk/ssd¶

Change ceph speed for i/o recovery¶

By default we want to keep I/O recovery traffic low to not impact customer experience. However when multiple disks fail at the same point, we might want to prioritise recover for data safety over performance.

The default configuration on our servers contains:

[...]

The important settings are osd max backfills and osd recovery max active, the priority is always kept low so that regular I/O has priority.

To adjust the number of backfills per osd and to change the number of threads used for recovery, we can use on any node with the admin keyring:

[...]

where Y and X are the values that we want to use. Experience shows that Y=5 and X=5 doubles to triples the recovery performance, whereas X=10 and Y=10 increases recovery performance 5 times.

Debug scrub errors / inconsistent pg message¶

From time to time disks don't save what they are told to save. Ceph scrubbing detects these errors and switches to HEALTH_ERR. Use ceph health detail to find out which placement groups (pgs) are affected. Usually a *ceph pg repair <number> fixes the problem.

If this does not help, consult https://ceph.com/geen-categorie/ceph-manually-repair-object/.

SH Updated by Samuel Hailu almost 8 years ago Actions
Copy link
#9

Nico Schottelius wrote:

Introduction¶

This article describes the ungleich storage architecture that is based on ceph. It describes our architecture as well maintenance commands. Required for

Communication guide¶

Usually when disks fails no customer communication is necessary, as it is automatically compensated/rebalanced by ceph. However in case multiple disk failures happen at the same time, I/O speed might be reduced and thus customer experience impacted.

For this reason communicate whenever I/O recovery settings are temporarily tuned.

Note please inform or notify in the infrastructure channel before you add or remove a disk.

Adding a new disk/ssd¶

Adding a new disk/ssd¶

Moving a disk/ssd to another server¶

Removing a disk/ssd¶

Change ceph speed for i/o recovery¶

By default we want to keep I/O recovery traffic low to not impact customer experience. However when multiple disks fail at the same point, we might want to prioritise recover for data safety over performance.

The default configuration on our servers contains:

[...]

The important settings are osd max backfills and osd recovery max active, the priority is always kept low so that regular I/O has priority.

To adjust the number of backfills per osd and to change the number of threads used for recovery, we can use on any node with the admin keyring:

[...]

where Y and X are the values that we want to use. Experience shows that Y=5 and X=5 doubles to triples the recovery performance, whereas X=10 and Y=10 increases recovery performance 5 times.

Debug scrub errors / inconsistent pg message¶

From time to time disks don't save what they are told to save. Ceph scrubbing detects these errors and switches to HEALTH_ERR. Use ceph health detail to find out which placement groups (pgs) are affected. Usually a *ceph pg repair <number> fixes the problem.

If this does not help, consult https://ceph.com/geen-categorie/ceph-manually-repair-object/.

NS Updated by Nico Schottelius almost 8 years ago Actions
Copy link
#10

Description updated (diff)

NS Updated by Nico Schottelius over 7 years ago Actions
Copy link
#11

Status changed from In Progress to Closed

Project

General

Profile

Open Infrastructure

Custom queries

Task #5786

Document our ceph setup && maintenance guide

Introduction¶

Communication guide¶

Adding a new disk/ssd¶

Moving a disk/ssd to another server¶

Removing a disk/ssd¶

Handling DOWN osds with filesystem errors¶

Change ceph speed for i/o recovery¶

Debug scrub errors / inconsistent pg message¶

NS Updated by Nico Schottelius almost 8 years ago ActionsCopy link #1

NS Updated by Nico Schottelius almost 8 years ago ActionsCopy link #2

NS Updated by Nico Schottelius almost 8 years ago ActionsCopy link #3

NS Updated by Nico Schottelius almost 8 years ago ActionsCopy link #4

NS Updated by Nico Schottelius almost 8 years ago ActionsCopy link #5

NS Updated by Nico Schottelius almost 8 years ago ActionsCopy link #6

SH Updated by Samuel Hailu almost 8 years ago ActionsCopy link #7

Introduction¶

Communication guide¶

Moving a disk/ssd to another server¶

Removing a disk/ssd¶

Change ceph speed for i/o recovery¶

Debug scrub errors / inconsistent pg message¶

Introduction¶

Communication guide¶

Adding a new disk/ssd¶

Moving a disk/ssd to another server¶

Removing a disk/ssd¶

Change ceph speed for i/o recovery¶

Debug scrub errors / inconsistent pg message¶

SH Updated by Samuel Hailu almost 8 years ago ActionsCopy link #8

Introduction¶

Communication guide¶

Adding a new disk/ssd¶

Adding a new disk/ssd¶

Moving a disk/ssd to another server¶

Removing a disk/ssd¶

Change ceph speed for i/o recovery¶

Debug scrub errors / inconsistent pg message¶

SH Updated by Samuel Hailu almost 8 years ago ActionsCopy link #9

Introduction¶

Communication guide¶

Adding a new disk/ssd¶

Adding a new disk/ssd¶

Moving a disk/ssd to another server¶

Removing a disk/ssd¶

Change ceph speed for i/o recovery¶

Debug scrub errors / inconsistent pg message¶

NS Updated by Nico Schottelius almost 8 years ago ActionsCopy link #10

NS Updated by Nico Schottelius over 7 years ago ActionsCopy link #11

NS Updated by Nico Schottelius almost 8 years ago Actions
Copy link
#1

NS Updated by Nico Schottelius almost 8 years ago Actions
Copy link
#2

NS Updated by Nico Schottelius almost 8 years ago Actions
Copy link
#3

NS Updated by Nico Schottelius almost 8 years ago Actions
Copy link
#4

NS Updated by Nico Schottelius almost 8 years ago Actions
Copy link
#5

NS Updated by Nico Schottelius almost 8 years ago Actions
Copy link
#6

SH Updated by Samuel Hailu almost 8 years ago Actions
Copy link
#7

SH Updated by Samuel Hailu almost 8 years ago Actions
Copy link
#8

SH Updated by Samuel Hailu almost 8 years ago Actions
Copy link
#9

NS Updated by Nico Schottelius almost 8 years ago Actions
Copy link
#10

NS Updated by Nico Schottelius over 7 years ago Actions
Copy link
#11