Task #8069: Investigate potential bottleneck on storage/CEPH at DCL - Open Infrastructure - ungleich redmine

Actions

Copy link

Task #8069

closed

Investigate potential bottleneck on storage/CEPH at DCL

Added by Timothée Floure about 5 years ago. Updated over 1 year ago.

Status:

Closed

Priority:

Normal

Assignee:

Target version:

Start date:

05/27/2020

Due date:

% Done:

Estimated time:

PM Check date:

Actions

Copy link

Updated by Timothée Floure about 5 years ago

Our hardware:

RAID controllers: perc h700, perc h800
- Technical manual: https://www.dell.com/learn/us/en/04/shared-content~data-sheets/documents~perc-technical-guidebook.pdf
- 2x4 ports, 6GB SAS 2.0, x8 PCIe 2.0
- 512M to 1GB cache, 800MHz DDR2
- IO load balacing on H800, not on h700 <- how does it work, is it significant?
Each server has dual 10Gbps connectivity.
Arista switches: 7050s
- Datasheet: https://www.arista.com/assets/data/pdf/Datasheets/7050S_Datasheet.pdf
- 52 x 1/10GbE SFP
- 4GB RAM, Dual-core x86 CPU
- '1.04 Tbps'
- 9MB Dynamic Buffer Allocation
Cables?
Disks?

Actions

Copy link

Updated by Nico Schottelius about 5 years ago

Some questions we should be able to answer:

Real scenarios¶

NOTE: assuming all disks running at 'full speed'.
NOTE: big big unknown here is how the cache of the RAID controller behave.
NOTE: unknown IOPS limitations on raid controllers. <--- TODO, more important than bandwith!

The R710 server has 8 disks slots (supposedly with a h700 controller). Given that we fully populate the server, what is the maximum bandwidth available per OSD running on that machine?
-> 4 GB/s from PCIe but 3GB/s for SATA -> 375 MB/s per disk modulo caching from RAID controller.

The R815 has 6 disk slots (is that true? -> Balazs). Same question as above.
-> SAS 6GB/s but 3GB/s for SATA -> 500 MB/s per disk, module caching from RAID controller.

What about an R815 with an md array (12x 3.5" HDD via SAS cable attached to H800)
-> 4 GB/s from PCIe connector (SAS supports 6GB/s) -> 333 MB/s per device.
- Is the bottleneck likely a) the disk b) the controller c) the network of the server d) another component in the server
  - 10 Gbps = 1.25 GB/s = 104 MB/s per disk at full speed.
  - Controller PCIe limits at 500 MB/s per disk at full speed.
  -> Bottleneck likely to be on disk or network.

Given an Arista 7050 and an imaginary bandwidth per disk of 50 MB/s, how many disks can we run on one 7050?
- The Arista is supposed to handle 1.04 Tbps = 130000 MB/s = 2600 * 50 MB/s => not an issue.

Is the PCI-E bus (it's not a bus anymore - afair it's point-to-point) on either server model a limitation?
- It provides access to networking, disks and has an interconnect to the cpus

-> No worried, but TODO.

We are using ceph bluestore (https://ceph.io/community/new-luminous-bluestore/)
- Does it make sense to switch our storage model to use 2 SSDs (f.i. 1TB) in a raid1 in front of HDDs and drop the distinction of HDD/SSD?
  - raid1 is needed as on the failure of the SSD all osds that have the rocksdb/bluefs on it fail

-> TODO

(skip answers if they are too far from what you can gather)

Actions

Copy link

Updated by Timothée Floure about 5 years ago

Regarding the RAID controllers:

RAID0 (striping - redundancy is handled by CEPH across physical servers).
Some controllers are battery-backed:
- Likely write-back cache.
Some are not:
- Likely write-though cache.
- .. or forced WB via BIOS/firmware setting?
Read cache defaults to 'Adaptive Read Ahead': When selected, the controller begins using Read-Ahead if the two most recent disk accesses occurred in sequential sectors.
- Fairly useless for random reads.

Actions

Copy link

Updated by Timothée Floure about 5 years ago

Regarding PCIe AND SAS/SATA:

Controllers are connected on x8 PCIe 2.0 => 500 MB/s per-lane for PCIe 2.0 -> x8 = 4 GB/s
6 GB/s SAS 2.0 connectivity -> how is this split between disks? Should be fine anyway.
- perc h700 supports SATA 3GB/s, perch800 does not support SATA.
How are our network cards connected? Should be fine anyway. 10Gbpe = 1.25 GB/s -> even PCIe 2.0 4x is more than enough.

Actions

Copy link

Updated by Timothée Floure about 5 years ago

I'll be AFK for a little while: the big pain point is the hardware RAID controller.

Unknown effect on IOPS (needs more digging, not obvious).
- The internet says (reddit, random wikis, CEPH mailing list) using RAID0 when passthrough is not supported is BAD (lower performance/IOPS, buggy firmware, some (unknown?) implication on cache, ...).
Unknown effect from the cache.

Actions

Copy link