Product and service reviews are conducted independently by our editorial team, but we sometimes make money when you click on links. Learn more.
 

How We Test Enterprise SSDs

How We Test Enterprise SSDs
By

Mechanical hard drives have been the mainstay of enterprise storage for several decades. Unfortunately, the maturation of HDD technology, coupled with relatively few manufacturers producing products, has left fewer performance differentiators between competing drives. The HDD purchasing decision centers on several key factors. Price-per-GB, density, reliability and power consumption metrics are always considered.

Unlike HDDs, flash-based storage products display tremendous performance differentiation even among models designed to compete within the same class. SSD technology has been disruptive to the established norms of enterprise storage, and flash-based storage devices present a unique set of metrics and characteristics that require analysis during the procurement process. Performance of any storage device varies depending upon the end application, but flash-based storage amplifies the performance delta.

The SSD manufacturers have settled into product categories largely defined by write endurance, which is a key metric brought on by the finite lifespan of NAND. The products have stratified into three classes: read-centric, mixed-use and write-centric. Our test methodology scales from value-class read-centric SSDs up to the fastest PCIe and memory-channel devices. We separate representatives from each category into distinct test pools defined by write endurance and price.

Performance Variability

In a perfect world, every I/O request occurs at the same speed and latency with little to no variability. Unfortunately, all storage devices provide variable performance during operation. SSDs and other flash-based products can feature wider performance variation in comparison to standard HDDs. Differing performance consistency can be traced back to controllers, firmware, NAND, error control algorithms, internal management techniques and over-provisioning.

It is important to characterize performance variability with flash-based products due to the wide range of contributing factors. Performance variability has a negative impact on both application performance and RAID scaling capabilities. Errant I/Os force applications into waiting for crucial pieces of data, and in some cases the following operations are reliant upon previously requested data. This allows a few I/Os to slow the entire application. 

Software and hardware RAID, and other means of aggregating drives behind a single entity, only exacerbate variability. The speed of a RAID array is only as fast as the slowest member, thus an errant I/O from one drive slows the speed of the entire array in most scenarios. The number of errant I/Os multiplies as drives are added to the array due to the simple fact that more drives are individually generating errant I/Os. 

Average performance measurements provide a basic understanding of performance but do little to expose I/O QoS (Quality of Service). To illustrate performance variability, we inject a high level of granularity into our entire test suite. We measure and plot performance every second, which reveals interesting performance tendencies specific to each drive. Drives with a tighter I/O service distribution perform significantly better in applications and scale well in drive-aggregation environments.

Read-centric (or value-class) SSDs exhibit more variability than robust high-end SSDs. We take a glimpse into the 4KB random steady-state performance consistency of three leading 2.5" products from the read-centric class in the chart above. The Y-axis indicates IOPS performance, and the X-axis denotes the length of time we applied the workload to the Device Under Test (DUT) (as noted in the chart subtitle).

We immediately spot clear differentiators in performance consistency among the products in the sample pool. Product C has a much higher average speed (denoted by the black trend line), but our methodology exposes a significant group of requests that fall below competing SSDs. This SSD might be a good fit due to its higher average speed in single-drive deployments. However, in RAID and other drive aggregation schemes, the significant variability leads to increasing performance loss with the addition of more drives to the pool.

Product B exhibits a nice tight performance envelope without significant outliers and will scale nearly linearly when aggregated with other drives. Granularity is also helpful to ascertain certain tendencies of internal SSD management functions. Product A has a tight performance envelope and the intermittent bursts of reduced performance are indicative of garbage collection routines operating in the background.

Preconditioning

Effective competitive performance analysis requires adherence to industry-accepted preconditioning fundamentals. SNIA (Storage Networking Industry Association) is a standards body that removed some of the barriers to effective performance characterization of flash-based devices. The Solid State Storage Performance Test Specification (PTS) Enterprise v1.1 outlines several key tenets of conditioning methodology that ensure accurate and repeatable test results. Our tests progress after we execute the conditioning protocol, and we present our test results in the same linear fashion.

SSDs are often received Fresh Out of Box (FOB). In its FOB state, the drive hasn't experienced a sustained workload and initial performance testing will result in uncharacteristically high results. Steady state represents the final level of performance for the measured workload, and it only occurs after extended use. The graphic above illustrates the descent from FOB into Steady State.

As the test progresses we note a performance reduction during the transition period as the workload forces the SSD controller into a read-modify-write cycle for every pending write operation. The SSDs finally settle into steady state, which is representative of attainable performance during extended use.

SNIA defines three steps for attaining steady state convergence. The first step is to purge the DUT (Device Under Test). This brings it into a consistent state prior to preconditioning and performance measurements. The DUT purge operation requires ATA secure erase commands for SATA devices, or alternatively the SCSI Format Unit command for SAS devices. Some components, such as PCIe-based SSDs, require vendor-specific tools to purge the device.

We initiate two types of preconditioning once the device is in a clean FOB state (via the purge command). Workload Independent Preconditioning (WIP) consists of a workload that is unrelated to the actual test sample. We use 128KB sequential write data to write the entire capacity of the storage device twice for the WIP preconditioning stage. This maps all LBA addresses and over-provisioning areas and fills them with data.

We immediately begin Workload Dependent Preconditioning (WDPC) after completing the WIP phase. WDPC utilizes the test workload of the measured variable to place the device into a steady state that is relevant to the desired workload. This results in average performance that doesn't vary during the performance measurement window, and each workload requires a new iteration of the entire preconditioning process. We open a measurement window to log performance metrics upon steady state convergence.

The chart above exhibits the transition to steady state with our test methodology. SSDs exhibit varying levels of performance consistency (covered on the preceding page) and we log performance every second to highlight performance variability during the preconditioning phase. The Y-axis indicates IOPS performance and the X-axis denotes the length of time we applied the workload to the DUT (as noted in the chart subtitle).

This preconditioning plot includes 18,000 data points. Blue data points signify IOPS measurements, and grey dots indicate latency metrics (logarithmic vertical axis on the right). The red lines through the data points denote the average results during preconditioning. The latency and IOPS are mirror images, indicating we can get a good feeling for latency distribution by viewing performance results at high granularity. The preconditioning phase occurs very rarely in the life cycle of a normal enterprise SSD and we include it to verify steady state convergence for each workload.

Random Read, Write And Mixed

Flash-based storage helps close the gap between mechanical storage and system memory. SSDs provide tremendous advantages in random workloads, which are among the hardest types of file access for any storage device. The mechanics of a hard drive do not avail themselves well to small-file random operations, and any SSD can easily best a HDD's random performance.

We include 4KB (and 8KB) read, write and mixed workload performance as a standard in all product evaluations. Four-kilobyte random access is the industry-standard metric used in marketing materials, and it helps us ascertain if the storage device lives up to its billing. Many enterprise applications lean heavily on 8KB random performance and we include those measurements to highlight performance trends found in many enterprise applications.

We expect performance degradation with write workloads, but some SSDs even experience significant read performance degradation in steady state. We adhere to our standard preconditioning protocols prior to all tests, including read workloads. The chart above takes a closer look at 4KB random read performance with leading PCIe-based SSDs. The Y-axis indicates IOPS performance and the X-axis denotes the length of time we applied the workload to the DUT (as noted in the chart subtitle).


Outstanding I/O (OIO) indicates the pending number of outstanding operations awaiting completion at any given time. We designed our testing with threaded workloads in mind. Each workload above 8 OIO in the above chart has eight different threads submitting I/O operations. For example, eight individual workers are each generating a workload with a Queue Depth of 32 for the 256 OIO measurement window. 4 OIO represents 4 workers individually generating a workload of QD1, 2 OIO consists of two workers at QD1, and 1 OIO is a single worker at QD1. We measure each 300-second segment at differing OIO, as noted at the top of the chart.

Storage devices typically operate within the 16-64 OIO range in most real-world usage scenarios. We consider optimum scaling in the mid-ranges a desirable trait when evaluating storage products. 100 percent random read workloads aren't subject to as much variability as 100 percent write and mixed workloads, but they certainly are not immune. 

Random write measurements unveil significant variability in this test pool. We include a line for each 300-second segment that marks the average performance. We also include a sub-chart in the lower-right corner. This provides the average performance for the most demanding segment of the test (256 OIO). This test pool is comprised of market-leading PCIe SSDs that do not experience as much variability as value models. Scatter charts can become a bit muddy when numerous SSDs exhibit significant variability, primarily in the value class. In some cases, we will provide a chart with a results trend line (to remove scatter) and clicking the image will present the scatter image for viewing. Click the chart above for the trend line version.

Manufacturers list performance data with 100 percent read or write workloads, but pure random read/write workloads are somewhat rare in most environments. Mixed workloads are most common and the mixture varies with each application. The test above illustrates performance in 10 percent increments for varying read/write mixtures. The 100 percent column represents a pure random read workload. As we move across the chart we mix in writes gradually, ending with a pure write workload in the 0/100 percent right-hand column. The test consists of 11 measurement windows. We allow the storage device to settle into a steady state for each write mixture and we only present the last 300 seconds of each phase.

This test is particularly good at fleshing out weaknesses not readily apparent in standard testing. For instance, Product A experiences some significant variability from 60 to 80 percent and observing the latency during this test helps to explain the issue (page six). We also note that Product B does not feature chart-topping read performance, but as we move into common mixed-random workloads, it delivers a tangible performance increase beneficial to actual application environments.

The Importance Of Low Queue Depth Testing

The early SSDs were radically slower in comparison to the latest SATA SSDs, let alone today's NVMe hotrods. Data centers also employ today's SSDs into pedestrian workloads with more frequency due to lower prices. The increased performance and lower cost combine to alter the nature of relevant storage performance characterization. 

Manufacturers love to regale us with the incredible IOPS and throughput numbers that are only attainable under heavy load. Unfortunately, SSDs rarely reach the boundaries of the performance envelope during regular deployment, if at all. On rare occasions, an SSD will reach the upper end of its performance capabilities, but usually it is not within a tenable latency envelope that satisfies the SLA requirements. 

The disconnect between most test techniques and the needs of real-world applications boils down to the effective Queue Depth (QD). The QD indicates how many outstanding requests are in line (queue) waiting to be serviced. Imagine an empty bucket with a small hole in the bottom. Pouring water into the bucket at a steady rate that exceeds the bucket's drain rate will begin to fill it with water. This is similar to how the QD builds up for an SSD. The data requests (water) builds up faster than the device can respond, thus our imaginary bucket of outstanding requests begins to fill up.
 
The goal is to keep the bucket empty at all times, and the convenient answer is to drill a bigger hole in the bottom. The water drains faster, so the bucket will not be quite as full because of its increased drain rate. The larger hole is representative of a faster storage device, which can serve the requests faster and more efficiently.

Today's SSDs have reached such a high level of performance that they can serve the requests extremely quickly, thus lowering the overall QD. This, along with poorly optimized applications and operating systems, reduces the relevance of high-queue depth testing for the majority of use-cases.

The graph above, provided by Intel, illustrates Intel's own efforts to characterize the real-world load on an SSD during different applications. The chart illustrates that even the most demanding workloads in the data center rarely top a QD of 64.

This is not the only data we have observed indicating the importance of low-QD performance. Our conversations with multiple performance engineering departments from the various storage vendors align with Intel's data. We have also conducted internal testing as we develop a solid methodology for testing database workloads. We find that the QD remains well under 32 in TPC-E and TPC-C testing in our environment, even with severely limited RAM, confirming the importance of low-QD performance.
Generally, the desired latency SLA range is 5-10ms. Intense workloads, such as real-time bidding and financial services, require less than 5ms of response time to be effective. Some analytics routines require roughly 8ms and data warehousing requires less than 10ms of latency. We include charts that focus on low queue depth performance (1 - 32 OIO) in our evaluations.

Sequential Read, Write And Mixed

Hard drives actually provide plenty of sequential speed, especially when aggregated into various RAID implementations. SSDs also offer tremendous speed in sequential workloads, especially when backing up and replicating data.

Caching and tiering implementations attempt to keep sequential activity on spindled layers and random activity on flash-based layers. This still requires strong sequential read/write performance when moving data to and from the flash layer. Numerous other workloads require solid sequential performance, making it a key ingredient to providing a well-rounded SSD. 

Sequential read workloads come into play during OLAP, batch processing, content delivery, streaming and backup scenarios. The fastest device in our test pool peaks at roughly 3.1 GB/s, or 3185 MB/s at 128 OIO.

Sequential write performance comes into play during caching, replication, HPC and database logging workloads. As a whole, there is significantly less variation with sequential write workloads in comparison to random read workloads. However, some SSDs still exhibit more variability than others do during sequential write testing. Another key component to analyzing sequential performance is observing latency metrics, as covered on page seven.

The importance of mixed workloads cannot be overstated and the same holds true for sequential access. We analyze high-end PCIe SSDs in this chart. This test also flushes out key differentiators, particularly in the 2.5" value-segment. Many 2.5" SSDs feature high 100 percent sequential read/write scores, but markedly lower performance in the middle range, which can lend a bathtub curve to our results.

The 100/0 percent column represents a pure read workload. As we move across the chart we mix in writes gradually, ending with a pure write workload in the 0/100 percent right-hand column. The test consists of 11 measurement windows. We allow the storage device to settle into a steady state for each write mixture, and we only present the last 300 seconds of each phase.

Workload Testing

Storage subsystem performance varies based on a host of environmental factors. Servers, networking adapters, switches, operating systems, drivers and firmware of all devices in the test environment have a direct impact on our measurements. This makes effective performance characterization with real application workloads challenging. And much like synthetic workloads, even the best-replicated environments only provide results representative of that specific environment.

Our goal is to expand into application testing. But in the interim, we provide a number of workloads tested with industry-accepted emulations. We test the device in a direct-attached configuration. This provides us a level playing field for all storage devices with no networking, protocols or other hardware to interfere with our performance measurements. This test methodology provides a clear view of the performance available to applications from the storage device without artificial hindrances, such as excessive CPU load.

The OLTP/Database test is representative of a large number of transactional workloads experienced in On-Line Transaction Processing (OLTP/Database) usage. The 8KB random test features a 67 percent read and 33 percent write distribution. We test with a threaded workload for all of our application tests, and the sub-chart in the lower-right corner indicates the average result at 256 OIO.

The email server workload consists of a heavier 50 percent read and 50 percent write distribution with 8KB random data. This is indicative of performance in environments with a heavy mixed-random workload.

The Web server workload consists of 100 percent random read activity in a wide range of file sizes. The wide range of multi-threaded requests for varying file sizes is very challenging for the storage device.

Workload
Read
Write
512B
1KB
2KB
4KB
8KB
16KB
32KB
64KB
128KB
512KB
Web Server
100%
0%
22%
15%
8%
23%
15%
2%
6%
7%
1%
1%

The file server workload consists of a number of varying file sizes with an 80 percent random read and 20 percent random write distribution. The addition of write requests, along with a wide range of file sizes, creates a demanding test for storage devices.

Workload
Read
Write
512B
1KB
2KB
4KB
8KB
16KB
32KB
64KB
128KB
512KB
File Server
80%
20%
10%
5%
5%
60%
2%
4%
4%
10%
-
-

Latency Measurements

Performance data without accompanying latency measurements is worthless. The goal of every storage solution is to deliver consistent and predictable latency. Unfortunately, this isn't always the case. Metrics like Maximum Latency only quantify the single worst I/O during the measurement window. This can be misleading, and a single errant I/O can obscure the view of an otherwise excellent storage device. Standard deviation measurements quantify latency distribution, but we obtain a clearer visual representation of latency QoS with more granularity by viewing one-second measurement intervals.

Low latency is a key strength of flash-based devices, and to applications, low latency is the most desirable characteristic. Our high-granularity testing illustrates latency variability during numerous workloads.

Our standard latency plot includes all latency data recorded during the measurement window. We include this chart for each workload, observing latency under varying loads. The majority of workloads will reside in the 16-64 OIO range.

Our Latency-to-IOPS charts reveal IOPS performance at specific latency points. These charts present relevant latency and IOPS metrics in an easy-to-read format for those familiar with application requirements. The larger version of the chart shows latency data out to 256 OIO. Many devices, particularly 2.5" SSDs, far outstrip the boundaries of reasonable latency under such heavy workloads. We include a smaller sub-chart in the upper-left corner that denotes performance under 2ms for flash-based devices.

Oddly enough, the incredible speed of SSDs can actually constrain latency-to-IOPS performance. Fast storage devices serve I/O requests quickly, making it harder to reach the heavy load where many SSDs deliver the most IOPS. Many applications require response times within certain latency ranges, and the ability to deliver more IOPS within a lower latency envelope is a desirable trait.

We also post latency from our mixed random workload testing. Traditionally, latency is specified with a 4KB random workload at a queue depth of one, and this provides a solid base measurement. Latency measurements change drastically under heavy load and with different write mixtures. We note the Mangstor MX6300 experiences significant variability, and a much higher latency envelope, than competing SSDs from 20/80 to 0/100 percent write mixtures. Testing all write percentages for random data helps find errata that in some cases can adversely affect application performance. Measuring latency at all points of the read/write spectrum, and under heavier workloads, is essential for effective competitive performance analysis.

Power Efficiency Testing

Power consumption is a pain point in the datacenter. The price of power consumption accrues over time and usually ends up costing more than the up-front drive acquisition. Most flash-based storage devices are uniquely poised to alleviate power constraints. The lack of moving parts in a typical SSD reduces power consumption, and thus heat generation. Not all datacenters are open-air designs, and lower heat generation results in fewer cooling requirements, amplifying power savings.

IOPS-per-Watt metrics are important to measure SSDs enhanced efficiency. Some high-performance PCIe SSDs can draw more power than a single HDD, but deliver an exponential amount of performance. The SSD will always prevail in IOPS-per-Watt metrics, which measure work accomplished per Watt of energy expended.

Many decision-makers do not analyze SSD power consumption closely simply because they're more efficient than their spindled counterparts. Some just consider the large power consumption reduction a win. However, a few watts of difference per SSD can mean large savings over the typical five-year service period for a large-scale SSD deployment. We have found tremendous power differentiation among different SSD models, and we designed our power measurements to represent the same high-granularity testing as the rest of our regimen.

Measuring power consumption over an extended period of time is surprisingly difficult. Most power measurement tools have finite amounts of onboard storage, which ultimately limits the total number of measurements. We utilize the Quarch XLC Programmable Power Module to record power measurements because it features a streaming application that provides a virtually unlimited power measurement window. The Quarch XLC is also incredibly flexible - it measures SAS, SATA, AIC (Add In Card) form factor PCIe devices and U.2 2.5" PCIe SSDs. Our full product evaluation of the Quarch XLC PPM is here.

We measure power every second during the entire preconditioning phase of our testing. The chart above contains power measurements during a typical 15,000-second (4-hour) run. We note that power consumption varies, and like performance, it changes once the SSD is in a steady state. We compile the average number, in the lower-right sub-chart, from the last five minutes of the test. 

We combine our performance and power measurements during the preconditioning phase to generate a view of the IOPS per Watt for the SSDs. The difference in efficiency between the two competing models weighs in at over 1000 IOPS per Watt. This equates to more work done within a reduced power envelope, and there are even larger differences between many competing models. Many deployments consist of multiple SSDs in the overall architecture, which effectively multiplies efficiency metrics. 

The QoS Domino Effect - QoS Testing

The saying that "IOPS data without latency data is worthless" holds true. However, the real story lies in the latency QoS (Quality of Service) during operation, which affects application performance tremendously. We test with one-second granularity to help visualize latency variability, but QoS metrics bring the problem into laser-like focus.

Vendors define QoS metrics in percentiles, such as 99.99th percentile latency. The open-source fio test utility measures the latency of every single I/O issued during the test period, which often equates to billions of I/O requests. For instance, a single 4K preconditioning run often consists of over 4 billion I/O requests.

The 99.99th percentile latency measurement excludes the latency measurements of all but the slowest 0.01 percent of the requests, thus providing us an average number that quantifies the worst-case latency.

Focusing on 0.01 percent of I/O may seem to be an incredibly small percentage of the overall number of requests, but it equates to the performance of the worst 40 million I/Os during the 4K preconditioning run. These errant I/Os are usually referred to as "outliers" or "longtail latencies" (latency measurements that go far beyond the normal range). The outliers add up quickly, but the real multiplier lies in the QoS domino effect.

Applications often require information from a preceding operation in order to continue with downstream operations. For instance, a mathematical equation may require a solution before a downstream operation can proceed, and each successive operation triggers other operations. A single outlier slows the first operation in the chain, which then affects a multitude of downstream operations in a chain reaction. If an outlying I/O holds a single operation it can affect hundreds, or thousands, of following operations -- and each successive operation may encounter other outliers.

This hinders application performance tremendously. Even worse, applications can encounter a multitude of outliers simultaneously, thus knocking over other rows of downstream operation dominoes.

In many cases, a few milliseconds of delay at the storage layer results in seconds of delay at the application layer. High-speed SSDs can process over 800,000 requests per second and outliers pile up quickly. Outliers are particularly devastating in RAID implementations. A RAID array is constrained to the speed of the slowest I/O, and each additional device adds yet more outliers in an almost perfect storm that kills the scalability of RAID arrays.

We include QoS data in every article, along with additional high-granularity QoS breakouts. The QoS over IOPS chart plots the measured 99.99th latency metric as the workload intensifies. We plot QoS (measured in latency) on the Y-axis (vertical), and the IOPS of the workload on the X-axis (horizontal). The lowest results on the chart are the most desirable; in this case, Product A offers the best QoS to performance ratio.

The Latency over QoS chart sweeps through the various percentile rankings during the 32 OIO measurement period. The chart above plots the various QoS measurements on the X-axis, with 1-percentile results on the left of the chart and progressing up to the demanding 99.99th percentile results on the right. We plot latency (in milliseconds) on the vertical axis. The device with the lowest results, in this case the Intel DC P3700, offers the best performance during the test.

Standard Deviation is another commonly utilized measurement that characterizes performance variability. Unfortunately, standard deviation is difficult to present in a coherent manner. For instance, if we compare two devices that both offer a standard deviation measurement of 1ms they would appear to be equal. However, one of the SSDs may offer 1 million IOPS in comparison to the competing SSD with only 5,000 IOPS - in which case the faster SSD is clearly better. 

To effectively compare standard deviation we normalize the measurements by comparing the deviation at various performance levels. In the chart above Product A offers 0.05ms of deviation (Y-axis) at 500,000 IOPS, and Product B and D provide 0.075ms at 500,000 IOPS. The lowest deviation for each IOPS measurement signifies the best results.

Histogram results focus on the percentage of operations that land within various latency ranges at 32 OIO. In the chart above, 72 percent of Product C's operations land between 0.1 and 0.25ms of latency. Desirable histogram results are relative to both the percentage of IOPS that land in each range and the amount of operations that land within the lower latency ranges. The best results would consist of the highest percentage of operations in the lowest range.

Test Environment

This half-rack holds most of our test equipment, but we also have other servers left open to facilitate easier component swapping in our busy lab. We use these development platforms to explore various performance characteristics prior to actual performance measurements. We log and measure performance statistics from a single server dedicated to the individual class of device under test.

Test platforms vary depending upon the type of storage device we are testing, but the go-to server for DAS testing consists of an Intel S2600GZ server. It features two Xeon E5-2680 v2 processors and 128GB of Micron DRAM. We utilize Avago HBA's for most direct-attached testing. We also leverage various Avago and Adaptec RAID controllers for RAID testing. We list any hardware variations in individual product evaluations. We measure all presented test results for each evaluation on the same server platform to ensure a level playing field.

The 10GbE Layer 3 Supermicro SSE-X3348TR switch provides the central connectivity in our testing environment. The switch features 48 10GbE Base-T ports and 4 additional 40GbE uplink ports with QFSP connections. The QFSP to 4x FSP+ breakout cables offer us plenty of connectivity options in the lab and the robust feature set allows us to manage our entire test environment in a cohesive manner.

The SSE-X338TR switch offers 24 link aggregation ports (up to eight members per port), customizable QoS characteristics, dual hot-swappable PSUs and redundant fans, among a plethora of other features. The switch offers up to 1284 Gbps of non-blocking performance and rounds out our testing environment.

Conclusion

As always, your mileage may vary. Performance is always dependent upon the deployment environment. Our test methodology measures performance of all devices in a standardized environment, and produces accurate and repeatable results. 

The nuances of enterprise data storage dictate that no one solution fits all needs. While performance may be slower with some devices, it may be an acceptable trade-off for lower power consumption or a reduced capital expenditure. The key is to identify the strengths and weaknesses of each solution to guide our readers in making informed purchasing decisions. If you have any questions, or requests, feel free to sound off in the comments section.