The StorLite Project

Computer Architecture Simulation & Visualisation

Storlite Modelling & Simulation - Final Report

Project Context

Network storage systems are now close to the limits of the performance and capacity that can be achieved with current technologies. As the intra-enclosure transmission rate and the number of disks required in each storage sub-system increase, implementing electrical backplanes in storage systems is becoming increasingly difficult. The Storlite (Storage Systems Optical Networking) project has been a collaborative industry-academia research project funded under the DTI/EPSRC LINK Information Storage and Displays programme and led by Xyratex, a disk storage company based in Havant, that has involved an investigation of the application of affordable short-reach optical backplane technologies to future architectures of storage systems. Xyratex has worked with Exxelis (a company based in Glasgow) and University College London on optical components and with ICSA at the University of Edinburgh on the simulation of RAID systems. The optical part of the Storlite project (EPSRC grant GR/S28136) has investigated how to implement optical backplanes for storage systems to make them more scalable. The modelling and simulation part of the project has involved using HASE, a simulation environment developed at the University of Edinburgh, to create simulation models of RAID storage systems and industry standard benchmark traffic generators (SPC and IOmeter) and using these models to identify performance bottlenecks, to evaluate communication protocol options and to explore the design space for new architectures for next generation storage sub-systems based on optical backplanes.

RAID Systems

Redundant Arrays of Inexpensive/Independent Devices/Disks (RAID) have been used for hard-disk storage systems for many years. Enterprise class RAIDs typically use the Fibre Channel Arbitrated Loop (FC-AL) protocol to move SCSI commands between disks and RAID controllers. In early systems the disks were organised in a daisy-chain style loop (known as Just a Bunch Of Disks, JBOD) but this has the disadvantage that the failure of any device or link in the loop results in the failure of the entire RAID. To overcome this, a Managed Bunch of Disks (MBOD) system contains a hub with port-bypass and management functionality. An MBOD is tolerant of the failure of any given device or link but the hub introduces additional delay into the system; in systems containing many devices, there is already a large delay as communications traverse the loop. In the more recent Switched Bunch Of Disks (SBOD) configuration, a simple cross-bar switch replaces the port-bypass function of the hub. When two devices wish to communicate, the switch is used to create a virtual loop, just consisting of these devices, with the switch as an interposer. In such a system, the observed loop delay no longer scales with the number of devices. The main objective of this project has been to evaluate the relative performance of the MBOD and SBOD RAID architectures and to identify bottlenecks in terms of disk response time, communication bandwidth, number of disks in the system and strip size for different traffic profiles.

Modelling & Simulation

Within the first few months of the project it was identified that meaningful simulations would have to simulate long periods of operation to remove the effects of system start-up from the simulations. In the storage system literature, the usual ratio between simulation time and wall clock is between 4 and 6 orders of magnitude. Standardised benchmarks for storage systems, such as the SPC-1, require 3 hours of simulation time for the system performance to be validated. Using existing simulation models, 3 hours of simulation time would equate to 3.5 years of wall clock time (on a standard Pentium IV machine), which was not practical. The simulation time was therefore optimised by (a) raising the level of abstraction by reducing the number of states and transition in the FC_AL protocol [9] (b) creating an overlay protocol that exploits global knowledge of the state of each device in the simulation to prevent unnecessary messages from being sent between them. While the former simplification is relatively straightforward and fully documented in [3], the latter involved the design of a new protocol with primitives to let the central switches (MBOD and SBOD) know about the intention of each disk and RAID controller in the system. The overlay protocol was designed to be timing correct whilst also reducing the unnecessary events. The use of these two techniques, together with the improvements to HASE itself, reduced the ratio between simulation time and wall clock time to a single order of magnitude (30-80) depending on the simulation. This represents a major improvement compared with existing simulation models and is believed to be unique to the Storlite project. The initial storage system model was built using current Xyratex products as a base design. The system was built in an ad-hoc manner with each component being coded as the developer became aware of its functionality. This system was completed in June 2004 and was then used for testing the more abstract FC-AL model. This work produced very promising results with a less than 3% difference being observed between the simulated systems and tests run on the real systems.

Disk Model

The disk used as a basis for this model is a high speed Fibre Channel disk. In order to optimise the simulation time, the response time of the disk was modelled according to data provided by the manufacturer. The data sheet specifies the disk response time as a function of the disk queue size, the I/O command size, the type of I/O command (read or write), the randomness of the requests/updates (fully sequential or random), and other variables such as cache enabled/disabled. The Arbitrated Loop Port State Machine (LPSM) is used to determine the action to take upon reception of a Fibre Channel word. When data is ready to be sent back to the initiator, the disk model times out and arbitrates to gain tenancy of the loop.

RAID Controller Model

The model of the RAID controller is based on a Xyratex prototype system that implements RAID level 5 protection. Although the model is highly parameterisable, the data/parity mapping chosen for all experiments is 4+1 - i.e. each stripe is composed of 4 data strips and 1 parity strip. The RAID controller plays several r\^oles in the system, the first being to translate each high level I/O command (containing logical unit addresses) that it receives from the network into several disk level commands (using disk addresses). As an FC_AL node, the controller must implement the Loop Port State Machine (LPSM) to determine the action to take upon receipt of a Fibre Channel word. Also, it must be able to arbitrate if multiple disk level commands are ready to be sent in the arbitrated loop. Similarly to the disk model, the RAID controller times out when it needs to arbitrate, sending an arbitration (ARB) request on the virtual loop. Each RAID controller also implements a cache which can be parameterised in terms of its size (up to 16GB) and replacement policy.

MBOD and SBOD Switch Models

The MBOD switch model is implemented transparently, i.e. the port on the MBOD switch does not interfere with the FC_AL protocol. The MBOD only gives the connectivity between the different AL_nodes on the arbitrated loop, and any word received on an input port (from the disk or controller) is forwarded with a fixed delay to the next output port. As opposed to the MBOD, the SBOD model is not transparent to the Fibre Channel protocol and plays an active part in the FC_AL arbitration process. When an attached device wishes to arbitrate, it transmits ARB words to the switch. After a short delay, the switch returns these ARB words, thereby allowing that device the opportunity to transmit an OPN; this device is known as the initiator. This forms a virtual FC_AL with two devices in the loop, with a delay in the transmission path reduced to the latency across the switch.

Benchmark Models

We have implemented three different benchmarks: the IOmeter benchmark [4], the SPC-1 benchmark [5] and the SPC-2 benchmark [8]. To assess the performance of RAID sub-systems, the IOmeter benchmark [6], which consists of fully sequential requests, has traditionally been used to measure the maximum achievable number of I/O commands per second (IOPS) and data throughput. Whilst the IOmeter model faithfully represents backup and video streaming applications, it does not capture the workload generated in more complex environments. The Storage Performance Council has therefore designed benchmarks that mimic the command profiles observed in live systems. The SPC Benchmark-1 (SPC-1) [7] is the first industry standard benchmark for Storage Area Networks (SANs) using a highly efficient multi-platform and multi-threaded workload to emulate the precise characteristics of multi-user I/O in a multi-application environment. SPC-2, emulates traffic generated by large files, databases and video streaming applications.

Results

The main goal of the simulation part of Storlite was to characterise the performance of the SBOD switch compared with the more traditional MBOD. This comparison was based on both IOmeter and SPC-1 benchmarks. All the results presented in the following section were validated by (a) developing analytical models to determine a performance upper bound (b) comparing the simulation results with results from the hardware prototype; the results matched to within 3%.

Sequential Request Traffic

For each RAID system size (14, 16, 20, 24 and 30 disks), different I/O request sizes were considered (4K, 8K, 16K, 32K and 64K bytes) for running experiments. The parameters of comparison were throughput and number of I/O operations per seconds (or IOPS).

Relative Performance: MBOD vs. SBOD

We showed that the SBOD is particularly attractive in comparison with the MBOD for systems with a large number of disks issuing small requests. This is explained by the fact that in large systems, the SBOD round trip delay (SBOD latency) remains constant and independent of the number of AL_Nodes. Furthermore, when the request size is small, the transmission overhead of the MBOD is significantly larger than that associated with the SBOD. Overall, it was found that the SBOD outperforms the MBOD by between 100% and 275%, depending on the strip size and number of disks [4]. This result was quite surprising as only a 100% improvement was initially expected by the industry.

SBOD Arbitration Policy

Another interesting observation is based on the arbitration policy of the SBOD. It was shown in [4] that the arbitration policy implemented by the SBOD was crucial to obtain significant performance improvement over the MBOD system. It was observed that, under sequential traffic, the RAID controller had to implement an unfair policy (gain arbitration of the loop whenever it had a command ready to transmit to one of the disks). If this condition is not met, the IOmeter can severely underload the SBOD by giving too much bandwidth to the disks, preventing new outstanding commands from being inserted into the controller's queue. Since the arbitration policy in an SBOD system is left to the switch manufacturer, our research gave some guidance to implement an efficient arbitration scheme.

SPC-1 traffic

Optimum Strip Size

When RAID manufacturers benchmark their storage products, they need to configure the strip size on each volume of the system. Hence, it is crucial for them to know what strip size will maximise their equipment's performance according to the benchmark selected to evaluate their system. In this context, our work has shown that the optimum strip size for a RAID system tested under SPC-1 traffic is 16k bytes, irrespective of the system size [4]. This result is crucial as it helps manufacturers to configure their systems to optimise their performance.

Communication Bandwidth

It was also shown that under SPC-1 traffic, using large strips (64Kb), the SBOD does not benefit significantly from an increased transmission rate (from 2Gbps to 4Gbps), whereas the performance almost doubles for an MBOD system. This implies that the SBOD hardly benefits from a state of the art 4Gbps Fibre channel interface, which is very expensive, but instead can achieve the same level of performance with a 2Gbps interface. This is not the case for the MBOD system, which is limited by communication bandwidth due to the increased loop round-trip time. This result has a significant economic impact, not only because of the lower speed of the interface, but also because the design of the device is considerably simplified [4].

Star Connect Topology

As noted earlier, it becomes increasingly difficult to route 100's of copper tracks in the backplane of future generation RAID systems to connect disks, SBOD and controllers together. In [10], we show how a passive optical multi-splitter topology can be used to alleviate these problems. It is also shown how such a topology can be fully interoperable with the communications protocols that exist today and how its use results in a reduction of up to 48% in the number of optical drivers required in the system. We also suggest a system with 93% fewer optical drivers and the same performance compared with an optical implementation of today's state of the art system.

HASE Development

HASE [1] has two main elements written in C/C++: a tool to build simulations and a discrete event simulation engine (Hase++). The first of these elements is a construction framework that allows users to build an abstraction of a model. This framework also provides for graphical interaction with models, both prior to a simulation run and afterwards. Before running a simulation, the values of appropriately defined parameters of the model can be modified, allowing the effects of varying these values to be investigated without recompilation of the model. Post simulation, animation facilities allow users to see graphically the movement of the packets and the changes of state of the simulation entities. This mode is mainly used for demonstration purposes but can also be used for debugging. The HASE-2 environment used at the start of the project used the QT library for its graphical interface. The use of this library become problematic because of changes to the Linux support environment on which Hase-2 runs, and consequent changes to QT itself, which would necessitate considerable re-programming of HASE. Furthermore, the QT licence fee for commercial use was prohibitive. To address these problems, it was decided to port HASE to Java and to write a new graphical interface, also in Java. Using Java would also offer platform independence. As a result of this port, a new HASE tool, known as the HASE-III Integrated Simulation Environment, has been created in which the HASE framework, the simulation engine and the graphical interface are clearly separated. HASE-III became available in mid 2005 and has since been in regular use for both research and teaching purposes.

Hase++ Extensions

The HASE simulation engine (Hase++) has been extended in a number of ways to support the simulation of storage systems. The main modifications are an extension of the asymetrical bi-clock library, the creation of real compound entity objects and reductions in run-time and trace file memory requirements. The change associated with the bi-clock library relates to the nature of the traffic generated by various benchmarks. In this context, the ON and OFF periods need to have different durations, hence the bi-clock library was modified to accommodate this requirement. In the RAID simulation models, a central component is the dual Fibre Channel port, a compound entity composed of several sub-entities. Each of these sub-entities has the same Fibre Channel address but, as originally defined, compound entities had no separate existence so there was no mechanism to allow for parameters at the compound entity level. The new compound entity mechanism, used for the dual Fibre Channel port entity, allows the address parameter to be inherited by all its sub-entities.

On-going Work

Further work on developing and using the SBOD RAID model is being undertaken as part of a PhD project. The code for the model has been restructured and enhanced to produce SimRAID [11], a highly configurable, parameterised and extensible simulation environment for performance analysis of storage systems. SimRAID allows experimentation not only with variations in strip size, disk size and system topology but also cache size and cache policies. An MSc project completed in August 2006 was used SimRAID to investigate read-ahead algorithms.

Refererences

"Hierarchical Computer Architecture Design and Simulation Environment", P.S. Coe et al., ACM Transactions on Modelling and Computer Simulation, V8, n4, pp 431-446, Oct 1998.
"Computer Architecture Simulation Models'', R.N. Ibbett, J.C. Diaz y Carballo & D.A.W. Dolman, Proc ITiCSE06, p 353, ACM, 2006.
"Work in Progress: Low level simulation of back-end storage network using HASE", T. Courtney & F. Chevalier, Mass Storage Systems and Technology, IEEE Conf, Maryland, April 2004.
"IOmeter Performance Comparison of SBOD and MBOD", F. Chevalier, R. Ibbett & T. Courtney, WSEAS Transactions on Computers, Issue 6, Vol. 3, Dec 2004.
"Performance Comparison of SBOD and MBOD Under SPC-1 Traffic With Varying Strip Sizes and Numbers of Disks", T. Courtney & F. Chevalier, IASTED PDCN 2005, Innsbruck, Austria, February 2005.
"IOmeter User's Guide", Intel, Ver 2003.12.16, December 21, 2003.
"SPC Benchmark-1" (SPC-1), Storage Performance Council, http://www.storageperformance.org/specs/
"SPC Benchmark-2" (SPC-2), Storage Performance Council, http://www.storageperformance.org/specs/
"Fibre Channel Arbitrated Loop (FC-AL-2)", ANSI (NCITS) T11 Working Group, Rev7.0 Amendment, October 23, 2001
"Overcoming the Driver Cost in Extreme Short Range Optical Interconnect", T. Courtney & F. Chevalier, IASTED Optical Communication Systems and Networks Conference (OCSN), Banff, Canada, July 2005,
"SimRAID: An Efficient Performance Evaluation Tool for RAID Systems'', Y. Li, T Courtney, F Chevalier & R.N. Ibbett, SCSC2006, pp 431-438, Calgary, Canada, 2006.

Return to Storlite home page

line

HASE Project
Institute for Computing Systems Architecture, School of Informatics, University of Edinburgh
Last change 31/12/2006

Please contact our webadmin with any comments or changes.
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh.