
Computer Architecture Simulation & Visualisation
Storlite Modelling & Simulation - Final Report
Project Context
Network storage systems are now close to the limits of the performance
and capacity that can be achieved with current technologies. As the
intra-enclosure transmission rate and the number of disks required in
each storage sub-system increase, implementing electrical backplanes
in storage systems is becoming increasingly difficult. The Storlite
(Storage Systems Optical Networking) project has been a collaborative
industry-academia research project funded under the DTI/EPSRC LINK
Information Storage and Displays programme and led by Xyratex, a disk
storage company based in Havant, that has involved an investigation of
the application of affordable short-reach optical backplane
technologies to future architectures of storage systems. Xyratex has
worked with Exxelis (a company based in Glasgow) and University
College London on optical components and with ICSA at the University
of Edinburgh on the simulation of RAID systems.
The optical part of the Storlite project (EPSRC grant GR/S28136) has
investigated how to implement optical backplanes for storage systems
to make them more scalable. The modelling and simulation part of the
project has involved using HASE, a simulation environment developed at
the University of Edinburgh, to create simulation models of RAID
storage systems and industry standard benchmark traffic generators
(SPC and IOmeter) and using these models to identify performance
bottlenecks, to evaluate communication protocol options and to explore
the design space for new architectures for next generation storage
sub-systems based on optical backplanes.
RAID Systems
Redundant Arrays of Inexpensive/Independent Devices/Disks (RAID) have
been used for hard-disk storage systems for many years. Enterprise
class RAIDs typically use the Fibre Channel Arbitrated Loop (FC-AL)
protocol to move SCSI commands between disks and RAID controllers. In
early systems the disks were organised in a daisy-chain style loop
(known as Just a Bunch Of Disks, JBOD) but this has the disadvantage
that the failure of any device or link in the loop results in the
failure of the entire RAID.
To overcome this, a Managed Bunch of Disks (MBOD) system contains a
hub with port-bypass and management functionality. An MBOD is
tolerant of the failure of any given device or link but the hub
introduces additional delay into the system; in systems containing
many devices, there is already a large delay as communications traverse
the loop.
In the more recent Switched Bunch Of Disks (SBOD) configuration, a
simple cross-bar switch replaces the port-bypass function of the
hub. When two devices wish to communicate, the switch is used to
create a virtual loop, just consisting of these devices, with the
switch as an interposer. In such a system, the observed loop delay
no longer scales with the number of devices.
The main objective of this project has been to evaluate the relative
performance of the MBOD and SBOD RAID architectures and to identify
bottlenecks in terms of disk response time, communication bandwidth,
number of disks in the system and strip size for different traffic
profiles.
Modelling & Simulation
Within the first few months of the project it was identified that
meaningful simulations would have to simulate long periods of
operation to remove the effects of system start-up from the
simulations. In the storage system literature, the usual ratio
between simulation time and wall clock is between 4 and 6 orders of
magnitude. Standardised benchmarks for storage systems, such as the
SPC-1, require 3 hours of simulation time for the system performance
to be validated. Using existing simulation models, 3 hours of
simulation time would equate to 3.5 years of wall clock time (on a
standard Pentium IV machine), which was not practical. The simulation
time was therefore optimised by (a) raising the level of abstraction
by reducing the number of states and transition in the FC_AL
protocol [9] (b) creating an overlay protocol that exploits
global knowledge of the state of each device in the simulation to
prevent unnecessary messages from being sent between them.
While the former simplification is relatively straightforward and
fully documented in [3], the latter involved the design of a
new protocol with primitives to let the central switches (MBOD and
SBOD) know about the intention of each disk and RAID controller in the
system. The overlay protocol was designed to be timing correct whilst
also reducing the unnecessary events.
The use of these two techniques, together with the improvements to
HASE itself, reduced the ratio between simulation time and wall clock
time to a single order of magnitude (30-80) depending on the
simulation. This represents a major improvement compared with existing
simulation models and is believed to be unique to the Storlite
project.
The initial storage system model was built using current Xyratex
products as a base design. The system was built in an ad-hoc manner
with each component being coded as the developer became aware of its
functionality. This system was completed in June 2004 and was then
used for testing the more abstract FC-AL model. This work produced
very promising results with a less than 3% difference being observed
between the simulated systems and tests run on the real systems.
Disk Model
The disk used as a basis for this model is a high speed Fibre Channel
disk. In order to optimise the simulation time, the response time of
the disk was modelled according to data provided by the
manufacturer. The data sheet specifies the disk response time as a
function of the disk queue size, the I/O command size, the type of I/O
command (read or write), the randomness of the requests/updates (fully
sequential or random), and other variables such as cache
enabled/disabled. The Arbitrated Loop Port State Machine (LPSM) is
used to determine the action to take upon reception of a Fibre Channel
word. When data is ready to be sent back to the initiator, the disk
model times out and arbitrates to gain tenancy of the loop.
RAID Controller Model
The model of the RAID controller is based on a Xyratex prototype
system that implements RAID level 5 protection. Although the model is
highly parameterisable, the data/parity mapping chosen for all
experiments is 4+1 - i.e. each stripe is composed of 4 data
strips and 1 parity strip. The RAID controller plays several r\^oles
in the system, the first being to translate each high level I/O
command (containing logical unit addresses) that it receives from the
network into several disk level commands (using disk addresses).
As an FC_AL node, the controller must implement the Loop Port State
Machine (LPSM) to determine the action to take upon receipt of a Fibre
Channel word. Also, it must be able to arbitrate if multiple disk
level commands are ready to be sent in the arbitrated loop. Similarly
to the disk model, the RAID controller times out when it needs to
arbitrate, sending an arbitration (ARB) request on the virtual loop.
Each RAID controller also implements a cache which can be
parameterised in terms of its size (up to 16GB) and replacement
policy.
MBOD and SBOD Switch Models
The MBOD switch model is implemented transparently, i.e. the
port on the MBOD switch does not interfere with the FC_AL protocol.
The MBOD only gives the connectivity between the different AL_nodes
on the arbitrated loop, and any word received on an input port (from
the disk or controller) is forwarded with a fixed delay to the next
output port.
As opposed to the MBOD, the SBOD model is not transparent to the Fibre
Channel protocol and plays an active part in the FC_AL arbitration
process. When an attached device wishes to arbitrate, it transmits ARB
words to the switch. After a short delay, the switch returns these ARB
words, thereby allowing that device the opportunity to transmit an
OPN; this device is known as the initiator. This forms a virtual
FC_AL with two devices in the loop, with a delay in the transmission
path reduced to the latency across the switch.
Benchmark Models
We have implemented three different benchmarks: the IOmeter
benchmark [4], the SPC-1 benchmark [5] and the
SPC-2 benchmark [8]. To assess the performance of RAID
sub-systems, the IOmeter benchmark [6], which consists of
fully sequential requests, has traditionally been used to measure the
maximum achievable number of I/O commands per second (IOPS) and data
throughput. Whilst the IOmeter model faithfully represents backup and
video streaming applications, it does not capture the workload
generated in more complex environments. The Storage Performance
Council has therefore designed benchmarks that mimic the command
profiles observed in live systems. The SPC Benchmark-1
(SPC-1) [7] is the first industry standard benchmark for
Storage Area Networks (SANs) using a highly efficient multi-platform
and multi-threaded workload to emulate the precise characteristics of
multi-user I/O in a multi-application environment. SPC-2, emulates
traffic generated by large files, databases and video streaming
applications.
Results
The main goal of the simulation part of Storlite was to
characterise the performance of the SBOD switch compared with the more
traditional MBOD. This comparison was based on both IOmeter and SPC-1
benchmarks. All the results presented in the following section were
validated by (a) developing analytical models to determine a
performance upper bound (b) comparing the simulation results with
results from the hardware prototype; the results matched to within 3%.
Sequential Request Traffic
For each RAID system size (14, 16, 20, 24 and 30 disks), different I/O
request sizes were considered (4K, 8K, 16K, 32K and 64K bytes) for
running experiments. The parameters of comparison were throughput and
number of I/O operations per seconds (or IOPS).
Relative Performance: MBOD vs. SBOD
We showed that the SBOD is particularly attractive in comparison with
the MBOD for systems with a large number of disks issuing small
requests. This is explained by the fact that in large systems, the
SBOD round trip delay (SBOD latency) remains constant and independent
of the number of AL_Nodes. Furthermore, when the request size is
small, the transmission overhead of the MBOD is significantly larger
than that associated with the SBOD. Overall, it was found that the
SBOD outperforms the MBOD by between 100% and 275%, depending on the
strip size and number of disks [4]. This result was quite
surprising as only a 100% improvement was initially expected by the
industry.
SBOD Arbitration Policy
Another interesting observation is based on the arbitration policy of
the SBOD. It was shown in [4] that the arbitration policy
implemented by the SBOD was crucial to obtain significant performance
improvement over the MBOD system. It was observed that, under
sequential traffic, the RAID controller had to implement an unfair
policy (gain arbitration of the loop whenever it had a command ready
to transmit to one of the disks). If this condition is not met, the
IOmeter can severely underload the SBOD by giving too much bandwidth to
the disks, preventing new outstanding commands from being inserted into
the controller's queue. Since the arbitration policy in an SBOD system
is left to the switch manufacturer, our research gave some guidance to
implement an efficient arbitration scheme.
SPC-1 traffic
Optimum Strip Size
When RAID manufacturers benchmark their storage products, they need to
configure the strip size on each volume of the system. Hence, it is
crucial for them to know what strip size will maximise their
equipment's performance according to the benchmark selected to
evaluate their system. In this context, our work has shown that the
optimum strip size for a RAID system tested under SPC-1 traffic is 16k
bytes, irrespective of the system size [4]. This result is
crucial as it helps manufacturers to configure their systems to
optimise their performance.
Communication Bandwidth
It was also shown that under SPC-1 traffic, using large strips (64Kb),
the SBOD does not benefit significantly from an increased transmission
rate (from 2Gbps to 4Gbps), whereas the performance almost doubles for
an MBOD system. This implies that the SBOD hardly benefits from a
state of the art 4Gbps Fibre channel interface, which is very
expensive, but instead can achieve the same level of performance with
a 2Gbps interface. This is not the case for the MBOD system, which is
limited by communication bandwidth due to the increased loop
round-trip time. This result has a significant economic impact, not
only because of the lower speed of the interface, but also because the
design of the device is considerably simplified [4].
Star Connect Topology
As noted earlier, it becomes increasingly difficult to route 100's of
copper tracks in the backplane of future generation RAID systems to
connect disks, SBOD and controllers together. In [10], we
show how a passive optical multi-splitter topology can be used to
alleviate these problems. It is also shown how such a topology can be
fully interoperable with the communications protocols that exist today
and how its use results in a reduction of up to 48% in the number of
optical drivers required in the system. We also suggest a system with
93% fewer optical drivers and the same performance compared with an
optical implementation of today's state of the art system.
HASE Development
HASE [1] has two main elements written in C/C++: a tool to
build simulations and a discrete event simulation engine (Hase++).
The first of these elements is a construction framework that allows
users to build an abstraction of a model. This framework also
provides for graphical interaction with models, both prior to a
simulation run and afterwards. Before running a simulation, the values
of appropriately defined parameters of the model can be modified,
allowing the effects of varying these values to be investigated
without recompilation of the model. Post simulation, animation
facilities allow users to see graphically the movement of the packets
and the changes of state of the simulation entities. This mode is
mainly used for demonstration purposes but can also be used for
debugging.
The HASE-2 environment used at the start of the project used the QT
library for its graphical interface. The use of this library become
problematic because of changes to the Linux support environment on
which Hase-2 runs, and consequent changes to QT itself, which would
necessitate considerable re-programming of HASE. Furthermore, the QT
licence fee for commercial use was prohibitive.
To address these problems, it was decided to port HASE to Java and to
write a new graphical interface, also in Java. Using Java would also
offer platform independence. As a result of this port, a new HASE
tool, known as the HASE-III Integrated Simulation Environment, has
been created in which the HASE framework, the simulation engine and
the graphical interface are clearly separated. HASE-III became
available in mid 2005 and has since been in regular use for both
research and teaching purposes.
Hase++ Extensions
The HASE simulation engine (Hase++) has been extended in a number of
ways to support the simulation of storage systems. The main
modifications are an extension of the asymetrical bi-clock library,
the creation of real compound entity objects and reductions in
run-time and trace file memory requirements.
The change associated with the bi-clock library relates to the nature of
the traffic generated by various benchmarks. In this context, the ON and OFF
periods need to have different durations, hence the bi-clock library was
modified to accommodate this requirement.
In the RAID simulation models, a central component is the dual Fibre
Channel port, a compound entity composed of several sub-entities. Each
of these sub-entities has the same Fibre Channel address but, as
originally defined, compound entities had no separate existence so
there was no mechanism to allow for parameters at the compound entity
level. The new compound entity mechanism, used for the dual Fibre
Channel port entity, allows the address parameter to be
inherited by all its sub-entities.
On-going Work
Further work on developing and using the SBOD RAID model is being
undertaken as part of a PhD project. The code for the model has been
restructured and enhanced to produce SimRAID [11], a highly
configurable, parameterised and extensible simulation environment for
performance analysis of storage systems. SimRAID allows
experimentation not only with variations in strip size, disk size and
system topology but also cache size and cache policies. An MSc project
completed in August 2006 was used SimRAID to investigate
read-ahead algorithms.
Refererences
-
"Hierarchical Computer Architecture Design and
Simulation Environment", P.S. Coe et al., ACM Transactions on
Modelling and Computer Simulation, V8, n4, pp 431-446, Oct 1998.
-
"Computer Architecture Simulation Models'', R.N. Ibbett, J.C. Diaz y
Carballo & D.A.W. Dolman, Proc ITiCSE06, p 353, ACM, 2006.
-
"Work in Progress: Low level simulation of back-end
storage network using HASE", T. Courtney & F. Chevalier, Mass
Storage Systems and Technology, IEEE Conf, Maryland, April 2004.
-
"IOmeter Performance Comparison of SBOD and MBOD",
F. Chevalier, R. Ibbett & T. Courtney, WSEAS Transactions on
Computers, Issue 6, Vol. 3, Dec 2004.
-
"Performance Comparison of SBOD and MBOD Under SPC-1
Traffic With Varying Strip Sizes and Numbers of Disks", T. Courtney &
F. Chevalier, IASTED PDCN 2005, Innsbruck, Austria, February 2005.
-
"IOmeter User's Guide", Intel, Ver 2003.12.16,
December 21, 2003.
-
"SPC Benchmark-1" (SPC-1), Storage Performance Council,
http://www.storageperformance.org/specs/
-
"SPC Benchmark-2" (SPC-2), Storage Performance Council,
http://www.storageperformance.org/specs/
- "Fibre Channel Arbitrated Loop (FC-AL-2)",
ANSI (NCITS) T11 Working Group, Rev7.0 Amendment, October 23, 2001
- "Overcoming the Driver Cost in Extreme Short Range
Optical Interconnect", T. Courtney & F. Chevalier, IASTED
Optical Communication Systems and Networks Conference (OCSN), Banff,
Canada, July 2005,
- "SimRAID: An Efficient Performance Evaluation Tool
for RAID Systems'', Y. Li, T Courtney, F Chevalier & R.N. Ibbett,
SCSC2006, pp 431-438, Calgary, Canada, 2006.
Return to Storlite home page
HASE Project
Institute for Computing Systems Architecture, School of Informatics,
University of Edinburgh
Last change 31/12/2006
Please contact our
webadmin with any comments or changes.
Unless explicitly stated otherwise, all material is
copyright © The University of Edinburgh.