
Computer Architecture Simulation & Visualisation
Simulation of the UKQCD Computer - Final Report
Project Summary
The QCD Computer Simulation Project was funded by the EPSRC under
grant GR/R27129 (May 2001 - April 2004).
The overall aim of the project was to gain an understanding of the
factors which influence the performance of QCD computers and to
influence future high performance system designs. In order to achieve
the project aims, parameterised simulation models of the proposed
UKQCD computer, QCDOC, were constructed in HASE (Hierarchical computer
Architecture Design and Simulation Environment). QCDOC (Quantum
Chromodynamics On-Chip) is a recent, application-specific
supercomputer designed for QCD grand challenge application
simulations [1]. The HASE models were used:
- to investigate the factors which influence the performance
of the QCDOC computer
- to explore the design parameter space of the simulation models
and to investigate variations in performance against a range of
architectural parameters in order to inform the design of subsequent
generations of QCD computers.
In parallel with designing the models, the capabilities of the
simulation environment, HASE, were extended in response to the demands
placed on it by the these models and a number of new features were
added.
The HASE QCDOC Models
Parameterised hardware-software co-simulation models of the QCDOC
computer were created for conducting a range of
experiments [2,3]. Because of hardware and
operating system limitations of the host platform, a complete 10K
processing node could not be simulated with precise
design details. Models were therefore developed at two different
levels of abstraction:
- The low-level abstraction model can simulate up to 16 processing
nodes. It captures the microarchitectural details of the system,
including the processor, the on-chip cache hierarchy, the system
bus, the communication unit, etc. This model was designed
for studying the performance characteristics of a single node
and nearest neighbour communications when executing QCD code.
- The high-level abstraction model can simulate up to 12K
processing nodes. The nodes simply act as sources of
communication events, with the intervals between events being
based on data taken from the low-level abstraction model.
This model was used to study the performance characteristics of
the custom communication protocol and different ways of
implementing the QCD global sum mechanism.
Co-Simulation
The HASE QCDOC simulation models provide a unified framework in which
not only can the architectural parameters be varied but so too can the
workload parameters. The parallel QCD workload on a system depends on
the lattice QCD volume. This volume is represented as the number of
lattice sites in four space-time dimensions, which in turn depends on
and is driven by the degree of approximation of a QCD
calculation. Large QCD volumes provide better approximations but also
result in large workload sizes per processing node. Another dimension
of the QCD workload per processing node is the workload mapping and
distribution scheme. The HASE QCDOC models incorporate QCD workload
volume, mapping and distribution in the behavioural modelling. Thus
the models allow a simulation user to alter the workload size per
processing node as well as the workload mapping and distribution
through HASE parameter windows. The benefits of this co-simulation
scheme include:
- Wider design parameter options: the architectural parameters
are not limited by a fixed workload size; for instance, the scaling
behaviour of the hardware can only be studied by altering both the
workload per processing node and the total number of processing
nodes.
- Realistic workload configurations: QCD physicists expect to
run a variety of different workloads, of varying sizes and
configurations, on the QCDOC machine.
HASE Extensions
Modelling QCDOC in HASE has thrown up a number of interesting
simulation problems, including two simulation clocking problems that
appear not to have been solved elsewhere, and these have led to novel
solutions [4].
MeshnD Template
The real QCDOC machine has a 6-D torus topology although,
at any given time, the QCD workload only utilises four of these
dimensions. HASE already had a 1-D, 2-D and 3-D mesh template
generation facility (essentially an automated version of the compound
entity mechanism) but this was inadequate for QCDOC. The mesh
facility was therefore extended to become an n-dimensional mesh/torus
topology generation facility. This extended facility not only
instantiates a mesh of sub-entities but also creates the links between
the sub-entities in the specified topology, i.e. the simulation
designer can simply define a mesh and the sub-entity to be replicated
in the mesh, and does not need to write explicit code to generate
instances of the sub-entities, their interconnections or the visual
display.
In many models, the sub-entities within a mesh contain a number of
parameters. In most cases the values of these parameters are required
to be the same in each instance of the sub-entity. In order to
minimise the overheads and errors associated with parameter changes
between simulation runs for a HASE project with a very large number of
replicated sub-entities, a static parameter mechanism was introduced.
If a parameter of a mesh sub-entity is declared as static, only one
copy of the parameter exists in the simulation model, so if it is
altered in one sub-entity, the change will automatically appear in all
other sub-entities.
The Clock Mechanism
The clock mechanism used in most HASE models serves two purposes: (a)
to simulate the actual clock used in the real system under
investigation (b) to act as a barrier synchronisation mechanism
between the entities representing the components of the model. Both
require clocking signals to pass between the clock entity and the
clocked entities. This requires links to be defined between these
entities. In experimenting with the QCDOC model, an important
parameter is the number of processors in the multi-dimensional array
and it is desirable to be able to vary this parameter at run time,
without recompiling the project. To enable this to happen, a clock
inheritance mechanism [5] has been designed that allows
entities to self-register with the clock at the start of a
simulation. This same mechanism also solves a second problem, the need
to be able to use multiple clock frequencies, synchronised to the
master clock in the model, reflecting the situation not just in QCDOC
but in system-on-chip designs generally. In this case, entities
self-register with an intermediate clock which self-registers with the
master clock.
Memory Modelling
The memory capacity and performance of the host machine on which a
simulation is run are key limitations in modelling large-scale
systems. To address the first limitation, an alternative memory array
declaration scheme has been introduced into HASE. In the case of a
parallel system being run in Single Program Multiple Data (SPMD) mode,
for example, the contents of the instruction memories in the real
system are identical. The new scheme allows the memory arrays in all
the entities in a mesh to use the same memory file.
The second limitation is the length to time required to read and write
the memory initialisation files. An option for defining a HASE memory
array was therefore introduced such that a single physical file can be
used to initialise multiple memory instances. In this case, a single
memory file with separators is created, allowing the memory array in
each entity in a mesh to load a different section of the file.
Visualisation Mechanisms
Timing diagrams have been introduced into the framework to inspect the
design parameters and send/receive events of a simulation entity over
time. Unlike the animation facility, the timing diagrams show the
values of multiple parameters changing over time within an entity.
This proved to be extremely useful in designing and validating
complex HASE entities in the QCDOC model.
In addition, the way in which the contents of a large memory array can
be viewed has been improved such that memory contents are displayed in
multiple columns rather than in a single long column.
JavaHASE
Although HASE is freely available for use by others, the design
environment is not really required by those users who only want access
to the simulation models. Furthermore, designers may not wish to
release the source code of their models. JavaHASE [6] solves this
problem by translating HASE projects into applets containing fully
fledged simulation models based on simjava (The behavioural
description language used by SimJava). JavaHASE applets can be
downloaded via the WWW and simulations run on client machines, rather
than on a server. The QCDOC models have proved to be too complex to be
converted to applets but a spin-off benefit of JavaHASE has been the
conversion of a number of HASE models into applets which can be used
as learning objects, each embedded in its own supporting website
[7,8,9].
Results
Using a combination of hardware and software parameters of the HASE
QCDOC simulation models, a number of performance evaluation
experiments were conducted. In general, the results confirm that the
architecture of the QCDOC computer is very effective at supporting the
QCD application for which it was designed. However, significant
performance improvements could be achieved by incorporating the key
findings of the simulation results into the design, particularly in
relation to the memory hierarchy and the global sum schemes. We have
discussed and shared concerns about the on-chip memory limits, and the
tight-coupling between the optimised QCD kernels and the QCDOC
hardware, with the QCDOC design team.
A detailed description and analysis of results can be found in
[3]. The following subsections summraise the key findings.
Floating-point Unit Performance
The parallel QCD workload is dominated by double-precision
floating-point operations; thus the floating-point unit (FPU)
performance is pivotal to overall code performance. Experiments were
conducted by varying workload and system parameters to understand and
to explore the dynamics of the FPU execution pipelines for the
execution of the QCD code. The simulation results enabled us to
quantify the performance limits inherent in the QCD code instruction
mix and to observe the execution pattern of floating-point load-store
and arithmetic instructions. These results provided useful feedback
to the QCDOC design team; their analytical models had been driven by
unrealistic FPU performance estimates.
On-chip Cache Hierarchy
An in-depth analysis of performance results from a combination of
memory configuration parameters including bandwidth and capacity
enabled us to identify the limits of the QCDOC on-chip memory
hierarchy for QCD code execution. It was observed that while the
custom design configurations result in improvements to the QCD code
performance, the commodity design components, particularly the level 1
data cache design configuration, limit the achievable performance.
QCD Workload Characteristics
The computation and communication overlap and the load balancing
properties were investigated with a range of workload and
co-simulation parameters in the HASE QCDOC model. The QCD workload
characteristics for the overlapping computation and communication
blocks of the code were explored for a number of parallel QCD workload
distributions and mapping schemes. It was identified that an uneven
distribution of lattice volumes onto the processing nodes can reduce
the benefit of overlapping the computation and communication tasks.
EDRAM Size Limits
It was also identified through experiments that, while small-sized or
fine-grained QCD workload data and instructions fit in the available 4
MBytes on-chip EDRAM, larger local volumes will not. This would
require a substantial rewrite of the existing QCD kernels to sustain
the achievable performance. Access to the EDRAM is memory-mapped,
therefore, a normal page-fault will not invoke a data movement from
on-chip to large off-chip memory.
Hypothetical Design Configurations
The hypothetical (ideal) design configurations of entities are
modelled so that the limitations of other design components can be
quantified. For instance, we measured the maximum achievable
performance for a given instruction mix of the QCD code by reducing
the cache miss penalties to zero clock cycles.
QCD Global Sums over Torus and Tree Networks
The QCD global sum schemes over a four-dimensional torus and tree
network were studied on the high-level abstraction model. It was
confirmed that the enhanced pass-through unit over a 4-D torus network
use in QCDOC and a bit-serial tree network provide the best global sum
performance over large system configurations.
Custom Communication Protocol Performance
The custom communication protocol performance experiments were
conducted over unreliable links on a large-scale system [10]. It was
demonstrated that the custom communication protocol and the buffering
scheme in the serial communication unit offer the optimal throughtput
and channel utilisation for QCD communication patterns.
Metamodelling
One of the aims of the project was to compare and to predict the QCD
code performance of future-generation supercomputers. In order to
maximise design component re-use and minimise re-design efforts, we
introduced a scheme called hierarchical, application-driven
metamodelling; metamodelling allows for efficient generation of
simulation models with alternative system configurations by maximising
component re-use and minimising redesign overheads [11]. We
extended the existing HASE QCDOC models and redefined the QCD workload
mapping according to the underlying architecture. A HASE metamodel was
created that is capable of generating QCDOC and Bluegene/L design
configurations for QCD code execution, with an appropriate combination
of parameters.
Bluegene/L is a teraflops-scale scientific supercomputer, which is
expected to be available in 2005. Unlike QCDOC, the Bluegene/L
architecture is claimed to be optimised for a range of high-end,
scientific applications, including QCD. The experiments conducted on
the HASE models simulating QCDOC and Bluegene/L design configurations
demonstrate that the QCDOC architectural configurations are highly
optimised and efficient for the QCD code execution, particularly, the
low-latency and high-bandwidth memory accesses between level 1 and
level 2 data caches. Furthermore, it was confirmed through experiments
that the four-dimensional custom interconnection network of the QCDOC
machine is highly optimal for the QCD code execution as compared to a
three-dimenional torus interconnection network, which has been a
choice for a number of scientific supercomputers, including Bluegene/L.
References
- "The QCDOC Supercomputer: Hardware, Software, and
Performance", P. A. Boyle, C. Jung & T. Wettig, Conf. for
Computing in High Energy and Nuclear Physics, 2003.
- "Simulation of a Computer Architecture for
Quantum Chromodynamics Calculations", S.R. Alam, R.N. Ibbett & F.
Mallet, Crossroads, The ACM Student Magazine, Interdisciplinary
Computer Science, Issue 9.3, pp 16-23, Spring 2003.
- Sadaf Alam, "Simulation of the UKQCD Computer",
PhD thesis, School of Informatics, University of Edinburgh, 2004.
- HASE User Manuals,
www.icsa.inf.ed.ac.uk/research/groups/hase/manuals
- "An Extensible Clock mechanism for Computer
Architecture Simulations", F. Mallet, S. Alam & R.N. Ibbett, 13th
International Conference on Modelling and Simulation, pp
91-96, USA, May 2002.
- "JavaHase: Automatic Generation of Applets from Hase
Simulation Models", F.Mallet & R.N. Ibbett, Proc. Summer Computer
Simulation Conference, pp 659-664, Canada, July 2003.
- JavaHASE applets,
www.icsa.inf.ed.ac.uk/research/groups/hase
- "Computer Architecture Simulation Applets for Use in
Teaching", R.N. Ibbett & F. Mallet, Proc. Frontiers in
Education 2003, Boulder, Colorado, Nov 2003.
-
"A Simulation Applet for Microcoding Exercises",
R.N. Ibbett, Proc Workshop on Computer Architecture Education at ISCA,
pp 122-129, 2004.
- "Performance Evaluation of Local Communications: A
Case-study", S.R. Alam, R.N. Ibbett & F. Mallet, 15th
International Conference on Parallel and Distributed Computing and
Systems, IASTED, pp 393-398, USA, November 2003.
- "A Methodology for Simulating Scientific
Supercomputing Systems", S. Alam & R.N. Ibbett, Summer Computer
Simulation Conference, USA, July 2004.
Return to QCD home page
HASE Project
Institute for Computing Systems Architecture, School of Informatics,
University of Edinburgh
Last change 20/07/2004
Please contact our
webadmin with any comments or changes.
Unless explicitly stated otherwise, all material is
copyright © The University of Edinburgh.