Simulation of the UKQCD Computer

Computer Architecture Simulation & Visualisation

Simulation of the UKQCD Computer - Final Report

Project Summary

The QCD Computer Simulation Project was funded by the EPSRC under grant GR/R27129 (May 2001 - April 2004).

The overall aim of the project was to gain an understanding of the factors which influence the performance of QCD computers and to influence future high performance system designs. In order to achieve the project aims, parameterised simulation models of the proposed UKQCD computer, QCDOC, were constructed in HASE (Hierarchical computer Architecture Design and Simulation Environment). QCDOC (Quantum Chromodynamics On-Chip) is a recent, application-specific supercomputer designed for QCD grand challenge application simulations [1]. The HASE models were used:

to investigate the factors which influence the performance of the QCDOC computer
to explore the design parameter space of the simulation models and to investigate variations in performance against a range of architectural parameters in order to inform the design of subsequent generations of QCD computers.

In parallel with designing the models, the capabilities of the simulation environment, HASE, were extended in response to the demands placed on it by the these models and a number of new features were added.

The HASE QCDOC Models

Parameterised hardware-software co-simulation models of the QCDOC computer were created for conducting a range of experiments [2,3]. Because of hardware and operating system limitations of the host platform, a complete 10K processing node could not be simulated with precise design details. Models were therefore developed at two different levels of abstraction:

The low-level abstraction model can simulate up to 16 processing nodes. It captures the microarchitectural details of the system, including the processor, the on-chip cache hierarchy, the system bus, the communication unit, etc. This model was designed for studying the performance characteristics of a single node and nearest neighbour communications when executing QCD code.
The high-level abstraction model can simulate up to 12K processing nodes. The nodes simply act as sources of communication events, with the intervals between events being based on data taken from the low-level abstraction model. This model was used to study the performance characteristics of the custom communication protocol and different ways of implementing the QCD global sum mechanism.

Co-Simulation

The HASE QCDOC simulation models provide a unified framework in which not only can the architectural parameters be varied but so too can the workload parameters. The parallel QCD workload on a system depends on the lattice QCD volume. This volume is represented as the number of lattice sites in four space-time dimensions, which in turn depends on and is driven by the degree of approximation of a QCD calculation. Large QCD volumes provide better approximations but also result in large workload sizes per processing node. Another dimension of the QCD workload per processing node is the workload mapping and distribution scheme. The HASE QCDOC models incorporate QCD workload volume, mapping and distribution in the behavioural modelling. Thus the models allow a simulation user to alter the workload size per processing node as well as the workload mapping and distribution through HASE parameter windows. The benefits of this co-simulation scheme include:

Wider design parameter options: the architectural parameters are not limited by a fixed workload size; for instance, the scaling behaviour of the hardware can only be studied by altering both the workload per processing node and the total number of processing nodes.
Realistic workload configurations: QCD physicists expect to run a variety of different workloads, of varying sizes and configurations, on the QCDOC machine.

HASE Extensions

Modelling QCDOC in HASE has thrown up a number of interesting simulation problems, including two simulation clocking problems that appear not to have been solved elsewhere, and these have led to novel solutions [4].

MeshnD Template

The real QCDOC machine has a 6-D torus topology although, at any given time, the QCD workload only utilises four of these dimensions. HASE already had a 1-D, 2-D and 3-D mesh template generation facility (essentially an automated version of the compound entity mechanism) but this was inadequate for QCDOC. The mesh facility was therefore extended to become an n-dimensional mesh/torus topology generation facility. This extended facility not only instantiates a mesh of sub-entities but also creates the links between the sub-entities in the specified topology, i.e. the simulation designer can simply define a mesh and the sub-entity to be replicated in the mesh, and does not need to write explicit code to generate instances of the sub-entities, their interconnections or the visual display. In many models, the sub-entities within a mesh contain a number of parameters. In most cases the values of these parameters are required to be the same in each instance of the sub-entity. In order to minimise the overheads and errors associated with parameter changes between simulation runs for a HASE project with a very large number of replicated sub-entities, a static parameter mechanism was introduced. If a parameter of a mesh sub-entity is declared as static, only one copy of the parameter exists in the simulation model, so if it is altered in one sub-entity, the change will automatically appear in all other sub-entities.

The Clock Mechanism

The clock mechanism used in most HASE models serves two purposes: (a) to simulate the actual clock used in the real system under investigation (b) to act as a barrier synchronisation mechanism between the entities representing the components of the model. Both require clocking signals to pass between the clock entity and the clocked entities. This requires links to be defined between these entities. In experimenting with the QCDOC model, an important parameter is the number of processors in the multi-dimensional array and it is desirable to be able to vary this parameter at run time, without recompiling the project. To enable this to happen, a clock inheritance mechanism [5] has been designed that allows entities to self-register with the clock at the start of a simulation. This same mechanism also solves a second problem, the need to be able to use multiple clock frequencies, synchronised to the master clock in the model, reflecting the situation not just in QCDOC but in system-on-chip designs generally. In this case, entities self-register with an intermediate clock which self-registers with the master clock.

Memory Modelling

The memory capacity and performance of the host machine on which a simulation is run are key limitations in modelling large-scale systems. To address the first limitation, an alternative memory array declaration scheme has been introduced into HASE. In the case of a parallel system being run in Single Program Multiple Data (SPMD) mode, for example, the contents of the instruction memories in the real system are identical. The new scheme allows the memory arrays in all the entities in a mesh to use the same memory file. The second limitation is the length to time required to read and write the memory initialisation files. An option for defining a HASE memory array was therefore introduced such that a single physical file can be used to initialise multiple memory instances. In this case, a single memory file with separators is created, allowing the memory array in each entity in a mesh to load a different section of the file.

Visualisation Mechanisms

Timing diagrams have been introduced into the framework to inspect the design parameters and send/receive events of a simulation entity over time. Unlike the animation facility, the timing diagrams show the values of multiple parameters changing over time within an entity. This proved to be extremely useful in designing and validating complex HASE entities in the QCDOC model. In addition, the way in which the contents of a large memory array can be viewed has been improved such that memory contents are displayed in multiple columns rather than in a single long column.

JavaHASE

Although HASE is freely available for use by others, the design environment is not really required by those users who only want access to the simulation models. Furthermore, designers may not wish to release the source code of their models. JavaHASE [6] solves this problem by translating HASE projects into applets containing fully fledged simulation models based on simjava (The behavioural description language used by SimJava). JavaHASE applets can be downloaded via the WWW and simulations run on client machines, rather than on a server. The QCDOC models have proved to be too complex to be converted to applets but a spin-off benefit of JavaHASE has been the conversion of a number of HASE models into applets which can be used as learning objects, each embedded in its own supporting website [7,8,9].

Results

Using a combination of hardware and software parameters of the HASE QCDOC simulation models, a number of performance evaluation experiments were conducted. In general, the results confirm that the architecture of the QCDOC computer is very effective at supporting the QCD application for which it was designed. However, significant performance improvements could be achieved by incorporating the key findings of the simulation results into the design, particularly in relation to the memory hierarchy and the global sum schemes. We have discussed and shared concerns about the on-chip memory limits, and the tight-coupling between the optimised QCD kernels and the QCDOC hardware, with the QCDOC design team.

A detailed description and analysis of results can be found in [3]. The following subsections summraise the key findings.

Floating-point Unit Performance

The parallel QCD workload is dominated by double-precision floating-point operations; thus the floating-point unit (FPU) performance is pivotal to overall code performance. Experiments were conducted by varying workload and system parameters to understand and to explore the dynamics of the FPU execution pipelines for the execution of the QCD code. The simulation results enabled us to quantify the performance limits inherent in the QCD code instruction mix and to observe the execution pattern of floating-point load-store and arithmetic instructions. These results provided useful feedback to the QCDOC design team; their analytical models had been driven by unrealistic FPU performance estimates.

On-chip Cache Hierarchy

An in-depth analysis of performance results from a combination of memory configuration parameters including bandwidth and capacity enabled us to identify the limits of the QCDOC on-chip memory hierarchy for QCD code execution. It was observed that while the custom design configurations result in improvements to the QCD code performance, the commodity design components, particularly the level 1 data cache design configuration, limit the achievable performance.

QCD Workload Characteristics

The computation and communication overlap and the load balancing properties were investigated with a range of workload and co-simulation parameters in the HASE QCDOC model. The QCD workload characteristics for the overlapping computation and communication blocks of the code were explored for a number of parallel QCD workload distributions and mapping schemes. It was identified that an uneven distribution of lattice volumes onto the processing nodes can reduce the benefit of overlapping the computation and communication tasks.

EDRAM Size Limits

It was also identified through experiments that, while small-sized or fine-grained QCD workload data and instructions fit in the available 4 MBytes on-chip EDRAM, larger local volumes will not. This would require a substantial rewrite of the existing QCD kernels to sustain the achievable performance. Access to the EDRAM is memory-mapped, therefore, a normal page-fault will not invoke a data movement from on-chip to large off-chip memory.

Hypothetical Design Configurations

The hypothetical (ideal) design configurations of entities are modelled so that the limitations of other design components can be quantified. For instance, we measured the maximum achievable performance for a given instruction mix of the QCD code by reducing the cache miss penalties to zero clock cycles.

QCD Global Sums over Torus and Tree Networks

The QCD global sum schemes over a four-dimensional torus and tree network were studied on the high-level abstraction model. It was confirmed that the enhanced pass-through unit over a 4-D torus network use in QCDOC and a bit-serial tree network provide the best global sum performance over large system configurations.

Custom Communication Protocol Performance

The custom communication protocol performance experiments were conducted over unreliable links on a large-scale system [10]. It was demonstrated that the custom communication protocol and the buffering scheme in the serial communication unit offer the optimal throughtput and channel utilisation for QCD communication patterns.

Metamodelling

One of the aims of the project was to compare and to predict the QCD code performance of future-generation supercomputers. In order to maximise design component re-use and minimise re-design efforts, we introduced a scheme called hierarchical, application-driven metamodelling; metamodelling allows for efficient generation of simulation models with alternative system configurations by maximising component re-use and minimising redesign overheads [11]. We extended the existing HASE QCDOC models and redefined the QCD workload mapping according to the underlying architecture. A HASE metamodel was created that is capable of generating QCDOC and Bluegene/L design configurations for QCD code execution, with an appropriate combination of parameters. Bluegene/L is a teraflops-scale scientific supercomputer, which is expected to be available in 2005. Unlike QCDOC, the Bluegene/L architecture is claimed to be optimised for a range of high-end, scientific applications, including QCD. The experiments conducted on the HASE models simulating QCDOC and Bluegene/L design configurations demonstrate that the QCDOC architectural configurations are highly optimised and efficient for the QCD code execution, particularly, the low-latency and high-bandwidth memory accesses between level 1 and level 2 data caches. Furthermore, it was confirmed through experiments that the four-dimensional custom interconnection network of the QCDOC machine is highly optimal for the QCD code execution as compared to a three-dimenional torus interconnection network, which has been a choice for a number of scientific supercomputers, including Bluegene/L.

References

"The QCDOC Supercomputer: Hardware, Software, and Performance", P. A. Boyle, C. Jung & T. Wettig, Conf. for Computing in High Energy and Nuclear Physics, 2003.
"Simulation of a Computer Architecture for Quantum Chromodynamics Calculations", S.R. Alam, R.N. Ibbett & F. Mallet, Crossroads, The ACM Student Magazine, Interdisciplinary Computer Science, Issue 9.3, pp 16-23, Spring 2003.
Sadaf Alam, "Simulation of the UKQCD Computer", PhD thesis, School of Informatics, University of Edinburgh, 2004.
HASE User Manuals, www.icsa.inf.ed.ac.uk/research/groups/hase/manuals
"An Extensible Clock mechanism for Computer Architecture Simulations", F. Mallet, S. Alam & R.N. Ibbett, 13th International Conference on Modelling and Simulation, pp 91-96, USA, May 2002.
"JavaHase: Automatic Generation of Applets from Hase Simulation Models", F.Mallet & R.N. Ibbett, Proc. Summer Computer Simulation Conference, pp 659-664, Canada, July 2003.
JavaHASE applets, www.icsa.inf.ed.ac.uk/research/groups/hase
"Computer Architecture Simulation Applets for Use in Teaching", R.N. Ibbett & F. Mallet, Proc. Frontiers in Education 2003, Boulder, Colorado, Nov 2003.
"A Simulation Applet for Microcoding Exercises", R.N. Ibbett, Proc Workshop on Computer Architecture Education at ISCA, pp 122-129, 2004.
"Performance Evaluation of Local Communications: A Case-study", S.R. Alam, R.N. Ibbett & F. Mallet, 15th International Conference on Parallel and Distributed Computing and Systems, IASTED, pp 393-398, USA, November 2003.
"A Methodology for Simulating Scientific Supercomputing Systems", S. Alam & R.N. Ibbett, Summer Computer Simulation Conference, USA, July 2004.

Return to QCD home page

line

HASE Project
Institute for Computing Systems Architecture, School of Informatics, University of Edinburgh
Last change 20/07/2004

Please contact our webadmin with any comments or changes.
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh.