Dr. Marc Snir, Director, Mathematics and Computer Science Division, Argonne National Laboratory
At the University of Illinois at Urbana Champaign, Dr. Snir is Michael Faiman and Saburo Muroga Professor in Computer Science.
Large-scale simulations have become in the last decades an essential tool in science and engineering: Simulations have replaced wind-tunnels because they are cheaper and faster; they have replaced combustion experiments because they enable us to peek at the center of a flame that is hard to observe in real experiments; they help us understand how the universe started and how a black hole behaves.
The performance of top supercomputers has increased exponentially over the last decades. The current top system, deployed in China, contains the equivalent of 16,000 powerful server nodes and consumes 18 MW power. While this is an impressive system, it is dwarfed by the data centers of leading cloud providers: For example, the data centers of Amazon are estimated to contain 50,000 to 80,000 servers each; Amazon is said to have about 30 such centers. Have clouds dethroned supercomputers?
“Technologies that facilitate the integration of big computing with big data have the potential to significantly change industrial use of High- Performance Computing”
There is much similarity between clouds and supercomputers: Both are built by connecting together tens of thousands of server nodes. However, clouds and supercomputers are typically optimized for different workloads. Supercomputers are large because any one job can utilize a sizeable fraction of the system. Clouds are large because larger data centers are more efficient and can better accommodate variations in a large number of small individual workloads. Supercomputing jobs often require interconnects with higher performance and the ability to access files in parallel. Therefore, they are typically fitted with more performing and more expensive interconnects, and use a different I/O architecture.
However, clouds are “good enough” for less demanding parallel workloads: i.e., computations that are “loosely coupled” and require less frequent interaction between the computing nodes. Clouds can replace small clusters in support of such workloads. In addition, technologies developed in support of cloud computing will be increasingly relevant to high-performance computing. For example, solutions for increased energy efficiency apply to unset; the problem of monitoring the health of tens of thousands of servers and reacting quickly to failures is common. Container technologies, such as Docker, can facilitate the porting of applications across platforms, and can facilitate the quick deployment of parallel workloads, thus bringing interactive high-performance computing closer to reality. Since the cloud market is much larger than the HPC market, HPC will benefit from such technology sharing.
While high-performance computing has long been synonymous with large-scale simulations, it increasingly covers demanding data analytics workloads. The analysis of large graphs or of large scientific datasets often requires that the data be stored in memory and demands fast communication across nodes – i.e., supercomputing architectures. Large-scale simulations generate masses of data that need to be analyzed in situ, in order to avoid I/O bottlenecks. New applications are emerging that require a tight interaction between large streams of observational data and large-scale simulations: e.g., for monitoring large systems and running “what-if” experiments to cope with abnormal situations. This is leading to a search for unified system designs that can support closely interacting simulations and data analysis workloads. Another requirement that stems from such scenarios is the need to easily couple different parallel applications in complex workloads and to dynamically allocate resources to such workloads. Here, too, cloud-computing technologies could be adapted, with suitable modifications.
The current leading supercomputers are performing over one billion millions numerical operations per second (Petaflop/ second); researchers and vendors are working on the next threshold, namely one billion billions numerical operations per second (Exaflop/second), thus moving from Petascale to Exascale. However, this continued march toward increasing performance is becoming more difficult.
The leading supercomputer in 1993 performed about 60 billion numerical operations per second (60 GF/s); the current leader performs about 30 million of billions of numerical operations per second (30 PF/s), an improvement by a factor of 500,000. During this same period, the number of transistors per chip has increased by a factor of 1,600, from 3 million to 5 billion; and the clock speed increased by a factor of 20. These two factors explain much of the performance gains, but not all. Another factor of 15 can be attributed to an increase in the number of components used by top supercomputers, and by the use of more specialized components.
The increase in microprocessor performance has also increased the energy consumption of these microprocessors. This energy is released as heat and taxes our ability to cool computer components. As a result, clock speeds have not improved in the last decade. The improvements in chip performance are entirely due to an increase in the amount of on-chip parallelism, with more operations executed simultaneously. The current leading supercomputer has over three million cores, each executing its own sequential stream of instructions; an exascale machine will need hundreds of millions of cores and programs using the entire machine will need to divide their computation into hundreds of millions of concurrent activities. The increased size of such system will also increase its energy consumption and increase the frequency of errors.
While these obstacles are significant, they are not insurmountable: Industry is working on low-power circuits; new architectures and better power management can further decrease the power consumption. Various techniques can be applied to prevent, detect and correct, or tolerate errors. Multiple application domains in science and engineering can leverage the vast amounts of parallelisms that will become available; new algorithms and new programming environments are being developed to facilitate this task. The FY16 Budget Request for the Department of Energy includes $273M for working on these problems and developing the application codes that will leverage exascale performance. Such level of investments will need to be maintained for the foreseeable future.
As usual, supercomputing technologies will trickle down: Technologies for a machine room sized exascale system will also support multiple petaflop/s in a rack-size system. New methods and algorithms developed to support research will be applied to industrial problems. And technologies that facilitate the integration of big computing with big data have the potential to significantly change industrial use of High-Performance Computing.