Brute Force vs High Performance Computing - What's the Difference?
“When these four systems— CPU, Disk, Network, and Software—are aligned, HPC is a wonder to behold”
A friend of mine long ago suggested that the definition of a super computer was one that allowed you to do one order of magnitude more work than what you could accomplish with the typical largest computer around. He was thinking of physics problems, but it is not a bad definition for any field. Also long ago in the oil industry we started using array processors and CDC, Cray, IBM (and other) computers to do "embarrassingly parallel" work. If we could decompose the problem into many small tasks or many parallel streams of similar work, then we could get a lot accomplished in short order. The array processor and its follow-on computers allowed us to pipeline many "vectors" of numbers through a simple state machine and spit out answers on the other side.
Nowadays with the major search engines, we might instead be searching for a word string or indexing a web page in a huge map-reduce algorithm. The form is not significantly different -- many small tasks, repeated over and over on different sets of data, with the results collected and abstracted to something higher-level (a 3D Seismic image or a web search result for instance)
I contend that real "High Performance Computing" is a step above just doing many things in parallel. It is the difference between sending the Concorde across the Atlantic in 3 hours and sending 10,000 yachts. You can move many more people on the yachts but it will still take a couple of days. The folks that took the Concorde would have been to London and back before the others were halfway across.
Real HPC requires a finely tuned balance of power in four dimensions:
1. The network must be very low latency and very high bandwidth. Ideally, the network is a backplane inside the computer. More ideally, the network is the space between layers of silicon on a 3D chip.
2. The computing units must be very fast and very capable. There may be more than one, but each unit must be able to do additions, multiplies, fetches, and more in record time. Each unit must have large arrays of register memory, superior algorithms for caching instructions and data, and methods of pipelining instructions. Attached memory must be very fast.
3. The data store (disks on most systems) must be quick enough to serve instructions and problem data to and from the computing units without causing the computing units to wait long.
4. Software must be aligned with the best attributes of the above three elements. Without good software underpinnings the best hardware will starve for work to do.
Each of the above elements must also include the ability to recover from errors and failures. For instance, a network needs alternative routing in case a switching node fails. A disk needs RAID or other methods in order to not lose results. A computing core needs to be able to block off bad registers or circuits and route to redundant elements. Finally, software must checkpoint or retry or save partial results.
In the four companies that use HPC where I've been associated I've seen the full range in all four of the above areas, from excellent to catastrophic. Let's leave software for the next paragraph. The hardware side is where I've seen the biggest mistakes. Not in the computing units (the computers, clusters, big-iron, etc., but typically in network or disk purchases. If there is not a good balance between those three elements of the hardware, money has certainly been wasted. There is no magic bullet, but a good rule of thumb would be to split your money in three piles. Give the computers about 40 percent of the budget, and spend the remaining 60 percent equally with 30 percent to disk systems and 30 percent to your network infrastructure. Prototype your design by purchasing time in an HPC co-location facility. Hire a specialist, fund some grad students, rent time on the cloud – do whatever it takes to test the waters first. You can spend millions and then wonder why the disks are so slow, the computers not busy, or the network keeps falling over. The typical VP of IT is centered on enterprise systems, SOX, financial and HR systems and VDI. He or she is not the right person to "spec out" your HPC system. I recall one high level network engineer telling me I should purchase low performance network switches because I didn't want to oversaturate the demand on my disks. Exactly the opposite of what I wanted to do. I wanted to pull data so fast that the disks melted! If they weren't up to the task, I wanted faster disks. If one sentence is all you remember, this is it: "Don't under-budget your disks and network."
If you run commercial software on your HPC systems, then you still need someone who pays attention to the tuning between disk access, network traffic, job scheduling, and the computers. If your company writes its own code however, make sure your software engineers know about the latest in "nosql" databases, message passing, threading, remote procedure calls, and system underpinnings. Linux should be your operating system, and your system administrators should know how to tune all of the system services. Something higher-level than NFS and CIFS should be your disk access protocol – see Lustre, GPFS, GlusterFS, ZFS, Seph, Hadoop, etc. C, C++, FORTRAN2000 (yes really) and Python are where the fastest codes will be running. MPI, OpenMP, and OpenCL need to be well understood. Don't rely on open-source everything, although a lot of open source systems are great! Buy a good compiler and test/tuning suite. Make sure your software engineers are using agile methods. Peer review their codes with university professors if needed. You don't want software to keep you behind the times. Look at accelerators but weigh the costs in software development time that you will use up in making them work efficiently for you.
When these four systems—CPU, Disk, Network, and Software—are aligned, HPC is a wonder to behold. Problems are solved in record time, lives are saved, money made, rockets launched, and weather predicted. However in many cases, our problems really are "embarrassingly parallel". So do we really have HPC everywhere? Look at what your company does, and I'll let you make the call – HPC or BfC? When it comes down to it, however – I'll also bet that we don't care and perhaps we shouldn't.