Publications

Accelogic thrives on state-of-the-art technologies. Always pioneering new technologies and solutions, we ensure that our customers integrate the best solutions for their particular applications and needs. Here we share some of our very own technology developments.

Reducing wall-clock time in high-performance distributed software through digital compression of data

Juan Gonzalez,
ACCELOGIC TECHNICAL REPORT #14-0237
September, 2014

Abstract:
A fast-paced continual increase on the ratio of processing to communication speed feeds an exponentially growing limitation for extracting performance from High Performance Computing (HPC) systems. Ongoing developments and trends make it clear that this ratio will keep increasing over the next decade. Research in hardware (e.g., computer networks and interconnects technologies) can provide improvements in latency and bandwidth, but algorithmic and software solutions are also needed for improving efficiency of data movement. Data compression for HPC has recently arisen as a solution for reduced communications in HPC. However, existing data compression methods for HPC provide small compression factors (typically less than 1.3x). This paper introduces a novel compression theory, which we have codenamed “Compressive Computing”. The results beat all existing compression factors in floating-point numbers. Performance studies show that substantial compression gains are possible for kernel codes typically used in performance-demanding software.

Novel data compression methods for reduced communications in parallel codes and their application to the acceleration of supercomputer software

Juan Gonzalez,
ACCELOGIC TECHNICAL REPORT #14-0225
August, 2014

Abstract:
The communication bottleneck is pervasive in large-scale High Performance Computing (HPC) applications. It stands out as one of the biggest challenges lying ahead for HPC. Computational fluid dynamics, geophysics, weather, biology, and computational chemistry are just a sampling of the wide spectrum of HPC applications that, in their path to exa-scale, are in strong need for new ways of tackling the communication barrier. A key aspect to consider when tackling the HPC communication bottleneck is that the computing capacity of modern processors is increasing much faster than their own capacity to communicate between one another in a given network. Taking advantage of the “disposable flops” that become available from waiting for communication to take effect, data compression has risen naturally as a promising weapon that can help break the communication bottleneck. This paper introduces novel compression/decompression methods for accelerating communication in parallel software. The application of the proposed solution is illustrated in the implementation of a large-scale distributed eigensolver.

The Role of Specialized Processors in the Future of HPC – A Preliminary Study on the Implementation and Exploitation of Specialized Linear Algebra Processors (SLAPs)

Juan Gonzalez, Santiago Fonseca, and Rafael Nunez
ACCELOGIC TECHNICAL REPORT #11-0108
October, 2011

Released: May, 2015

Abstract:

The advent of modern heterogeneous processing technologies (e.g., multicore CPUs, GPUs, FPGAs) poses an opportunity for significantly increasing the performance of supercomputer codes. These technologies are enabling massive computational parallelism as well as revolutionary approaches to memory-access speedup that could potentially reduce computation times of complete applications in the order of thousands. Exploiting these technologies in HPC environments is not trivial –it requires appropriately balancing processing and data transfer performance, as well as providing programming tools able to keep development costs comparable to those of conventional software environments. We introduce the concept of Specialized Linear Algebra Processor (SLAP) as a low-cost processor with the ability to compute certain specialized numerical kernels at speeds comparable to (or higher than) those achieved by general-purpose supercomputers. SLAPs can be placed strategically inside a supercomputer network to enhance its performance for certain applications, with relatively low investments in additional hardware or development costs. For example, codes in the FPGA accelerated LAPACKrc™ library, could be used within a supercomputer enhanced with one or a handful of commercial-off-the-shelf FPGA systems, to inject substantial acceleration into certain applications. To illustrate the potential of SLAPs, we present a demonstration of the NAS CG benchmark that makes use of the Krylov solver component of LAPACKrc™ in a state-of-the-art FPGA system. Results indicate that for certain NAS benchmarks, just a single SLAP could outperform an entire supercomputer, which emphasizes the importance of Specialized Processors in the future of HPC.

The Art of FPGA Algorithm Design- The Case for the Extreme Acceleration of Linear-Algebra-Intensive Software

Juan Gonzalez and Rafael Nunez
ACCELOGIC TECHNICAL REPORT #10-0048
November, 2010

Released: September, 2014

Abstract:

The developments seen in the field of reconfigurable computing during the last ten years bring an unprecedented opportunity for the acceleration of supercomputing applications in computational fluid dynamics (CFD). Reconfigurable computing algorithms implemented in Field-Programmable Gate Arrays (FPGAs) have proven to be 10x to 1,000x faster than traditional CPU-based solutions.
It is estimated that over seventy percent of supercomputer CPU cycles worldwide are spent solving large-scale linear equations. Accelogic is developing unique algorithmic innovations that will enable a single FPGA chip to surpass the performance of a 4,096-CPU cluster for solving large-scale linear systems. This technology has the potential to reduce both cost and power consumption by one to two orders of magnitude, while maintaining code portability and ease of use for FORTRAN and C environments.
We show our recent results in this direction, including insights on why and how things can go wrong when designing FPGA supercomputing kernels, and why the common-wisdom approach of “porting” or “translating” the algorithms into the FPGA has not delivered the promised levels of performance for CFD. We discuss the critical factors of success behind Accelogic’s latest 60x-speedup narrowband linear solver – the fastest FPGA linear solver at the time of writing.

Achieving Maximal Concurrency in Heterogeneous HPC Systems –A case study on the efficient parallelization of the FFT algorithm in Multi-CPU/GPU systems

Juan Gonzalez, Santiago Fonseca, and Rafael Nunez
ACCELOGIC TECHNICAL REPORT #11-0023
June, 2011

Abstract:

Recent advances in computing systems are providing a variety of processing engines. While CPU’s have been the preferred processing unit during the last decade, advances on specialized processors (e.g., GPU) present an alternative for massively parallel in-chip processing. Instead of individually optimizing code for each type of computing architecture, the goal should be to build codes for the supercomputer of the future, i.e., for a computing architecture made of many thousands of CPUs and specialized computing cores working concurrently. Our work is focused on generating the most-needed heterogeneous-ready software that has the ability to scale to thousands of specialized cores. In this paper we introduce a general framework for algorithm partitioning, and apply it to the FFT method.

Extreme-Speed Scalable Direct Sparse Solvers for Heterogeneous Supercomputing –An Enhancement to the LAPACKrc Library

Juan Gonzalez and Rafael Nunez
SciDAC Conference 2010, Chattanooga, Tennessee
September, 2010

Abstract:

At the core of DOE‘s scientific priorities is a broad need for more computing power and supercomputing infrastructure. An estimated 70% of computing cycles spent globally in the HPC ecosystem are used to solve large-scale linear algebra computational problems. Linear algebra is at the core and constitutes the primary bottleneck of important DOE research problems in fusion energy, nuclear accelerator modelling, circuit simulation, and weather modelling, among other research challenges. As part of an ambitious research program co-funded by DOE, NASA, and the Department of Defense, Accelogic is spearheading the development of LAPACKrc, a groundbreaking family of FPGA-based linear algebra solvers able to achieve speedups larger than 100x with a single chip (“rc” stands for “reconfigurable computing,” a radically new computing paradigm that is changing the way high performance computing is done today). Recent efforts on the LAPACKrc research program have focused on producing FPGA-based direct sparse solvers—a key functionality still missing in the current LAPACKrc solver suite. Our latest direct sparse solver prototype demonstrates a speedup of up to 125x (compared against state-of-the-art CPU direct sparse solvers), which provides support for broad-based science/engineering breakthroughs.

LAPACKrc: Fast linear algebra kernels/solvers for FPGA accelerators

Juan Gonzalez and Rafael Nunez
2009 Journal of Physics: Conf. Ser. 180 012042
May, 2009

Abstract:

We present LAPACKrc, a family of FPGA-based linear algebra solvers able to achieve more than 100x speedup per commodity processor on certain problems. LAPACKrc subsumes some of the LAPACK and ScaLAPACK functionalities, and it also incorporates sparse direct and iterative matrix solvers. Current LAPACKrc prototypes demonstrate between 40x-150x speedup compared against top-of-the-line hardware/software systems. A technology roadmap is in place to validate current performance of LAPACKrc in HPC applications, and to increase the computational throughput by factors of hundreds within the next few years.

Fast and Accurate Computation of the Myriad Filter via Branch-and-Bound Search

Rafael Nunez, Juan Gonzalez, Gonzalo Arce, and John Nolan
IEEE Transactions on Signal Processing, Vol. 56, No. 7, July 2008
May, 2008

Abstract:

The myriad filter has demonstrated to be a robust countermeasure against the negative effect that impulsive noise has over electronic systems. However, its use is still limited in systems where processing speed is critical, as is the case of radar, sonar, and real-time audio and video processing. This limitation has its roots in the challenges imposed by the numerical approximation of the myriad filter. In particular, minimization operations at the interior of nonlinear operations are sensitive components that have a direct impact on the performance of the filtering algorithms. In the case of the myriad filter, the minimization of functions with multiple local minima is a common operation, and poorly chosen algorithms compromise the good behavior of the filter. In this correspondence, we present an alternative for the minimization of the objective function in the computation of the myriad filter. This solution exploits general concepts in global optimization and adapts them to the particular case of myriad filtering. This technique improves accuracy and speed in the computation of the myriad filter, making the method feasible in many problems.

Reconfigurable Computing in Engineering Mechanics

D.E. Veley, Air Force Research Laboratory. Juan Gonzalez and Rafael Nunez, Accelogic, LLCn
IEEE Transactions on Signal Processing, Vol. 56, No. 7, July 2008
May, 2008

Abstract:

The clock speed of computer processors has reached its limit due to thermal constraints. Gone are the days when it was possible to speed up an algorithm simply by putting it on a faster processor. New desktop computers are being built with multiple processors to increase speed. DARPA launched the High Productivity Computer Systems (HPCS) program (now in phase III) to address the pitfalls of parallel computing. Under this DARPA effort, new languages are being developed to enable more efficient programming and more efficient use of resources for massively parallel computing systems. An alternative to this approach is the use of field programmable gate arrays (FPGAs). FPGAs allow parallel processing and streaming of data on a single chip in a manner that promises to outstrip multithreaded processing. In this paper, we demonstrate the tremendous performance gains achievable today through the use of FPGAs, via a massively parallel large-scale linear equation solver specifically designed for high-speed finite element computations. The solver, developed by Accelogic, and currently under evaluation at the Wright Patterson Air Force Base, is the first functional reconfigurable computing code to achieve speedups larger than 50x for large-scale matrix equations using a single FPGA (the baseline for comparison is LAPACK running on the fastest CPU available on the market at the time of writing). We conclude the paper with a brief overview of future directions in computing and some ideas on how to move forward in this new paradigm of parallel computing.

Large-Scale Numerical Solution of Partial Differential Equations with Reconfigurable Computing

Jose A. Camberos, U.S. Air Force Research Laboratory.
Juan Gonzalez and Rafael Nunez, Accelogic, LLC
36th AIAA Thermophysics Conference, Miami, FL
June, 2007

Abstract:

We describe the shift in paradigm that the algorithm developer must undertake in order to design efficient numerical solutions using Field Programmable Gate Arrays (FPGAs) and reconfigurable computing in general. FPGAs and reconfigurable computing can bring a new dimension to the algorithm design process, with potentially huge gains in algorithm speed, and significant reductions in cost and power consumption. We introduce three levels of parallelism that are possible under this emerging computing paradigm: fine-grained parallelism, coarse-grained parallelism, and algorithm-level parallelism. Furthermore, we illustrate through an example how these levels of parallelism work. Full exploitation of the three levels of parallelism will be a fundamental ingredient for developing the next generation of partial differential equation solvers during the next decade.