MIT-Stanford project uses LLVM to break big data bottlenecks

The more cores you can use, the better — especially with big data. But the easier a big data framework is to work with, the harder it is for the resulting pipelines, such as TensorFlow plus Apache Spark, to run in parallel as a single unit.

Researchers from MIT CSAIL, the home of envelope-pushing big data acceleration projects like Milk and Tapir, have paired with the Stanford InfoLab to create a possible solution. Written in the Rust language, Weld generates code for an entire data analysis workflow that runs efficiently in parallel using the LLVM compiler framework.

The group describes Weld as a “common runtime for data analytics” that takes the disjointed pieces of a modern data processing stack and optimizes them in concert. Each individual piece runs fast, but “data movement across the [different] functions can dominate the execution time.”

In other words, the pipeline spends more time moving data back and forth between pieces than actually doing work on it. Weld creates a runtime that each library can plug into, providing a common method to run key data across the pipeline that needs parallelization and optimization.

Frameworks don’t generate code for the runtime themselves. Instead, they call Weld via an API that describes what kind of work is being done. Weld then uses LLVM to generate code that automatically includes optimizations like multithreading or the Intel AV2 processor extensions for high-speed vector math.

InfoLab put together preliminary benchmarks comparing the native versions of Spark SQL, NumPy, TensorFlow, and the Python math-and-stats framework Pandas with their Weld-accelerated counterparts. The most dramatic speedups came with the NumPy-plus-Pandas benchmark, where the work could be amplified “by up to two orders of magnitude” when parallelized across 12 cores.

Those familiar with Pandas and want to take Weld for a spin can check out Grizzly, a custom implementation of Weld with Pandas.

It’s not the pipeline, it’s the pieces

Weld’s approach comes out of what its creators believe is a fundamental problem with the current state of big data processing frameworks. The individual pieces aren’t slow; most of the bottlenecks arise from having to hook them together in the first place.

Building a new pipeline integrated from the inside out isn’t the answer, either. People want to use existing libraries, like Spark and TensorFlow. Dumping that means getting rid of a culture of software already built around those products.

Instead, Weld proposes making changes to the internals of those libraries, so they can work with the Weld runtime. Application code that, say, uses Spark wouldn’t have to change at all. Thus, the burden of the work would fall on the people best suited to making those changes — the library and framework maintainers — and not on those constructing apps from those pieces.

Weld also shows that LLVM is a go-to technology for systems that generate code on demand for specific applications, instead of forcing developers to hand-roll custom optimizations. MIT’s previous project, Tapir, used a modified version of LLVM to automatically generate code that can run in parallel across multiple cores.

Another cutting-edge aspect to Weld: it was written in Rust, Mozilla’s language for fast, safe software development. Despite its relative youth, Rust has an active and growing community of professional developers frustrated with having to compromise safety for speed or vice versa. There’s been talk of rewriting existing applications in Rust, but it’s tough to fight the inertia. Greenfield efforts like Weld, with no existing dependencies, are likely to become the standard-bearers for the language as it matures.

Source: InfoWorld Big Data