Nvidia's new TensorRT speeds machine learning predictions

Nvidia's new TensorRT speeds machine learning predictions

Nvidia has released a new version of TensorRT, a runtime system for serving inferences using deep learning models through Nvidia’s own GPUs.

Inferences, or predictions made from a trained model, can be served from either CPUs or GPUs. Serving inferences from GPUs is part of Nvidia’s strategy to get greater adoption of its processors, countering what AMD is doing to break Nvidia’s stranglehold on the machine learning GPU market.

Nvidia claims the GPU-based TensorRT is better across the board for inferencing than CPU-only approaches. One of Nvidia’s proffered benchmarks, the AlexNet image classification test under the Caffe framework, claims TensorRT to be 42 times faster than a CPU-only version of the same test — 16,041 images per second vs. 374—when run on Nvidia’s Tesla P40 processor. (Always take industry benchmarks with a grain of salt.)

Serving predictions from a GPU is also more power-efficient and delivers results with lower latency, Nvidia claims.

TensorRT doesn’t work with anything other than Nvidia’s own GPU lineup, and is a proprietary, closed-source offering. AMD, by contrast, has been promising a more open-ended approach to how its GPUs can be used for machine learning applications, by way of the ROCm open source hardware-independent library for accelerating machine learning.

Source: InfoWorld Big Data

MapD SQL database gains enterprise-level scale-out, high availability

MapD SQL database gains enterprise-level scale-out, high availability

MapD, the SQL database and analytics platform that uses GPU acceleration for performance orders of magnitude ahead of CPU-based solutions, has been updated to version 3.0.

The update provides a mix of high-end and mundane additions. The high-end goodies consist of deep architectural changes that enable even greater performance gains in clustered environments. But the mundane things are no less important, as they’re aimed at making life easier for enterprise database developers—the audience most likely to use MapD.

Previous versions of MapD (not to be confused with Hadoop/Spark vendor MapR) were able to scale vertically but not horizontally. Users could add more GPUs to a given box, but they couldn’t scale MapD across multiple GPU-equipped servers. An online demo shows version 3 allowing users to explore in real time an 11-billion-row database of ship movements across the continental U.S. using MapD’s web-based graphical dashboard app.


A live demo of MapD 3.0 running on multiple nodes. An 11-billion-row database of ship movements throughout the continental U.S., can be explored and manipulated in real time, with both the graphical explorer and standard SQL commands.

Version 3 adds a native shared-nothing distributed architecture to the database—a natural extension of the existing shared-nothing architecture MapD used to split processing across GPUs. Data is automatically sharded in round-robin fashion between physical nodes. MapD founder Todd Mostak noted in a phone call that it ought to be possible in the future to manually adjust sharding based on a given database key.

The big advantage to using multiple shared-nothing nodes, according to Mostak, isn’t just a linear speed-up in processing—although that does happen. It also means a linear speed-up for ingesting data into the cluster, which is useful in lowering the bar to entry for database developers who want to try their data out on MapD.

Other features in version 3.0 —chief among them high availability—are what you’d expect from a database aimed at enterprise customers. Nodes can be clustered into HA groups, with data synchronized between them by way of a distributed file system (typically GlusterFS) and a distributed log (by way of an Apache Kafka record stream or “topic”).

Another addition aimed at attracting a general database audience is a native ODBC driver. Third-party tools such as Tableau or Qlik Sense can now plug into MapD without the overhead of the previous JDBC-to-ODBC solution.

A hybrid architecture is one thing that’s not yet possible with MapD’s scale-out system. MapD does have cloud instances available in Amazon Web Services, IBM Soflayer, and Google Cloud, but Mostak pointed out that MapD doesn’t currently support a scenario where nodes in an on-prem installation of MapD can be mixed with nodes from a cloud instance.

Most of MapD’s customers, he explained, have “either-or” setups—either entirely on-prem or entirely in-cloud—with little to no demand to mix the two. At least, not yet.

Source: InfoWorld Big Data

LLVM-powered Pocl puts parallel processing on multiple hardware platforms

LLVM-powered Pocl puts parallel processing on multiple hardware platforms

LLVM, the open source compiler framework that powers everything from Mozilla’s Rust language to Apple’s Swift, is emerging in yet another powerful role: an enabler of code deployment systems that target multiple classes of hardware for speeding up jobs like machine learning.

To write code that can run on CPUs, GPUs, ASICs, and FPGAs alike—something hugely useful with machine learning apps—it’s best to use something like OpenCL, which allows a program to be written once and then automatically deployed across all those different types of hardware.

Pocl, an implementation of OpenCL that was recently revamped to version 0.14, is using the LLVM compiler framework to do that kind of targeting. With Pocl, OpenCL code can be automatically deployed to any hardware platform with LLVM back-end support.

Pocl uses LLVM’s own Clang front end to take in C code that uses the OpenCL standard. Version 0.14 works with both LLVM 3.9 and the recently released LLVM 4.0. It also offers a new binary format for OpenCL executables, so they can be run on hosts that don’t have a compiler available.

Aside from being able to target multiple processor architectures and hardware types automatically, another reason Pocl uses LLVM  is that it aims to “[improve] performance portability of OpenCL programs with the kernel compiler and the task runtime, reducing the need for target-dependent manual optimizations,” according to the release note for version 0.14. 

There are other projects that automatically generate OpenCL code tailored to multiple hardware targets. The Lift project, written in Java, is one such code generation system. Lift generates a specially tailored IL (intermediate language) that allows OpenCL abstractions to be readily mapped to the behavior of the target hardware. In fact, LLVM works like this; it generates an IL from source code, which is then compiled for a given hardware platform. Another such project, Futhark, generates GPU-specific code.

LLVM is also being used as a code-generating system for other aspects of machine learning. The Weld project generates LLVM-deployed code that is designed to speed up the various phases of a data analysis framework. Code spends less time shuttling data back and forth between components in the framework and more time doing actual data processing.

The development of new kinds of hardware targets is likely to continue driving the need for code generation systems that can target multiple hardware types. Google’s Tensor Processing Unit, for instance, is a custom ASIC devoted to speeding one particular phase of a machine learning job. If hardware types continue to proliferate and become more specialized, having code for them generated automatically will save time and labor.

Source: InfoWorld Big Data

MIT-Stanford project uses LLVM to break big data bottlenecks

MIT-Stanford project uses LLVM to break big data bottlenecks

The more cores you can use, the better — especially with big data. But the easier a big data framework is to work with, the harder it is for the resulting pipelines, such as TensorFlow plus Apache Spark, to run in parallel as a single unit.

Researchers from MIT CSAIL, the home of envelope-pushing big data acceleration projects like Milk and Tapir, have paired with the Stanford InfoLab to create a possible solution. Written in the Rust language, Weld generates code for an entire data analysis workflow that runs efficiently in parallel using the LLVM compiler framework.

The group describes Weld as a “common runtime for data analytics” that takes the disjointed pieces of a modern data processing stack and optimizes them in concert. Each individual piece runs fast, but “data movement across the [different] functions can dominate the execution time.”

In other words, the pipeline spends more time moving data back and forth between pieces than actually doing work on it. Weld creates a runtime that each library can plug into, providing a common method to run key data across the pipeline that needs parallelization and optimization.

Frameworks don’t generate code for the runtime themselves. Instead, they call Weld via an API that describes what kind of work is being done. Weld then uses LLVM to generate code that automatically includes optimizations like multithreading or the Intel AV2 processor extensions for high-speed vector math.

InfoLab put together preliminary benchmarks comparing the native versions of Spark SQL, NumPy, TensorFlow, and the Python math-and-stats framework Pandas with their Weld-accelerated counterparts. The most dramatic speedups came with the NumPy-plus-Pandas benchmark, where the work could be amplified “by up to two orders of magnitude” when parallelized across 12 cores.

Those familiar with Pandas and want to take Weld for a spin can check out Grizzly, a custom implementation of Weld with Pandas.

It’s not the pipeline, it’s the pieces

Weld’s approach comes out of what its creators believe is a fundamental problem with the current state of big data processing frameworks. The individual pieces aren’t slow; most of the bottlenecks arise from having to hook them together in the first place.

Building a new pipeline integrated from the inside out isn’t the answer, either. People want to use existing libraries, like Spark and TensorFlow. Dumping that means getting rid of a culture of software already built around those products.

Instead, Weld proposes making changes to the internals of those libraries, so they can work with the Weld runtime. Application code that, say, uses Spark wouldn’t have to change at all. Thus, the burden of the work would fall on the people best suited to making those changes — the library and framework maintainers — and not on those constructing apps from those pieces.

Weld also shows that LLVM is a go-to technology for systems that generate code on demand for specific applications, instead of forcing developers to hand-roll custom optimizations. MIT’s previous project, Tapir, used a modified version of LLVM to automatically generate code that can run in parallel across multiple cores.

Another cutting-edge aspect to Weld: it was written in Rust, Mozilla’s language for fast, safe software development. Despite its relative youth, Rust has an active and growing community of professional developers frustrated with having to compromise safety for speed or vice versa. There’s been talk of rewriting existing applications in Rust, but it’s tough to fight the inertia. Greenfield efforts like Weld, with no existing dependencies, are likely to become the standard-bearers for the language as it matures.

Source: InfoWorld Big Data

Mesosphere DC/OS brings elastic scale to Redis, Couchbase

Mesosphere DC/OS brings elastic scale to Redis, Couchbase

Mesosphere DC/OS, the datacenter automation solution built atop the Apache Mesos orchestration system to provide one-click management for complex applications, has now hit its 1.9 revision.

With this release, Mesosphere is once again emphasizing DC/OS as a solution for deploying and maintaining large, complex data-centric applications. Version 1.9 adds out-of-the-box support for several major data services and a passel of improvements for DC/OS’s existing container support.

Everyone into the pool!

DC/OS manages a datacenter’s worth of Linux machines as if they were a single pooled resource maintained by high-level commands from a CLI and GUI. Apps like Apache Cassandra, Kafka, Spark, and HDFS — many of them not known for being easy to manage — can be deployed with a few command-line actions and scaled up or down on demand or automatically.

Among the new additions are two major stars of the modern open source data stack: The database/in-memory caching store Redis and the NoSQL database solution Couchbase. Redis in particular has become a valuable component for big data applications as an accelerator for Apache Spark, so being able to accelerate other DC/OS apps with it is a boon.

Version 1.9 also adds support for Elastic; DataStax Enterprise, the commercial offering based on the Apache Cassandra NoSQL system; and Alluxio, a data storage acceleration layer specifically designed for big data systems like Spark.

Managing applications like these through DC/OS makes better use of the utilization in a given cluster. Each application supported in DC/OS has its own scheduling system, so apps with complementary behaviors can be packed together more efficiently and automatically migrated between nodes as needed. DC/OS also ensures apps that upgrade frequently (like scrappy new big data frameworks) can be rolled out across a cluster without incurring downtime.

There’s barely a data application these days that isn’t tied into machine learning in some form. Given that Mesosphere was already promoting DC/OS for data-centric apps, it only makes sense the company is also pushing DC/OS as a management solution for machine learning apps built on its supported solutions. This claim has some validity with GPU resources, as DC/OS can manage GPU as simply another resource to be pooled for application use.

Container conscious

Because DC/OS also manages containers with Google’s Kubernetes project, it’s been described as a container solution, but only in the sense that containers are one of many kinds of resources DC/OS manages.

Containers have long been criticized for being opaque. Prometheus, now a Cloud Native Computing Foundation project, was originally developed by Soundcloud for getting insight into running containers, and DC/OS 1.9 supports Prometheus along with Splunk, ELK, and Datadog as targets for managing the logs and metrics it collects from containers.

Version 1.9 also introduces a feature called container process injection. With it, says the company, developers “remotely run commands in any container in the same namespace as the task being investigated.” Containers are not simply opaque by nature, but also ephemeral, so being able to connect to them and debug them directly while they’re still running will be useful.

Source: InfoWorld Big Data

3 Kaggle alternatives for collaborative data science

3 Kaggle alternatives for collaborative data science

What’s the best way to get a good answer to a tough question? Ask a bunch of people, and make a competition out of it. That’s long been Kaggle‘s approach to data science: Turn tough missions, like making lung cancer detection more accurate, into bounty-paying competitions, where the best teams and the best algorithms win.

Now Kaggle is rolling into Google, and while all signs point to it being kept as-is for now, there will be jitters about the long-term prospects for a site with such a devoted community and an an idiosyncratic approach.

Here are three other sites that share a similar mission, if not explicitly followed in Kaggle’s footsteps. (Note that some sites, like CrowdAnalytix, may consider accepted solutions in contests as works for hire and thus their property.)


A product of the École Polytechnique Fédérale de Lausanne in Switzerland, CrowdAI is an open source platform for hosting open data challenges and gaining insight into how the problems in question were solved. The platform is quite new, with only six challenges offered so far, but the tutorials derived from those challenges are detailed and valuable, providing step-by-step methodologies to reproduce that work or create something similar. The existing exercises cover common frameworks like Torch or TensorFlow, so it’s a good place to acquire hands-on details for using them.


DrivenData, created by a consultancy that deals in professional data problems, hosts online challenges lasting a few months. Each is focused specifically on pressing problems facing the world at large, like predicting the spread of diseases or mining Yelp data to improve restaurant inspection processes. Like Kaggle, DrivenData also has a data science jobs listing board — a feature people are worried might go missing from Kaggle post-acquisition.


Backed by investors from Accel Partners and SAIF Partners, CrowdAnalytix focuses on hosting data-driven problem-solving competitions, rather than sharing information that result from them. Contests are offered for finding solutions to problems in categories like modeling, visualization, and research, and each has bounties in the thousands of dollars. Some previous challenges include predicting the real costs of workers’ compensation claims or airline delays. Other contests, though, aren’t hosted for money, but for providing a competitive option to learn a related discipline, such as the R language.

Source: InfoWorld Big Data

Facebook's new machine learning framework emphasizes efficiency over accuracy

Facebook's new machine learning framework emphasizes efficiency over accuracy

In machine learning parlance, clustering or similarity search looks for affinities in sets of data that normally don’t make such a job easy. If you wanted to compare 100 million images against each other and find the ones that looked most like each other, that’s a clustering job. The hard part is scaling well across multiple processors, where you’d get the biggest speedup.

Facebook’s AI research division (FAIR) recently unveiled, with little fanfare, a proposed solution called Faiss. It’s an open source library, written in C++ and with bindings for Python, that allows massive data sets like still images or videos to be searched efficiently for similarities.

It’s also one of a growing class of machine learning solutions that’s exploring better methods of making algorithms operate in parallel across multiple GPUs for speed that’s only available at scale.

A magnet for the needle in the haystack

FAIR described the project and its goals in a paper published at the end of last February. The problem wasn’t only how to run similarity searches, or “k-selection” algorithms, on GPUs, but how to run them effectively in parallel across multiple GPUs, and how to deal with data sets that don’t fit into RAM (such as terabytes of video).

Faiss’ trick is not to search the data itself, but a compressed representation that trades a slight amount of accuracy for an order of magnitude or more of storage efficiency. Think of an MP3: Though MP3 is “lossy” compression format, it sounds good enough for most ears. In the same manner, Faiss uses encoding called PQ (product quantization) that can be split efficiently across multiple GPUs.

One example search shown in the paper involves searching the Yahoo Flickr Creative Commons 100 Million data set, a library of 100 million images. Faiss was fed two images — one of a red flower, and one of a yellow flower — and instructed to find a chain of similar images between them. Searching all 100 million images for such similarities took 35 minutes on a set of four Nvidia Titan X GPUs.

FAIR claims Faiss is “8.5× faster than prior GPU state of the art” and provided some benchmarks to support its claim. When compared against two previous GPU k-selection algorithms, FAIR claimed, the Faiss algorithm was not only faster, but came a good deal closer to maximizing the available memory bandwidth for the GPU.

Another advantage with Faiss, said FAIR, was the total end-to-end time for the search — the time needed to construct the PQ version of the data, plus the time needed to actually run the search. Competing solutions took days on end simply to build PQ graph data for one test; with Faiss, a “high-quality” graph can be built in “about half a day.”

Pick up the pace

FAIR’s strategy of slightly sacrificing accuracy is one of a variety of speedup tactics used by the latest generation of machine learning technologies.

Many of these speedups don’t simply goose the performance of high-end hardware like Nvidia Titan boards, but also empower lower-end hardware, like the GPUs in smartphones. Google’s deep leaning system TensorFlow was recently upgraded to allow smartphone-grade GPUs to perform image-recognition work.

Another likely long-term advantage of algorithms that can efficiently trade accuracy for speed is to divide labor between a local device (fast, but not as accurate) and a remote back end (more accurate, but requires more processing power). Classifications made by a local device could be used as-is or augmented with more horsepower on the back end if there’s a network connection.

The biggest takeaway with Faiss: There’s still plenty of work to be done in figuring out how machine learning of all stripes can further benefit from massively parallel hardware.

Source: InfoWorld Big Data

5 Python libraries to lighten your machine learning load

5 Python libraries to lighten your machine learning load

Machine learning is exciting, but the work is complex and difficult. It typically involves a lot of manual lifting — assembling workflows and pipelines, setting up data sources, and shunting back and forth between on-prem and cloud-deployed resources.

The more tools you have in your belt to ease that job, the better. Thankfully, Python is a giant tool belt of a language that’s widely used in big data and machine learning. Here are five Python libraries that help relieve the heavy lifting for those trades.


A simple package with a powerful premise, PyWren lets you run Python-based scientific computing workloads as multiple instances of AWS Lambda functions. A profile of the project at The New Stack describes PyWren using AWS Lambda as a giant parallel processing system, tackling projects that can be sliced and diced into little tasks that don’t need a lot of memory or storage to run.

One downside is that lambda functions can’t run for more than 300 seconds max. But if you need a job that takes only a few minutes to complete and need to run it thousands of times across a data set, PyWren may be a good option to parallelize that work in the cloud at a scale unavailable on user hardware.


Google’s TensorFlow framework is taking off big-time now that it’s at a full 1.0 release. One common question about it: How can I make use of the models I train in TensorFlow without using TensorFlow itself?

Tfdeploy is a partial answer to that question. It exports a trained TensorFlow model to “a simple NumPy-based callable,” meaning the model can be used in Python with Tfdeploy and the the NumPy math-and-stats library as the only dependencies. Most of the operations you can perform in TensorFlow can also be performed in Tfdeploy, and you can extend the behaviors of the library by way of standard Python metaphors (such as overloading a class).

Now the bad news: Tfdeploy doesn’t support GPU acceleration, if only because NumPy doesn’t do that. Tfdeploy’s creator suggests using the gNumPy project as a possible replacement.


Writing batch jobs is generally only one part of processing heaps of data; you also have to string all the jobs together into something resembling a workflow or a pipeline. Luigi, created by Spotify and named for the other plucky plumber made famous by Nintendo, was built to “address all the plumbing typically associated with long-running batch processes.”

With Luigi, a developer can take several different unrelated data processing tasks — “a Hive query, a Hadoop job in Java, a Spark job in Scala, dumping a table from a database” — and create a workflow that runs them, end to end. The entire description of a job and its dependencies are created as Python modules, not as XML config files or another data format, so it can be integrated into other Python-centric projects.


If you’re adopting Kubernetes as an orchestration system for machine learning jobs, the last thing you want is for the mere act of using Kubernetes to create more problems than it solves. Kubelib provides a set of Pythonic interfaces to Kubernetes, originally to aid with Jenkins scripting. But it can be used without Jenkins as well, and it can do everything exposed through the kubectl CLI or the Kubernetes API.


Let’s not forget about this recent and high-profile addition to the Python world, an implementation of the Torch machine learning framework. PyTorch doesn’t only port Torch to Python, but adds many other conveniences, such as GPU acceleration and a library that allows multiprocessing to be done with shared memory (for partitioning jobs across multiple cores). Best of all, it can provide GPU-powered replacements for some of the unaccelerated functions in NumPy.

Source: InfoWorld Big Data

IBM sets up a machine learning pipeline for z/OS

IBM sets up a machine learning pipeline for z/OS

If you’re intrigued by IBM’s Watson AI as a service, but reluctant to trust IBM with your data, Big Blue has a compromise. It’s packaging Watson’s core machine learning technology as an end-to-end solution available behind your firewall.

Now the bad news: It’ll only be available to z System / z/OS mainframe users … for now.

From start to finish

IBM Machine Learning for z/OS  isn’t a single machine learning framework. It’s  a collection of popular frameworks — in particular Apache SparkML, TensorFlow, and H2O — packaged with bindings to common languages used in the trade (Python, Java, Scala), and with support for “any transactional data type.” IBM is pushing it as a pipeline for building, managing, and running machine learning models through visual tools for each step of the process and RESTful APIs for deployment and management.

There’s a real need for this kind of convenience. Even as the number of frameworks for machine learning mushrooms, developers still have to perform a lot of heavy labor to create end-to-end production pipelines for training and working with models. This is why Baidu outfitted its PaddlePaddle deep learning framework with support for Kubernetes; in time the arrangement could serve as the underpinning for a complete solution that would cover every phase of machine learning.

Other components in IBM Machine Learning fit into this overall picture. The Cognitive Automation for Data Scientists element “assists data scientists in choosing the right algorithm for the data by scoring their data against the available algorithms and providing the best match for their needs,” checking metrics like performance and fitness to task for a given algorithm and workload.

Another function “schedule[s] continuous re-evaluations on new data to monitor model accuracy over time and be alerted when performance deteriorates.” Models trained on data, rather than algorithms themselves, are truly crucial in any machine learning deployment, so IBM’s wise to provide such utilities.

z/OS for starters; Watson it ain’t

The decision to limit the offering to z System machines for now makes the most sense as part of a general IBM strategy where machine learning advances are paired directly with branded hardware offerings. IBM’s PowerAI system also pairs custom IBM hardware — in this case, the Power8 processor — with commodity Nvidia GPUs to train models at high speed. In theory, PowerAI devices could run side by side with a mix of other, more mainstream hardware as part of an overall machine learning hardware array.

The z/OS incarnation of IBM Machine Learning is aimed at an even higher and narrower market: existing z/OS customers with tons of on-prem data. Rather than ask those (paying) customers to connect to something outside of their firewalls, IBM offers them first crack at tooling to help them get more from the data. The wording of IBM’s announcement — “initially make [IBM Machine Learning] available [on z/OS]” — implies that other targets are possible later on.

It’s also premature to read this as “IBM Watson behind the firewall,” since Watson’s appeal isn’t the algorithms themselves or the workflow IBM’s put together for them, but rather the volumes of pretrained data assembled by IBM, packaged into models and deployed through APIs. Those will remain exactly where IBM can monetize them best: behind its own firewall of IBM Watson as a service.

Source: InfoWorld Big Data

New big data tools for machine learning spring from home of Spark and Mesos

New big data tools for machine learning spring from home of Spark and Mesos

If the University of California, Berkeley’s AMPLab doesn’t ring bells, perhaps some of its projects will: Spark and Mesos.

AMPLab was planned all along as a five-year computer science research initiative, and it closed down as of last November after running its course. But a new lab is opening in its wake: RISELab, another five-year project at UC Berkeley with major financial backing and the stated goal of “focus[ing] intensely for five years on systems that provide Real-time Intelligence with Secure Execution [RISE].”

AMPLab was created with “a vision of understanding how machines and people could come together to process or to address problems in data — to use data to train rich models, to clean data, and to scale these things,” said Joseph E. Gonzalez, Assistant Professor in the Department of Electrical Engineering and Computer Science at UC Berkeley.

RISELab’s web page describes the group’s mission as “a proactive step to move beyond big data analytics into a more immersive world,” where “sensors are everywhere, AI is real, and the world is programmable.” One example cited: Managing the data infrastructure around “small, autonomous aerial vehicles,” whether unmanned drones or flying cars, where the data has to be processed securely at high speed.

Other big challenges Gonzalez singled out include security, but not the conventional focus on access controls. Rather, it involves concepts like “homomorphic” encryption, where encrypted data can be worked without first having to decrypt it. “How can we make predictions on data in the cloud,” said Gonzalez, “without the cloud understanding what it is it’s making predictions about?”

Though the lab is in its early days, a few projects have already started to emerge:


Machine learning involves two basic kinds of work: Creating models from which predictions can be derived and serving up those predictions from the models. Clipper focuses on the second task and is described as a “general-purpose low-latency prediction serving system” that takes predictions from machine learning frameworks and serves them up with minimal latency.

Clipper has three aims that ought to draw the attention of anyone working with machine learning: One, it accelerates serving up predictions from a trained model. Two, it provides an abstraction layer across multiple machine learning frameworks, so a developer only has to program to a single API. Three, Clipper’s design makes it possible to respond dynamically to how individual models respond to requests — for instance, to allow a given model that works better for a particular class of problem to receive priority. Right now there’s no explicit mechanism for this, but it is a future possibility.


It seems fitting that a RISELab projects would complement work done by AMPLab, and one does: Opaque works with Apache Spark SQL to enable “very strong security for DataFrames.” It uses Intel SGX processor extensions to allow DataFrames to be marked as encrypted and have all their operations performed within an “SGX enclave,” where data is encrypted in-place using the AES algorithm and is only visible to the application using it via hardware-level protection.

Gonzalez says this delivers the benefits of homomorphic encryption without the performance cost. The performance hit for using SGX is around 50 percent, but the fastest current implementations of homomorphic algorithms run 20,000 times slower. On the other hand, SGX-enabled processors are not yet offered in the cloud, although Gonzalez said this is slated to happen “in the near future.” The biggest stumbling block, though, may be the implementation, since in order for this to work, “you have to trust Intel,” as Gonzalez pointed out.


Ground is a context management system for data lakes. It provides a mechanism, implemented as a RESTful service in Java, that “enables users to reason about what data they have, where that data is flowing to and from, who is using the data, when the data changed, and why and how the data is changing.”

Gonzalez noted that data aggregation has moved away from strict, data-warehouse-style governance and toward “very open and flexible data lakes,” but that makes it “hard to track how the data came to be.” In some ways, he pointed out, knowing who changed a given set of data and how it was changed can be more important than the data itself. Ground provides a common API and meta model for track such information, and it works with many data repositories. (The Git version control system, for instance, is one of the supported data formats in the early alpha version of the project.)

Gonzalez admitted that defining RISELab’s goals can be tricky, but he noted that “at its core is this transition from how we build advanced analytics models, how we analyze data, to how we use that insight to make decisions — connecting the products of Spark to the world, the products of large-scale analytics.”

Source: InfoWorld Big Data