3 big data platforms look beyond Hadoop

3 big data platforms look beyond Hadoop

A distributed file system, a MapReduce programming framework, and an extended family of tools for processing huge data sets on large clusters of commodity hardware, Hadoop has been synonymous with “big data” for more than a decade. But no technology can hold the spotlight forever.

While Hadoop remains an essential part of the big data platforms, and the major Hadoop vendors—namely Cloudera, Hortonworks, and MapR—have changed their platforms dramatically. Once-peripheral projects like Apache Spark and Apache Kafka have become the new stars, and the focus has turned to other ways to drill into data and extract insight. 

Let’s take a brief tour of the three leading big data platforms, what each adds to the mix of Hadoop technologies to set it apart, and how they are evolving to embrace a new era of containers, Kubernetes, machine learning, and deep learning.

Cloudera Enterprise Data Hub

Cloudera was the first to market with a Hadoop distribution—not surprising given that its core team consisted of engineers who had leveraged Hadoop in places like Yahoo, Google, and Facebook. Hadoop co-creator Doug Cutting serves as chief architect. 

Source: InfoWorld Big Data

9 Splunk alternatives for log analysis

9 Splunk alternatives for log analysis

Quick! Name a log analysis service. If the first word that popped out of your mouth was “Splunk,” you’re far from alone.

But Splunk’s success has spurred many others to up their log-analysis game, whether open source or commercial. Here is a slew of contenders that have a lot to offer sysadmins and devops folks alike, from services to open source stacks.

Elasticsearch (ELK stack)

The acronym “LAMP” is used to refer to the web stack that comprises Linux, the Apache HTTP web server, the MySQL database, and PHP (or Perl, or Python). Likewise, “ELK” is used to describe a log analysis stack built from Elasticsearch for search functionality, Logstash for data collection, and Kibana for data visualization. All are open source.

Elastic, the company behind the commercial development of the stack, provides all the pieces either as cloud services or as free, open source offerings with support subscriptions. Elasticsearch, Logstash, and Kibana offer the best alternative to Splunk when used together, considering that Splunk’s strength is in searching and reporting as well as data collection.

Source: InfoWorld Big Data

What is TensorFlow? The machine learning library explained

What is TensorFlow? The machine learning library explained

Machine learning is a complex discipline. But implementing machine learning models is far less daunting and difficult than it used to be, thanks to machine learning frameworks—such as Google’s TensorFlow—that ease the process of acquiring data, training models, serving predictions, and refining future results.

Created by the Google Brain team, TensorFlow is an open source library for numerical computation and large-scale machine learning. TensorFlow bundles together a slew of machine learning and deep learning (aka neural networking) models and algorithms and makes them useful by way of a common metaphor. It uses Python to provide a convenient front-end API for building applications with the framework, while executing those applications in high-performance C++.

TensorFlow can train and run deep neural networks for handwritten digit classification, image recognition, word embeddings, recurrent neural networks, sequence-to-sequence models for machine translation, natural language processing, and PDE (partial differential equation) based simulations. Best of all, TensorFlow supports production prediction at scale, with the same models used for training.

How TensorFlow works

TensorFlow allows developers to create dataflow graphs—structures that describe how data moves through a graph, or a series of processing nodes. Each node in the graph represents a mathematical operation, and each connection or edge between nodes is a multidimensional data array, or tensor.

TensorFlow provides all of this for the programmer by way of the Python language. Python is easy to learn and work with, and provides convenient ways to express how high-level abstractions can be coupled together. Nodes and tensors in TensorFlow are Python objects, and TensorFlow applications are themselves Python applications.

The actual math operations, however, are not performed in Python. The libraries of transformations that are available through TensorFlow are written as high-performance C++ binaries. Python just directs traffic between the pieces, and provides high-level programming abstractions to hook them together.

TensorFlow applications can be run on most any target that’s convenient: a local machine, a cluster in the cloud, iOS and Android devices, CPUs or GPUs. If you use Google’s own cloud, you can run TensorFlow on Google’s custom TensorFlow Processing Unit (TPU) silicon for further acceleration. The resulting models created by TensorFlow, though, can be deployed on most any device where they will be used to serve predictions.

TensorFlow benefits

The single biggest benefit TensorFlow provides for machine learning development is abstraction. Instead of dealing with the nitty-gritty details of implementing algorithms, or figuring out proper ways to hitch the output of one function to the input of another, the developer can focus on the overall logic of the application. TensorFlow takes care of the details behind the scenes.

TensorFlow offers additional conveniences for developers who need to debug and gain introspection into TensorFlow apps. The eager execution mode lets you evaluate and modify each graph operation separately and transparently, instead of constructing the entire graph as a single opaque object and evaluating it all at once. The TensorBoard visualization suite lets you inspect and profile the way graphs run by way of an interactive, web-based dashboard.

And of course TensorFlow gains many advantages from the backing of an A-list commercial outfit in Google. Google has not only fueled the rapid pace of development behind the project, but created many significant offerings around TensorFlow that make it easier to deploy and easier to use: the above-mentioned TPU silicon for accelerated performance in Google’s cloud; an online hub for sharing models created with the framework; in-browser and mobile-friendly incarnations of the framework; and much more.

One caveat: Some details of TensorFlow’s implementation make it hard to obtain totally deterministic model-training results for some training jobs. Sometimes a model trained on one system will vary slightly from a model trained on another, even when they are fed the exact same data. The reasons for this are slippery—e.g., how random numbers are seeded and where, or certain non-deterministic behaviors when using GPUs). That said, it is possible to work around those issues, and TensorFlow’s team is considering more controls to affect determinism in a workflow.

TensorFlow vs. the competition

TensorFlow competes with a slew of other machine learning frameworks. PyTorch, CNTK, and MXNet are three major frameworks that address many of the same needs. Below I’ve noted where they stand out and come up short against TensorFlow.

  • PyTorch, in addition to being built with Python, and has many other similarities to TensorFlow: hardware-accelerated components under the hood, a highly interactive development model that allows for design-as-you-go work, and many useful components already included. PyTorch is generally a better choice for fast development of projects that need to be up and running in a short time, but TensorFlow wins out for larger projects and more complex workflows.

  • CNTK, the Microsoft Cognitive Toolkit, like TensorFlow uses a graph structure to describe dataflow, but focuses most on creating deep learning neural networks. CNTK handles many neural network jobs faster, and has a broader set of APIs (Python, C++, C#, Java). But CNTK isn’t currently as easy to learn or deploy as TensorFlow.

  • Apache MXNet, adopted by Amazon as the premier deep learning framework on AWS, can scale almost linearly across multiple GPUs and multiple machines. It also supports a broad range of language APIs—Python, C++, Scala, R, JavaScript, Julia, Perl, Go—although its native APIs aren’t as pleasant to work with as TensorFlow’s.

Source: InfoWorld Big Data

Julia vs. Python: Julia language rises for data science

Julia vs. Python: Julia language rises for data science

Of the many use cases Python covers, data analytics has become perhaps the biggest and most significant. The Python ecosystem is loaded with libraries, tools, and applications that make the work of scientific computing and data analysis fast and convenient.

But for the developers behind the Julia language — aimed specifically at “scientific computing, machine learning, data mining, large-scale linear algebra, distributed and parallel computing”—Python isn’t fast or convenient enough. It’s a trade-off, good for some parts of this work but terrible for others.

What is the Julia language?

Created in 2009 by a four-person team and unveiled to the public in 2012, Julia is meant to address the shortcomings in Python and other languages and applications used for scientific computing and data processing. “We are greedy,” they wrote. They wanted more: 

We want a language that’s open source, with a liberal license. We want the speed of C with the dynamism of Ruby. We want a language that’s homoiconic, with true macros like Lisp, but with obvious, familiar mathematical notation like Matlab. We want something as usable for general programming as Python, as easy for statistics as R, as natural for string processing as Perl, as powerful for linear algebra as Matlab, as good at gluing programs together as the shell. Something that is dirt simple to learn, yet keeps the most serious hackers happy. We want it interactive and we want it compiled.

(Did we mention it should be as fast as C?)

Here are some of the ways Julia implements those aspirations:

  • Compiled, not interpreted, for speed. Julia is just-in-time (JIT) compiled using the LLVM compiler framework. At its best, Julia can approach or match the speed of C.
  • Straightforward but useful syntax. Julia’s syntax is similar to Python’s—terse, but also expressive and powerful.
  • Dynamic typing with static type benefits. You can specify types for variables, like “unsigned 32-bit integer.” But you can also create hierarchies of types to allow general cases for handling variables of specific types—for instance, to write a function that accepts integers generally without specifying the length or signing of the integer. And, finally, you can do without typing entirely if it isn’t needed in a particular context.
  • Python, C, and Fortran libraries are just a call away. Julia can interface directly with external libraries written in C and Fortran. It’s also possible to interface with Python code by way of the PyCall library, and even share data between Python and Julia.
  • Metaprogramming. Julia programs can generate other Julia programs, and even modify their own code, in a way that is reminiscent of languages like Lisp.

Julia vs Python: Julia language advantages

Julia was designed from the start for scientific and numerical computation. Thus it’s no surprise that Julia has many features advantageous for such use cases:

  • Faster by default. Julia’s JIT compilation and type declarations mean it can routinely beat “pure,” unoptimized Python by orders of magnitude. Python can be made faster by way of external libraries, third-party JIT compilers (PyPy), and optimizations with tools like Cython, but Julia is designed to be faster right out of the gate.
  • A math-friendly syntax. A major target audience for Julia is users of scientific computing languages and environments like Matlab, R, Mathematica, and Octave. Julia’s syntax for math operations looks more like the way math formulas are written outside of the computing world, making it easier for non-programmers to pick up on.
  • Automatic memory management. Like Python, Julia doesn’t burden the user with the details of allocating and freeing memory, and it provides some measure of manual control over garbage collection. The idea is that if you switch to Julia, you don’t lose one of Python’s common conveniences.
  • Parallelism. Math and scientific computing thrive when you can make use of the full resources available on a given machine, especially multiple cores. Both Python and Julia can run operations in parallel. But Julia’s syntax is slightly less top-heavy than Python’s, lowering the threshold to its use.

Python vs Julia: Python advantages

Python is a general-purpose computing language that is easy to learn, and that has developed into a leading language for scientific computing. Some of the reasons Python may still be the better choice for data science work:

  • Julia arrays are 1-indexed. This might seem like an obscure issue, but it’s a potentially jarring one. In most languages, Python and C included, the first element of an array is accessed with a zero—e.g., string[0] in Python for the first character in a string. Julia uses 1 for the first element in an array. This isn’t an arbitrary decision; many other math and science applications, like Mathematica, use 1-indexing, and Julia is intended to appeal to that audience. It’s possible to support zero-indexing in Julia with an experimental feature, but 1-indexing by default may stand in the way of adoption by a more general-use audience with ingrained programming habits.
  • Julia is still young. The Julia language has been under development since 2009, and has undergone a fair amount of feature churn along the way. It still doesn’t have a 1.0 release, although the developers are getting close.
  • Python has far more third-party packages. The breadth and usefulness of Python’s culture of third-party packages remains one of the language’s biggest attractions. Again, Julia’s relative newness means the culture of software around it is still small. Some of that is offset by the ability to use existing C and Python libraries, but Julia needs libraries of its own to thrive.
  • Python’s huge community is a huge advantage. A language is nothing without a large, devoted, and active community around it. Python enjoys just such a community right now. The community around Julia is enthusiastic and growing, but it is still only a fraction of the size of the Python community. 

Source: InfoWorld Big Data

Apache PredictionIO: Easier machine learning with Spark

Apache PredictionIO: Easier machine learning with Spark

The Apache Foundation has added a new machine learning project to its roster, Apache PredictionIO, an open-sourced version of a project originally devised by a subsidiary of Salesforce.

What PredictionIO does for machine learning and Spark

Apache PredictionIO is built atop Spark and Hadoop, and serves Spark-powered predictions from data using customizable templates for common tasks. Apps send data to PredictionIO’s event server to train a model, then query the engine for predictions based on the model.

Spark, MLlib, HBase, Spray, and and Elasticsearch all come bundled with PredictionIO, and Apache offers supported SDKs for working in Java, PHP, Python, and Ruby. Data can be stored in a variety of back ends: JDBC, Elasticsearch, HBase, HDFS, and their local file systems are all supported out of the box. Back ends are pluggable, so a developer can create a custom back-end connector.

How PredictionIO templates make it easier to serve predictions from Spark

PredictionIO’s most notable advantage is its template system for creating machine learning engines. Templates reduce the heavy lifting needed to set up the system to serve specific kinds of predictions. They describe any third-party dependencies that might be needed for the job, such as the Apache Mahout machine-learning app framework.

Some existing templates include:

Some templates also integrate with other machine learning products. For example, two of the prediction templates currently in PredictionIO’s gallery, for churn rate detection and general recommendations, use H2O.ai’s Sparkling Water enhancements for Spark.

PredictionIO can also automatically evaluate a prediction engine to determine the best hyperparameters to use with it. The developer needs to choose and set metrics for how to do this, but there’s generally less work involved in doing this than in tuning hyperparameters by hand.

When running as a service, PredictionIO can accept predictions singly or as a batch. Batched predictions are automatically parallelized across a Spark cluster, as long as the algorithms used in a batch prediction job are all serializable. (PredictionIO’s default algorithms are.)

Where to download PredictionIO

PredictionIO’s source code is available on GitHub. For convenience, various Docker images are available, as well as a Heroku build pack.

Source: InfoWorld Big Data

11 open source tools to make the most of machine learning

11 open source tools to make the most of machine learning

Venerable Shogun was created in 1999 and written in C++, but can be used with Java, Python, C#, Ruby, R, Lua, Octave, and Matlab. The latest version, 6.0.0, adds native support for Microsoft Windows and the Scala language.

Though popular and wide-ranging, Shogun has competition. Another C++-based machine learning library, Mlpack, has been around only since 2011, but professes to be faster and easier to work with (by way of a more integral API set) than competing libraries.

Project: Shogun
GitHub: https://github.com/shogun-toolbox/shogun

Source: InfoWorld Big Data

ONNX makes machine learning models portable, shareable

ONNX makes machine learning models portable, shareable

Microsoft and Facebook have announced a joint project to make it easier for data analysts to exchange trained models between different machine learning frameworks.

The Open Neural Network Exchange (ONNX) format is meant to provide a common way to represent the data used by neural networks. Most frameworks have their own specific model format that will only work with models from other frameworks by way of a conversion tool.

ONNX allows models to be swapped freely between frameworks without the conversion process. A model trained on one framework can be used for inference by another framework.

Microsoft claims the ONNX format provides advantages above and beyond not having to convert between model formats. For instance, it allows developers to choose frameworks that reflect the job and workflow at hand, since each framework tends to be optimized for different use cases: “fast training, supporting flexible network architectures, inferencing on mobile devices, etc.”

Facebook notes that a few key frameworks are already on board to start supporting ONNX. Caffe2, PyTorch (both Facebook’s projects), and Cognitive Toolkit (Microsoft’s project) will provide support sometime in September. This, according to Facebook, “will allow models trained in one of these frameworks to be exported to another for inference.”

The first wave of ONNX-supporting releases won’t cover everything out of the gate. In PyTorch’s case, Facebook notes that “some of the more advanced programs in PyTorch such as those with dynamic flow control” won’t benefit fully from ONNX support yet.

It’s not immediately clear how ONNX model sizes shape up against those already in common use. Apple’s Core ML format, for instance, was designed by Apple so that small but accurate models could be deployed to and served from end-user devices like the iPhone. But Core ML is proprietary. One of ONNX’s long-term goals is to make it easier to deliver models for inference to many kinds of targets.

Source: InfoWorld Big Data

13 frameworks for mastering machine learning

13 frameworks for mastering machine learning

H2O, now in its third major revision, provides access to machine learning algorithms by way of common development environments (Python, Java, Scala, R), big data systems (Hadoop, Spark), and data sources (HDFS, S3, SQL, NoSQL). H2O is meant to be used as an end-to-end solution for gathering data, building models, and serving predictions. For instance, models can be exported as Java code, allowing predictions to be served on many platforms and in many environments.

H2O can work as a native Python library, or by way of a Jupyter Notebook, or by way of the R language in R Studio. The platform also includes an open source, web-based environment called Flow, exclusive to H2O, which allows interacting with the dataset during the training process, not just before or after. 

Source: InfoWorld Big Data

IBM speeds deep learning by using multiple servers

IBM speeds deep learning by using multiple servers

For everyone frustrated by how long it takes to train deep learning models, IBM has some good news: It has unveiled a way to automatically split deep-learning training jobs across multiple physical servers — not just individual GPUs, but whole systems with their own separate sets of GPUs.

Now the bad news: It’s available only in IBM’s PowerAI 4.0 software package, which runs exclusively on IBM’s own OpenPower hardware systems.

Distributed Deep Learning (DDL) doesn’t require developers to learn an entirely new deep learning framework. It repackages several common frameworks for machine learning: TensorFlow, Torch, Caffe, Chainer, and Theano. Deep learning projecs that use those frameworks can then run in parallel across multiple hardware nodes.

IBM claims the speedup gained by scaling across nodes is nearly linear. One benchmark, using the ResNet-101 and ImageNet-22K data sets, needed 16 days to complete on one IBM S822LC server. Spread across 64 such systems, the same benchmark concluded in seven hours, or 58 times faster.

IBM offers two ways to use DDL. One, you can shell out the cash for the servers it’s designed for, which sport two Nvidia Tesla P100 units each, at about $50,000 a head. Two, you can run the PowerAI software in a cloud instance provided by IBM partner Nimbix, for around $0.43 an hour.

One thing you can’t do, though, is run PowerAI on commodity Intel x86 systems. IBM has no plans to offer PowerAI on that platform, citing tight integration between PowerAI’s proprietary components and the OpenPower systems designed to support them. Most of the magic, IBM says, comes from a machine-to-machine software interconnection system that rides on top of whatever hardware fabric is available. Typically, that’s an InfiniBand link, although IBM claims it can also work on conventional gigabit Ethernet (still, IBM admits it won’t run anywhere nearly as fast).

It’s been possible to do deep-learning training on multiple systems in a cluster for some time now, although each framework tends to have its own set of solutions. With Caffe, for example, there’s the Parallel ML System or CaffeOnSpark. TensorFlow can also be distributed across multiple servers, but again any integration with other frameworks is something you’ll have to add by hand.

IBM’s claimed advantage is that it works with multiple frameworks and without as much heavy lifting needed to set things up. But those come at the cost of running on IBM’s own iron.

Source: InfoWorld Big Data

Apache Spark 2.2 gets streaming, R language boosts

Apache Spark 2.2 gets streaming, R language boosts

With version 2.2 of Apache Spark, a long-awaited feature for the multipurpose in-memory data processing framework is now available for production use.

Structured Streaming, as that feature is called, allows Spark to process streams of data in ways that are native to Spark’s batch-based data-handling metaphors. It’s part of Spark’s long-term push to become, if not all things to all people in data science, then at least the best thing for most of them.

Structured Streaming in 2.2 benefits from a number of other changes aside from losing its experimental designation. It can now work as a source or a sink for data coming from or being written to an Apache Kafka source, with lower latency for Kafka connections than previously.

Kafka, itself an Apache Software Foundation, is a distributed messaging bus widely used in streaming applications. Kafka has typically been paired with another stream-processing framework, Apache Storm, but Storm is limited to stream processing only, and Spark presents less complex APIs to the developer.

Structured Streaming jobs can now use Spark’s triggering mechanism to run a streaming job once and quit. Databricks, the chief commercial outfit supporting Spark development, claims this is a more efficient execution model than running Spark batch jobs intermittently.

The native collection of machine learning libraries in Spark, MLlib, has been outfitted with new algorithms for tasks like performing PageRank on datasets, or running multiclass logistic regression analysis (e.g., which current hit movie will a person in various demographic categories probably like best?). Machine learning is a common use case for Spark. 

Machine learning in Spark also gets a major boost from improved support for the R language. Earlier versions of Spark had wider support for Java and Python than R, but Spark 2.2 adds R support for 10 distributed algorithms. Structured Streaming and the Catalog API (used for accessing query metadata in Spark SQL) can now also be used within Spark.

Source: InfoWorld Big Data