Google Cloud Machine Learning hits public beta, with additions

Google Cloud Machine Learning hits public beta, with additions

Google unveiled today machine learning-related additions to its cloud platform, both to enrich its own cloud-based offerings and to offer expanded toolsets for businesses to develop their own machine learning-powered products.

The most prominent offering was the public beta of Google Cloud Machine Learning, a platform for building and training machine learning models with the TensorFlow  learning framework and data stored in the BigQuery and Cloud Storage back ends.

Google says its system simplifies the whole process of creating and deploying machine learning back ends for apps. Some of this is simply by making models faster to train. Google claims Cloud Machine Learning’s distributed training “can train models on terabytes of data within hours, instead of waiting for days.”

Much of it, however, is about Cloud Machine Learning’s APIs reducing the amount of programming required to build useful things. In a live demo, Google built and demonstrated a five-layer neural net for stock market analysis with just a few lines of code.

Another announced feature, HyperTune, removes another source of drudgery often associated with building machine learning models. Models often need to have parameters tweaked to yield the best results. Google claims HyperTune “automatically improves predictive accuracy” by automating that step.

Google Cloud Machine Learning was previously only available as an alpha-level tech preview, but InfoWorld’s Martin Heller was impressed with its pre-trained APIs for artificial vision, speech, natural language, and language translation.

Many of the machine learning tools Google now offers for end users, such as TensorFlow, arose from Google’s internal work to bolster its projects. The revamped version of Google’s office applications, G Suite, is one of the latest to be dressed up with machine-learning powered features. Most of these additions are for automating common busywork, such as finding a free time slot on a calendar to hold a meeting.

Google’s machine learning offerings pit it against several other big-league cloud vendors offering their own variations on the same themes, from IBM’s BlueMix and Watson services to Microsoft’s Azure Machine Learning. All of them, along with Amazon, Facebook, and others, recently announced the Partnership on AI effort to “study and formulate best practices on AI technologies” — although it seems more like a general clearinghouse for public awareness about machine learning than a way for those nominal rivals to collaborate on shared projects.

Source: InfoWorld Big Data

How MIT's C/C++ extension breaks parallel processing bottlenecks

How MIT's C/C++ extension breaks parallel processing bottlenecks

Earlier this week, MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) department announced word of Milk, a system that speeds up parallel processing of big data sets by as much as three or four times.

If you think this involves learning a whole new programming language, breathe easy. Milk is less a radical departure from existing software development than a refinement of an existing set of C/C++ tools.

All together now

According to the paper authored by the CSAIL team, Milk is a  C/C++ language family extension that addresses the memory bottlenecks plaguing big data applications. Apps that run in parallel contend with each other for memory access, so any gains from parallel processing are offset by the time spent waiting for memory.

Milk solves these problems by extending an existing library, OpenMP, widely used in C/C++ programming for parallelizing access to shared memory. Programmers typically use OpenMP by annotating sections of their code with directives (“pragmas”) to the compiler to use OpenMP extensions, and Milk works the same way. The directives are syntactically similar, and in some cases, they’re minor variants of the existing OpenMP pragmas, so existing OpenMP apps don’t have to be heavily reworked to be sped up.

Milk’s big advantage is that it performs what the paper’s authors describe as “DRAM-conscious clustering.” Since data shuttled from memory is cached locally on the CPU, batching together data requests from multiple processes allows the on-CPU cache to be shared more evenly between them.

The most advanced use of Milk requires using some functions exposed by the library — in other words, some rewriting — but it’s clearly possible to get some results right away by simply decorating existing code.

Let’s not throw all this out yet

As CPU speeds top out, attention has turned to other methods to ramp up processing power. The most direct option is to scale out: spreading workloads across multiple cores on a single chip, across multiple CPUs, or throughout a cluster of machines. While a plethora of tools exist to spread out workloads in these ways, the languages used for them don’t take parallelism into account as part of their designs. Hence the creation of functional languages like Pony to provide a fresh set of metaphors for how to program in such environments.

Another approach has been to work around the memory-to-CPU bottleneck by moving more of the processing to where the data already resides. Example: the MapD database, which uses GPUs and their local memory for both accelerated processing and distributed data caching.

Each of these approaches has their downsides. With new languages, there’s the pain of scrapping existing workflows and toolchains, some of which have decades of work behind them. Using GPUs has some of the same problems: Shifting workloads to a GPU is easy only if the existing work’s abstracted away through a toolkit that can be made GPU-aware. Otherwise, you’re back to rewriting everything from scratch. 

A project like Milk, on the other hand, is adds a substantial improvement to a tool set that’s already widely used and well-understood. It’s always easier to transform existing work than tear it down and start over, so Milk provides a way to squeeze more out of what we already have.

Source: InfoWorld Big Data

Teradata expands analytics for hybrid cloud

Teradata expands analytics for hybrid cloud

Analytics solutions provider Teradata has released new hybrid-cloud management capabilities to better compete with rising pressure from both open source and commercial solutions.

Teradata Everywhere expands support for existing cloud-hosted Teradata solutions and adds new hybrid and cross-cloud orchestration components that make it possible to manage Teradata instances across “on-premises appliances, on-premises virtualization environments, managed cloud, and public cloud,” according to the company’s announcement.

Teradata was previously available on Amazon Web Services, but the latest iteration provides up to 32 nodes per instance and conveniences like automated backup functionality. Later this year, Microsoft Azure is set to start running this iteration of Teradata, as are VMware environments, Teradata’s own Managed Cloud in Europe (Germany), and Teradata’s on-premises IntelliFlex platform. (Google Compute Engine support was not among the environments mentioned in the announcement.)

Other improvements in the works, but not slated to debut until next year, are features to allow expansion and rebalancing of data loads between Teradata instances without major downtime and a new “in-stream query re-planning” system designed to optimize queries as they are being executed.

Teradata’s plans involve more than providing a way to run cloud-hosted instances of its database on the infrastructure of one’s choices. Rather, the company says it hopes to make Teradata as “borderless,” or hybrid, as possible. Teradata QueryGrid and Teradata Unity are being revised to better support this goal.

One key change — managing Teradata instances across environments — is available now. But many of the others — for example, automatic capture and replay of selected data changes between Teradata systems or one-click database initialization across systems — are projected to be ready in 2017.

Though powerful, Teradata is facing stiffer competition. After Hadoop came to prominence as a commodity open source data-analysis solution, Teradata made use of it as a data source by way of the commercial MapR distribution.

Clouds such as Amazon Redshift or Microsoft’s Azure SQL also offer data warehousing solutions. Azure SQL has been enhanced by changes to SQL Server that encourage the bursting-to-the-cloud expansion that Teradata is now promising. There’s also pressure from new kinds of dedicated cloud services, such as Snowflake, which promises maximum performance with minimal oversight.

Source: InfoWorld Big Data

Tableau 10 gives users the features they want

Tableau 10 gives users the features they want

Tableau, the flagship visual tool for analytics and BI, wanted its 10.0 edition to be truly worthy of another digit in front of the decimal point.

The self-dubbed “Google of data visualization” shaped version 10.0 based on feedback from its user community, and paired that with an all-new visual design intended to make Tableau both more attractive and easier to use.

An eyeful of info

The most prominent outward feature of Tableau 10 is a redesigned interface that’s more in line with a Bootstrap-powered website than the legacy toolbar-and-panel desktop-app look found in previous versions of the app.

tableau 10 Tableau

Tableau 10’s refined interface hews closer to a Bootstrap-powered web app than a desktop app — both for the sake of consistency across devices and the aesthetic appeal of such a look.

That mobile-esque look is no accident. Tableau has a habit of leveraging trends in mobile devices and web interfaces, such as its Vizable app for data exploration. The company has also provided a tool for designing dashboards that are useful on both desktops and mobile devices, and has commissioned custom-designed fonts to make data legends more legible.

Big data brawlers: 4 challengers to Spark

Big data brawlers: 4 challengers to Spark

Big (and even not so big) data hasn’t been the same since Apache Spark made inroads with developers and became a staple ingredient in big data clouds.

But Spark is far from perfect. It’s certainly improving, as version 2.0 shows, but if a competitor offers a better handle on what Spark does and more, developers will pay attention.

Here are four projects emerging as possible competition for Spark, with new approaches to handling the conventional in-memory batch processing Spark is famous for and the streaming Spark continues to work on.

Apache Apex

What it is: Originally created by DataTorrent, Apex has since been donated to the Apache Foundation. It performs both stream and batch processing on Hadoop under YARN.

Spark 2.0 takes an all-in-one approach to big data

Spark 2.0 takes an all-in-one approach to big data

Apache Spark, the in-memory processing system that’s fast become a centerpiece of modern big data frameworks, has officially released its long-awaited version 2.0.

Aside from some major usability and performance improvements, Spark 2.0’s mission is to become a total solution for streaming and real-time data. This comes as a number of other projects — including others from the Apache Foundation — provide their own ways to boost real-time and in-memory processing.

Easier on top, faster underneath

Most of Spark 2.0’s big changes have been known well in advance, which has made them even more hotly anticipated.

One of the largest and most technologically ambitious additions is Project Tungsten, a reworking of Spark’s treatment for memory and code generation. Pieces of Project Tungsten have showed up in earlier releases, but 2.0 adds more, such as applying Tungsten’s memory management to both caching and runtime execution.

Spark-powered Splice Machine goes open source

Spark-powered Splice Machine goes open source

Splice Machine, the relational SQL database system that uses Hadoop and Spark to provide high-speed results, is now available in an open source edition.

Version 2.0 of Splice Machine added Spark to speed up OLAP-style workloads while still processing conventional OLTP workloads with HBase. The open source version, distributed under the Apache 2.0 license, supplies both engines and most of Splice Machine’s other features, including Apache Kafka streaming support. However, it omits a few enterprise-level options like encryption, Kerberos support, column-level access control, and backup/restore functionality.

Splice Machine is going open source for two reasons. First, to get into the hands of developers, letting them migrate data to it, test it on their own hardware or in the cloud, then upgrade to the full version if it fits the bill. Motive No. 2, as is the case with any open source project, is to allow those developers to contribute back to the project if they’re inclined.

The first motive is more relevant here. Originally, Splice Machine was offered in a free-to-use edition minus some enterprise features. The open source version provides a less ambiguous way to offer a freebie, as there’s less fear a user will casually violate the license agreement by enabling the wrong item (see: Oracle). Going open source also helps defray criticisms about Splice Machine as a proprietary black box, which InfoWorld’s Andy Oliver hinted at in his original 2014 discussion of the database.

Microsoft R Client provides a free taste of R Server

Microsoft R Client provides a free taste of R Server

If you’ve been hungering to make use of the advanced number-crunching technology in Microsoft’s R Server product, but feared its pricetag, Microsoft itself has a partial answer: Microsoft R Client.

Free, but not open source, R Client is built with much of the same code as R Server. It even includes many of the same features, such as the “ScaleR” technology that allows R programs to benefit from multicore architectures, although they’re only available here in a limited form.

Put another way, Microsoft R Client is to R Server as SQL Server Express is to SQL Server Enterprise. Users can get a taste of what’s possible in the full-blown product without shelling out a ton of cash, even if it’s no substitute for the original.

R Client leverages Microsoft R Open, which was known as Revolution Analytics before Microsoft acquired it. R Open lets users do most everything they could in an R environment, such as using the plethora of open source R packages out there, plus any ScaleR extension-based packages.

Spark users want convenience in the cloud — here are new ways they may get it

Spark users want convenience in the cloud — here are new ways they may get it

Over the course of the last couple of years, Apache Spark has enjoyed explosive growth in both usage and mind share. These days, any self-respecting big data offering is obliged to either connect to or make use of it.

Now comes the hard part: Turning Spark into a commodity. More than that, it has to live up to its promise of being the most convenient, versatile, and fast-moving data processing framework around.

There are two obvious ways to do that in this cloud-centric world: Host Spark as a service or build connectivity to Spark into an existing service. Several such approaches were unveiled this week at Spark Summit 2016, and they say as much about the companies offering them as they do Spark’s meteoric ascent

Microsoft

Microsoft has pinned a growing share of its future on the success of Azure, and in turn on the success of Azure’s roster of big data tools. Therefore, Spark has been made a first-class citizen in Power BI, Azure HDInsight, and the Azure-hosted R Server.

Twitter open-sources Heron for real-time stream analytics

Twitter open-sources Heron for real-time stream analytics

Heron, the real-time stream-processing system Twitter devised as a replacement for Apache Storm, is finally being open-sourced after powering Twitter for more than two years.

Twitter explained in a blog post that it created Heron because it needed more than speed and scale from its real-time stream processing framework. The company also needed easier debugging, easier deployment and management capabilities, and the ability to work well in a shared, multitenant cluster environment.

Apache Storm was the original solution to Twitter’s problems. It was created by a marketing intelligence company called BackType, and Twitter bought the company in 2011 and eventually open-sourced Storm, providing it to the Apache Foundation.

There’s no question Storm has a lot of advantages. It’s scalable and fault-tolerant, with a decent ecosystem of “spouts,” or systems for receiving data from established sources. But it was reputedly also hard to work with and hard to get good results from, and despite a recent 1.0 renovation, it’s been challenged by other projects, including Apache Spark and its own revised streaming framework.