Google Cloud Machine Learning hits public beta, with additions

September 29, 2016 by Serdar Yegulalp Posted in Industry Insights & News

Google unveiled today machine learning-related additions to its cloud platform, both to enrich its own cloud-based offerings and to offer expanded toolsets for businesses to develop their own machine learning-powered products.

The most prominent offering was the public beta of Google Cloud Machine Learning, a platform for building and training machine learning models with the TensorFlow learning framework and data stored in the BigQuery and Cloud Storage back ends.

Google says its system simplifies the whole process of creating and deploying machine learning back ends for apps. Some of this is simply by making models faster to train. Google claims Cloud Machine Learning’s distributed training “can train models on terabytes of data within hours, instead of waiting for days.”

Much of it, however, is about Cloud Machine Learning’s APIs reducing the amount of programming required to build useful things. In a live demo, Google built and demonstrated a five-layer neural net for stock market analysis with just a few lines of code.

Another announced feature, HyperTune, removes another source of drudgery often associated with building machine learning models. Models often need to have parameters tweaked to yield the best results. Google claims HyperTune “automatically improves predictive accuracy” by automating that step.

Google Cloud Machine Learning was previously only available as an alpha-level tech preview, but InfoWorld’s Martin Heller was impressed with its pre-trained APIs for artificial vision, speech, natural language, and language translation.

Many of the machine learning tools Google now offers for end users, such as TensorFlow, arose from Google’s internal work to bolster its projects. The revamped version of Google’s office applications, G Suite, is one of the latest to be dressed up with machine-learning powered features. Most of these additions are for automating common busywork, such as finding a free time slot on a calendar to hold a meeting.

Google’s machine learning offerings pit it against several other big-league cloud vendors offering their own variations on the same themes, from IBM’s BlueMix and Watson services to Microsoft’s Azure Machine Learning. All of them, along with Amazon, Facebook, and others, recently announced the Partnership on AI effort to “study and formulate best practices on AI technologies” — although it seems more like a general clearinghouse for public awareness about machine learning than a way for those nominal rivals to collaborate on shared projects.

Source: InfoWorld Big Data

How MIT's C/C++ extension breaks parallel processing bottlenecks

September 15, 2016 by Serdar Yegulalp Posted in Industry Insights & News

How MIT's C/C++ extension breaks parallel processing bottlenecks

Earlier this week, MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) department announced word of Milk, a system that speeds up parallel processing of big data sets by as much as three or four times.

If you think this involves learning a whole new programming language, breathe easy. Milk is less a radical departure from existing software development than a refinement of an existing set of C/C++ tools.

All together now

According to the paper authored by the CSAIL team, Milk is a C/C++ language family extension that addresses the memory bottlenecks plaguing big data applications. Apps that run in parallel contend with each other for memory access, so any gains from parallel processing are offset by the time spent waiting for memory.

Milk solves these problems by extending an existing library, OpenMP, widely used in C/C++ programming for parallelizing access to shared memory. Programmers typically use OpenMP by annotating sections of their code with directives (“pragmas”) to the compiler to use OpenMP extensions, and Milk works the same way. The directives are syntactically similar, and in some cases, they’re minor variants of the existing OpenMP pragmas, so existing OpenMP apps don’t have to be heavily reworked to be sped up.

Milk’s big advantage is that it performs what the paper’s authors describe as “DRAM-conscious clustering.” Since data shuttled from memory is cached locally on the CPU, batching together data requests from multiple processes allows the on-CPU cache to be shared more evenly between them.

The most advanced use of Milk requires using some functions exposed by the library — in other words, some rewriting — but it’s clearly possible to get some results right away by simply decorating existing code.

Let’s not throw all this out yet

As CPU speeds top out, attention has turned to other methods to ramp up processing power. The most direct option is to scale out: spreading workloads across multiple cores on a single chip, across multiple CPUs, or throughout a cluster of machines. While a plethora of tools exist to spread out workloads in these ways, the languages used for them don’t take parallelism into account as part of their designs. Hence the creation of functional languages like Pony to provide a fresh set of metaphors for how to program in such environments.

Another approach has been to work around the memory-to-CPU bottleneck by moving more of the processing to where the data already resides. Example: the MapD database, which uses GPUs and their local memory for both accelerated processing and distributed data caching.

Each of these approaches has their downsides. With new languages, there’s the pain of scrapping existing workflows and toolchains, some of which have decades of work behind them. Using GPUs has some of the same problems: Shifting workloads to a GPU is easy only if the existing work’s abstracted away through a toolkit that can be made GPU-aware. Otherwise, you’re back to rewriting everything from scratch.

A project like Milk, on the other hand, is adds a substantial improvement to a tool set that’s already widely used and well-understood. It’s always easier to transform existing work than tear it down and start over, so Milk provides a way to squeeze more out of what we already have.

Source: InfoWorld Big Data

Teradata expands analytics for hybrid cloud

September 12, 2016 by Serdar Yegulalp Posted in Industry Insights & News

Teradata expands analytics for hybrid cloud

Analytics solutions provider Teradata has released new hybrid-cloud management capabilities to better compete with rising pressure from both open source and commercial solutions.

Teradata Everywhere expands support for existing cloud-hosted Teradata solutions and adds new hybrid and cross-cloud orchestration components that make it possible to manage Teradata instances across “on-premises appliances, on-premises virtualization environments, managed cloud, and public cloud,” according to the company’s announcement.

Teradata was previously available on Amazon Web Services, but the latest iteration provides up to 32 nodes per instance and conveniences like automated backup functionality. Later this year, Microsoft Azure is set to start running this iteration of Teradata, as are VMware environments, Teradata’s own Managed Cloud in Europe (Germany), and Teradata’s on-premises IntelliFlex platform. (Google Compute Engine support was not among the environments mentioned in the announcement.)

Other improvements in the works, but not slated to debut until next year, are features to allow expansion and rebalancing of data loads between Teradata instances without major downtime and a new “in-stream query re-planning” system designed to optimize queries as they are being executed.

Teradata’s plans involve more than providing a way to run cloud-hosted instances of its database on the infrastructure of one’s choices. Rather, the company says it hopes to make Teradata as “borderless,” or hybrid, as possible. Teradata QueryGrid and Teradata Unity are being revised to better support this goal.

One key change — managing Teradata instances across environments — is available now. But many of the others — for example, automatic capture and replay of selected data changes between Teradata systems or one-click database initialization across systems — are projected to be ready in 2017.

Though powerful, Teradata is facing stiffer competition. After Hadoop came to prominence as a commodity open source data-analysis solution, Teradata made use of it as a data source by way of the commercial MapR distribution.

Clouds such as Amazon Redshift or Microsoft’s Azure SQL also offer data warehousing solutions. Azure SQL has been enhanced by changes to SQL Server that encourage the bursting-to-the-cloud expansion that Teradata is now promising. There’s also pressure from new kinds of dedicated cloud services, such as Snowflake, which promises maximum performance with minimal oversight.

Source: InfoWorld Big Data

Tableau 10 gives users the features they want

August 17, 2016 by Serdar Yegulalp Posted in Industry Insights & News

Tableau 10 gives users the features they want

Tableau, the flagship visual tool for analytics and BI, wanted its 10.0 edition to be truly worthy of another digit in front of the decimal point.

The self-dubbed “Google of data visualization” shaped version 10.0 based on feedback from its user community, and paired that with an all-new visual design intended to make Tableau both more attractive and easier to use.

An eyeful of info

The most prominent outward feature of Tableau 10 is a redesigned interface that’s more in line with a Bootstrap-powered website than the legacy toolbar-and-panel desktop-app look found in previous versions of the app.

Tableau

Tableau 10’s refined interface hews closer to a Bootstrap-powered web app than a desktop app — both for the sake of consistency across devices and the aesthetic appeal of such a look.

That mobile-esque look is no accident. Tableau has a habit of leveraging trends in mobile devices and web interfaces, such as its Vizable app for data exploration. The company has also provided a tool for designing dashboards that are useful on both desktops and mobile devices, and has commissioned custom-designed fonts to make data legends more legible.

Some of the new UI behaviors in Tableau, though, just refine or expand on conveniences that already existed in the program. Cluster maps, for instance, can now be created by dragging and dropping data inputs, and a “highlighter” function allows specific fields in a visualization to be given prominence — useful for calling out a value in a line chart, for instance.

Tell us what you want

But Version 10 isn’t about any one feature. Rather, it’s about new features being driven by user feedback in the product’s forums.

Users heavily requested new methods for pulling in external data — features that are Tableau’s lifeblood. So the ability to join data from different sources, to join extracts from separate data sources, and to filter data across multiple sources (also known as “dynamic parameters“) all made it into the finished product.

Tableau 10 also changes how it works with teams and communicates with its users — e..g, creating dashboard subscriptions for other users, or alerting a user by email when data in an extract cannot be refreshed. Other features involve broader use of geographic data (and more of it), and supplying push notifications to subscribers.

Big data brawlers: 4 challengers to Spark

August 1, 2016 by Serdar Yegulalp Posted in Industry Insights & News

Big data brawlers: 4 challengers to Spark

Big (and even not so big) data hasn’t been the same since Apache Spark made inroads with developers and became a staple ingredient in big data clouds.

But Spark is far from perfect. It’s certainly improving, as version 2.0 shows, but if a competitor offers a better handle on what Spark does and more, developers will pay attention.

Here are four projects emerging as possible competition for Spark, with new approaches to handling the conventional in-memory batch processing Spark is famous for and the streaming Spark continues to work on.

Apache Apex

What it is: Originally created by DataTorrent, Apex has since been donated to the Apache Foundation. It performs both stream and batch processing on Hadoop under YARN.

How it competes: Apex’s streaming is the real deal, while Spark’s “streaming” is actually a microbatch system. It also has native support for fault-tolerance by way of Hadoop — though that means Apex and Hadoop are tightly coupled. Spark can work with or without Hadoop, and Apex doesn’t yet have Spark’s machine learning features.

Heron

What it is: Twitter’s replacement for the Apache Storm stream-processing framework, Heron is now available as an open source project. Consider this a contender for Spark streaming.

How it competes: Heron runs streaming jobs via containers managed through a scheduler. To that end, it not only scales more readily than other solutions, but is easier to debug, deploy, and keep running well on clusters. It’s also designed to appeal to existing Storm users, since it’s compatible with the Storm API and shares many of Storm’s concepts (such as “spouts” and “bolts”).

Apache Flink

What it is: Apache Flink is a stream-processing library that competes with Apache Storm as much as Spark.

How it competes: Like Apex, Flink puts streaming first, and it uses a true streaming model rather than Spark’s streaming via microbatch. Flink also has provisions for performing iterative or repeated processing on streams, and it includes Spark-like features, such as machine learning and graph processing libraries. But Flink is still a relatively new project, having hit 1.0 earlier this year.

Onyx

What it is: Onyx is a “masterless, cloud scale, fault tolerant, high performance distributed computation system,” according to its documentation, with both batch and stream processing models.

How it competes: Written in the functional language Clojure rather than Scala, Onyx puts streaming first — batch operations are basically implemented as ministreams. Onyx also allows the developer to use language constructs in either Clojure or Java, such as Clojure’s vectors and maps, to define how data is processed. (Many of Onyx’s goals were laid down before the code was even created.) If Onyx catches on, it’ll most likely be due to Java’s existing popularity rather than Clojure’s expressiveness.

Source: InfoWorld Big Data

Spark 2.0 takes an all-in-one approach to big data

July 27, 2016 by Serdar Yegulalp Posted in Industry Insights & News

Spark 2.0 takes an all-in-one approach to big data

Apache Spark, the in-memory processing system that’s fast become a centerpiece of modern big data frameworks, has officially released its long-awaited version 2.0.

Aside from some major usability and performance improvements, Spark 2.0’s mission is to become a total solution for streaming and real-time data. This comes as a number of other projects — including others from the Apache Foundation — provide their own ways to boost real-time and in-memory processing.

Easier on top, faster underneath

Most of Spark 2.0’s big changes have been known well in advance, which has made them even more hotly anticipated.

One of the largest and most technologically ambitious additions is Project Tungsten, a reworking of Spark’s treatment for memory and code generation. Pieces of Project Tungsten have showed up in earlier releases, but 2.0 adds more, such as applying Tungsten’s memory management to both caching and runtime execution.

For users, these changes, plus a great many other under-the-hood improvements, provide across-the-board performance gains. Spark’s developers claim a two-to-tenfold increase in speed for common DataFrames and SQL operations, thanks to a new code generation system. Window functions, used for tasks like moving averages in data, have been reimplemented natively for further speed-ups.

Spark 2.0 also brings a major shift in programming APIs. DataFrames and Datasets, previously two different ways of accessing structured data, are now the same under the hood; DataFrames are now “just a type alias for Dataset of Row,” per Spark’s release notes. R language users can also now write a small range of user-defined functions and leverage better support for existing Spark features.

These changes make Spark more powerful without unnecessary complexity, since Spark’s straightforward APIs are one of its biggest attractions.

Spark has streaming — and company

Spark has been refining its metaphors for streamed and real-time data as well, and Structured Streaming makes its proper debut in 2.0. It repurposes Spark’s existing DataFrame/Dataset API to connect with streaming data sources like Kafka 0.10, so such data can be processed live.

Streaming has long been considered one of Spark’s weaker features because it’s harder to debug and keep running than it is to get set up. But it’s emerged as a contender to another major streaming-data solution, Apache Storm, in big part because Spark’s much easier to use overall.

With version 2.0, Spark is making a bid to be an all-in-one processing framework accessed by a few overarching APIs. But in the run-up to Spark 2.0, other projects have emerged with their own conceits for how to approach streaming and batch processing — Twitter’s Heron, Apache Apex, and Apache Flink, to name a few.

All these projects have their advantages. Heron reuses Apache Storm’s metaphors for streaming to make it easier for Storm users to get on board. Apex is even easier than Spark to work with, especially when it comes to fault tolerance or event ordering. And Flink uses a native streaming model rather than a retrofitted version of Spark’s existing data model.

Still, Spark has managed to establish itself solidly over the past couple of years as an ingredient in third-party software products (SnappyData, Splice Machine) and cloud-native data systems (IBM and more). Spark 2.0 is set on making that legacy harder to displace.

Source: InfoWorld Big Data

Spark-powered Splice Machine goes open source

July 19, 2016 by Serdar Yegulalp Posted in Industry Insights & News

Spark-powered Splice Machine goes open source

Splice Machine, the relational SQL database system that uses Hadoop and Spark to provide high-speed results, is now available in an open source edition.

Version 2.0 of Splice Machine added Spark to speed up OLAP-style workloads while still processing conventional OLTP workloads with HBase. The open source version, distributed under the Apache 2.0 license, supplies both engines and most of Splice Machine’s other features, including Apache Kafka streaming support. However, it omits a few enterprise-level options like encryption, Kerberos support, column-level access control, and backup/restore functionality.

Splice Machine is going open source for two reasons. First, to get into the hands of developers, letting them migrate data to it, test it on their own hardware or in the cloud, then upgrade to the full version if it fits the bill. Motive No. 2, as is the case with any open source project, is to allow those developers to contribute back to the project if they’re inclined.

The first motive is more relevant here. Originally, Splice Machine was offered in a free-to-use edition minus some enterprise features. The open source version provides a less ambiguous way to offer a freebie, as there’s less fear a user will casually violate the license agreement by enabling the wrong item (see: Oracle). Going open source also helps defray criticisms about Splice Machine as a proprietary black box, which InfoWorld’s Andy Oliver hinted at in his original 2014 discussion of the database.

Aside from downloading the bits and deploying them locally, users can try out Splice Machine via a “sandbox” — a cloud-hosted instance that runs on AWS and allows a developer to spin up a Splice Machine cluster for testing.

Despite employing Hadoop and Spark under the hood, Splice Machine’s main selling points are about its scale-out functionality, with Hadoop and Spark as convenient bonuses for those who want to use them directly. The introduction of the open source version doesn’t change that emphasis, although it remove any licensing or usage ambiguities that a company might face if it wants to deploy Splice Machine on a trial basis.

Going open source doesn’t guarantee increase uptake, though, since the biggest obstacle faced by Splice Machine has typically been getting a foothold in a market where incumbent solutions are hard to break away from. That said, Splice Machine offers consultancy services to aid migrations, although only in conjunction with the enterprise product.

Source: InfoWorld Big Data

Microsoft R Client provides a free taste of R Server

July 13, 2016 by Serdar Yegulalp Posted in Industry Insights & News

Microsoft R Client provides a free taste of R Server

If you’ve been hungering to make use of the advanced number-crunching technology in Microsoft’s R Server product, but feared its pricetag, Microsoft itself has a partial answer: Microsoft R Client.

Free, but not open source, R Client is built with much of the same code as R Server. It even includes many of the same features, such as the “ScaleR” technology that allows R programs to benefit from multicore architectures, although they’re only available here in a limited form.

Put another way, Microsoft R Client is to R Server as SQL Server Express is to SQL Server Enterprise. Users can get a taste of what’s possible in the full-blown product without shelling out a ton of cash, even if it’s no substitute for the original.

R Client leverages Microsoft R Open, which was known as Revolution Analytics before Microsoft acquired it. R Open lets users do most everything they could in an R environment, such as using the plethora of open source R packages out there, plus any ScaleR extension-based packages.

That said, there are two primary restrictions. First, any data to be processed has to fit in local memory — you can’t do remote processing as you would be able to with R Server. You can, however, push computations from R Client to a remote R Server instance — you just have to have an R Server instance handy to do it.

The other, more egregious limit R Client imposes is that a maximum of only two threads can be used to process ScaleR-powered functions. Given that most anyone serious about using R and working with hardware built in the last five years or so has at least eight cores at their disposal, it’s a major limitation.

Another potential drawback is that R Open is only available on Windows. A few years back, this wouldn’t have seemed as egregious for a Microsoft creation. But R itself has always been cross-platform and open source, and the growing expectation for the post-Ballmer, Nadella-era Microsoft is that more of its products ought to be cross-platform by default. SQL Server, for instance, will be available in a Linux version; why not this? Maybe eventually, just not now.

One partial compensation for the ScaleR limitations: R Client supports full multi-threaded behaviors with libraries that use Intel MKL (Math Kernel Library) functions. This, however, isn’t new: It was introduced into the original Revolution Open R product. Also, Microsoft is touting ScaleR as a way to allow functions to scale across multiple nodes, not just multiple cores on the same node as MKL does.

Spark users want convenience in the cloud — here are new ways they may get it

June 8, 2016 by Serdar Yegulalp Posted in Industry Insights & News

Spark users want convenience in the cloud — here are new ways they may get it

Over the course of the last couple of years, Apache Spark has enjoyed explosive growth in both usage and mind share. These days, any self-respecting big data offering is obliged to either connect to or make use of it.

Now comes the hard part: Turning Spark into a commodity. More than that, it has to live up to its promise of being the most convenient, versatile, and fast-moving data processing framework around.

There are two obvious ways to do that in this cloud-centric world: Host Spark as a service or build connectivity to Spark into an existing service. Several such approaches were unveiled this week at Spark Summit 2016, and they say as much about the companies offering them as they do Spark’s meteoric ascent

Microsoft

Microsoft has pinned a growing share of its future on the success of Azure, and in turn on the success of Azure’s roster of big data tools. Therefore, Spark has been made a first-class citizen in Power BI, Azure HDInsight, and the Azure-hosted R Server.

Power BI is Microsoft’s attempt — emphasis on “attempt” — at creating a Tableau-like data visualization service, while Azure HDInsight is an Azure-hosted Hadoop/R/HBase/Storm-as-a-service offering. For tools like those, the lack of Spark support is like a bike without pedals.

Microsoft is also rolling the dice on a bleeding-edge Spark feature, the recently revamped Structured Streaming component that allows its data to stream directly into Power BI. Structured Streaming is not only a significant upgrade to Spark’s streaming framework, it is a competitor to other data streaming technologies (such as Apache Storm). So far it’s relatively unproven in production, and already faces competition from the likes of Project Apex.

This is more a reflection of Microsoft’s confidence in Spark generally than in Structured Streaming specifically. The sheer amount of momentum around Spark ought to ensure that any issues with Structured Streaming are ironed out in time — whether or not Microsoft contributes any direct work to such a project.

IBM

IBM’s bet on Spark has been nothing short of massive. Not only has Big Blue re-engineered some of its existing data apps with Spark as the engine, it’s made Spark a first-class citizen on its Bluemix PaaS and will be adding its SystemML machine learning algorithms to Spark as open source. This is all part of IBM’s strategy to shed its mainframe-to-PC era legacy and become a cloud, analytics, and cognitive services giant.

Until now, IBM has leveraged Spark by making it a component of already established services — e.g., Bluemix. IBM’s next step, though, will be to provide Spark and a slew of related tools in an environment that is more free-form and interactive: the IBM Data Science Experence. It’s essentially an online data IDE, where a user can interactively manipulate data and code — Spark for analytics, Python/Scala/R for programming — add in data sources from Bluemix, and publish the results for others to examine.

If this sounds a lot like Jupyter for Python, that is one of the metaphors IBM had in mind — and in fact, Jupyter notebooks are a supported format. What’s new is that IBM is trying to expose Spark (and the rest of its service mix) in a way that complements Spark’s vaunted qualities — its overall ease of use and lowering of the threshold of entry for prospective data scientists.

Snowflake

Cloud data warehouse startup Snowflake is making Spark a standard-issue component as well. Its original mission was to provide analytics and data warehousing that spared the user from the hassle of micromanaging setup and management. Now, it’s giving Spark the same treatment: Skip the setup hassles and enjoy a self-managing data repository that can serve as a target for, or recipient of, Spark processing. Data can be streamed into Snowflake by way of Spark or extracted from Snowflake and processed by Spark.

Spark lets Snowflake users interact with their data in the form of a software library rather than a specification like SQL. This plays to Snowflake’s biggest selling point — automated management of scaling data infrastructure — rather than merely providing another black-box SQL engine.

Databricks

With Databricks, the commercial outfit that spearheads Spark development and offers its own hosted platform, the question has always been how it can distinguish itself from other platforms where Spark is a standard-issue element. The current strategy: Hook ’em with convenience, then sell ’em on sophistication.

Thus, Databricks recently rolled out the Community Edition, a free tier for those who want to get to know Spark but don’t want to monkey around with provisioning clusters or tracking down a practice data set. Community Edition provides a 6GB microcluster (it times out after a certain period of inactivity), a notebook-style interface, and several sample data sets.

Once people feel like they have a leg up on Spark’s workings, they can graduate to the paid version and continue using whatever data they’ve already migrated into it. In that sense, Databricks is attempting to capture an entry-level audience — a pool of users likely to grow with Spark’s popularity. But the hard part, again, is fending off competition. And as Spark is open source, it’s inherently easier for someone with far more scale and a far greater existing customer base to take all that away.

If there’s one consistent theme among these moves, especially as Spark 2.0 looms, it’s that convenience matters. Spark caught on because it made working with gobs of data far less ornery than the MapReduce systems of yore. The platforms that offer Spark as a service all have to assume their mission is twofold: Realize Spark’s promise of convenience in new ways — and assume someone else is also trying to do the same, only better.

Source: InfoWorld Big Data

Twitter open-sources Heron for real-time stream analytics

May 26, 2016 by Serdar Yegulalp Posted in Industry Insights & News

Twitter open-sources Heron for real-time stream analytics

Heron, the real-time stream-processing system Twitter devised as a replacement for Apache Storm, is finally being open-sourced after powering Twitter for more than two years.

Twitter explained in a blog post that it created Heron because it needed more than speed and scale from its real-time stream processing framework. The company also needed easier debugging, easier deployment and management capabilities, and the ability to work well in a shared, multitenant cluster environment.

Apache Storm was the original solution to Twitter’s problems. It was created by a marketing intelligence company called BackType, and Twitter bought the company in 2011 and eventually open-sourced Storm, providing it to the Apache Foundation.

There’s no question Storm has a lot of advantages. It’s scalable and fault-tolerant, with a decent ecosystem of “spouts,” or systems for receiving data from established sources. But it was reputedly also hard to work with and hard to get good results from, and despite a recent 1.0 renovation, it’s been challenged by other projects, including Apache Spark and its own revised streaming framework.

Rather than reuse an existing software project, Twitter elected to start from scratch with a container- and cluster-based design, outlined in a paper released last year. The user creates Heron jobs, or “topologies,” and submits them to a scheduling system, which launches the topology in a series of containers.

The scheduler can be any of a number of popular schedulers, like Apache Mesos or Apache Aurora. Storm, by contrast, has to be manually provisioned on clusters to add scale.

One smart decision by Twitter early on was to make Heron backward-compatible with Storm’s API. This was a practical choice on Twitter’s part, since it meant that existing Storm spouts and bolts could be reused in Heron. But it means anyone else with an existing investment in Storm can make the switch to Heron with less effort than it would take to make use of another project.

That should give existing Storm users some incentive to check out Heron. Twitter claims it’s been able to gain anywhere from two to five times an improvement in “efficiency” (basically, lower opex and capex) with Heron, and now others looking for a way to speed up their stream processing can find out for themselves.

Bare Metal Servers and Cloud Server Hosting

Author Archives: Serdar Yegulalp
Home / Articles Posted by Serdar Yegulalp (Page 4)

Google Cloud Machine Learning hits public beta, with additions

How MIT's C/C++ extension breaks parallel processing bottlenecks

All together now

Let’s not throw all this out yet

Teradata expands analytics for hybrid cloud

Tableau 10 gives users the features they want

An eyeful of info

Big data brawlers: 4 challengers to Spark

Apache Apex

Spark 2.0 takes an all-in-one approach to big data

Easier on top, faster underneath

Spark-powered Splice Machine goes open source

Microsoft R Client provides a free taste of R Server

Spark users want convenience in the cloud — here are new ways they may get it

Microsoft

Twitter open-sources Heron for real-time stream analytics