Azure Databricks: Fast analytics in the cloud with Apache Spark

Azure Databricks: Fast analytics in the cloud with Apache Spark

We’re living in a world of big data. The current generation of line-of-business computer systems generate terabytes of data every year, tracking sales and production through CRM and ERP. It’s a flood of data that’s only going to get bigger as we add the sensors of the industrial internet of things, and the data that’s needed to deliver even the simplest predictive-maintenance systems.

Having that data is one thing, using it as another. Big data is often unstructured, spread across many servers and databases. You need something to bring it together. That’s where big data analysis tools like Apache Spark come into play; these distributed analytical tools work across clusters of computers. Building on techniques developed for the MapReduce algorithms used by tools like Hadoop, today’s big data analysis tools go further to support more database-like behavior, working with in-memory data at scale, using loops to speed up queries, and providing a foundation for machine learning systems.

Apache Spark is fast, but Databricks is faster. Founded by the Spark team, Databricks is a cloud-optimized version of Spark that takes advantage of public cloud services to scale rapidly and uses cloud storage to host its data. It also offers tools to make it easier to explore your data, using the notebook model popularized by tools like Jupyter Notebooks.

Microsoft’s new support for Databricks on Azure—called Azure Databricks—signals a new direction of its cloud services, bringing Databricks in as a partner rather than through an acquisition.

Although you’ve always been able to install Spark or Databricks on Azure, Azure Databricks makes it a one-click experience, driving the setup process from the Azure Portal. You can host multiple analytical clusters, using autoscaling to minimize the resources in use. You can clone and edit clusters, tuning them for specific jobs or running different analyses on the same underlying data.

Configuring the Azure Databricks virtual appliance

The heart of Microsoft’s new service is a managed Databricks virtual appliance built using containers running on Azure Container Services. You choose the number of VMs in each cluster that it controls and uses, and then the service handles load automatically once it’s configured and running, loading new VMs to handle scaling.

Databricks’ tools interact directly with the Azure Resource Manager, which adds a security group and a dedicated storage account and virtual network to your Azure subscription. It lets you use any class of Azure VM for your Databricks cluster – so if you’re planning on using it to train machine learning systems, you’ll want to choose one of the latest GPU-based VMs. And of course, if one VM model isn’t right for your problem, you can switch it out for another. All you need to do is clone a cluster and change the VM definitions.

Querying in Spark brings engineering to data science

Spark has its own query language based on SQL, which works with Spark DataFrames to handle both structured and unstructured data. DataFrames are the equivalent of a relational table, constructed on top of collections of distributed data in different stores. Using named columns, you can construct and manipulate DataFrames with languages like R and Python; thus, both developers and data scientists can take advantage of them.

DataFrames is essentially a domain-specific language for your data, a language that extends the data analysis features of your chosen platform. By using familiar libraries with DataFrames, you can construct complex queries that take data from multiple sources, working across columns.

Because Azure Databricks is inherently data-parallel, and its queries are evaluated only when called to deliver actions, results can be delivered very quickly. Because Spark supports most common data sources, either natively or through extensions, you can add Azure Databricks DataFrames and queries to existing data relatively easily, reducing the need to migrate data to take advantage of its capabilities.

Although Azure Databricks provides a high-speed analytics layer across multiple sources, it’s also a useful tool for data scientists and developers trying to build and explore new models, turning data science into data engineering. Using Databricks Notebooks, you can develop scratchpad views of your data, with code and results in a single view.

The resulting notebooks are shared resources, so anyone can use them to explore their data and try out new queries. Once a query is tested and turned into a regular job, its output can be exposed as an element a Power BI dashboard, making Azure Databricks part of an end-to-end data architecture that allows more complex reporting than a simple SQL or NoSQL service—or even Hadoop.

Microsoft plus Databricks: a new model for Azure Services

Microsoft hasn’t yet detailed its pricing for Azure Databricks, but it does claim that it can improve performance and reduce cost by as much as 99 percent compared to running your own unmanaged Spark installation on Azure’s infrastructure services. If Microsoft’s claim bears out, that promises to be a significant saving, especially when you factor in no longer having to run your own Spark infrastructure.

Azure’s Databricks service will connect directly to Azure storage services, including Azure Data Lake, with optimizations for queries and caching. There’s also the option of using it with Cosmos DB, so you can take advantage of global data sources and a range of NoSQL data models, including MongoDB and Cassandra compatibility—as well as Cosmos DB’s graph APIs. It should also work well with Azure’s data-streaming tools, giving you a new option for near real-time IoT analytics.

If you’re already using Databricks’ Spark tools, this new service won’t affect you or your relationship with Databricks. It’s only if you take the models and analytics you’ve developed on-premises to Azure’s cloud that you’ll get a billing relationship with Microsoft. You’ll also have fewer management tasks, leaving you more time to work with your data.

Microsoft’s decision to work with an expert partner on a new service makes a lot of sense. Databricks has the expertise, and Microsoft has the platform. If the resulting service is successful, it could set a new pattern for how Azure evolves in the future, building on what businesses are already using and making them part of the Azure hybrid cloud without absorbing those services into Microsoft.

Source: InfoWorld Big Data

Microsoft’s R tools bring data science to the masses

Microsoft’s R tools bring data science to the masses

One of Microsoft’s more interesting recent acquisitions was Revolution Analytics, a company that built tools for working with big data problems using the open source statistical programming language R. Mixing an open source model with commercial tools, Revolution Analytics offered a range of tools supporting academic and personal use, alongside software that took advantage of massive amounts of data–including Hadoop. Under Microsoft’s stewardship, the now-renamed R Server has become a bridge between on-premises and cloud data.

Two years on, Microsoft has announced a set of major updates to its R tools. The R programming language has become an important part of its data strategy, with support in Azure and SQL Server—and, more important, in its Azure Machine Learning service, where it can be used to preprocess data before delivering it to a machine learning pipeline. It’s also one of Microsoft’s key cross-platform server products, with versions for both Red Hat Linux and Suse Linux.

R is everywhere in Microsoft’s ecosystem

Outside of Microsoft, the open source R has become a key tool for data science, with a lot of support in academic environments. (It currently ranks fifth in terms of all languages, according to the IEEE.) You don’t need to be a statistical expert to get started with R, because the Comprehensive R Archive Network (CRAN, a public library of R applications) now has more than 9,000 statistical modules and algorithms you can use with your data.

Microsoft’s vision for R is one that crosses the boundaries between desktop, on-premises servers, and the cloud. Locally, there’s a free R development client, as well as R support in Microsoft’s (paid) flagship Visual Studio development environment. On-premises, R Server runs on Windows and Linux, as well as inside SQL Server, giving you access to statistical analysis tools alongside your data. Local big data services based on Hadoop and Spark are also supported, while on Azure you can run R Server alongside Microsoft’s HDInsight services.

R is a tool for data scientists. Although the R language is relatively simple, you need a deep knowledge of statistical analytics to get the most from it. It’s been a long while since I took college-level statistics classes, so I found getting started with R complex because many of the underlying concepts require graduate-level understanding of complex statistical functions. The question isn’t so much whether you can write R code—it’s whether you can understand the results you’re getting.

That’s probably the biggest issue facing any organization that wants to work with big data: getting the skills needed to produce the analysis you want and, more important, to interpret the results you get. R certainly helps here, with built-in graphing tools that help you visualize key statistical measures.

Working with Microsoft R Server

The free Microsoft R Open can help your analytics team get up to speed with R before investing in any of the server products. It’s also a useful tool for quickly trying out new analytical algorithms and exploring the questions you want answered using your data. That approach works well as part of an overall analytics lifecycle, starting with data preparation, moving on to model development, and finally turning the model into tools that can be built into your business applications.

One interesting role for R is alongside GPU-based machine-learning tools. Here, R is employed to help train models before they’re used at scale. Microsoft is bundling its own machine learning algorithms with the latest R Server release, so you can test a model before uploading it to either a local big data instance or to the cloud. During a recent press event, Microsoft demonstrated this approach with astronomy images, training a machine-learning-based classifier on a local server with a library of galaxies before running the resulting model on cloud-hosted GPUs.

R is an extremely portable language, designed to work over discrete samples of data. That makes it very scalable and ideal for data-parallel problems. The same R model can be run on multiple servers, so it’s simple to quickly process large amounts of data. All you need to do is parcel out your data appropriately, then deliver it to your various R Server instances. Similarly, the same code can run on different implementations, so a model built and tested against local data sources can be deployed inside a SQL Server database and run against a Hadoop data lake.

R makes operational data models easy

Thus, R is very easy to operationalize. Your data science team can work on building the model you need, while your developers write applications and build infrastructures that can take advantage of their code. Once it’s ready, the model can be quickly deployed, and it can even be swapped out for improved models in the future without affecting the rest of the application. In the same manner, the same model can be used in different applications, working with the same data.

With a common model, your internal dashboards can show you the same answers as customer- and consumer-facing code. You can then use data to respond proactively—for example, providing delay and rebooking information to airline passengers when a model predicts weather delays. That model can be refined as you get more data, reducing the risks of false positives and false negatives.

Building R support into SQL Server makes a lot of sense. As Microsoft’s database platform becomes a bridge between on-premises data and the cloud, as well as between your systems of record and big data tools, having fine-grained analytics tools in your database is a no-brainer. A simple utility takes your R models and turns them into procs, ready for use inside your SQL applications. Database developers can work with data analytics teams to implement those models, and they don’t need to learn any new skills to build them into their applications.

Microsoft is aware that not every enterprise needs or has the budget to employ data scientists. If you’re dealing with common analytics problems, like trying to predict customer churn or detecting fraud in an online store, you have the option of working with a range of predefined templates for SQL Server’s R Services that contain ready-to-use models. Available from Microsoft’s MSDN, they’re fully customizable in any R-compatible IDE, and you can deploy them with a PowerShell script.

Source: InfoWorld Big Data