Spark picks up machine learning, GPU acceleration

Spark picks up machine learning, GPU acceleration

Databricks, corporate provider of support and development for the Apache Spark in-memory big data project, has spiced up its cloud-based implementation of Apache Spark with two additions that top IT’s current hot list.

The new features — GPU acceleration and integration with numerous deep learning libraries — can in theory be implemented in any local Apache Spark installation. But Databricks says its versions are tuned to avoid the resource contentions that complicate the use of such features.

Apache Spark isn’t configured out of the box to provide GPU acceleration, and to set up a system to support it, users must cobble together several pieces. To that end, Databricks offers to handle all the heavy lifting.

Databricks also claims that Spark’s behaviors are tuned to get the most out of a GPU cluster by reducing the number of contentions across nodes. This seems similar to the strategy used by MIT’s Milk library to accelerate parallel processing applications, wherein operations involving memory are batched to take maximum advantage of a system’s cache line. Likewise, Databricks’ setup tries to keep GPU operations from interrupting each other.

Another time-saving measure is adding direct access to popular machine learning libraries that can use Spark as a data source. Among them is Databricks’ TensorFrames, which allows the TensorFlow library to work with Spark and is GPU-enabled.

Databricks has tweaked its infrastructure to get the most out of Spark. It created a free tier of service to attract customers still wary of deep commitment, providing them with a subset of the conveniences available in the full-blown product. InfoWorld’s Martin Heller checked out the service earlier this year and liked what he saw, precisely because it was free to jump into and easy to get started.

But competition will be fierce, especially since Databricks faces brand-name juggernauts like Microsoft (via Azure Machine Learning), IBM, and Amazon. Thus, it has to find ways to both keep and expand an audience for a service as specific and focused as its own. The plan appears to involve not only adding features like machine learning and GPU acceleration to the mix, but ensuring they bring convenience, not complexity.

Source: InfoWorld Big Data

Fire up big data processing with Apache Ignite

Fire up big data processing with Apache Ignite

Apache Ignite is an in-memory computing platform that can be inserted seamlessly between a user’s application layer and data layer. Apache Ignite loads data from the existing disk-based storage layer into RAM, improving performance by as much as six orders of magnitude (1 million-fold).

The in-memory data capacity can be easily scaled to handle petabytes of data simply by adding more nodes to the cluster. Further, both ACID transactions and SQL queries are supported. Ignite delivers performance, scale, and comprehensive capabilities far above and beyond what traditional in-memory databases, in-memory data grids, and other in-memory-based point solutions can offer by themselves.

Apache Ignite does not require users to rip and replace their existing databases. It works with RDBMS, NoSQL, and Hadoop data stores. Apache Ignite enables high-performance transactions, real-time streaming, and fast analytics in a single, comprehensive data access and processing layer. It uses a distributed, massively parallel architecture on affordable, commodity hardware to power existing or new applications. Apache Ignite can run on premises, on cloud platforms such as AWS and Microsoft Azure, or in a hybrid environment.

apache ignite architecture

The Apache Ignite unified API supports SQL, C++, .Net, Java, Scala, Groovy, PHP, and Node.js. The unified API connects cloud-scale applications with multiple data stores containing structured, semistructured, and unstructured data. It offers a high-performance data environment that allows companies to process full ACID transactions and generate valuable insights from real-time, interactive, and batch queries.

Users can keep their existing RDBMS in place and deploy Apache Ignite as a layer between it and the application layer. Apache Ignite automatically integrates with Oracle, MySQL, Postgres, DB2, Microsoft SQL Server, and other RDBMSes. The system automatically generates the application domain model based on the schema definition of the underlying database, then loads the data. In-memory databases typically provide only a SQL interface, whereas Ignite supports a wider group of access and processing paradigms in addition to ANSI SQL. Apache Ignite supports key/value stores, SQL access, MapReduce, HPC/MPP processing, streaming/CEP processing, clustering, and Hadoop acceleration in a single integrated in-memory computing platform.

GridGain Systems donated the original code for Apache Ignite to the Apache Software Foundation in the second half of 2014. Apache Ignite was rapidly promoted from an incubating project to a top-level Apache project in 2015. In the second quarter of 2016, Apache Ignite was downloaded nearly 200,000 times. It is used by organizations around the world.

Architecture

Apache Ignite is JVM-based distributed middleware based on a homogeneous cluster topology implementation that does not require separate server and client nodes. All nodes in an Ignite cluster are equal, and they can play any logical role per runtime application requirement.

A service provider interface (SPI) design is at the core of Apache Ignite. The SPI-based design makes every internal component of Ignite fully customizable and pluggable. This enables tremendous configurability of the system, with adaptability to any existing or future server infrastructure.

Apache Ignite also provides direct support for parallelization of distributed computations based on fork-join, MapReduce, or MPP-style processing. Ignite uses distributed parallel computations extensively, and they are fully exposed at the API level for user-defined functionality.

Key features

In-memory data grid. Apache Ignite includes an in-memory data grid that handles distributed in-memory data management, including ACID transactions, failover, advanced load balancing, and extensive SQL support. The Ignite data grid is a distributed, object-based, ACID transactional, in-memory key-value store. In contrast to traditional database management systems, which utilize disk as their primary storage mechanism, Ignite stores data in memory. By utilizing memory rather than disk, Apache Ignite is up to 1 million times faster than traditional databases.

apache ignite data grid

SQL support. Apache Ignite supports free-form ANSI SQL-99 compliant queries with virtually no limitations. Ignite can use any SQL function, aggregation, or grouping, and it supports distributed, noncolocated SQL joins and cross-cache joins. Ignite also supports the concept of field queries to help minimize network and serialization overhead.

In-memory compute grid. Apache Ignite includes a compute grid that enables parallel, in-memory processing of CPU-intensive or other resource-intensive tasks such as traditional HPC, MPP, fork-join, and MapReduce processing. Support is also provided for standard Java ExecutorService asynchronous processing.

apache ignite compute grid

In-memory service grid. The Apache Ignite service grid provides complete control over services deployed on the cluster. Users can control how many service instances should be deployed on each cluster node, ensuring proper deployment and fault tolerance. The service grid guarantees continuous availability of all deployed services in case of node failures. It also supports automatic deployment of multiple instances of a service, of a service as a singleton, and of services on node startup.

In-memory streaming. In-memory stream processing addresses a large family of applications for which traditional processing methods and disk-based storage, such as disk-based databases or file systems, are inadequate. These applications are extending the limits of traditional data processing infrastructures.

apache ignite streaming

Streaming support allows users to query rolling windows of incoming data. This enables users to answer questions such as “What are the 10 most popular products over the last hour?” or “What is the average price in a certain product category for the past 12 hours?”

Another common stream processing use case is pipelining a distributed events workflow. As events are coming into the system at high rates, the processing of events is split into multiple stages, each of which has to be properly routed within a cluster for processing. These customizable event workflows support complex event processing (CEP) applications.

In-memory Hadoop acceleration. The Apache Ignite Accelerator for Hadoop enables fast data processing in existing Hadoop environments via the tools and technology an organization is already using.

apache ignite hadoop rev

Ignite in-memory Hadoop acceleration is based on the first dual-mode, high-performance in-memory file system that is 100 percent compatible with Hadoop HDFS and an in-memory optimized MapReduce implementation. Delivering up to 100 times faster performance, in-memory HDFS and in-memory MapReduce provide easy-to-use extensions to disk-based HDFS and traditional MapReduce. This plug-and-play feature requires minimal to no integration. It works with any open source or commercial version of Hadoop 1.x or Hadoop 2.x, including Cloudera, Hortonworks, MapR, Apache, Intel, and AWS. The result is up to 100-fold faster performance for MapReduce and Hive jobs.

Distributed in-memory file system. A unique feature of Apache Ignite is the Ignite File System (IGFS), which is a file system interface to in-memory data. IGFS delivers similar functionality to Hadoop HDFS. It includes the ability to create a fully functional file system in memory. IGFS is at the core of the Apache Ignite In-Memory Accelerator for Hadoop.

The data from each file is split on separate data blocks and stored in cache. Data in each file can be accessed with a standard Java streaming API. For each part of the file, a developer can calculate an affinity and process the file’s content on corresponding nodes to avoid unnecessary networking.

Unified API. The Apache Ignite unified API supports a wide variety of common protocols for the application layer to access data. Supported protocols include SQL, Java, C++, .Net, PHP, MapReduce, Scala, Groovy, and Node.js. Ignite supports several protocols for client connectivity to Ignite clusters, including Ignite Native Clients, REST/HTTP, SSL/TLS, and Memcached.SQL.

Advanced clustering. Apache Ignite provides one of the most sophisticated clustering technologies on JVMs. Ignite nodes can automatically discover each other, which helps scale the cluster when needed without having to restart the entire cluster. Developers can also take advantage of Ignite’s hybrid cloud support, which allows users to establish connections between private clouds and public clouds such as AWS or Microsoft Azure.

Additional features. Apache Ignite provides high-performance, clusterwide messaging functionality. It allows users to exchange data via publish-subscribe and direct point-to-point communication models.

The distributed events functionality in Ignite allows applications to receive notifications about cache events occurring in a distributed grid environment. Developers can use this functionality to be notified about the execution of remote tasks or any cache data changes within the cluster. Event notifications can be grouped and sent in batches and at timely intervals. Batching notifications help attain high cache performance and low latency.

Ignite allows for most of the data structures from the java.util.concurrent framework to be used in a distributed fashion. For example, you could add to a double-ended queue (java.util.concurrent.BlockingDeque) on one node and poll it from another node. Or you could have a distributed primary key generator, which would guarantee uniqueness on all nodes.

Ignite distributed data structures include support for these standard Java APIs: Concurrent map, distributed queues and sets, AtomicLong, AtomicSequence, AtomicReference, and CountDownLatch.

Key integrations

Apache Spark. Apache Spark is a fast, general-purpose engine for large-scale data processing. Ignite and Spark are complementary in-memory computing solutions. They can be used together in many instances to achieve superior performance and functionality.

Apache Spark and Apache Ignite address somewhat different use cases and rarely compete for the same task. The table below outlines some of the key differences.

Apache Spark doesn’t provide shared storage, so data from HDFS or other disk storage must be loaded into Spark for processing. State can be passed from Spark job to job only by saving the processed data back into external storage. Ignite can share Spark state directly in memory, without storing the state to disk.

One of the main integrations for Ignite and Spark is the Apache Ignite Shared RDD API. Ignite RDDs are essentially wrappers around Ignite caches that can be deployed directly inside of executing Spark jobs. Ignite RDDs can also be used with the cache-aside pattern, where Ignite clusters are deployed separately from Spark, but still in-memory. The data is still accessed using Spark RDD APIs.

Spark supports a fairly rich SQL syntax, but it doesn’t support data indexing, so it must do full scans all the time. Spark queries may take minutes even on moderately small data sets. Ignite supports SQL indexes, resulting in much faster queries, so using Spark with Ignite can accelerate Spark SQL more than 1,000-fold. The result set returned by Ignite Shared RDDs also conforms to the Spark Dataframe API, so it can be further analyzed using standard Spark dataframes. Both Spark and Ignite natively integrate with Apache YARN and Apache Mesos, so it’s easier to use them together.

When working with files instead of RDDs, it’s still possible to share state between Spark jobs and applications using the Ignite In-Memory File System (IGFS). IGFS implements the Hadoop FileSystem API and can be deployed as a native Hadoop file system, exactly like HDFS. Ignite plugs in natively to any Hadoop or Spark environment. IGFS can be used with zero code changes in plug-and-play fashion.

Apache Cassandra. Apache Cassandra can serve as a high-performance solution for structured queries. But the data in Cassandra should be modeled such that each predefined query results in one row retrieval. Thus, you must know what queries will be required before modeling the data.

While very powerful in certain cases, Cassandra lacks an in-memory option that can severely limit performance. Cassandra can be useful for OLAP applications but lacks support for transactions, ACID or otherwise, so is not employed for OLTP. Predefined queries can be efficient with Cassandra, but Cassandra lacks SQL support and does not support joins, aggregations, groupings, or usable indexes. These limitations mean Cassandra cannot support ad hoc queries.

Apache Ignite offers native support for Cassandra. With Ignite, Cassandra users gain very powerful capabilities such as the ability to leverage in-memory computing to reduce query times by 1,000x. They can also leverage ANSI-compliant SQL support to run ad hoc and structured queries against in-memory data using joins, aggregations, groupings, and usable indexes.

Installing Ignite

Despite the breadth of its feature set, Apache Ignite is very easy to use and deploy. There are no custom installers. The code base comes as a single Zip file with only one mandatory dependency: ignite-core.jar. All other dependencies, such as integration with Spring for configuration, can be added to the process à la carte. The project is fully Mavenized; it is composed of more than a dozen Maven artifacts that can be imported and used in any combination. Apache Ignite is based on standard Java APIs. For distributed caches and data grid functionality, Apache Ignite implements the JCache (JSR107) standard.

Apache Ignite is a high-performance, distributed in-memory computing platform for large-scale data sets. It offers performance gains to transactional and analytical applications on the order of 1,000 to 1 million times faster throughput, as well as lower latencies than are possible with traditional disk-based or flash technologies. Ignite sits between the application and data layers and does not require the rip-and-replacement of existing RDBMS, NoSQL, or Hadoop data stores.

Apache Ignite is composed of, in one well-integrated framework, a set of in-memory computing capabilities, including an in-memory data grid, an in-memory compute grid, an in-memory service grid, in-memory stream processing, and in-memory acceleration for Hadoop, Spark, and Cassandra. In combination with traditional or distributed data stores, Apache Ignite holds the key to high-volume transactions, real-time analytics, and the emerging class of hybrid transaction/analytical processing (HTAP) workloads.

Apache Ignite resources and documentation, including white papers, recorded webinars, and code samples, are available on the GridGain website.

Nikita Ivanov is founder and CTO of GridGain Systems, where he has led the development of advanced and distributed in-memory data processing technologies. He has more than 20 years of experience in software application development, building HPC and middleware platforms, and contributing to the efforts of companies including Adaptec, Visa, and BEA Systems.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Source: InfoWorld Big Data

Big data grab: Now they want your car's telemetry

Big data grab: Now they want your car's telemetry

A year ago the management consulting giant McKinsey & Co. predicted that the internet of things (IoT) could unlock $11 trillion in economic value by 2025. It’s a bold claim, particularly given that IoT currently proves more useful in launching massive DDoS attacks than in recognizing that I need to buy more milk.

Now, McKinsey has a new projection. It involves cars, and it declares that data “exhaust” from autos will be worth $750 billion by 2030. The consulting firm even goes so far as to lay out exactly how we can grab that revenue. If only it were as easy to make money off car data — which consumers may not want to share — as it is to prognosticate about it.

Follow these two easy steps

The automotive industry is huge, which is a big reason that Google, Apple, and others have been looking for opportunities to disrupt it in their favor. Given how much time we spend in our cars, particularly in North America, and how much data those cars generate, it’s easy to imagine massive new auto-related businesses built entirely on data. After all, Uber is a giant data-crunching company, not a cab company.

This isn’t simply a market for one Uber to dominate, suggests McKinsey in its new report, “Monetizing Car Data.” As the report authors conclude, the opportunity to monetize car data could be worth $450 billion to $750 billion within the next 13 years.

mckinsey auto data

The hitch is getting there. According to McKinsey’s analysis:

The opportunity for industry players hinges on their ability to 1) quickly build and test car data-driven products and services focused on appealing customer propositions and 2) develop new business models built on technological innovation, advanced capabilities, and partnerships that push the current boundaries of the automotive industry.

Let me paraphrase: $750 billion can be had by anyone who can 1) figure out cool new products that lots of people will want to buy and 2) sell those products in such a manner that people will pay for them. Um, thanks, McKinsey!

What the report doesn’t say is that auto exhaust, the data on which these hypothetical businesses will be based, may be a little more closely guarded than web exhaust.

Ideas are easy, execution is hard

In a rather blithe and generic manner, McKinsey gets one thing right about this new market: “The first challenge on the path towards car data monetization is communicating to the end customers exactly what is in it for them.”

On the web, the value proposition of giving up personal data in exchange for free stuff has simply become part of the furniture. The tech industry has no problem treating consumers as products. Last week, for example, Google (very quietly) changed its ad policies to enable much more invasive tracking of consumer behavior.

Will it be any different in Autopia?

Let’s assume for a minute that it will be. After all, data about where you go and how you drive generally has more serious implications than which websites you visit.

What’s the incentive for consumers to share that data? McKinsey lists a range of reasons, from consumers opting into proactive maintenance, better insurance rates, and more. However, these suggestions tend to overlook history: We haven’t generally been willing to proactively pay for security, we don’t like the idea of giving insurance companies the ability to lower our rates through data (because it will more likely result in raising our rates through that same data), and so on.

On the other hand, we may simply not care enough to stop it. The younger the demographic, the less likely it is to be concerned by privacy, the report unsurprisingly finds, while 90 percent of those surveyed by McKinsey are already aware that “data is openly accessible to applications and third parties.” Given that Pandora’s box filled with data is already open, it’s not surprising that 79 percent of those surveyed are “willing to consciously grant access to their data,” a percentage that has climbed 11 points since 2015.

Yet businesses still need to figure out how to monetize this willingness to trade data for services. Uber has already figured it out and presumably plenty more such companies are waiting to be born. The market for car data will likely be big, but capitalizing on it will plow through consumer privacy in ways hitherto unimagined.

Source: InfoWorld Big Data

IBM Receives Cloud Company Of The Year Award From Frost & Sullivan

IBM Receives Cloud Company Of The Year Award From Frost & Sullivan

IBM has announced that it has received the 2016 Cloud Company of the Year Award from leading independent technology market research firm Frost & Sullivan. The Award acknowledges IBM’s market leadership in delivering a complete and fully integrated stack of cloud services including IaaS, PaaS, and SaaS.

Particularly important to CIOs and IT managers, Frost & Sullivan highlights IBM’s ability to support hybrid environments via the company’s extensive portfolio of connectivity tools and capabilities that allow enterprises to easily create, deploy, and manage a flexible range of applications and microservices.

“This award recognizes the extraordinary range and depth of IBM’s cloud services portfolio,” said Don Boulia, VP of Cloud Strategy and Portfolio Management at IBM. “IBM Cloud provides clients with flexibility and choice when embracing hybrid solutions. They can continue driving value from their existing investments, while also gaining access to public, scalable infrastructure and services across our global footprint of data centers, including IBM Watson.”

According to Lynda Stadtmueller, vice president of cloud services, Stratecast|Frost & Sullivan, “IBM’s cloud platform supports the concept of ‘hybrid integration’ — that is, a hybrid IT environment in which disparate applications and data are linked via a comprehensive integration platform, allowing the apps to share common management functionality and control.” These capabilities enable customers to leverage Watson and analytics functionality made available through application programming interfaces (APIs) on Bluemix.

Stadtmueller also commented on the economic benefits of IBM Cloud, noting, “IBM Cloud offers a price-performance advantage over competitors due to its infrastructure configurations and service parameters—including a bare metal server option; single-tenant (private) compute and storage options; granular capacity selections for processing, memory, and network for public cloud units; and all-included technical support.”

Competitors in the cloud market do not offer such a broad portfolio of integrated infrastructure, software, and platform solutions, according to Frost & Sullivan.

To win the Frost & Sullivan Company of the Year Award a company must demonstrate excellence in growth, innovation, and leadership. This kind of excellence typically translates into superior performance in three key areas: demand generation, brand development, and competitive positioning. These areas serve as the foundation of a company’s future success and prepare it to deliver on the two criteria that define the Company of the Year Award—Visionary Innovation & Performance and Customer Impact.

Source: CloudStrategyMag

TierPoint Joins AWS Partner Network

TierPoint Joins AWS Partner Network

TierPoint has announced that the company has joined the AWS Partner Network (APN), offering dedicated high-speed connectivity to Amazon Web Services (AWS) via AWS Direct Connect through TierPoint’s Seattle data center. 

Seattle is the first TierPoint site to become an AWS Direct Connect location and the company expects to make it available through additional data centers in the future.

“Customers want more options for quickly and cost-effectively spinning up their workloads,” said Octavio Morales, tierpoint senior vice president, operations. “Through AWS Direct Connect we can provide our cloud customers with a dedicated network connection that helps reduce costs with enterprise-level reliability and performance.”

Source: CloudStrategyMag

Red Hat And Ericsson Announce Broad Alliance

Red Hat And Ericsson Announce Broad Alliance

Red Hat, Inc. and Ericsson have formed a broad alliance to deliver fully open source and production-ready cloud infrastructure, spanning OpenStack, software-defined networking (SDN), and software-defined infrastructure (SDI). Ericsson and Red Hat are working together to enable customers to embrace the opportunity presented by the Internet of Things (IoT), 5G, and other next-generation communications solutions with modern and agile solutions.

Ericsson is a leading provider of hardware, software and services for the service provider industry and is an industry acknowledged leader in NFV. Red Hat leads the technology industry in offering solutions that are open, scalable, flexible, and secure. It is a leader in OpenStack, which has become a go-to platform for telco and enterprise cloud deployments.

The companies have long worked together to bring Red Hat Enterprise Linux and Red Hat JBoss Middleware to Ericsson customers. Today the companies are expanding the collaboration to focus on NFV infrastructure (NFVi), OpenStack, SDN, SDI, and containers and help define the next generation of modern technology for the communications industry, including:

Upstream collaboration: The companies are taking an “upstream first” approach to collaboration across open source projects and communities — including OPNFV, OpenStack, and OpenDaylight — to address customer concerns about lock-in resulting from proprietary forks, differentiating the partnership from other providers. Engineering teams from both companies will collaborate to address customer requirements in upstream open source projects, helping accelerate technology innovation for scalable cloud deployments.

Solution certification and new joint offerings: Red Hat and Ericsson are collaborating on hardware and software roadmaps, aimed at developing new joint offerings for NFV infrastructure, SDN and SDI. Through the collaboration, the companies plan to work together to certify Ericsson’s platform and portfolio of solutions including Ericsson Cloud Execution Environment, Ericsson Cloud SDN solution, and Hyperscale Datacenter System 8000 for Red Hat Enterprise Linux and Red Hat OpenStack Platform, and backed by reference architectures and labs.

Ericsson is expanding its – NFV infrastructure solution to also include Red Hat OpenStack Platform to meet the needs of service providers across the globe who require a fully open and agile infrastructure. For their joint NFV infrastructure, SDN and SDI solutions, the companies plan to work together to offer easy-to-deploy solutions, including automated deployment and management.

Technical alignment to advance container innovation and adoption: Both Red Hat and Ericsson see container technologies as a major part of the platform evolution and will collaborate in upstream activities, in for example the CNCF and OCI communities.

Backed by industry leaders: Ericsson’s Red Hat Enterprise Linux-based workloads, will participate in Red Hat’s certification program for applications running on Red Hat OpenStack Platform. The joint solutions will be backed by service-level agreements offered by Ericsson.

Professional services: Customers looking to evolve their businesses in NFV, IT and datacenter modernization can benefit from Red Hat’s consulting and training expertise in open source and emerging technology enablement, and Ericsson’s expertise in end-to-end consulting, systems integration, managed services, and support services. The combined portfolio of technologies, services, training, and certifications from Red Hat and Ericsson helps enable our customers to transform their business in NFV, IT and Data Center Modernization. With this joint capability, customers gain access to a global team that can position customers for success in today’s dynamic ICT market.

Source: CloudStrategyMag

6 small steps to digital transformation

6 small steps to digital transformation

The phrase “digital transformation” is wearing a little thin, as marketers twist it into a pitch for whatever they’re trying to sell. So let’s settle on a broad yet simple meaning: The journey from inflexible platforms, products, and workflows to a “permanently agile” condition.

InfoWorld Contributing Editor Dan Tynan and I make the argument for this definition in a new Deep Dive you can download here. Naturally, the details of transformation vary infinitely depending on the organization. Yet commonalities persist in nearly all cases, such as devops, cross-silo collaboration, and big data analytics.

But how do you get there from here? As with any big initiative, you need to start small. At the C-level, digital transformation speaks to a burning desire to jump on new business opportunities and reduce the cost of operations. The challenge for IT pros and developers is to pick a project and execute in a way that provides a transformative example. These general guidelines, derived from real-world cases, may help point the way:

1. Sell the real potential

In pursuing digital transformation, business management often seems driven by an envy of high-flying digital natives like Google or Uber or Snapchat. Right away, you need to disabuse zealots of the notion that if you dump a bunch of technology in the hopper, everything will change overnight.

Yet the buzz around digital transformation presents a very real opportunity. The most important changes often end up being organizational, such as breaking down silos to foster collaboration or enabling lines of business to spin up their own projects continually without laborious authorization processes. You don’t want to start by pitch organizational disruption, though — you want to demonstrate real benefits.

Chief among them is the ability to bring products or applications to market much, much faster, which is the big payoff of agility. So identify a project you’re 90 percent sure will prove your point.

2. Pick the right project

Some organizations have already plunged into transformation with, say, IoT initiatives that reach into core products or processes that differentiate the business. Companies such as GE or Ford come to mind.

But beginning your transformational journey with a core initiative is almost impossible, unless a champion at the top of the company drives it. In most cases, the best place to focus is on web and mobile applications that target customers and need to change frequently.

Pay special attention to customer-facing applications that have the potential to drive new revenue quickly. From an IT professional’s perspective, an important aspect of digital transformation is to shift the emphasis from cost reduction to revenue generation. Also remember that this should not be a one-and-done project. The point is to create an environment where you can iterate and refine applications as needs change and as evidence points to how applications can be improved.

It goes without saying that collaboration with business stakeholders is essential, from both political and logistical standpoints. Your success should be their success, and business objectives must be well understood from the start, with feedback solicited continuously along the way.

3. Assemble the right team

The seeds of change always lie within — in the form of people looking for a better way. Sometimes they’re people who have been around a while and feel frustrated by standard procedure yet carry vital institutional knowledge. In other cases, they’re new recruits who are less entrenched in the usual way of doing things and may already have the necessary skills in agile tools, platforms, and methodologies.

Very likely, you’re going to need both types on your team. You need fresh thinking and people motivated and talented enough to spin up a devops skunkworks. But new applications need to be integrated with legacy systems and procedures, so you also need people who really understand that stuff well yet want to get beyond the old, boring way of doing things. Best case, a team like this (or at least the core of it) already exists, so you won’t need to assemble it from scratch.

4. Put devops to work

I’m going to make the assumption that devops has not already been established in your organization. If it has been, particularly at scale, then you’re well on the path to digital transformation. The next hurdle may be to apply devops to the bespoke software that defines a company’s core business, from manufacturing software to collaborative design platforms to logistical systems. (That stuff is heavily guarded. Good luck.)

Otherwise, laying the devops groundwork is the most important step in getting transformation off the ground. Devops dictates that software developers should be empowered to provision their own environments, while operations should have the ability to automate continuous, reliable deployment at scale. It makes agile development — where stakeholders review applications in progress, provide feedback, and change direction if necessary — possible in the real world.

If you’re still mired in waterfall development or have implemented a bureaucratic form of agile development that runs counter to the whole idea, implementing devops could be a very big lift. You may need a smaller project or stipulate that team members should already be fluent in true agile and/or devops methodologies. Either way, you’re going to want to avoid a long procurement cycle for new hardware and software for your project.

5. Choose your cloud

Unless you’re explicitly prevented from doing so, you’ll want to spin up your “transformative” project in the public cloud. AWS, Azure, Google Cloud Platform, and IBM Bluemix all offer cloud platforms with the services you need to get almost any devops project off the ground. By nature, pretty much everything in the cloud is self-service, which is a key aspect of digital transformation. Your team can get what it needs with a credit card and avoid the internal procurement bureaucracy.

Which cloud you choose should depend on the skills of your team and the nature of your applications. For example, if they’re Microsoft developers, they most likely will be most comfortable in the Azure cloud. If your applications will tap machine learning, your team may want to give Google Cloud’s exciting TensorFlow API a whirl. And of course, AWS offers the widest array of cloud services on earth.

Remember that a big part of your intent is to prove the worth of devops. Typically, the devops toolchain begins with GitHub or Bitbucket code repositories, which are their own independent cloud services. Jenkins has become the default platform for continuous integration and is supported by all the major public clouds. When it’s time to deploy, you may choose an industrial-strength configuration management solution such as Puppet, Chef, Ansible, or Salt — or opt for cloud templates such as AWS’s Cloud Formation. All the major clouds now support Docker, which enables you to spin up applications in containers, which are much more lightweight and portable than VMs.

Another consideration is whether or not to adopt a full-fledged PaaS. The PaaS offerings Cloud Foundry and OpenShift are available on most of the leading clouds, with Azure also offering its own Service Fabric PaaS. The major PaaS offerings all support Docker containers.

6. Measure, analyze, report

Digital transformation is all about iteration and continuous cycles of improvement. To know what to improve, you need a deep view of how your applications perform — there’s little point to agility unless you have an informed idea of what to do next.

Much of the data collected from customer-facing applications tends be semi-structured: clickstreams, time-series data, event log files, and so on. All the major public clouds include a bundle of big data technologies (generally centering on the distributed processing frameworks Spark and Hadoop) to operate on the semi-structured stuff, typically transforming the results so that SQL analytics software can handle it. Once converted, it can be mashed up with existing SQL data such as transaction records, product data, pricing information, and so on to gain new insights.

With these tools, you can collect and analyze the information you need to prove the success of your project and point the way toward continuous improvement. BI tools such as Tableau, Qlik Sense, or Microsoft Power BI deliver powerful visualizations that can showcase your results.

Show your work

You may have noticed that these first steps to digital transformation are really about spinning up a successful devops initiative. That’s because, when implemented correctly, devops has the potential to vastly increase the efficiency of software development — and software now embodies nearly every aspect of business.

But the organizational implications are just as profound. Most large organizations still maintain silos, and essential to digital transformation is the ability to breach those silos and pull together ad hoc groups that span business and technology factions. That’s one reason why, from the start, you need a business champion who will commit to staying involved throughout your project’s lifecycle.

A successful devops project is just a first step. What you really want to put in place is a platform — and a culture — for experimentation, which by definition results in projects that “fail fast” as well as succeed. If you pick the right projects to show agility’s benefits, the cultural change should follow.

Source: InfoWorld Big Data

The wait for TensorFlow on Windows is almost over

The wait for TensorFlow on Windows is almost over

When will it be possible to run Google’s TensorFlow deep learning system on Windows with full GPU support? The short answer is “soon.”

The real holdup, though, hasn’t even been TensorFlow. It’s been the lack of a working Windows version of Bazel, Google’s in-house tool that delivers TensorFlow builds.

TensorFlow on Windows seems like a no-brainer. Support for GPU-accelerated applications on Windows is highly robust, and Windows is about as popular a platform as you could ask for. To that end, a GitHub issue has been open with TensorFlow for providing Windows support since November of last year.

But the lack of a Windows version of Bazel has kept TensorFlow off Windows — until now. A working edition of Bazel has finally shipped for Windows, and it’s even available to developers through the Chocolatey package management system.

The other delay is adding GPU support to TensorFlow on Windows. While TensorFlow can fall back to CPUs across multiple nodes as a compatibility measure, it’s best run with full GPU support. After some work, said support for Windows is now on the verge of being merged into the project’s mainline.

An earlier fork of the project, produced some two months ago, provided a Windows build for TensorFlow via CMake and Visual Studio 2015 rather than Bazel. But it lacked support for GPU acceleration, and the cost of not using Bazel for the build process might well have been unsupportable over time.

Getting TensorFlow on Windows, then, is a double milestone. Aside from putting a powerful and useful deep learning tool into the hands of a much broader audience of users, the process of bringing it to that audience means future Google projects built with Bazel will also have native Windows versions sooner, too.

Source: InfoWorld Big Data

Snowflake now offers data warehousing to the masses

Snowflake now offers data warehousing to the masses

Snowflake, the cloud-based data warehouse solution co-founded by Microsoft alumnus Bob Muglia, is lowering storage prices and adding a self-service option, meaning prospective customers can open an account with nothing more than a credit card.

These changes also raise an intriguing question: How long can a service like Snowflake expect to reside on Amazon, which itself offers services that are more or less in direct competition — and where the raw cost of storage undercuts Snowflake’s own pricing for same?

Open to the public

The self-service option, called Snowflake On Demand, is a change from Snowflake’s original sales model. Rather than calling a sales representative to set up an account, Snowflake users can now provision services themselves with no more effort than would be needed to spin up an AWS EC2 instance.

In a phone interview, Muglia discussed how the reason for only just now transitioning to this model was more technical than anything else. Before self-service could be offered, Snowflake had to put protections into place to ensure that both the service itself and its customers could be protected from everything from malice (denial-of-service attacks) to incompetence (honest customers submitting massively malformed queries).

“We wanted to make sure we had appropriately protected the system,” Muglia said, “before we opened it up to anyone, anywhere.”

This effort was further complicated by Snowflake’s relative lack of hard usage limits, which Muglia characterized as being one of its major standout features. “There is no limit to the number of tables you can create,” Muglia said, but he further pointed out that Snowflake has to strike a balance between what it can offer any one customer and protecting the integrity of the service as a whole.

“We get some crazy SQL queries coming in our direction,” Muglia said, “and regardless of what comes in, we need to continue to perform appropriately for that customer as well as other customers. We see SQL queries that are a megabyte in size — the query statements [themselves] are a megabyte in size.” (Many such queries are poorly formed, auto-generated SQL, Muglia claimed.)

Fewer costs, more competition

The other major change is a reduction in storage pricing for the service — $30/TB/month for capacity storage, $50/TB/month for on-demand storage, and compressed storage at $10/TB/month.

It’s enough of a reduction in price that Snowflake will be unable to rely on storage costs as a revenue source, since those prices barely pay for the use of Amazon’s services as a storage provider. But Muglia is confident Snowflake is profitable enough overall that such a move won’t impact the company’s bottom line.

“We did the data modeling on this,” said Muglia, “and our margins were always lower on storage than on compute running queries.”

According to the studies Snowflake performed, “when customers put more data into Snowflake, they run more queries…. In almost every scenario you can imagine, they were very much revenue-positive and gross-margin neutral, because people run more queries.”

The long-term implications for Snowflake continuing to reside on Amazon aren’t clear yet, especially since Amazon might well be able to undercut Snowflake by directly offering competitive services.

Muglia, though, is confident that Snowflake’s offering is singular enough to stave off competition for a good long time, and is ready to change things up if need be. “We always look into the possibility of moving to other cloud infrastructures,” Muglia said, “although we don’t have plans to do it right now.”

He also noted that Snowflake competes with Amazon and Redshift right now, but “we have a very different shape of product relative to Redshift…. Snowflake is storing multiple petabytes of data and is able to run hundreds of simultaneous concurrent queries. Redshift can’t do that; no other product can do that. It’s that differentiation that allows to effective compete with Amazon, and for that matter Google and Microsoft and Oracle and Teradata.” 

Source: InfoWorld Big Data

StatSocial Moves Its Public Cloud To NYI’s Managed Hybrid Solution

StatSocial Moves Its Public Cloud To NYI’s Managed Hybrid Solution

NYI has announced that it has been chosen by StatSocial, a premier provider of social data, to fully manage its critical IT infrastructure. Migrating from a public cloud service to NYI’s fully managed hybrid solution not only reduces StatSocial’s operating costs, but it also increases scalability and enables its core IT team to focus on key business growth and innovation.

StatSocial’s IT environment was initially comprised of unmanaged, dedicated servers and a public cloud service. As the company grew, it quickly became apparent that the environment would be increasingly difficult to manage in-house. Its team looked to NYI to migrate to a more scalable, dynamic and fully managed architecture.

“Like many fast-growing tech companies, we found it easy to spin up environments in the cloud, but quickly realized it made no sense to scale in that environment,” comments Michael Hussey, chief executive officer, StatSocial. “With NYI, we now have a fully managed solution that ensures our core services run reliably, efficiently, and are specifically tailored to our unique needs. NYI has given us back our CTO, who can now focus on supporting development and growth initiatives instead of infrastructure challenges. Perhaps most impressive was how quickly NYI became a seamless extension of our technology team, and at a fraction of the cost of trying to do it all ourselves.”

StatSocial enables brands and publishers to understand, segment and target their web-based audiences by evaluating demographics and lifestyles. Using over 40,000 defining variables, StatSocial provides companies with an incredibly detailed perspective of audience composition and characteristics — a powerful tool in today’s data-driven marketplace.

“For over 20 years, we’ve been helping companies with our customized solutions and ensuring they are fully optimized across the board,” adds Phillip Koblence, co-founder and chief operations officer, NYI. “With our hybrid services, enterprises, SMBs and start-ups can find the right solution for each and every server and application. It would be virtually impossible to find that level of flexibility with a public cloud provider.”

Source: CloudStrategyMag