IDG Contributor Network: How in-memory computing drives digital transformation with HTAP

IDG Contributor Network: How in-memory computing drives digital transformation with HTAP

In-memory computing (IMC) is becoming a fixture in the data center, and Gartner predicts that by 2020, IMC will be incorporated into most mainstream products. One of the benefits of IMC is that it will enable enterprises to start implementing hybrid transactional/analytical processing (HTAP) strategies, which have the potential to revolutionize data processing by providing real-time insights into big data sets while simultaneously driving down costs.

Here’s why IMC and HTAP are tech’s new power couple.

Extreme processing performance with IMC

IMC platforms maintain data in RAM to process and analyze data without continually reading and writing data from a disk-based database. Architected to distribute processing across a cluster of commodity servers, these platforms can easily be inserted between existing application and data layers with no rip-and-replace.

They can also be easily and cost effectively scaled by adding new servers to the cluster and can automatically take advantage of the added RAM and CPU processing power. The benefits of IMC platforms include performance gains of 1,000X or more, the ability to scale to petabytes of in-memory data, and high availability thanks to distributed computing.

In-memory computing isn’t new, but until recently, only companies with extremely high-performance, high-value applications could justify the cost of such solutions. However, the cost of RAM has dropped steadily, approximately 10 percent per year for decades. So today the value gained from in-memory computing and the increase in performance it provides can be cost-effectively realized by a growing number of companies in an increasing number of use cases.

Transactions and analytics on the same data set with HTAP

HTAP is a simple concept: the ability to process transactions (such as investment buy and sell orders) while also performing real-time analytics (such as calculating historical account balances and performance) on the operational data set.

For example, in a recent In-Memory Computing Summit North America keynote, Rafique Awan from Wellington Management described the importance of HTAP to the performance of the company’s new investment book of rRecord (IBOR). Wellington has more than $1 trillion in assets under management.

But HTAP isn’t easy. In the earliest days of computing, the same data set was used for both transaction processing and analytics. However, as data sets grew in size, queries started slowing down the system and could lock up the database.

To ensure fast transaction processing and flexible analytics for large data sets, companies deployed transactional databases, referred to as online transaction processing (OLTP) systems, solely for the purpose of recording and processing transactions. Separate online analytical processing (OLAP) databases were deployed, and data from an OLTP system was periodically (daily, weekly, etc.) extracted, transformed, and loaded (ETLed) into the OLAP system.

This bifurcated architecture has worked well for the last few decades. But the need for real-time transaction and analytics processing in the face of rapidly growing operational data sets has become crucial for digital transformation initiatives, such as those driving web-scale applications and internet of things (IoT) use cases. With separate OLTP and OLAP systems, however, by the time the data is replicated from the OLTP to the OLAP system, it is simply too late—real-time analytics are impossible.

Another disadvantage of the current strategy of separate OLTP and OLAP systems is that IT must maintain separate architectures, typically on separate technology stacks. This results in hardware and software costs for both systems, as well as the cost for human resources to build and maintain them.

The new power couple

With in-memory computing, the entire transactional data set is already in RAM and ready for analysis. More sophisticated in-memory computing platforms can co-locate compute with the data to run fast, distributed analytics across the data set without impacting transaction processing. This means replicating the operational data set to an OLAP system is no longer necessary.

According to Gartner, in-memory computing is ideal for HTAP because it supports real-time analytics and situational awareness on the live transaction data instead of relying on after-the-fact analyses on stale data. IMC also has the potential to significantly reduce the cost and complexity of the data layer architecture, allowing real-time, web-scale applications at a much lower cost than separate OLTP/OLAP approaches.

To be fair, not all data analytics can be performed using HTAP. Highly complex, long running queries must still be performed in OLAP systems. However, HTAP can provide businesses with a completely new ability to react immediately to a rapidly changing environment.

For example, for industrial IoT use cases, HTAP can enable the real-time capture of incoming sensor data and simultaneously make real-time decisions. This can result in more timely maintenance, higher asset utilization, and reduced costs, driving significant financial benefits. Financial services firms can process transactions in their IBORs and analyze their risk and capital requirements at any point in time to meet the real-time regulatory reporting requirements that impact their business.

Online retailers can transact purchases while simultaneously analyzing inventory levels and other factors, such as weather conditions or website traffic, to update pricing for a given item in real time. And health care providers can continually analyze the transactional data being collected from hundreds or thousands of in-hospital and home-based patients to provide immediate individual recommendations while also looking at trend data for possible disease outbreaks.

Finally, by eliminating the need for separate databases, an IMC-powered HTAP system can simplify life for development teams and eliminate duplicative costs by reducing the number of technologies in use and downsizing to just one infrastructure.

The fast data opportunity

The rapid growth of data and the drive to make real-time decisions based on the data generated as a result of digital transformation initiatives is driving companies to consider IMC-based HTAP solutions. Any business faced with the opportunities and challenges of fast data from initiatives such as web-scale applications and the internet of things, which require ever-greater levels of performance and scale, should definitely take the time to learn more about in-memory computing-driven hybrid transactional/analytical processing.

This article is published as part of the IDG Contributor Network. Want to Join?

Source: InfoWorld Big Data

IDG Contributor Network: Ensuring big data and fast data performance with in-memory computing

IDG Contributor Network: Ensuring big data and fast data performance with in-memory computing

In-memory computing (IMC) technologies have been available for years. However, until recently, the cost of memory made IMC impractical for all but the most performance-critical, high value applications.

Over the last few years, however, with memory prices falling and demand for high performance increasing in just about every area of computing, I’ve watched IMC discussions go from causing glazed eyes to generating mild interest, to eliciting genuine excitement: “Please! I need to understand how this technology can help me!”

Why all the excitement? Because companies that understand the technology also understand that if they don’t incorporate it into their architectures, they won’t be able to deliver the applications and the performance their customers demand today and will need tomorrow. In-memory data grids and in-memory databases, both key elements of an in-memory computing platform, have gained recognition and mindshare as more and more companies have deployed them successfully.

All the new developments around in-memory computing shouldn’t fool you into thinking it’s unproven. It’s a mature, mainstream technology that’s been used for more than a decade in applications including fraud detection, high-speed trading and high performance computing.

Consider the challenges caused by the explosion in data being collected and processed as part of the digital transformation. As you go through your day, almost everything you do intersects with some form of data production, collection or processing: text messaging, emailing, social media interaction, event planning, research, digital payments, video streaming, interacting with a digital voice assistant…. Every department in your company relies on more sophisticated, web-scale applications (such as ERP, CRM and HRM), which themselves have ever more sophisticated demands for data and analytics.

Now add in the growing range of consumer IoT applications: smart refrigerators, watches and security systems—with nonstop monitoring and data collection—and connected vehicles with constant data exchange related to traffic and road conditions, power consumption and the health of the car. Industrial IoT is potentially even bigger. I recently read that to improve braking efficiency, a train manufacturer is putting 400 sensors in each train, with plans to increase that number to 4,000 over the next five years. And data from all of these applications must be collected and often analyzed in real time.

That’s where in-memory computing comes in. An IMC platform offers a way to transact and analyze data which resides completely in RAM instead of continually retrieving data from disk-based databases into RAM before processing. In addition, in-memory computing solutions are built on distributed architectures so they can utilize parallel processing to further speed the platform versus single node, disk-based database alternatives. These benefits can be gained by simply inserting an in-memory computing layer between existing application and database layers. Taken together, performance gains can be 1,000X or more.

Also, because in-memory computing solutions are distributed systems, it is easy to increase the RAM pool and the processing power of the system by adding nodes to the cluster. The systems will automatically recognize the new node and rebalance data between the nodes.

Today, IMC use cases continue to expand. Companies are accelerating their operational and customer-facing applications by deploying in-memory data grids between the application and database layers of their systems to cache the data and enable distributed parallel processing across the cluster nodes. Some are using IMC technology for event and stream processing to rapidly ingest, analyze, and filter data on the fly before sending the data elsewhere.

Many large analytic databases and data warehouses are using IMC technology to accelerate complicated queries on large data sets. And companies are beginning to deploy hybrid transactional/analytical processing (HTAP) models which allow them to transact and run queries on the same operational data set, reducing the complexity and cost of their computing infrastructure in use cases such as IoT.

The importance of IMC will continue to increase over the coming years as ongoing development and new technologies become available including:

First-class support for distributed SQL

Strong support for SQL will extend the life of this industry standard, eliminating the need for SQL professionals to learn proprietary languages to create queries—something they can do with a single line of SQL code. Leading in-memory data grids already include ANSI SQL-99 support.

Non-volatile memory (NVM)

NVM retains data during a power loss, eliminating the need for software-based fault-tolerance. A decade from now, NVM will likely be the predominant computing storage model, enabling large-scale, in-memory systems which only use hard disks or flash drives for archival purposes.

Hybrid storage models for large datasets

By supporting a universal interface to all storage media—RAM, flash, disk, and NVM—IMC platforms will give businesses the flexibility to easily adjust storage strategy and processing performance to meet budget requirements without changing data-access mechanisms. 

IMC as a system of record

IMC platforms will increasingly be used by businesses as authoritative data sources for business-critical records. This will in part be driven by IMC support for highly efficient hybrid transactional and analytical processing (HTAP) on the same database as well as the introduction of disk-based persistence layers for high availability and disaster recovery.

Artificial intelligence

Machine learning on small, dense datasets is easily accomplished today, but machine learning on large, sparse data sets requires a data management system that can store terabytes of data and perform fast parallel computations, a perfect IMC use case.

All the new developments around in-memory computing shouldn’t fool you into thinking it’s unproven. It’s a mature, mainstream technology that’s been used for more than a decade in applications including fraud detection, high-speed trading and high performance computing. But it’s now more affordable and vendors are making their IMC platforms easier to use and applicable to more use cases. The sooner you begin exploring IMC, the sooner your company can benefit from it.

This article is published as part of the IDG Contributor Network. Want to Join?

Source: InfoWorld Big Data

Fire up big data processing with Apache Ignite

Fire up big data processing with Apache Ignite

Apache Ignite is an in-memory computing platform that can be inserted seamlessly between a user’s application layer and data layer. Apache Ignite loads data from the existing disk-based storage layer into RAM, improving performance by as much as six orders of magnitude (1 million-fold).

The in-memory data capacity can be easily scaled to handle petabytes of data simply by adding more nodes to the cluster. Further, both ACID transactions and SQL queries are supported. Ignite delivers performance, scale, and comprehensive capabilities far above and beyond what traditional in-memory databases, in-memory data grids, and other in-memory-based point solutions can offer by themselves.

Apache Ignite does not require users to rip and replace their existing databases. It works with RDBMS, NoSQL, and Hadoop data stores. Apache Ignite enables high-performance transactions, real-time streaming, and fast analytics in a single, comprehensive data access and processing layer. It uses a distributed, massively parallel architecture on affordable, commodity hardware to power existing or new applications. Apache Ignite can run on premises, on cloud platforms such as AWS and Microsoft Azure, or in a hybrid environment.

apache ignite architecture

The Apache Ignite unified API supports SQL, C++, .Net, Java, Scala, Groovy, PHP, and Node.js. The unified API connects cloud-scale applications with multiple data stores containing structured, semistructured, and unstructured data. It offers a high-performance data environment that allows companies to process full ACID transactions and generate valuable insights from real-time, interactive, and batch queries.

Users can keep their existing RDBMS in place and deploy Apache Ignite as a layer between it and the application layer. Apache Ignite automatically integrates with Oracle, MySQL, Postgres, DB2, Microsoft SQL Server, and other RDBMSes. The system automatically generates the application domain model based on the schema definition of the underlying database, then loads the data. In-memory databases typically provide only a SQL interface, whereas Ignite supports a wider group of access and processing paradigms in addition to ANSI SQL. Apache Ignite supports key/value stores, SQL access, MapReduce, HPC/MPP processing, streaming/CEP processing, clustering, and Hadoop acceleration in a single integrated in-memory computing platform.

GridGain Systems donated the original code for Apache Ignite to the Apache Software Foundation in the second half of 2014. Apache Ignite was rapidly promoted from an incubating project to a top-level Apache project in 2015. In the second quarter of 2016, Apache Ignite was downloaded nearly 200,000 times. It is used by organizations around the world.

Architecture

Apache Ignite is JVM-based distributed middleware based on a homogeneous cluster topology implementation that does not require separate server and client nodes. All nodes in an Ignite cluster are equal, and they can play any logical role per runtime application requirement.

A service provider interface (SPI) design is at the core of Apache Ignite. The SPI-based design makes every internal component of Ignite fully customizable and pluggable. This enables tremendous configurability of the system, with adaptability to any existing or future server infrastructure.

Apache Ignite also provides direct support for parallelization of distributed computations based on fork-join, MapReduce, or MPP-style processing. Ignite uses distributed parallel computations extensively, and they are fully exposed at the API level for user-defined functionality.

Key features

In-memory data grid. Apache Ignite includes an in-memory data grid that handles distributed in-memory data management, including ACID transactions, failover, advanced load balancing, and extensive SQL support. The Ignite data grid is a distributed, object-based, ACID transactional, in-memory key-value store. In contrast to traditional database management systems, which utilize disk as their primary storage mechanism, Ignite stores data in memory. By utilizing memory rather than disk, Apache Ignite is up to 1 million times faster than traditional databases.

apache ignite data grid

SQL support. Apache Ignite supports free-form ANSI SQL-99 compliant queries with virtually no limitations. Ignite can use any SQL function, aggregation, or grouping, and it supports distributed, noncolocated SQL joins and cross-cache joins. Ignite also supports the concept of field queries to help minimize network and serialization overhead.

In-memory compute grid. Apache Ignite includes a compute grid that enables parallel, in-memory processing of CPU-intensive or other resource-intensive tasks such as traditional HPC, MPP, fork-join, and MapReduce processing. Support is also provided for standard Java ExecutorService asynchronous processing.

apache ignite compute grid

In-memory service grid. The Apache Ignite service grid provides complete control over services deployed on the cluster. Users can control how many service instances should be deployed on each cluster node, ensuring proper deployment and fault tolerance. The service grid guarantees continuous availability of all deployed services in case of node failures. It also supports automatic deployment of multiple instances of a service, of a service as a singleton, and of services on node startup.

In-memory streaming. In-memory stream processing addresses a large family of applications for which traditional processing methods and disk-based storage, such as disk-based databases or file systems, are inadequate. These applications are extending the limits of traditional data processing infrastructures.

apache ignite streaming

Streaming support allows users to query rolling windows of incoming data. This enables users to answer questions such as “What are the 10 most popular products over the last hour?” or “What is the average price in a certain product category for the past 12 hours?”

Another common stream processing use case is pipelining a distributed events workflow. As events are coming into the system at high rates, the processing of events is split into multiple stages, each of which has to be properly routed within a cluster for processing. These customizable event workflows support complex event processing (CEP) applications.

In-memory Hadoop acceleration. The Apache Ignite Accelerator for Hadoop enables fast data processing in existing Hadoop environments via the tools and technology an organization is already using.

apache ignite hadoop rev

Ignite in-memory Hadoop acceleration is based on the first dual-mode, high-performance in-memory file system that is 100 percent compatible with Hadoop HDFS and an in-memory optimized MapReduce implementation. Delivering up to 100 times faster performance, in-memory HDFS and in-memory MapReduce provide easy-to-use extensions to disk-based HDFS and traditional MapReduce. This plug-and-play feature requires minimal to no integration. It works with any open source or commercial version of Hadoop 1.x or Hadoop 2.x, including Cloudera, Hortonworks, MapR, Apache, Intel, and AWS. The result is up to 100-fold faster performance for MapReduce and Hive jobs.

Distributed in-memory file system. A unique feature of Apache Ignite is the Ignite File System (IGFS), which is a file system interface to in-memory data. IGFS delivers similar functionality to Hadoop HDFS. It includes the ability to create a fully functional file system in memory. IGFS is at the core of the Apache Ignite In-Memory Accelerator for Hadoop.

The data from each file is split on separate data blocks and stored in cache. Data in each file can be accessed with a standard Java streaming API. For each part of the file, a developer can calculate an affinity and process the file’s content on corresponding nodes to avoid unnecessary networking.

Unified API. The Apache Ignite unified API supports a wide variety of common protocols for the application layer to access data. Supported protocols include SQL, Java, C++, .Net, PHP, MapReduce, Scala, Groovy, and Node.js. Ignite supports several protocols for client connectivity to Ignite clusters, including Ignite Native Clients, REST/HTTP, SSL/TLS, and Memcached.SQL.

Advanced clustering. Apache Ignite provides one of the most sophisticated clustering technologies on JVMs. Ignite nodes can automatically discover each other, which helps scale the cluster when needed without having to restart the entire cluster. Developers can also take advantage of Ignite’s hybrid cloud support, which allows users to establish connections between private clouds and public clouds such as AWS or Microsoft Azure.

Additional features. Apache Ignite provides high-performance, clusterwide messaging functionality. It allows users to exchange data via publish-subscribe and direct point-to-point communication models.

The distributed events functionality in Ignite allows applications to receive notifications about cache events occurring in a distributed grid environment. Developers can use this functionality to be notified about the execution of remote tasks or any cache data changes within the cluster. Event notifications can be grouped and sent in batches and at timely intervals. Batching notifications help attain high cache performance and low latency.

Ignite allows for most of the data structures from the java.util.concurrent framework to be used in a distributed fashion. For example, you could add to a double-ended queue (java.util.concurrent.BlockingDeque) on one node and poll it from another node. Or you could have a distributed primary key generator, which would guarantee uniqueness on all nodes.

Ignite distributed data structures include support for these standard Java APIs: Concurrent map, distributed queues and sets, AtomicLong, AtomicSequence, AtomicReference, and CountDownLatch.

Key integrations

Apache Spark. Apache Spark is a fast, general-purpose engine for large-scale data processing. Ignite and Spark are complementary in-memory computing solutions. They can be used together in many instances to achieve superior performance and functionality.

Apache Spark and Apache Ignite address somewhat different use cases and rarely compete for the same task. The table below outlines some of the key differences.

Apache Spark doesn’t provide shared storage, so data from HDFS or other disk storage must be loaded into Spark for processing. State can be passed from Spark job to job only by saving the processed data back into external storage. Ignite can share Spark state directly in memory, without storing the state to disk.

One of the main integrations for Ignite and Spark is the Apache Ignite Shared RDD API. Ignite RDDs are essentially wrappers around Ignite caches that can be deployed directly inside of executing Spark jobs. Ignite RDDs can also be used with the cache-aside pattern, where Ignite clusters are deployed separately from Spark, but still in-memory. The data is still accessed using Spark RDD APIs.

Spark supports a fairly rich SQL syntax, but it doesn’t support data indexing, so it must do full scans all the time. Spark queries may take minutes even on moderately small data sets. Ignite supports SQL indexes, resulting in much faster queries, so using Spark with Ignite can accelerate Spark SQL more than 1,000-fold. The result set returned by Ignite Shared RDDs also conforms to the Spark Dataframe API, so it can be further analyzed using standard Spark dataframes. Both Spark and Ignite natively integrate with Apache YARN and Apache Mesos, so it’s easier to use them together.

When working with files instead of RDDs, it’s still possible to share state between Spark jobs and applications using the Ignite In-Memory File System (IGFS). IGFS implements the Hadoop FileSystem API and can be deployed as a native Hadoop file system, exactly like HDFS. Ignite plugs in natively to any Hadoop or Spark environment. IGFS can be used with zero code changes in plug-and-play fashion.

Apache Cassandra. Apache Cassandra can serve as a high-performance solution for structured queries. But the data in Cassandra should be modeled such that each predefined query results in one row retrieval. Thus, you must know what queries will be required before modeling the data.

While very powerful in certain cases, Cassandra lacks an in-memory option that can severely limit performance. Cassandra can be useful for OLAP applications but lacks support for transactions, ACID or otherwise, so is not employed for OLTP. Predefined queries can be efficient with Cassandra, but Cassandra lacks SQL support and does not support joins, aggregations, groupings, or usable indexes. These limitations mean Cassandra cannot support ad hoc queries.

Apache Ignite offers native support for Cassandra. With Ignite, Cassandra users gain very powerful capabilities such as the ability to leverage in-memory computing to reduce query times by 1,000x. They can also leverage ANSI-compliant SQL support to run ad hoc and structured queries against in-memory data using joins, aggregations, groupings, and usable indexes.

Installing Ignite

Despite the breadth of its feature set, Apache Ignite is very easy to use and deploy. There are no custom installers. The code base comes as a single Zip file with only one mandatory dependency: ignite-core.jar. All other dependencies, such as integration with Spring for configuration, can be added to the process à la carte. The project is fully Mavenized; it is composed of more than a dozen Maven artifacts that can be imported and used in any combination. Apache Ignite is based on standard Java APIs. For distributed caches and data grid functionality, Apache Ignite implements the JCache (JSR107) standard.

Apache Ignite is a high-performance, distributed in-memory computing platform for large-scale data sets. It offers performance gains to transactional and analytical applications on the order of 1,000 to 1 million times faster throughput, as well as lower latencies than are possible with traditional disk-based or flash technologies. Ignite sits between the application and data layers and does not require the rip-and-replacement of existing RDBMS, NoSQL, or Hadoop data stores.

Apache Ignite is composed of, in one well-integrated framework, a set of in-memory computing capabilities, including an in-memory data grid, an in-memory compute grid, an in-memory service grid, in-memory stream processing, and in-memory acceleration for Hadoop, Spark, and Cassandra. In combination with traditional or distributed data stores, Apache Ignite holds the key to high-volume transactions, real-time analytics, and the emerging class of hybrid transaction/analytical processing (HTAP) workloads.

Apache Ignite resources and documentation, including white papers, recorded webinars, and code samples, are available on the GridGain website.

Nikita Ivanov is founder and CTO of GridGain Systems, where he has led the development of advanced and distributed in-memory data processing technologies. He has more than 20 years of experience in software application development, building HPC and middleware platforms, and contributing to the efforts of companies including Adaptec, Visa, and BEA Systems.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Source: InfoWorld Big Data