MapR Announces Free Stream Processing On-Demand Training for Real-time Analytics and IoT Applications

MapR Announces Free Stream Processing On-Demand Training for Real-time Analytics and IoT Applications

MapR Logo - New 2014_FEATUREMapR Technologies, Inc., provider of the Converged Data Platform, announced at Kafka Summit 2016 it is now offering stream processing training via MapR Academy’s free On-Demand Training program.  The new training enables Apache Kafka developers to extend their real-time analytics and Internet of Things (IoT) applications. Developers can also benefit with MapR Streams that provides Kafka-compatibility, integrated security and multi-data center support as an integrated component of the MapR Converged Data Platform.

Companies interested in Apache Kafka will immediately see the value of our growing library of free on-demand training courses that focus on data-in-motion for real-time analytics and IoT applications,” said Suzanne Ferry, vice president of global education and training, MapR Technologies.  “MapR Academy has created these new courses to deliver best-in-class professional training on critical big data technologies like Kafka and MapR Streams to both development and IT operations teams.”

MapR Streams is the newly released, global event publish and subscribe framework that is integrated directly into the MapR Converged Data Platform. MapR converges the power of Hadoop, Spark and other open source technologies into one unified, enterprise-grade platform for streaming, real-time database capabilities, and enterprise storage.

Stream processing expert data Artisans recently tested MapR Streams with Apache Flink performance with the Yahoo! stream processing benchmark and it resulted in MapR Streams delivering an incredible 10 million events per second with 3X replication.

We benchmarked MapR Streams with Apache Flink and were blown away by the results,” said Kostas Tzoumas, co-founder and CEO, data Artisans. “Ten million events per second is incredible throughput, and the 3X replication adds a level of data protection that will please enterprise IT shops.”

The free curriculum from MapR based on stream processing, includes:

  • DEV 350 – Streams Essentials training provides developers a broad understanding of the core concepts behind stream processing and prepares them to begin using MapR Streams.  Sample topics include MapR Streams overview, use cases, Streams architecture, core components, and summary of “life of a message.”
  • DEV 351 – Streams Development training gives developers the core concepts necessary to build simple MapR Streams applications as well as a basic framework for building and configuring Stream Processing applications. Sample topics include creating a stream, developing/configuring stream producers and consumers, describing properties, and options for producers and consumers.

More than 50,000 professionals have enrolled in MapR Academy’s On-Demand Training for an easy way to receive free online Hadoop training. The free curriculum also covers Apache Spark, Apache Drill, and other technologies. The courses offer the same depth and content as instructor-led training courses, and include hands-on exercises, labs and quizzes to ensure an effective, interactive learning experience.

Sign up for the free insideBIGDATA newsletter.

Source: insideBigData

Movidius Announces Fathom Deep Learning Accelerator Compute Stick

Movidius Announces Fathom Deep Learning Accelerator Compute Stick

Fathom1Movidius, a leader in low-power machine vision technology, today announced both the Fathom Neural Compute Stick – the world’s first deep learning acceleration module, and Fathom deep learning software framework. Both tools hand-in-hand will allow powerful neural networks to be moved out of the cloud, and deployed natively in end-user devices.

The new Fathom Neural Compute Stick is the world’s first embedded neural network accelerator. With the company’s ultra-low power, high performance Myriad 2 processor inside, the Fathom Neural Compute Stick can run fully-trained neural networks at under 1 Watt of power. Thanks to standard USB connectivity, the Fathom Neural Compute Stick can be connected to a range of devices and enhance their neural compute capabilities by orders of magnitude.

Neural Networks are used in many revolutionary applications such as object recognition, natural speech understanding, and autonomous navigation for cars. Rather than engineers programming explicit rules for machines to follow, vast amounts of data are processed offline in self-teaching systems that generate their own rule-sets. Neural networks significantly outperform traditional approaches in tasks such as language comprehension, image recognition and pattern detection.

[embedded content]

When connected to a PC, the Fathom Neural Compute Stick behaves as a neural network profiling and evaluation tool, meaning companies will be able to prototype faster and more efficiently, reducing time to market for products requiring cutting edge artificial intelligence.

As a participant in the deep learning ecosystem, I have been hoping for a long time that something like Fathom would become available,” said Founding Director of New York University Data Science Center, Dr. Yann LeCun. “The Fathom Neural Compute Stick is a compact, low-power convolutional net accelerator for embedded applications that is quite unique. As a tinkerer and builder of various robots and flying contraptions, I’ve been dreaming of getting my hands on something like the Fathom Neural Compute Stick for a long time. With Fathom, every robot, big and small, can now have state-of-the-art vision capabilities.”

Fathom allows developers to take their trained neural networks out of the PC-training phase and automatically deploy a low-power optimized version to devices containing a Myriad 2 processor. Fathom supports the major deep learning frameworks in use today, including Caffe and TensorFlow.

Deep learning has tremendous potential — it’s exciting to see this kind of intelligence working directly in the low-power mobile environment of consumer devices,” Google’s AI Technical Lead Pete Warden. “With TensorFlow supported from the outset, Fathom goes a long way towards helping tune and run these complex neural networks inside devices.”

Fathom Features

  • Plugged into existing systems (ARM host + USB port), Fathom can accelerate performance between 20x and 30x on deep learning tasks, i.e. plug it into a “dumb” drone and then you can run neural network applications on it.
  • It contains the latest Myriad 2 MA2450 chip – the same one Google is using in their undisclosed next generation deep learning devices.
  • It’s ultra-low power (under 1.2W) is ideal for many mobile and smart devices. This is roughly 1/10th of what competitors can achieve today.
  • Can take Tensorflow and Caffe PC networks and put them into embedded silicon at under 1W. Fathom Images/Second/Watt is roughly 2x Nvidia on similar tests.
  • Fathom takes machine intelligence out of the cloud and into actual devices. It can run deep neural networks in real time on the device itself.
  • With Fathom, you are able to finally bridge the gap between training (i.e. server side on GPU blades), and inferencing (running without cloud connection and in user’s devices). Customers can rapidly convert a PC-trained network and deploy to an embedded environment – meaning they are going to be able to put deep learning into end user products way faster, and far more easily than before.
  • Application example: plug Fathom into a GoPro and turn it into a camera with deep learning capabilities.

Availability

General availability will be Q4 of this year. Pricing will be sub $100 per unit.

Sign up for the free insideBIGDATA newsletter.

Source: insideBigData

A Brief History of Kafka, LinkedIn’s Messaging Platform

A Brief History of Kafka, LinkedIn’s Messaging Platform

Apache Kafka is a highly scalable messaging system that plays a critical role as LinkedIn’s central data pipeline. But it was not always this way. Over the years, we have had to make hard architecture decisions to arrive at the point where developing Kafka was the right decision for LinkedIn to make. We also had to solve some basic issues to turn this project into something that can support the more than 1.4 trillion messages that pass through the Kafka infrastructure at LinkedIn. What follows is a brief history of Kafka development at LinkedIn and an explanation of how we’ve integrated Kafka into virtually everything we do. Hopefully, this will help others that are making similar technology decisions as their companies grow and scale.

Why did we develop Kafka?

Over six years ago, our engineering team need to completely redesign LinkedIn’s infrastructure. To accommodate our growing membership and increasing site complexity, we had already migrated from a monolithic application infrastructure to one based on microservices. This change allowed our search, profile, communications, and other platforms to scale more efficiently. It also led to the creation of a second set of mid-tier services to provide API access to data models and back-end services to provide consistent access to our databases.

We initially developed several different custom data pipelines for our various streaming and queuing data. The use cases for these platforms ranged from tracking site events like page views to gathering aggregated logs from other services. Other pipelines provided queuing functionality for our InMail messaging system, etc. These needed to scale along with the site. Rather than maintaining and scaling each pipeline individually, we invested in the development of a single, distributed pub-sub platform. Thus, Kafka was born.

Kafka was built with a few key design principles in mind: a simple API for both producers and consumers, designed for high throughput, and a scaled-out architecture from the beginning.

What is Kafka today at LinkedIn?

Kafka became a universal pipeline, built around the concept of a commit log, and was built with speed and scalability in mind. Our early Kafka use cases encompassed both the online and offline worlds, both feeding systems that consume events in real-time and those that perform batch analysis. Some common ways we used Kafka included traditional messaging (publishing data from our content feeds and relevance systems to our online serving stores), to provide metrics for system health (used in dashboards and alerts), and to better understand how members use our products (user activity tracking and feeding data to Hadoop grid for analysis and report generation). In 2011 we open sourced Kafka via the Apache Software Foundation, providing the world with a powerful open source solution for managing streams of information.

Today we run several clusters of Kafka brokers for different purposes in each data center. We generally run off the open source Apache Kafka trunk and put out a new internal release a few times a year. However, as our Kafka usage continued to rapidly grow, we had to solve some significant problems to make all of this happen at scale. In the years since we released Kafka as open source, the Engineering team at LinkedIn has developed an entire ecosystem around Kafka.

As pointed out in this blog post by Todd Palino, a key problem for an operation as big as LinkedIn’s is the need for message consistency across multiple datacenters. Many applications, such as those maintaining the indices that enable search, need a view of what is going on in all of our datacenters around the world. At LinkedIn, we use the Kafka MirrorMaker to make copies of of our clusters. There are multiple mirroring pipelines that run both within data centers and across data centers and are laid out to keep network costs and latency to a minimum.

The Kafka ecosystem

A key innovation that has allowed Kafka to maintain a mostly self-service model has been our integration with Nuage, the self-service portal for online data-infrastructure resources at LinkedIn. This service offers a convenient place for users to manage their topics and associated metadata, abstracting some of the nuances of Kafka’s administrative utilities and making the process easier for topic owners.

Another open source project, Burrow, is our answer to the tricky problem of monitoring Kafka consumer health. It provides a comprehensive view of consumer status, and consumer lag checking as a service without the need to specify thresholds. It monitors committed offsets for all consumers at topic-partition granularity and calculates the status of those consumers on demand.

Scaling Kafka in a time of rapid growth

The scale of Kafka at LinkedIn continues to grow in terms of data transferred, clusters and the number of applications it powers.  As a result we face unique challenges in terms of reliability, availability and cost of our heavily multi-tenant clusters. In this blog post, Kartik Paramasivam explains the various things that we have improved in Kafka and its ecosystem at LinkedIn to address these issues.

Samza is LinkedIn’s stream processing platform that empowers users to get their stream processing jobs up and running in production as quickly as possible. Unlike other stream processing systems that focus on a very broad feature set, we concentrated on making Samza reliable, performant and operable at the scale of LinkedIn. Now that we have a lot of production workloads up and running, we can turn our attention to broadening the feature set. You can read about our use-cases for relevance, analytics, site-monitoring, security, etc., here.

Kafka’s strong durability, low latency, and recently improved security have enabled us to use Kafka to power a number of newer mission-critical use cases. These include replacing MySQL replication with Kafka-based replication in Espresso, our distributed document store. We also plan to support the next generation of Databus, our source-agnostic distributed change data capture system, using Kafka. We are continuing to invest in Kafka to ensure that our messaging backbone stays healthy as we ask more and more from it.

The Kafka Summit in San Francisco was recently held on April 26.

JoelKoshy_LinkedInContributed by: Joel Koshy, a member of the Kafka team within the Data Infrastructure group at LinkedIn and has worked on distributed systems infrastructure and applications for the past eight years. He is also a PMC member and committer for the Apache Kafka project. Prior to LinkedIn, he was with the Yahoo! search team where he worked on web crawlers. Joel received his PhD in Computer Science from UC Davis and his bachelors in Computer Science from IIT Madras.

Sign up for the free insideBIGDATA newsletter.

Source: insideBigData

Qubole and Looker Join Forces to Empower Business Users to Make Data-Driven Decisions

Qubole and Looker Join Forces to Empower Business Users to Make Data-Driven Decisions

Qubole_logoQubole, the big data-as-a-service company, and Looker, the company that is powering data-driven businesses, today announced that they are integrating Looker’s business analytics with Qubole’s cloud-based big data platform, giving line of business users across organizations access to powerful, yet easy-to-use big data analytics.

Business units face an uphill battle when it comes to gleaning information from vast and disparate sources. Line of business users find it challenging to extract, shape and present the variety and volume of data to executives to help make informed business decisions. As a result, data scientists are overwhelmed with requests to access data or provide fixed reports to line of business users, diverting their attention from gathering data insights through statistics and modeling techniques. Furthermore, line of business users become frustrated when they are forced to decipher the output of SQL aggregations created by data scientists.

Qubole and Looker are addressing this issue by integrating the Qubole Data Service (QDS) and Looker’s analytics data platform. The combination gives line of business users instant access to automated, scalable, self-service data analytics without having to rely on or overburden the data science team — and without having to build and maintain on-premises infrastructure.

Data has become essential for every business function across the enterprise, but most big data offerings are still too complicated for line of business users to use, substantially reducing the business impact data can have,” said Ashish Thusoo, co-founder and CEO of Qubole. “Qubole and Looker have similar philosophies that it is essential for businesses to make insights accessible to as many people in an organization as possible to stay competitive. The integration of our offerings serves that very purpose.”

Looker-logoQDS is a self-service platform for big data analytics that runs on the three major public clouds: Amazon AWS, Google Compute Engine and Microsoft Azure. QDS automatically provisions, manages and scales up clusters to match the needs of a particular job, and then winds down nodes when they’re no longer needed. QDS is a fully managed big data offering that leverages the latest open source technologies, such as Apache Hadoop, Hive, Presto, Pig, Oozie, Sqoop and Spark, to provide the only comprehensive, “everything-as-a-service” data analytics platform, complete with enterprise security features, an easy to use UI and built-in data governance.

Our customers are using Looker every day to operationalize their data and make better business decisions,” said Keenan Rice, vice president of alliances, Looker. “Now with our support for Qubole’s automated, scalable, big data platform, businesses have greater access to their cloud-based data. At the same time, Qubole’s rapidly growing list of customers utilize our data platform to find, explore and understand the data that runs their business.”

Sign up for the free insideBIGDATA newsletter.

Source: insideBigData

Redis Collaborates with Samsung Electronics to Achieve Groundbreaking Database Performance

Redis Collaborates with Samsung Electronics to Achieve Groundbreaking Database Performance

Redis today announced the general availability of Redis on Flash with standard x86 servers, including standard SATA-based SSD instances available on public clouds and more advanced NVMe based SSDs like the Samsung PM1725. Running Redis, the world’s most popular in-memory data structure store, on cost effective persistent memory options enables customers to process and analyze large datasets at near real-time speeds with 70% lower cost.

The Redis on Flash offering has been optimized to run Redis with flash memory used as a RAM extender. Operational processing and analysis of very large datasets in-memory is often limited by the cost of dynamic random access memory (DRAM). By running a combination of Redis on Flash and DRAM, datacenter managers benefit from leveraging the high throughput and low latency characteristics of Redis while achieving substantial cost savings.

New next generation persistent memory technology like Samsung’s NVMe SSD delivers orders of magnitude higher performance at only an incremental added cost compared to standard flash memory. Redis collaborated with Samsung to demonstrate 2 million ops/second with sub-millisecond latency and over 1GB disk bandwidth on a single standard Dell Xeon Server, placing 80 percent of the dataset on the NVMe SSD technology and only 20 percent of it on DRAM.

We are happy to contribute to a new solution for our customers, one that shows a 40X improvement in throughput at sub-millisecond latencies compared to standard SATA-based SSDs,” stated Mike Williams, vice president, product planning, Samsung Device Solutions Americas. “This solution – using our next generation NVMe SSD technology and Redis in-memory processing – can play a key role in the advancement of high performance computing technology for the analysis of extremely large data sets.”

Spot.IM, a next generation on-demand social network that powers social conversations on leading entertainment and media websites such as Entertainment Weekly and CG Media is already reaping the benefits of deploying Redis on Flash. Spot.IM’s cutting-edge architecture seeks minimal latency, so the transition from webpage viewing to interactive dialog appears to be seamless. With Redis automatically scaling, highly responsive database, the service is able to easily handle 400,000 to one million user requests a day, to and from third-party websites at sub-millisecond latencies. As Spot.IM scaled out its architecture in an AWS Virtual Private Cloud (VPC) environment, the company turned to Redis on Flash, delivered as Redis Enterprise Cluster (RLEC), to help optimize the costs of running an extremely demanding, high performance, low latency application without compromising on responsiveness. With RLEC Flash, Spot.IM maintains extremely high throughput (processing several hundred thousands of requests per second) at significantly lower costs compared to a pure RAM solution.

Redis is our main database and a critical component of our highly demanding application because our architecture needs to handle extremely high speed operations with very little complexity and at minimal cost” says Ishay Green, CTO, Spot.IM. “Redis technology satisfies all our requirements around high availability, seamless scalability, high performance and now at a very attractive price point with Redis on Flash.”

Redis on Flash is now available as RLEC (Redis Enterprise Cluster) over standard x86 servers, including SSD backed cloud instances and IBM POWER8 platforms. It is also available to Redis Cloud customers running on a dedicated Virtual Private Cloud environment.

Sign up for the free insideBIGDATA newsletter.

Source: insideBigData

BackOffice Associates Releases Data Stewardship Platform 6.5 and dspConduct for Information Stewardship

BackOffice Associates Releases Data Stewardship Platform 6.5 and dspConduct for Information Stewardship

BackOffice_logoBackOffice Associates, a leader of information governance and data modernization solutions, today announced Version 6.5 of its flagship Data Stewardship Platform (DSP) and debuted its newest dspConduct application for comprehensive business process governance and application data management across all data in all systems.

Next-generation information governance is necessary to maximize the value of an enterprise’s data assets, improve the efficiency of business processes and increase the overall value of the organization,” said David Booth, chairman and CEO, BackOffice Associates. “Our continued vision and offerings are designed to help organizations embrace the next wave in data stewardship.”

dspConduct is built on DSP 6.5 – the most powerful data stewardship platform to date.  With this latest release, the DSP continues to drive the consumption and adoption of data stewardship by linking business users and technical experts through the business processes of data.  By introducing new user experience paradigms, executive and management reporting, extended data source connectivity, and improved performance and scale, the 6.5 release continues to expand the platform’s capabilities and reach.

dspConduct helps Global 2000 organizations proactively set and enforce strategic data policies across the enterprise. The solution complements master data management (MDM) strategies by ensuring transactions run as planned in critical business systems such as ERP, CRM, PLM, and others.

We designed dspConduct to extend beyond the traditional capabilities of master data management—bringing today’s business users a single platform that addresses their complex application data landscape with the tools needed to conduct world-class business process governance and achieve measurable business results,” added Rex Ahlstrom, Chief Strategy Officer, BackOffice Associates.

dspConduct helps business users achieve business process governance across all application data found in their organization’s enterprise architecture. The solution empowers users to plan and analyze specific policies for various types of enterprise data—whether customer, supplier, financial, human resources, manufacturing—and then execute and enforce those policies across the organization’s heterogeneous IT system landscape.  Built on BackOffice Associates’ more than 20 years of real-world experience meeting the most complex and critical data challenges, dspConduct and the DSP bring to the market a proven solution to maximize the business value of data.

Additional enhancements available in DSP 6.5 include:

  • Highest performance platform for data stewardship to date
  • Native Excel interoperability through the DSP for a simpler business-user experience
  • Native SAP HANA® connectivity and support for migrations to SAP® Business Suite 4 SAP HANA (SAP S/4HANA)
  • Generic interface layer for complete enterprise architecture interconnectivity
  • Native SAP Fiori® apps for migration and data quality metrics accessible by all stakeholders

BackOffice Associates was recently named a Strong Performer by Forrester Research in its independent report, “The Forrester Wave™: Data Governance Stewardship Applications, Q1 2016.”

Sign up for the free insideBIGDATA newsletter.

Source: insideBigData

Understanding TCO: How to Avoid the Four Common Pitfalls that May Lead to Skyrocketing Bills After Implementing a BI Solution

Understanding TCO: How to Avoid the Four Common Pitfalls that May Lead to Skyrocketing Bills After Implementing a BI Solution

Ulrik PedersonIn this special guest feature, Ulrik Pedersen, Chief Operations Officer at TARGIT,  highlights the constant battle between IT and finance on Total Cost of Ownership (TCO) when it comes to implementing a new BI solution. But, with IT budgets increasingly moving hands from IT departments to specific lines of business that may not be aware of this concept, TCO can quickly become a convoluted quagmire. Ulrik Pedersen joined the TARGIT team as Project Manager in 1999. Since then, he’s taken on the challenge of penetrating the North American market with TARGIT’s Business Intelligence and Analytics solution. Ulrik holds a Master of Science in Economics and Business Administration, B2B Marketing from Aalborg University.

Just about any savvy IT or business professional today understands the value a business intelligence (BI) solution can bring to an organization. From uncovering new sales opportunities to measuring growth to streamlining processes, BI solutions provide many benefits to the organization. However, those benefits come with a price tag. Often the total cost of a BI solution isn’t necessarily in accordance with the value it brings.

Organizations need to think carefully before investing in a BI solution to ensure they are aware of hidden costs. Total Cost of Ownership (TCO) isn’t as simple as just adding up infrastructure plus people. In reality, software only accounts for a fraction of the total cost of a BI project, and there are many other direct and indirect costs that rise steadily up front and over time. Having a full understanding of the time and resources a BI solution will cost your organization beyond the initial price tag is essential. These are the four most common pitfalls IT and business leaders should avoid to drive the most value from a BI solution.

1 – Poor Data Quality

The first step in implementing a BI project is pulling data into the data warehouse from the various other corporate systems such as the CRM, HR, and finance systems. Unfortunately, this is also one of the most time-consuming and costly steps because the data must first be cleansed and brought up to standard.

Cleansing and updating data is a long, arduous process that typically comes with a high price tag by the consultants that have to do it. It doesn’t take long for those consultancy hours to add up in a significantly expensive way.

2 – The Never Ending Project

Otherwise known as “scope creep,” long-stretch projects plague companies who struggle to select the most important data to bring into a BI project. Unfortunately for many of these companies, it’s impossible to truly know which data sets they want until they see the numbers. By then, a consultant or data scientist has already taken the time—and handed over the bill—for incorporating that data.

This results in a seemingly never ending process of starting and stopping the BI project. Worse, it’s not uncommon to see corporate priorities change before any analytics objectives can be obtained, rendering everything already done up until that point useless. The business world is changing so rapidly that a slow BI implementation can mean no BI at all.

3- License Creep

License creep refers to the uncontrolled growth in software licenses within a company. The ultimate goal of any successful BI implementation is to spread the power of analytics to as many users as possible throughout the company. But with many BI solutions, each additional user comes with a price tag, regardless of their level of BI involvement.

Additionally, rolling out an enterprise-wide BI solution usually necessitates additional servers.

It isn’t fair to say license creep is the result of poor project management. Rather, it is a result of unrealistic planning of license cost related to a successfully adopted BI solution. Imagine TCO as a line chart: license creep is where that line takes a dramatic 45-degree projection up from the initial cost. Over time, that final price tag can be double the estimated price was originally quoted.

4- The Under-Utlization Obstacle

A powerful BI and analytics solution is worthless if users aren’t armed with the know-how they need to take advantage of the various levels of tools. Companies are often won over with the words “self-service” only to discover that quite a bit of technical expertise is needed and when business decision makers need to dig in to further details, they need expensive consultants to help.

As a result, an overall under-utilization of the BI platform ensures the ambition of transformation into a data-driven company will never be realized, nor will ROI. Opportunities are lost on multiple scales, including the very basic objective of eliminating different data-truths that are floating around a company and aligning every decision-maker with the true data they need.

The Bottom Line

Don’t fall victim to these common TCO pitfalls. Enter the buying process informed about what should – and what shouldn’t—lie ahead in a successful business intelligence implementation and strategy. The right partner is incentivized to ensure you enter into a plan that works best for the unique needs of your company and works with you for a fast return on investment and long-lasting, mutually beneficial relationship.

Sign up for the free insideBIGDATA newsletter.

Source: insideBigData

Streaming Analytics with StreamAnalytix by Impetus

Streaming Analytics with StreamAnalytix by Impetus

The insideBIGDATA Guide to Streaming Analytics is a useful new resource directed toward enterprise thought leaders who wish to gain strategic insights into this exciting new area of technology. Many enterprises find themselves at a key inflection point in the big data timeline with respect  to streaming analytics technology. There is a huge opportunity for direct financial and market growth for enterprises by leveraging streaming analytics. Streaming analytics deployments are being engaged by companies in a broad variety of different use cases. The vendor and technology landscape is complex and numerous open source options are mushrooming. It’s important to choose a platform that will supply a proven and  pre-integrated, performance-tuned stack, ease of use, enterprise-class reliability and flexibility to protect the enterprise from rapid technology  changes. Maybe the most important reason to evaluate this technology now is that a company’s competitors are very likely implementing  enterprise-wide real-time streaming analytics right now and may soon gain significant advantages in customer perception & market-share. The complete insideBIGDATA Guide to Streaming Analytics is available for download from the insideBIGDATA White Paper Library.

insideBIGDATA_Guide_Streaming_AnalyticsStreamAnalytix is a state-of-the-art streaming analytics platform based on a best-of-breed open source technology stack. StreamAnalytix is a  horizontal product for comprehensive dataingestionacross industry verticals. It is developed on an enterprise-grade scale with open source components including Apache Kafka, Apache Storm and Apache Spark while also incorporating the popular Hadoop and NoSQL platforms into its structure. The solution provides all required components for streaming app-development not normally found in one place, all brought together under this platform combined with an extremely friendly UI.

A key benefit of StreamAnalytix is the multi-engine abstracted architecture which enables alternative streaming engines underneath—supporting  Spark Streaming for rapid and easy development of realtime streaming analytics applications in addition to original support for Apache Storm. Being  able to choose among multiple streaming engines means you can take the risk out of being constrained with a single engine. With a multiengine streaming analytics platform, you can do Storm streaming pipelines and Spark streaming pipelines and interconnect them—using the best engine for the best use case based on the optimal architecture. When new engines become widely accepted in the future they can be rolled into this multi-engine platform.

StreamAnalytix_NEW

The following is an overview of the product and its enterprise-grade, multi-engine open source based platform:

Open source technology

StreamAnalytix is built on Apache Storm and Apache Spark (open source distributed real-time computation systems) and is therefore able to leverage the numerous upgrades, improvements and flow of innovation that are foundational to the global Open Source movement.

Spark streaming

Spark streaming includes a rich array of drag-and-drop Spark data transformations, Spark SQL support, and built-in operators for predictive models with inline model-test feature.

Versatility and comprehensiveness

StreamAnalytix is a “horizontal” product for comprehensive high-speed data-ingestion across industry verticals. Its IDE development environment  offers a palette of applications based on customer requirements. Multiple components can be dragged and dropped into a smart dash-board in order  to create a customized work-sphere. The visual pipeline designer can be used to create, configure and administer complex real-time data pipelines.

Stream_Analytics_StreamAnalytixAbstraction layer driving simplicity

The platform’s architecture incorporates an abstraction layer beneath the application definition interface. This innovative setup enables automatic selection of the ideal streaming engine while also allowing concurrent use of several engines.

Compatibility

Built on Apache Storm, Apache Spark, Kafka and Hadoop, the StreamAnalytix platform is seamlessly compatible with all Hadoop distributions and vendors. This enables easy ingestion, processing, analysis, storage and visualization of streaming data from any input data source, proactively  boosting split-second decision making.

“Low latency” capability and flexible scalability

The platform’s ability to ingest high-speed streaming data with very low, sub-second latencies makes it ideal for use cases which warrant split-second response, such as flight-alerts or critical control of risk factors prevalent in complex manufacturing environments. Any fast-ingest data store can be used.

Intricate robust analytics

StreamAnalytix offers a wide collection of built-in data-processing operators. These operators enable high-speed data ingestion and processing in  terms of complex correlations, multiple aggregation functions, statistical models and window aggregates. For rapid application development, it is possible to port predictive analytics and machine learning models built in SAS or R via PMML onto real-time data.

Detailed data visualization

StreamAnalytix provides comprehensive support for 360-degree real-time data visualization. This means the system delivers incoming data streams instantaneously in the form of appropriate charts and dashboards.

If you prefer, the complete insideBigData Guide to Streaming Analytics is available for download as a PDF from the insideBIGDATA White Paper Library, courtesy of Impetus.

Source: insideBigData

ODPi Publishes First Runtime Specification and Test Suite To Simplify and Expedite Development of Data-Driven Applications

ODPi Publishes First Runtime Specification and Test Suite To Simplify and Expedite Development of Data-Driven Applications

ODPi_logoODPi, a nonprofit organization accelerating the open ecosystem of big data solutions, announced the first release of the ODPi Runtime Specification and test suite to ensure applications will work across multiple Apache Hadoop® distributions.

Designed to make it easier to create big data solutions and data-driven applications, the ODPi Runtime Specification is the first release from the industry-backed organization. While the Hadoop ecosystem is rapidly innovating, a certain degree of diversity and complexity are actually impeding adoption. Founded last year, more than 25 ODPi members are focused on simplification and standardization within the big data ecosystem and further advancing the work of the Apache Software Foundation.

Descending from Apache Hadoop 2.7, the Runtime Specification features HDFS, YARN, and MapReduce components and is part of the common reference platform ODPi Core.

The turbulent big data market needs more confidence, more maturity, and less friction for both technology vendors and consumers alike,” said Nik Rouda, senior big data analyst at Enterprise Strategy Group (ESG). “ESG research found that 85% of those responsible for current Hadoop deployments believed that ODPi would add value.”

Key ODPi Runtime Specification Technical Features

The ODPi test framework and self-certification also aligns closely with the Apache Software Foundation by leveraging Apache BigTop for comprehensive packaging, testing, and configuration. Additionally, more than half the code in the latest Big Top release originated in ODPi.

All ODPi Runtime-Compliance tests are linked directly to lines in the ODPi Runtime Specification. To assist with compliance, in addition to the test suite, ODPi also provides a reference build.

The published specification also includes rules and guidelines on how to incorporate additional, non-breaking features, which are allowed provided source code is made available through relevant Apache community processes.

What’s Next for ODPi

The ODPi Operations Specification to help enterprises improve installation and management of Hadoop and Hadoop-based applications will be available later this year.  The Operations Specification covers Apache Ambari, the ASF project for provisioning, managing, and monitoring Apache Hadoop clusters.

ODPi complements the work done in the Apache projects by filling a gap in the big data community in bringing together all members of the Hadoop ecosystem,” said John Mertic, senior manager of ODPi. “Our members – Hadoop distros, app vendors, solution providers, and end-users – are fully committed to leveraging Apache projects and utilizing feedback from real-world use cases to provide industry guidance on how Hadoop should be deployed, configured, and managed. We will continue to expand and contribute to innovation happening inside the Hadoop ecosystem.”

Comments from Members

Ampool

With its broader, flexible approach to standardizing the Hadoop stack, ODPi is particularly attractive to smaller companies, such as Ampool. Instead of spending testing/qualification cycles across different distributions and respective versions, the reference implementation would really help reduce both the effort and risk of Hadoop integration for us.” – Milind Bhandarkar, Ph.D, founder and CEO, Ampool

DataTorrent

ODPi will simplify developing and testing applications that work across distros and hence lower the cost of building Hadoop-based big data applications. For example, DataTorrent will be able to certify RTS installation and runtime for ODPi and know it will work with multiple platform providers.” – Thomas Weise, Apache Apex (incubating) PPMC member and architect/co-founder, DataTorrent

Hortonworks

At Hortonworks, we aim to speed Hadoop adoption through ecosystem interoperability rooted in open source so enterprise customers can reap the benefits of increased choice with more modern data applications and solutions. As a founding member, we are pleased to see ODPi’s first release become available to the ecosystem and look forward to our continued involvement to accelerate the adoption of modern data applications.” – Alan Gates, co-founder, Hortonworks

IBM

Big Data is the key to enterprises welcoming the cognitive era and there’s a need across the board for advancements in the Hadoop ecosystem to ensure companies can get the most out of their deployments in the most efficient ways possible. With the ODPi Runtime Specification, developers can write their application once and run it across a variety of distributions – ensuring more efficient applications that can generate the insights necessary for business change.” – Rob Thomas, vice president of product development, IBM Analytics

Linaro

Linaro recognizes the importance of ODPi’s work to promote and advance the state of Apache Hadoop and Big Data technologies for the enterprise while minimizing fragmentation and redundant effort. Linaro’s own focus is similar to this in developing open source software for the ARM ecosystem and it makes perfect sense that where these two areas intersect that Linaro and ODPi should work together to ensure ARM is fully supported and that fragmentation is minimized across the industry.” – Martin Stadtler, director of the Linaro Enterprise Group (LEG)

Pivotal

It was a little over a year ago that ODPi was formed, and we have already proved beneficial to upstream ASF projects (Hadoop, Bigtop, Ambari). There’s a need for a stable enterprise-grade platform that is managed as an industry asset to benefit all of the companies driving value from Hadoop and big data. This is why the first release of the ODPi Runtime Specification and test suite is so exciting. It is a big step toward realizing our goal of accelerating the delivery of business outcomes through big data solutions by driving interoperability on an enterprise-ready core platform.” – Roman Shaposhnik, director of Open Source at Pivotal, Apache Hadoop and Bigtop committer and ASF member

SAS

As a founding member, SAS’s support of the Open Data Platform Initiative demonstrates our ongoing commitment to developing innovative applications and solutions for our customers that are compatible with the Hadoop ecosystem. OPDi enables us to remain committed to ensuring our applications work with and exploit the Hadoop distribution of our customers’ choice, while being able to bank on the stability and quality expected in demanding business environments.” – Craig Rubendall, vice president of platform R&D, SAS

Sign up for the free insideBIGDATA newsletter.

Source: insideBigData

Paxata Continues to Redefine the Traditional Data-to-Information Pipeline with New Spring ‘16 Release

Paxata Continues to Redefine the Traditional Data-to-Information Pipeline with New Spring ‘16 Release

paxataPaxata, provider of the Adaptive Data Preparation™ platform for the enterprise, announced the availability of its Spring ’16 product release. Paxata’s latest release bridges the gap between analysts and IT with new intuitive capabilities, providing connected information to every person in the enterprise without compromising on security, scale, and cost efficiency. Spring ’16 also enables analysts to collaboratively explore and prepare all of their data no matter the source or format.

Our investigations involve a great deal of unknowns in the data, and our customers turn to us to make sense of it,” said Conrad Mulcahy, Associate Managing Director and Director of Data Analytics, K2 Intelligence. “Paxata’s Spring ’16 release allows K2 to do a highly sophisticated MRI on the data. Paxata already showed us hard tissue versus soft tissue, but now we can distinguish between different kinds of soft tissue. Granular observations can be made on the data at an early stage with all of the new capabilities: sophisticated sampling options, cluster and edit, column search and support for nested files. Paxata keeps us from going in the wrong direction early on, keeps us focused, and gets the dialogue with the client headed in the right direction. It’s hard to put a price on how valuable that is for us as investigators, having our clients know that we’re not wasting their valuable time or resources.”

Paxata’s new release serves as another milestone in Paxata’s mission of delivering connected information to every person in the enterprise, without compromising on security, scale, and cost efficiency. Key features of Paxata’s Spring ’16 release includes:

  • Advanced filtergrams for comprehensive data profiling with semantic-awareness of timestamp and numeric data, automatically suggested intelligent visualizations and custom bucketing
  • Smart integration of complex nested JSON/XML data and Hadoop compressed files – unfolded, flattened and ready for multi-structured data analysis to address IoT and other high-value use cases
  • Granular searching across all columns of wide datasets and in every cell value for patterns, outliers and duplicate values
  • New options for iterative and flexible data discovery with smart statistical selections of datasets at any scale

Cloudera is committed to advancing Hadoop as a mainstream platform that improves customer experiences and drives new revenue streams through highly scalable, more intelligent storage and processing capabilities,” said Tim Stevens, vice president, Business and Corporate Development at Cloudera. “Paxata continues to deliver on the promise of the Hadoop ecosystem with numerous joint customers who have amplified the benefits of their Cloudera platform by making it accessible through Paxata’s connected information platform for self-service data quality, integration, governance and collaboration.”

In addition to providing quick access to data, the new release provides IT-specific controls to support governance, security and scale, including:

  • Visual column-lineage for detailed and understandable traceability
  • REST API for SAML for complete integration into the IT environment
  • Ability to use analyst projects as repeatable “recipes” to build into ETL, virtualized views or data quality dashboards

Since we began the self-service data preparation revolution, we set the pace for delivering major advancements against our roadmap. With every quarterly release, we ask two questions, the first being ‘how do we make the life of the analyst easier so they can go from raw data to the right information regardless of analytic use case?’” The second is ‘how do we lead the industry in moving from legacy scale-up, on-premise, relational worlds to distributed, elastic cloud, scale-out architectures?’” said Prakash Nanduri, Co-Founder and CEO of Paxata. “Every major Fortune 1000 corporation is moving to this new world and Paxata is leading the way. The Spring ’16 release is another major advancement in this transformation. I am proud of the hard work of our team, customers and partners.”

Additional details about the Paxata Spring ’16 release can be found HERE.

Sign up for the free insideBIGDATA newsletter.

Source: insideBigData