How Arcadia Data brings self-service BI to the data lake

How Arcadia Data brings self-service BI to the data lake

Data has evolved over the years. Complex data structures, unstructured data, real-time processing, growing data volumes, and new varieties of data are all part of the evolution. Platforms have changed as well. “Schema-less,” real-time events, “schema-on-read,” and extract/load/discover/transform (ELDT) are now part of our vernacular.

Despite these changes, many businesses rely on the same data warehouse infrastructure that they’ve relied on for years. Many businesses have also turned to data lakes, through platforms such as Apache Hadoop, NoSQL databases, and Apache Kafka, or cloud storage technologies like Amazon S3, as a cost-effective way of managing large volumes of disparate data sets. Unfortunately, the success rates of these data lakes have been disappointing, as they have not been able to deliver quicker or better value to businesses.

Why isn’t the data lake living up to its promise? It turns out that the data lake architecture is a robust data management strategy, but a weak data intelligence strategy, due to its lack of focus on tools to analyze the data. While the data landscape has evolved, not all tools have kept pace. The data management platforms have changed, and it’s time to rethink the BI and analytics tools as well.

In this article, we’ll examine the advantages of an analytics and BI platform built specifically for data lakes, and how Arcadia Enterprise, the flagship product from Arcadia Data, brings self-service analytics to those environments. 

BI and the data lake

A common practice to enable BI on data lakes is to use traditional tools designed for data warehouses and layer them on the data lake. However, because the data in the data lakes is both large in scale and fast moving, what was inefficient but tolerable in the data warehouse has become impractical and even painful in the data lake. Data movement to BI-specific servers, heavy data performance modeling into cubes and extracts, and slow feedback loops between IT and business analysts have all become barriers to insight. BI tools designed for the data warehouse have limited the value of the data lake.

Instead of constraining the value of your data lake, you should embrace a new approach to BI: Keep your existing BI tools for your remaining data warehouse workloads, but use “native” BI tools, i.e., tools that were architected for data lakes, for your modern data platform. These tools are deemed “native” because they run within the data lake cluster—the BI engine is distributed and runs on each data node. Analytics are performed where the data lives, and data movement to a separate BI-specific cluster is eliminated. In addition, the benefits of a native BI platform include unified security, seamless semantic modeling, and query processing optimizations that result in reduced administration, improved self-service, and high speed and scalability.

arcadia enterprise dashboardArcadia Data

Arcadia Enterprise provides an analytics/BI platform native to big data.

Semantic modeling for self-service BI

If faster time-to-insight on large and diverse data sets is what you hope to achieve from your data lake, then using the right tools will be a key factor in achieving that objective. One key inhibitor to self-service analytics with traditional BI tools is the slow process of moving data from the data lake to the dedicated BI server for analysis. This process has the additional drawback of separating the semantic modeling and analytic/visual discovery steps, resulting in long, drawn-out feedback loops. Arcadia Enterprise shortens the analytics lifecycle by keeping data in the data lake, allowing you to do data discovery and semantic modeling in tighter feedback loops without any IT intervention.

Business analysts need a business-oriented view of the data that they can easily explore and analyze. This “business meaning” needs to be documented alongside the data tables and fields of their database. In Arcadia Enterprise, this information is stored as a “dataset,” a semantic layer that provides a business-facing definition to the underlying data.

The semantic layer is critical for helping non-technical users understand the data. In many traditional analytics environments, semantic layers are created and maintained by IT personnel. This is not an ideal situation, because IT personnel often do not have the same view of data as the analysts. They tend to use different terms, often do not understand the business definition of the data, and often do not know what queries would be run by end users. IT personnel typically need frequent interactions with business analysts to properly define the data.

Arcadia Enterprise provides an easy-to-use interface to let business analysts create and maintain an accurate and detailed semantic layer. Semantic layers can be used to enable the right level of information sharing and collaboration across authorized users. Two key features come into play here:

  • Data model definition. Power users or analysts can specify tables (and fields) to join and the join type (e.g., inner, left, right). Users can preview a subset of the joined data.
  • Field definitions. For enriching information about fields, users can set a display name, add a comment/description, specify a default aggregation (used in the visualization interface), and specify geo type (see screen image below). Users can also categorize fields as a “dimension” or a “measure,” set the data type, and hide fields from the visualization interface.

arcadia enterprise semantic layer uiArcadia Data

The Arcadia Enterprise semantic layer UI lets users add meaning to data tables.

Arcadia Enterprise initially makes a best guess about field definitions, which can then be updated as desired. As part of a comprehensive self-service analytics platform, the Arcadia Enterprise semantic layer interface simplifies the access to data for non-technical users. Users do not have to turn to IT personnel to define these layers, but are empowered to make adjustments at will and quickly build visualizations, dashboards, and analytical applications.

Accelerating queries with analytical views

Another factor in shorter analytics lifecycles is of course query performance. Here Arcadia Data leverages a technology we call Smart Acceleration, which is a built-in recommendation engine that analyzes queries and defines pre-computed aggregates to speed up application queries with minimal IT effort. These pre-computed aggregates are known as analytical views.

Arcadia Enterprise analytical views are a semantic caching mechanism that allows you to pre-compute and cache the results of expensive SQL operations, such as grouping and aggregation. Query acceleration on data lakes is an increasingly popular topic, especially with new business requirements around high volumes of data and high levels of concurrent users. Businesses are finding that without query acceleration, they can’t fully address the analytical demands of their end users.

But even with query acceleration, using traditional approaches like moving data to a separate, dedicated BI platform is not a sustainable approach. Analytical views represent a unique, native approach to query acceleration in data lakes, providing users with significant performance benefits:

  • Queries from apps are automatically routed to analytical views with matching SQL expressions.
  • Predictable workloads (querying) can be optimized and completed within a few seconds.
  • As the workload becomes more predictable, the automatic use of analytical views increases.
  • Analytical views that are well-partitioned (and partitioned identically to the base tables) enable incremental refresh.

The beauty of analytical views is that you don’t need to create and continuously refine your own data cube. Analytical views kick in automatically and can optimize for joins, distinct counts, medians, etc. Since your BI applications are built against the base data, you work with a single unified view of the data with access to all fields, even though specific reports may be supported by different analytical views.

An analytical view gathers and maintains aggregated data based on the query used to create it. Think of it as a shadow to the base tables. It is built using syntax similar to creating a logical view. An analytical view tracks aggregates for columns represented in its query and keeps them updated. Queries using an analytical view gain a significant performance benefit and utilize fewer system resources as compared to running the query against base tables.

When a query is run that can be partially or entirely answered by an analytical view, ArcEngine, the core analytics engine in Arcadia Enterprise, automatically uses that analytical view. The end user of the query is completely unaware of the analytical view. This is helpful because the dashboards that run the queries are built against base tables, not separate data structures, so end users don’t need to think about how best to run their queries.

Here is an example of a base table and an analytical view:

Base table

(event_id STRING,
app_id STRING,
app_instance_id STRING,
user_id STRING,
device_id STRING,
platform STRING)
PARTITIONED BY (year INT, month INT, day INT);

Analytical view

CREATE ANALYTICAL VIEW events_month_platform_view
(SELECT count(device_id) as count_device_id,
count(user_id) as count_user_id,
FROM events
GROUP BY month, platform);

Refresh the analytical view

REFRESH ANALYTICAL VIEW events_month_platform_view;

In this example, the events table is comprised of data generated by a sensor. This data can have multiple dashboards built on it. An analytical view called events_month_platform_view has been created that tracks the count of users and devices by month and platform. If an analytical view has been outdated due to new or modified data, you can update it incrementally by refreshing it (as long as the partitions of the analytical view are a subset of the partitions of the base table). As you would expect, refreshing incrementally will reduce the time needed to bring the analytical view fully up-to-date.

Defining analytical views starts with the BI developer/analyst understanding which dashboards and underlying queries will be deployed to business users. If you know which queries need acceleration, you can create analytical views manually.

Analytical views can be created manually, or automatically by the system. If you know exactly which queries need acceleration, they can be created manually via the command line as shown in the example above, or by using the Arcadia Enterprise “Create Analytical View” UI as shown in the figure below.

arcadia enterprise analytical viewsArcadia Data

The manual option for creating analytical views in the Arcadia Enterprise UI.

However, many times you do not know exactly which queries need to be accelerated, or what is the best combination of analytical views, which is where Smart Acceleration fits in. Smart Acceleration will identify which analytical views to build with its recommendation engine for the dashboards and visuals you select for acceleration.

With the Smart Acceleration Recommendation Manager UI (see below), you can select which dashboards or visuals to accelerate, and the system will provide a list of analytical views you can create. The recommended analytical views are defined based on real-world usage of the dashboards, and will identify the most broadly applicable analytical views so that multiple queries can use the same analytical view. This minimizes redundancy across analytical views and guards against system bloat when building views.

arcadia enterprise smart accelerationArcadia Data

Arcadia Enterprise’s Smart Acceleration Recommendation Manager simplifies the process of accelerating queries.

If you face long delays in getting data to an analytics-ready state for your business users due to extensive performance modeling, or you aren’t getting the query responsiveness that your end users need, then Arcadia Enterprise analytical views and Smart Acceleration are likely the technologies you need.

Semantic layers and analytical views for modern BI

Business analysts and power users today often have the data knowledge needed to explore data sets and document the business meaning. These users can more accurately design and maintain accurate, business-friendly semantic data layers. To leverage their knowledge and avoid IT bottlenecks, businesses need a platform that empowers non-technical users to create a semantic layer as a precursor to building dashboards, visualizations, and data applications.

1 2 Page 2

Arcadia Enterprise puts discovery and semantic modeling in adjacent tasks to greatly reduce the time to build visualizations. And with Smart Acceleration, Arcadia Enterprise can make recommendations on how to optimize queries with analytical views. The combination of analytical views and Smart Acceleration allows Arcadia Enterprise to support large numbers of concurrent users without the delays of performance modeling that are typical of OLAP cube-oriented environments.

Once analytical views are created, they need to be refreshed as the underlying data gets updated. Refreshes can be run automatically by a scheduler job at regular intervals. Incremental refreshes usually take little time, since they are aggregating data from new or updated partitions.

From this point onward, your dashboards are ready to be operationalized, making the data lake a unified platform for data discovery, as well as a foundation for production analytics applications for use cases spanning 360-degree customer views to cybersecurity, supply chain analysis, and communications network performance optimization.

Priyank Patel is co-founder and chief product officer at Arcadia Data, where he leads the charge in building visually beautiful and highly scalable analytical products. Prior to co-founding Arcadia, he was part of the founding engineering team at Aster Data, where he designed core components of the Aster Database. He later transitioned into field roles to win the company’s first customers in the Eastern US region and then into product management for the SQL-MapReduce and Analytical Frameworks. Following Teradata’s acquisition of Aster, Priyank led product management for its Big Data Appliance in close partnership with Hortonworks.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to

Source: InfoWorld Big Data

IDG Contributor Network: Tales of eTail West, big data, and e-commerce success—or failure

IDG Contributor Network: Tales of eTail West, big data, and e-commerce success—or failure

I attend the eTail West event, annually if I am able, because that’s where the enablers hang out. The worlds of retail, fashion, and e-commerce offer plenty of glitz and glamor. If what you are looking for, however, are serious discussions and actionable business opportunities with the big data and analytics providers, and like-minded brands and retailers, that are putting data-driven strategies and systems in place to make retail and e-commerce go, eTail West is a great place for that.

It was at eTail West last year that a series of events began unfolding that opened my eyes as to some changing realities in data and commerce. None of these is especially noteworthy on its own, but collectively, they speak to the occasionally eye-popping results that are possible when companies put actionable analytics to work across their operations—and to the disappointment, and questioning, that can occur when they (apparently) do not.

At eTail West 2017, I had my normal full schedule of briefings and events with vendors, brands, and retailers in full swing when sales types in one of our world regions picked up the red phone to our company’s top executives, whereupon said execs reached out, um, enthusiastically to me to put together something sizable and crucial about 24 hours. The mission: Create an analysis on spec in the hopes of dazzling a hot sales prospect in the aforementioned red-phone region. The subject: how effectively an entity I refer to as the Retail Death Star uses analytic insights to bedevil the competition. Using a bracing combination of secondary and primary research, analyst magic, and Starbucks, I pulled it off to (muffled internal) applause. I analyzed how the Death Star is combining some truly high-quality third-party vendor offerings with its own internal data smarts to create systems and processes that use data about a lot of relevant things including the weather to anticipate and meet consumer demand.

The Death Star is not ‘killing it’ for this online shopper

My command performance, however, is not really the point. This is. Over the past year, intrigued by the fireworks and hurrahs emanating from the Death Star, seeing its acquisition of a property that looked it should propel the Death Start into the thick of the e-commerce derb and noting its more aggressive e-commerce posture in the market, as well as its whiz-bang new website, I have placed two online orders with the Death Star.

The Death Star has now gone two-for-two on those orders.

As in, it has promised me rapid delivery of these items guaranteed to help me save money and live better and sent me upbeat updates tracking the fulfillment process …before ultimately failing to complete both orders. These were not by any stretch of the imagination, in the words of the Bon Qui Qui character on MadTV, “complicated orders.” The most recent one was for a cocoa welcome mat. After sending me promise after promise—including an email it sent on 19 April guaranteeing delivery on 18 April—the Death Star did what I suppose it knows best: lowering the price and dropping the shipping charges. It did so only after I had reached out wanting to know where our order was, but I suppose it’s the discount that counts. In the end, though, after all the order intrigue, the Death Star left me feeling order-unfilled.

If this is what hot competition for Amazon’s e-commerce dominance looks like, Jeff Bezos can sleep even more soundly tonight on his comfy cushion of billions.

Enthusiastic product presentation and order execution followed by fulfillment failure mean one thing: the kinds of data gaps, disconnects, and mismatches we at Stratecast have been analyzing for years that are blowing a several-trillion-dollar hole in global retail and e-commerce economy.

Two-day delivery? Same-day delivery? How do you feel about delivery in minutes?

My purchase was ordered, promised, tracked, and showed up, on time and without a hitch.

“So, what? We all know about Amazon.” So, this: at eTail West 2018 I encountered a company that is positioning itself to give both Amazon and the Death Star a run for their money, and which made me feel a whole lot better about the state of competition in e-commerce. claims to be China’s largest online retailer and largest overall retailer, as well as the country’s largest Internet company by revenue, and one of the top three e-commerce sites on the planet.

Retailers must grapple with the move toward everyone expecting to be able to obtain virtually anything by pushing a button. That and four other data-driven megatrends are shaping global markets: artificial intelligence moving into virtually every aspect of business and society; increasing social connectedness not merely for social interaction purposes, but to drive transactions and commerce; organizations controlling more of the supply chain themselves to reduce risk and increase customer experience; and, as part of that expansion, experimenting with drones to speed delivery of physical products to customers and reduce costs. is stepping up in all of these areas, and its shopping app and e-commerce website are just the beginning. The company now has control of much of its supply chain, leaving nothing to chance in terms of ordering, merchandise, inventory, or fulfillment. It receives a lot of its orders via the WeChat messaging platform, which enjoys a Facebook Messenger level of ubiquity in China and is already a thriving e-commerce engine there. also has bustling AI and robotics labs and is experimenting with drones to handle fulfillment. It is also tapping into the trend toward “Uberization”: Employees can deliver goods to customers located in range of their daily commute and are paid their regular wages and mileage costs to do so.

Competitors have captured much attention (particularly in the US) for offering free two-day shipping, and even same-day delivery on many items. is using a razor-sharp command of data and distribution to beat even that. By crunching massive amounts of data to perform predictive analytics in real time, is able to predict purchases with such timeliness and accuracy that in many cases it is able, for example, to station delivery teams at the edges of residential centers with goods for which consumers are imminently placing orders; in some cases, consumers are receiving their orders in minutes.

What about ‘that other China e-commerce competitor’ and stock market darling?

Great, you ask, but where is Alibaba in all this? For its part, Alibaba positions itself as China’s (“and by some measures, the world’s”) largest online commerce company. Alibaba handles more business than any other e-commerce company, and transactions on its online sites totaled $248 billion last year, more than those of eBay and combined.

Much of’s competitive thrust versus Alibaba in their common home market of China comes down to supply chain and distribution. controls much of the native infrastructure in China with regard to shipping and other logistics. More broadly, though, it commands a range of commerce and logistical capabilities that would take the combination of Amazon, WeChat, and FedEx to match.

Ultimately the success of and Alibaba on the world stage is going to hinge on the ability of each to drive revenue in the Americas. is moving rapidly toward expanding into the US; US businesses have already exported an estimated $15 billion to $20 billion in goods to Alibaba customers in China.

Whoever wins the battle for e-commerce supremacy in China, and other battles in other nations and regions, one thing about which I am certain is that avoiding a global monopoly on a broad swath of the goods and services we all need or want is kind of important. Another is that data holds the key.

This article is published as part of the IDG Contributor Network. Want to Join?

Source: InfoWorld Big Data

IDG Contributor Network: Big data + AI: Context, trust, and other key secrets to success

IDG Contributor Network: Big data + AI: Context, trust, and other key secrets to success

When Target deduced that a teenager from Minnesota was pregnant—and told her father about it before she’d broken the news herself—it was a reminder of just how powerful data analytics can be, and how companies must wield that power carefully.

Several years later, big data and machine learning are being used together more and more in business, providing a powerful engine for personalized marketing, fraud detection, cybersecurity and many other uses. A 2017 survey from Accenture found that 85 percent of executives plan to invest extensively in AI-related technologies over the next three years.

But machine learning doesn’t merely take existing problems and solve them more quickly. It’s an entirely new model that can address new types of problems, spur innovation, and uncover opportunities. To take advantage of it, businesses and users need to rethink some of their approaches to analytics and be aware of AI’s strengths and weaknesses.

Machine learning today, like AI in general, is both incredibly smart and incredibly dumb. It can look through vast amounts of data with great speed and accuracy, identifying patterns and connections that might have gone unnoticed before, but it does so without the broader context and awareness that people take for granted. Thus, it can divine that a girl is pregnant, but has no idea how to act on that information in an appropriate way.

As you embed machine learning into your own analytics processes, here are four areas you should be aware of to maximize the opportunities it presents.

Context is king

Machine learning can yield compelling insights within the scope of the information it has, but it lacks the wider context to know which results are truly useful. A query about what clothes a retailer should stock in its Manhattan store, for example, might return ten suggestions based on historic sales and demographic data, but some of those suggestions may be entirely impractical or things you’ve tried before. In addition, machines need people to tell them which data sets will be useful to analyze; if AI isn’t programmed to take a variable into account, it won’t see it. Business users must sometimes provide the context—as well as plain common sense—to know which data to look at and which recommendations are useful.

Cast a wide net

Machine learning can uncover precisely the answer you’re looking for—but it’s far more powerful when it uncovers something you didn’t know to ask. Imagine you’re a bank, trying to identify the right incentives to persuade your customers to take out another loan. Machine learning can crunch the numbers and provide an answer—but is securing more loans really the goal? If your objective is to increase revenue, your AI program might have even better suggestions, like opening a new branch, but you won’t know unless you define your query in a broad enough way to encompass other responses. AI needs latitude to do its best work, so don’t limit your queries based on assumptions.

Trust the process

One of the marvels of AI is that it can figure things out and we never fully understand how it did it. When Google built a neural network and showed it YouTube videos for a week, the program learned to identify cats even though it had never been trained to do so. That type of mystery is fine for an experiment like Google’s, but what about in business?

One of the biggest challenges for AI is trust. People are less likely to trust a new technology if they don’t know how it arrived at an answer, and with machine learning that’s sometimes the case. The insights may be valuable, but business users need to trust the program before they can act on them. That doesn’t mean accepting every result at face value (see “context” above), but users prefer the ability to see how a solution was arrived at. As with people, it takes time to build trust, and it often forms after we’ve seen repeated good results. At first, we feel the need to verify the output, but once an algorithm has proved itself reliable the trust becomes implicit.

Act responsibly

Target isn’t the only company that failed to see how the power of data analytics could backfire. After Facebook failed to predict how its data could be used by a bad actor like Cambridge Analytica, the best excuse it could muster was that it didn’t see it coming. “I was maybe too idealistic,” Mark Zuckerberg said. For all the good it brings, machine learning is a powerful capability and companies must be aware of potential consequences of its use. This can include how analytics results are used by employees, as in Target’s case, and also how data might be used by a third party when it’s shared. Naivety is rarely a good look, especially in business.

The use of AI is expanding as companies seek new opportunities for growth and efficiency, but technologies like machine learning need to be used thoughtfully. Sometimes the technology is embedded deep within applications, and not every employee needs to know that AI is at work behind the scenes. But for some uses, results need to be assessed critically to ensure they make good business sense. Despite its intelligence, artificial intelligence is still just that—artificial—and it takes people to get maximize its use. Keeping the above recommendations in mind will help you do just that.

This article is published as part of the IDG Contributor Network. Want to Join?

Source: InfoWorld Big Data

Review: Amazon SageMaker scales deep learning

Review: Amazon SageMaker scales deep learning

Amazon SageMaker, a machine learning development and deployment service introduced at re:Invent 2017, cleverly sidesteps the eternal debate about the “best” machine learning and deep learning frameworks by supporting all of them at some level. While AWS has publicly supported Apache MXNet, its business is selling you cloud services, not telling you how to do your job.

SageMaker, as shown in the screenshot below, lets you create Jupyter notebook VM instances in which you can write code and run it interactively, initially for cleaning and transforming (feature engineering) your data. Once the data is prepared, notebook code can spawn training jobs in other instances, and create trained models that can be used for prediction. SageMaker also sidesteps the need to have massive GPU resources constantly attached to your development notebook environment by letting you specify the number and type of VM instances needed for each training and inference job.

Trained models can be attached to endpoints that can be called as services. SageMaker relies on an S3 bucket (that you need to provide) for permanent storage, while notebook instances have their own temporary storage.

SageMaker provides 11 customized algorithms that you can train against your data. The documentation for each algorithm explains the recommended input format, whether it supports GPUs, and whether it supports distributed training. These algorithms cover many supervised and unsupervised learning use cases and reflect recent research, but you aren’t limited to the algorithms that Amazon provides. You can also use custom TensorFlow or Apache MXNet Python code, both of which are pre-loaded into the notebook, or supply a Docker image that contains your own code written in essentially any language using any framework. A hyperparameter optimization layer is available as a preview for a limited number of beta testers.

Source: InfoWorld Big Data

The era of the cloud database has finally begun

The era of the cloud database has finally begun

Folks, it’s happening. Although enterprises have spent the last few years shifting on-premises workloads to the public cloud, databases have been a sticking point. Sure, Amazon Web Services can point to 64,000 database migrations over the last two years, but that still leaves millions more stuck in corporate datacenters.

But not, it would appear, for long.

Ryanair, Europe’s largest airline, just signaled a significant shift in cloud migrations, announcing that it is “going all-in” on AWS, moving its infrastructure to the cloud leader. But what makes this so important is that it also includes mention of Ryanair moving away from Microsoft SQL Server and replacing it with Amazon Aurora, “standardizing on … AWS databases.”

When companies embrace cloud databases wholesale, it’s effectively game over.

Why migrating databases to the cloud has been so hard

Source: InfoWorld Big Data

Watch Tech Talk on May 17 for an in-depth GDPR discussion

Watch Tech Talk on May 17 for an in-depth GDPR discussion

Computerworld | May 8, 2018

The GDPR deadline is coming up fast, and most businesses in the U.S. aren’t ready yet. Join Ken Mingis and his panel of experts as they discuss the impact of the new rules and what U.S. organizations must do now to protect customer data. Find the show here on May 17.

Source: InfoWorld Big Data

6 hidden bottlenecks in cloud data migration

6 hidden bottlenecks in cloud data migration

Moving terabytes or even petabytes of data to the cloud is a daunting task. But it is important to look beyond the number of bytes. You probably know that your applications are going to behave differently when accessed in the cloud, that cost structures will be different (hopefully better), and that it will take time to move all that data.

Because my company, Data Expedition, is in the business of high-performance data transfer, customers come to us when they expect network speed to be a problem. But in the process of helping companies overcome that problem, we have seen many other factors that threaten to derail cloud migrations if left overlooked.

Collecting, organizing, formatting, and validating your data can present much bigger challenges than moving it. Here are some common factors to consider in the planning stages of a cloud migration, so you can avoid time-consuming and expensive problems later.

Cloud migration bottleneck #1: Data storage

The most common mistake we see in cloud migrations is pushing data into cloud storage without considering how that data will be used. The typical thought process is, “I want to put my documents and databases in the cloud and object storage is cheap, so I’ll put my document and database files there.” But files, objects, and databases behave very differently. Putting your bytes into the wrong one can cripple your cloud plans.

Files are organized by a hierarchy of paths, a directory tree. Each file can be quickly accessed, with minimal latency (time to first byte) and high speed (bits per second once the data begins flowing). Individual files can be easily moved, renamed, and changed down to the byte level. You can have many small files, a small number of large files, or any mix of sizes and data types. Traditional applications can access files in the cloud just like they would on premises, without any special cloud awareness.

All of these advantages make file-based storage the most expensive option, but storing files in the cloud has a few other disadvantages. To achieve high performance, most cloud-based file systems (like Amazon EBS) can be accessed by only one cloud-based virtual machine at a time, which means all applications needing that data must run on a single cloud VM. To serve multiple VMs (like Azure Files) requires fronting the storage with a NAS (network attached storage) protocol like SMB, which can severely limit performance. File systems are fast, flexible, and legacy compatible, but they are expensive, useful only to applications running in the cloud, and do not scale well.

Objects are not files. Remember that, because it is easy to forget. Objects live in a flat namespace, like one giant directory. Latency is high, sometimes hundreds or thousands of milliseconds, and throughput is low, often topping out around 150 megabits per second unless clever tricks are used. Much about accessing objects comes down to clever tricks like multipart upload, byte range access, and key name optimization. Objects can be read by many cloud-native and web-based applications at once, from both within and outside the cloud, but traditional applications require performance crippling workarounds. Most interfaces for accessing object storage make objects look like files: key names are filtered by prefix to look like folders, custom metadata is attached to objects to appear like file metadata, and some systems like FUSE cache objects on a VM file system to allow access by traditional applications. But such workarounds are brittle and sap performance. Cloud storage is cheap, scalable, and cloud native, but it is also slow and difficult to access.

Databases have their own complex structure, and they are accessed by query languages such as SQL. Traditional databases may be backed by file storage, but they require a live database process to serve queries. This can be lifted into the cloud by copying the database files and applications onto a VM, or by migrating the data into a cloud-hosted database service. But copying a database file into object storage is only useful as an offline backup. Databases scale well as part of a cloud-hosted service, but it is critical to ensure that the applications and processes that depend on the database are fully compatible and cloud-native. Database storage is highly specialized and application-specific.

Balancing the apparent cost savings of object storage against the functionality of files and databases requires careful consideration of exactly what functionality is required. For example, if you want to store and distribute many thousands of small files, archive them into a ZIP file and store that as a single object instead of trying to store each individual file as a separate object. Incorrect storage choices can lead to complex dependencies that are difficult and expensive to change later.

Cloud migration bottleneck #2: Data preparation

Moving data to the cloud is not as simple as copying bytes into the designated storage type. A lot of preparation needs to happen before anything is copied, and that time requires careful budgeting. Proof-of-concept projects often ignore this step, which can lead to costly overruns later.

Filtering out unnecessary data can save a lot of time and storage costs. For example, a data set may contain backups, earlier versions, or scratch files that do not need to be part of the cloud workflow. Perhaps the most important part of filtering is prioritizing which data needs to be moved first. Data that is being actively used will not tolerate being out of sync by the weeks, months, or years it takes to complete the entire migration process. The key here is to come up with an automated means of selecting which data is to be sent and when, then keep careful records of everything that is and is not done.

Different cloud workflows may require the data to be in a different format or organization than on-premises applications. For example, a legal workflow might require translating thousands of small Word or PDF documents and packing them in ZIP files, a media workflow might involve transcoding and metadata packing, and a bioinformatics workflow might require picking and staging terabytes of genomics data. Such reformatting can be an intensely manual and time-consuming process. It may require a lot of experimentation, a lot of temporary storage, and a lot of exception handling. Sometimes it is tempting to defer any reformatting to the cloud environment, but remember that this does not solve the problem, it just shifts it to an environment where every resource you use has a price.

Part of the storage and formatting questions may involve decisions about compression and archiving. For example, it makes sense to ZIP millions of small text files before sending them to the cloud, but not a handful of multi-gigabyte media files. Archiving and compressing data makes it easier to transfer and store the data, but consider the time and storage space it takes to pack and unpack those archives at either end.

Cloud migration bottleneck #3: Information validation

Integrity checking is the single most important step, and also the easiest to get wrong. Often it is assumed that corruption will occur during the data transport, whether that is by physical media or network transfer, and can be caught by performing checksums before and after. Checksums are a vital part of the process, but it is actually the preparation and importing of the data where you are most likely to suffer loss or corruption.

When data is shifting formats and applications, meaning and functionality can be lost even when the bytes are the same. A simple incompatibility between software versions can render petabytes of “correct” data useless. Coming up with a scalable process to verify that your data is both correct and useable can be a daunting task. At worst, it may devolve into a labor-intensive and imprecise manual process of “it looks okay to me.” But even that is better than no validation at all. The most important thing is to ensure that you will be able to recognize problems before the legacy systems are decommissioned!

Cloud migration bottleneck #4: Transfer marshaling

When lifting a single system to the cloud, it is relatively easy to just copy the prepared data onto physical media or push it across the Internet. But this process can be difficult to scale, especially for physical media. What seems “simple” in a proof-of-concept can balloon to “nightmare” when many and varied systems come into play.

A media device, such as an AWS Snowball, must be connected to each machine. That could mean physically walking the device around one or more data centers, juggling connectors, updating drivers, and installing software. Connecting over the local network saves the physical movement, but software setup can still be challenging and copy speed may drop to well below what could be achieved with a direct Internet upload. Transferring the data directly from each machine over the Internet saves many steps, especially if the data is cloud-ready.

If data preparation involves copying, exporting, reformatting, or archiving, local storage can become a bottleneck. It may be necessary to set up dedicated storage to stage the prepared data. This has the advantage of allowing many systems to perform preparation in parallel, and reduces the contact points for shippable media and data transfer software to just one system.

Cloud migration bottleneck #5: Data transfer

When comparing network transfer to media shipment, it is easy to focus on just the shipping time. For example, an 80 terabyte AWS Snowball device might be sent by next-day courier, achieving an apparent data rate of more than eight gigabits per second. But this ignores the time it takes to acquire the device, configure and load it, prepare it for return, and allow the cloud vendor to copy the data off on the back-end. Customers of ours who do this regularly report that four-week turnaround times (from device ordering to data available in the cloud) are common. That brings the actual data transfer rate of shipping the device down to just 300 megabits per second, much less if the device is not completely filled.

Network transfer speeds likewise depend on a number of factors, foremost being the local uplink. You can’t send data faster than the physical bit rate, though careful data preparation can reduce the amount of data you need to send. Legacy protocols, including those that cloud vendors use by default for object storage, have difficulty with speed and reliability across long-distance Internet paths, which can make achieving that bit rate difficult. I could write many articles about the challenges involved here, but this is one you do not have to solve yourself. Data Expedition is one of a few companies that specialize in ensuring that the path is fully utilized regardless of how far away your data is from its cloud destination. For example, one gigabit Internet connection with acceleration software like CloudDat yields 900 megabits per second, three times the net throughput of an AWS Snowball.

The biggest difference between physical shipment and network transfer is also one of the most commonly overlooked during proof-of-concept. With physical shipment, the first byte you load onto the device must wait until the last byte is loaded before you can ship. This means that if it takes weeks to load the device, then some of your data will be weeks out of date by the time it arrives in the cloud. Even when data sets reach the petabyte levels where physical shipment may be faster over all, the ability to keep priority data current during the migration process may still favor network transfer for key assets. Careful planning during the filtering and prioritization phase of data preparation is essential, and may allow for a hybrid approach.

Getting the data into a cloud provider may not be the end of the data transfer step. If it needs to be replicated to multiple regions or providers, plan carefully how it will get there. Upload over the Internet is free, while AWS, for example, charges up to two cents per gigabyte for interregional data transfer and nine cents per gigabyte for transfer to other cloud vendors. Both methods will face bandwidth limitations that could benefit from transport acceleration such as CloudDat.

Cloud migration bottleneck #6: Cloud scaling

Once data arrives at its destination in the cloud, the migration process is only half finished. Checksums come first: Make sure that the bytes that arrived match those that were sent. This can be trickier than you may realize. File storage uses layers of caches that can hide corruption of data that was just uploaded. Such corruption is rare, but until you’ve cleared all of the caches and re-read the files, you can’t be sure of any checksums. Rebooting the instance or unmounting the storage does a tolerable job of clearing caches.

Validating object storage checksums requires that each object be read out into an instance for calculation. Contrary to popular belief, object “E-tags” are not useful as checksums. Objects uploaded using multipart techniques in particular can only be validated by reading them back out.

1 2 Page 2

Once the transferred data has been verified, it may need further extraction and reformatting and distribution before your cloud-based applications and services can make use of it. This is pretty much the opposite of the preparation and marshaling that occurred on premises.

The final step of scaling out the data is to verify that it is both correct and useful. This is the other side of the information validation planning discussed above and is the only way to know whether you are truly done.

Cloud migration is more about processes than data. Even seemingly simple tasks like file distribution can require complex migration steps to ensure that the resulting cloud infrastructure matches the desired workflow. Much of the hype surrounding cloud, from cost savings to scalability, is justifiable. But careful planning and anticipation of difficulties is essential to determining what tools and methods are necessary to realize those returns.

Seth Noble is the creator of the patented Multipurpose Transaction Protocol (MTP) technology and a top data transport expert. He is founder and president of Data Expedition, with a dual BS-MS degree from Caltech, and a doctorate in computer science from the University of Oklahoma for work developing MTP.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to

Source: InfoWorld Big Data

IDG Contributor Network: 3 requirements of modern archive for massive unstructured data

IDG Contributor Network: 3 requirements of modern archive for massive unstructured data

Perhaps the least understood component of secondary storage strategy, archive has become a necessity for modern digital enterprises with petabytes of data and billions of files.

So, what exactly is archive, and why is it so important?

Archiving data involves moving data that is no longer frequently accessed off primary systems for long-term retention.

The most apparent benefit of archiving data is to save precious space on expensive primary NAS or to retain data for regulatory compliance, but archiving can reap long-term benefits for your business as well. For example, archiving the results of scientific experiments that would be costly to replicate can be extremely valuable later for future studies.

In addition, a strong archive tier can cost-effectively protect and enable usage of the huge data sets needed for enhanced analytics, machine learning, and artificial intelligence workflows.

Legacy archive fails for massive unstructured data

However, legacy archive infrastructure wasn’t built to meet the requirements of massive unstructured data, resulting in three key failures of legacy archive solutions.

First, the scale of data has changed greatly, from terabytes to petabytes and quickly growing. Legacy archive can’t move high volumes of data quickly enough and can’t scale with today’s exploding data sets.

Second, the way organizations use data has also changed. It’s no longer adequate to simply throw data into a vault and keep it safe; organizations need to use their archived data as digital assets become integral to business. As more organizations employ cloud computing and machine learning/AI applications using their huge repositories of data, legacy archive falls short in enabling usage of archived data.

Third, traditional data management must become increasingly automated and delivered as-a-Service to relieve management overhead on enterprise IT and reduce total cost of ownership as data explodes beyond petabytes.

Modern archive must overcome these failures of legacy solutions and meet the following requirements.

1. Ingest petabytes of data

Because today’s digital enterprises are generating and using petabytes of data and billions of files, a modern archive solution must have the capacity to ingest enormous amounts of data.

Legacy software uses single-threaded protocols to move data, which was necessary to write to tape and worked for terabyte-scale data but fail for today’s petabyte-scale data.

Modern archive needs highly parallel and latency-aware data movement to efficiently move data from where it lives to where it’s needed, without impacting performance. The ability to automatically surface archive-ready data and set policies to snapshot, move, verify, and re-export data can reduce administrator effort and streamline data management.

In addition, modern archive must be able to scale with exponentially growing data. Unlike legacy archive, which necessitates silos as data grows large, a scale-out archive tier keeps data within the same system for simpler management.

2. API-driven, cloud-native architecture

An API-driven archive solution can plug into customer applications, ensuring that the data can be used. Legacy software wasn’t designed with this kind of automation, making it difficult to use the data after it’s been archived.

Modern archive that’s cloud-native can much more easily plug into customer applications and enable usage. My company’s product, Igneous Hybrid Storage Cloud, is built with event-driven computing, applying the cloud-native concept of having interoperability at every step. Event-driven computing models tie compute to actions on data and are functionally API-driven, adding agility to the software. Building in compatibility with any application is simply a matter of exposing existing APIs to customer-facing applications.

This ensures that data can get used by customer applications. This capability is especially useful in the growing fields of machine learning and AI, where massive repositories of data are needed for compute. The more data, the better—which not only requires a scale-out archive tier, but one that enables that data to be computed.

An example of a machine learning/AI workflow used by Igneous customers involves using Igneous Hybrid Storage Cloud as the archive tier for petabytes of unstructured file data and moving smaller subsets of data to a “hot edge” primary tier from which the data can be processed and computed.

3. As-a-Service delivery

Many of the digital enterprises and organizations with enormous amounts of unstructured file data don’t necessarily have the IT resources or budget to match, let alone the capacity to keep pace with the growing IT requirements of their exponentially growing data.

To keep management overhead reasonable and cost-effective, many organizations are turning to as-a-service solutions. With as-a-service platforms, software is remotely monitored, updated, and troubleshooted, so that organizations can focus on their business, not IT.

Modern archive solutions that are delivered as-a-service can help organizations save on total cost of ownership (TCO) when taking into account the amount of time it frees up for IT administrators to focus on other tasks—like planning long-term data management and archiving strategy.

This article is published as part of the IDG Contributor Network. Want to Join?

Source: InfoWorld Big Data

IDG Contributor Network: Why data democratization is crucial to your business

IDG Contributor Network: Why data democratization is crucial to your business

In the Information Age, the power of data has been mostly kept in the hands of a few data analysts with the skills and understanding necessary to properly organize, crunch, and interpret the data for their organization. This approach was born out of necessity—most employees were not trained how to effectively use the growing flood of data.

But things have changed with the emergence of technologies capable of making data shareable and interpretable for nondata analysts. Data democratization allows data to pass safely from the hands of a few analysts into the hands of the masses in a company.

Data democratization is a game-changer

Data democratization will catapult companies to new heights of performance, if done right. Indeed, the utopian vision of data democratization is hard to refuse.

“Data democratization means that everybody has access to data and there are no gatekeepers that create a bottleneck at the gateway to the data. The goal is to have anybody use data at any time to make decisions with no barriers to access or understanding,” says Bernard Marr, bestselling author of Big Data in Practice.

The ability to instantly access and understand data will translate into faster decision-making, and that will translate into more agile teams. Those teams will have a competitive advantage over slower data-stingy businesses.

But Marr believes it’s about more than just being able to take instant action. “When you allow data access to any tier of your company, it empowers individuals at all levels of ownership and responsibility to use the data in their decision making,” he says. If the current situation encourages team members to go around data to get things done on time, data democratization creates team members that are more data-driven.

When things happen in a good or bad sense, and the right people are proactively informed, those people can dig into and understand those anomalies and be proactively informed.

Ultimately, for marketers striving to create the ultimate customer experience, data democratization is a must. The question on their minds should not be if data democratization is coming, but how they can create it in their organization quickly and efficiently. as quickly as possible.

Laying the foundation for data democratization

Businesses that wish to benefit from data democratization will have to create it intentionally. This means an organizational investment must be made in terms of budget, software, and training. 

In the world of data democratization, breaking down information silos is the first step toward user empowerment. This cannot be done without customizable analytics tools capable of desegregating and connecting previously siloed data making it manageable from a single place.

Ideally, the tools will filter the data and visualizations shared with each individual—whether they are an executive, a director, or a designer—according to each person’s role. Marketing managers, for instance, will need data that allows them to analyze customer segments leading up to a new campaign. CMOs, on the other hand, will need data that allows them to analyze marketing ROI as they build next year’s budgets.

Those tools must help employees visualize their data. The ability to access data points in a visual way that consumers of the data can be comfortable with is important. These visualizations must align with the organization’s KPIs: metrics, goals, targets, and objectives that have been aligned from the top-down that enable data-driven decisions.

With the right tools in place, team training becomes the next essential step. Because data democratization depends on the concept of self-service analytics, every team member must be trained up to a minimum level of comfort with the tools, concepts, and processes involved to participate.

Last, you cannot have a democracy without checks and balances. The final step to sharing data across your data governance. Mismanagement or misinterpretation of data is a real concern. Therefore, a center of excellence is recommended to keep the use of data on the straight and narrow. This center of excellence should have a goal to drive adoption of data usage which is made possible by owning data accuracy, curation, sharing, and training. These teams are often most successful when they have budget, a cross-section of skillsets, and executive approval.

When executed this way, sharing data can allow every player on your team to realize the value of that data. Fortunately, we don’t have to wait for the future to see what marketing teams can accomplish when this powerful resource is available to them.

The future of data democratization is now

For a sterling example of data democratization in action, you need look no further than the Royal Bank of Scotland, a client of my company Adobe Systems. The bank’s digital marketing leaders invited representatives from multiple parts of its business—including its call center, human resources, and legal department—to help optimize parts of the customer experience. Working off the same data, these nonmarketers could bring fresh insights to the marketing process and revolutionize the bank’s customer experience.

 “Raising visibility from our digital marketing platform and data-driven strategies was vital to the shift,” says the bank’s head of analytics, Giles Richardson. “We had to have concrete, measurable insights and ways for our cross-functional teams to act on them to propel RBS into its next chapter.”

For the Royal Bank of Scotland and other businesses interested in making the move toward data democratization, the journey is not measured in reaching a single destination. It has to be viewed as an ongoing process.

“Expect that data democratization is an evolution where each individual small win, when nontechnical users gain insight because of accessing the data, adds up to ultimately prove the merits of data democratization,” says Marr.

Data democratization is the future of managing big data and realizing its value. Businesses armed with the right tools and understanding are succeeding today because they are arming all their employees with the knowledge necessary to make smart decisions and provide better customer experiences.

This article is published as part of the IDG Contributor Network. Want to Join?

Source: InfoWorld Big Data

Google Cloud tutorial: Get started with Google Cloud

Google Cloud tutorial: Get started with Google Cloud

When people think of the word Google, they think about search and the immense computational infrastructure that converts your words into a list of websites that probably have exactly what you’re looking to find. It took Google years to hire the engineers, design the custom computers, and create the huge collection of hardware that answers web queries. Now it can be yours with just a few keystrokes and clicks. 

Google rents out much of that expertise and infrastructure to other web companies. If you want to build a clever website or service, Google is ready to charge you to run it on its vast collection of machines. All you need to do is start filling out some web forms and soon you’ll have a big collection of servers ready to scale and handle your chores.

For a quick guide to getting started, and to navigating the many choices along the way, just follow me.  

Step 1: Set up your account

This is the easy part. If you’ve got a Google account, you’re ready to go. You can log into and head right to your Console and Dashboard. There won’t be much to see here when you begin, but soon you’ll start to see details about what your vast computing empire is doing. That is, the load on any server instances you’ve created, the data flowing through the network, and the usage of APIs. You can assure yourself that everything is running smoothly with a glance.

Source: InfoWorld Big Data