IDG Contributor Network: ETL is dead

IDG Contributor Network: ETL is dead

Extract, transform, and load. It doesn’t sound too complicated. But, as anyone who’s managed a data pipeline will tell you, the simple name hides a ton of complexity.

And while none of the steps are easy, the part that gives data engineers nightmares is the transform. Taking raw data, cleaning it, filtering it, reshaping it, summarizing it, and rolling it up so that it’s ready for analysis. That’s where most of your time and energy goes, and it’s where there’s the most room for mistakes.

If ETL is so hard, why do we do it this way?

The answer, in short, is because there was no other option. Data warehouses couldn’t handle the raw data as it was extracted from source systems, in all its complexity and size. So the transform step was necessary before you could load and eventually query data. The cost, however, was steep.

Rather than maintaining raw data that could be transformed into any possible end product, the transform shaped your data into an intermediate form that was less flexible. You lost some of the data’s resolution, imposed the current version of your business’ metrics on the data, and threw out useless data.

And if any of that changed—if you needed hourly data when previously you’d only processed daily data, if your metric definitions changed, or some of that “useless” data turned out to not be so useless after all—then you’d have to fix your transformation logic, reprocess your data, and reload it.

The fix might take days or weeks

It wasn’t a great system, but it’s what we had.

So as technologies change and prior constraints fall away, it’s worth asking what we would do in an ideal world—one where data warehouses were infinitely fast and could handle data of any shape or size. In that world, there’d be no reason to transform data before loading it. You’d extract it and load it in its rawest form.

You’d still want to transform the data, because querying low-quality, dirty data isn’t likely to yield much business value. But your infinitely fast data warehouse could handle that transformation right at query time. The transformation and query would all be a single step. Think of it as just-in-time transformation. Or ELT.

The advantage of this imaginary system is clear: You wouldn’t have to decide ahead of time which data to discard or which version of your metric definitions to use. You’d always use the freshest version of your transformation logic, giving you total flexibility and agility.

So, is that the world we live in? And if so, should we switch to ELT?

Not quite. Data warehouses have indeed gotten several orders of magnitude faster and cheaper. Transformations that used to take hours and cost thousands of dollars now take seconds and cost pennies. But they can still get bogged down with misshapen data or huge processes.

So there’s still some transformation that’s best accomplished outside the warehouse. Removing irrelevant or dirty data, and doing heavyweight reshaping, is still often a preloading process. But this initial transform is a much smaller step and thus much less likely to need updating down the road.

Basically, it’s gone from a big, all-encompassing ‘T’ to a much smaller ‘t’

Once the initial transform is done, it’d be nice to move the rest of the transform to query time. But especially with larger data volumes, the data warehouses still aren’t quite fast enough to make that workable. (Plus, you still need a good way to manage the business logic and impose it as people query.)

So instead of moving all of that transformation to query time, more and more companies are doing most of it in the data warehouse—but they’re doing it immediately after loading. This gives them lots more agility than in the old system, but maintains tolerable performance. For now, at least, this is where the biggest “T” is happening.

The lightest-weight transformations—the ones the warehouses can do very quickly—are happening right at query time. This represents another small “t,” but it has a very different focus than the preloading “t.” That’s because these lightweight transformations often involve prototypes of new metrics and more ad hoc exploration, so the total flexibility that query-time transformation provides is ideal.

In short, we’re seeing a huge shift that takes advantage of new technologies to make analytics more flexible, more responsive, and more performant. As a result, employees are making better decisions using data that was previously slow, inaccessible, or worst of all, wrong. And the companies that embrace this shift are outpacing rivals stuck in the old way of doing things.

ETL? ETL is dead. But long live … um … EtLTt?

This article is published as part of the IDG Contributor Network. Want to Join?

Source: InfoWorld Big Data

IDG Contributor Network: Bringing embedded analytics into the 21st century

IDG Contributor Network: Bringing embedded analytics into the 21st century

Software development has changed pretty radically over the last decade. Waterfall is out, Agile is in. Slow release cycles are out, continuous deployment is in. Developers avoid scaling up and scale out instead. Proprietary integration protocols have (mostly) given way to open standards.

At the same time, exposing analytics to customers in your application has gone from a rare, premium offering to a requirement. Static reports and SOAP APIs that deliver XML files just don’t cut it anymore.

And yet, the way that most embedded analytics systems are designed is basically the same as it was 10 years ago: Inflexible, hard to scale, lacking modern version control, and reliant on specialized, expensive hardware.

Build or Buy?

It’s no wonder that today’s developers often choose to build embedded analytics system in-house. Developers love a good challenge, so when faced with the choice between an outdated, off-the-shelf solution and building for themselves, they’re going to get to work.

But expectations for analytics have increased, and so even building out the basic functionality that customers demand can sidetrack engineers (whose time isn’t cheap) for months. This is to say nothing of the engineer-hours required to maintain a homegrown system down the line. I simply don’t believe that building it yourself is the right solution unless analytics is your core product.

So what do you do?

Honestly, I’m not sure. Given the market opportunity, I think it’s inevitable that more and more vendors will move into the space and offer modern solutions. And so I thought I’d humbly lay out 10 questions embedded analytic buyers should ask about the solutions they’re evaluating.

  1. How does the solution scale as data volumes grow? Does it fall down or require summarization when dealing with big data?
  2. How does the tool scale to large customer bases? Is supporting 1,000 customers different than supporting 10?
  3. Do I need to maintain specialized ETLs and data ingestion flows for each customer? What if I want to change the ETL behavior? How hard is that?
  4. What’s the most granular level that customers can drill to?
  5. Do I have to pay to keep duplicated data in a proprietary analytics engine? If so, how much latency does that introduce? How do things stay in sync?
  6. Can I make changes to the content and data model myself or is the system a black box where every change requires support or paid professional services?
  7. Does it use modern, open frameworks like HTML5, Javascript, iFrame, HTTPS and RESTful APIs?
  8. Does the platform offer version control? If so, which parts of the platform (data, data model, content, etc.) are covered by version control?
  9. How customizable is the front-end? Can fonts, color palettes, language, timezones, logos, and caching behavior all be changed? Can customization be done on a customer-by-customer basis or is it one template for all customers?
  10. How much training is required for admins and developers? And how intuitive is the end-user interface?

No vendor that I know of has the “right” answer to all these questions (yet), but they should be taking these issues seriously and working toward these goals.

If they’re not, you can bet your engineers are going to start talking about how they could build something better in a week. HINT: They actually can’t, but good luck winning that fight 😉

This article is published as part of the IDG Contributor Network. Want to Join?

Source: InfoWorld Big Data

IDG Contributor Network: Getting off the data treadmill

IDG Contributor Network: Getting off the data treadmill

Most companies start their data journey the same way: with Excel. People who are deeply familiar with the business start collecting some basic data, slicing and dicing it, and trying to get a handle on what’s happening.

The next place they go, especially now, with the advent of SaaS tools that aid in everything from resource planning to sales tracking to email marketing, is into the analytic tools that come packaged with their SaaS tools.

These tools provide basic analytic functions, and can give a window into what’s happening in at least one slice of the business. But drawing connections between those slices (joining finance data with marketing data, or sales with customer service) is where the real value lies. And that’s exactly where these department-specific tools fall down.

So when you talk to people in that second phase, understandably, they’re looking forward to the day when all of their data automatically flows into one place.. No more manual, laborious hours spent combining data. Just one place to look and see exactly what’s happening in the business.

Except

Once you give people a taste of the data and they can see what’s happening, naturally, their very next question is, “Well, why did that happen?”

How things usually work

And that’s where things break down. For most of the history of business intelligence, the way you answered “why” questions was to extract the relevant data from that beautiful centralized tool and send it off to an analyst. They would load the data back into a workbook, start from scratch on a new report, and you’d wait.

By the time you got your answer, it was usually too late to use that knowledge in making your decision.

The whole thing is kind of silly, though — you’d successfully gotten rid of a manual, laborious process and replaced it with one that is, well, manual and laborious. You thought you were moving forward, but it turns out you were just on a treadmill.

To sketch it out, here’s what that looks like:

img1Daniel Mintz

Another path

Recently though, more and more businesses are realizing that there’s another way: With the right tools, you can put the means to answer why questions in the hands of the people who can (and will) take action based on those answers.

In the old world, you’d find out in February that January leads were down, and wait until March for the analysis that reveals that — d’oh! — the webform wasn’t working on mobile. In the new world, you can get an automated alert about the drop-off in the first week of the year. You can drill into the relevant data immediately by device type, realize that the drop-off only affects mobile, surface the bug, and get it fixed that afternoon.

That’s the real value that most businesses aren’t realizing from their data. It’s much less about incorporating the latest machine learning algorithm that delivers a 3% improvement in behavioral prediction, and more about the seemingly simple task of putting the right information in front of the right person at the right time.

The task isn’t simple (especially considering the mountains of data most companies are sitting on). But the good news is that it is achievable and it doesn’t take a room full of Ph.D’s or millions of dollars in specialized software.

What it does take is focus, and a commitment to being data-driven.

Luckily, it’s worth it. The payoff of facilitating this kind of exploration is enormous. It can be the difference between making the right decision and the wrong one — hundreds of times a month — all across your company.

img2Daniel Mintz

So if you find yourself stuck on the treadmill, try stepping off. I think you’ll like where the path takes you.

This article is published as part of the IDG Contributor Network. Want to Join?

Source: InfoWorld Big Data