In the rush to big data, we forgot about search

In the rush to big data, we forgot about search

I ready David Linthicum’s post ”Data integration is the one thing the cloud makes worse” with great interest. A huge reason that I decided my next job would be for a search company was because of this very problem. (That’s why I now work for LucidWorks, which produces Solr– and Spark-based search tools.) While working with clients, I realized that with big data and the cloud a tough problem, finding things was becoming worse. I had seen the upcoming meltdown as the use of Hadoop formed yet another data silo and as a result produced few actual insights.

Part of the problem is that the technology industry is trend-driven rather than problem-solving. A few years ago, it was all about client/server under the guise of distributed computing à la Enterprise JavaBeans, followed by web services and then big data. Now it is all about machine learning. Many of these steps were important, and machine learning is an important tool for solving problems.

We lost indexing and search as big data emerged

But sadly, the most important problem-solving trend got lost in the shuffle: indexing and search.

The modern web began with search. The web would be a lot smaller if Yahoo and the search portals of the late 1990s had triumphed. The dot-com bomb happened and yet Google was born from its ashes. Search also birthed big data and arguably the modern machine learning trend. Google, Facebook, and other companies needed more ways to handle their indexing jobs and their large amounts of data distributed to internet scale. Meanwhile, they needed better ways to find and organize data after they ran upon the limits of crowdsourcing and human intelligence. blew away the retail market in part because it dared to invest in search technology. The main reason I go to Amazon and not other vendors is because I’ll almost definitely find what I’m looking for. In fact, Amazon may suggest what I want before I get around to searching for it. (Though, I have to say that’s recommendations are now falling behind the curve.) Yet many retailers still use the built-in search in their commerce suite and then wonder why customer conversion and engagement is off. (Hint: Customers can’t find anything to buy.)

Meanwhile, many companies continue to keep old-style enterprise search products. Some of these products aren’t even maintained, belonging to dead or acquired companies. Most people still operate with bookmarks. So if you move some of your data to SaaS solutions, move some of your data to PaaS solutions, move some of your data to IaaS solutions and across multiple vendors’ cloud platforms while maintaining some of your data behind the firewall—yeah, no one is going to find anything!

How to redefine “integration”

To address what Linthicum raised in his post, we need to do is redefine “integration” for the distributed and cloud computing era.

Data integration used to mean just that: grabbing all the data and dumping it into a big, fat, single area. First this was with databases, then data warehouses, and then Hadoop. Ironically, we moved further away from indexed technology when doing this.

Now, integration must mean that we can index and find the data where it lives, deduplicate it, and derive a result. To find a single source of truth, we need to capture timestamps and source IDs.

To integrate, we need a single search solution that can reach our on-premises data and our cloud data. The worst thing we can do is deploy a search tool that only searches one source of data, serves only one use case, or can’t be used behind our firewall.

In the cloud era, we need to look at search to be the glue that lets us find the data and analyze it together, no matter where it lives. We can’t just dump everything into one place; we need tools to let us get to exactly the right data where it lives.

Source: InfoWorld Big Data

No, you shouldn’t keep all that data forever

No, you shouldn’t keep all that data forever

Modern ethos is that all data is valuable, should be stored forever, and that machine learning will one day magically find the value of it. You’ve probably seen that EMC picture about how there will be 44 zettabytes of data by 2020? Remember how everyone had Fitbits and Jawbone Ups for about a minute? Now Jawbone is out of business. Have you considered this “all data is valuable” fad might be the corporate equivalent? Maybe we shouldn’t take a data storage company’s word on it that we should store all data and never delete anything.

Back in the early days of the web it was said that the main reasons people went there were for porn, jobs, or cat pictures. If we download all of those cat pictures and run a machine learning algorithm on them, we can possibly determine the most popular colors of cats, the most popular breeds of cats, and the fact that people really like their cats. But we don’t need to do this—because we already know these things. Type any of those three things into Google and you’ll find the answer. Also, with all due respect to cat owners, this isn’t terribly important data.

Your company has a lot of proverbial cat pictures. It doesn’t matter what your policy and procedures for inventory retention were in 1999. Any legal issues you had reason to store back then have passed the statute of limitation. There isn’t anything conceivable that you could glean from that old data that could not be gleaned from any of the more recent revisions.

Machine learning or AI isn’t going to tell you anything interesting about any of your 1999 policies and procedures for inventory retention. It might even be sort of a type of “dark data,” because your search tool probably boosts everything else above it, so unless someone queries for “inventory retention procedure for 1999,” it isn’t going to come up.

You’ve got logs going back to the beginning of time. Even the Jawbone UP didn’t capture my every breath and certainly didn’t store my individual steps for all time. Sure each breath or step may have slightly different characteristics, but it isn’t important. Likewise, It probably doesn’t matter how many exceptions per hour your Java EE applications server used to throw in 2006. You use Node.js now anyhow. If “how many errors per hour per year” is a useful metric, you can probably just summarize that. You don’t need to keep every log for all time. It isn’t reasonable to expect it to be useful.

Supposedly, we’re keeping this stuff around for the day when AI or machine learning find something useful in it. But machine learning isn’t magical. Mostly, machine learning falls into classification, regression, and clustering. Clustering basically groups stuff that looks “similar”—but it isn’t very likely your 2006 app server logs have anything useful in them that can be found via clustering. The other two algorithms require you to think of something and “train” the machine learning. This means you need a theory of what could be useful and to find something useful, then train the computer to find it. Don’t you have better things to do?

Storage is cheap, but organization and insight are not. Just because you got a good deal on your SAN or have been running some kind of mirrored JBOD setup with a clustered file system doesn’t mean that storing noise is actually cheap. You need to consider the human costs of organizing, maintaining, and keeping all this stuff around. Moreover, while modern search technology is good at sorting relevant stuff from irrelevant, it does cost you something to do so. So while autumn is on the wane, go ahead and burn some proverbial corporate leaves.

It really is okay if you don’t keep it.

Source: InfoWorld Big Data

Which Spark machine learning API should you use?

Which Spark machine learning API should you use?

You’re not a data scientist. Supposedly according to the tech and business press, machine learning will stop global warming, except that’s apparently fake news created by China. Maybe machine learning can find fake news (a classification problem)? In fact, maybe it can.

But what can machine learning do for you? And how will you find out? There’s a good place to start close to home, if you’re already using Apache Spark for batch and stream processing. Along with Spark SQL and Spark Streaming, which you’re probably already using, Spark provides MLLib, which is, among other things, a library of machine learning and statistical algorithms in API form.

Here is a brief guide to four of the most essential MLlib APIs, what they do, and how you might use them.  

Basic statistics

Mainly you’ll use these APIs for A-B testing or A-B-C testing. Frequently in business we assume that if two averages are the same then the two things are roughly equivalent. That isn’t necessarily true. Consider if a car manufacturer replaces the seat in a car and surveys customers on how comfortable it is. At one end the shorter customers may say the seat is much more comfortable. At the other end, taller customers will say it is really uncomfortable to the point that they wouldn’t buy the car and the people in the middle balance out the difference. On average the new seat might be slightly more comfortable but if no one over 6 feet tall buys the car anymore, we’ve failed somehow. Spark’s hypothesis testing allows you to do a Pearson chi-squared or a Kolmogorov–Smirnov test to see how well something “fits” or whether the distribution of values is “normal.” This can be used most anywhere we have two series of data. That “fit” might be “did you like it” or did the new algorithm provide “better” results than the old one. You’re just in time to enroll in a Basic Statistics Course on Coursera.


What are you? If you take a set of attributes you can get the computer to sort “things” into their right category. The trick here is coming up with the attribute that matches the “class,” and there is no right answer there. There are a lot of wrong answers. If you think of someone looking through a set of forms and sorting them into categories, this is classification. You’ve run into this with spam filters, which use a list of words spam usually has. You may also be able to diagnose patients or determine which customers are likely to cancel their broadcast cable subscription (people who don’t watch live sports). Essentially classification “learns” to label things based on labels applied to past data and can apply those labels in the future. In Coursera’s Machine Learning Specialization there is a course specifically on this that started on July 10, but I’m sure you can still get in.


If k-means clustering is the only thing out of someone’s mouth after you ask them about machine learning, you know that they just read the crib sheet and don’t know anything about it. If you take a set of attributes you may find “groups” of points that seem to be pulled together by gravity. Those are clusters. You can “see” these clusters but there may be clusters that are close together. There may be one big one and one small one on the side. There may be smaller clusters in the big cluster. Because of these and other complexities there are a lot of different “clustering” algorithms. Though different from classification, clustering is often used to sort people into groups. The big difference between “clustering” and “classification” is that we don’t know the labels (or groups) up front for clustering. We do for classification. Customer segmentation is a very common use. There are different flavors of that, such as sorting customers into credit or retention risk groups, or into buying groups (fresh produce or prepared foods), but it is also used for things like fraud detection. Here’s a course on Coursera with a lecture series specifically on clustering and yes, they cover k-means for that next interview, but I find it slightly creepy when half the professor floats over the board (you’ll see what I mean).

Collaborative filtering

Man, collaborative filtering is a popularity contest. The company I work for uses this to improve search results. I even gave a talk on this. If enough people click on the second cat picture it must be better than the first cat picture. In a social or e-commerce setting, if you use the likes and dislikes of various users, you can figure out which is the “best” result for most users or even specific sets of people. This can be done on multiple properties for recommender systems. You see this on Google Maps or Yelp when you search for restaurants (you can then filter by service, food, decor, good for kids, romantic, nice view, cost). There is a lecture on collaborative filtering from the Stanford Machine Learning course, which started on July 10 (but you can still get in).

This is not all you can do (by far) but these are some of the common uses along with the algorithms to accomplish them. Within each of these broad categories are often several alternative algorithms or derivatives of algorithms. Which to pick? Well, that’s a combination of mathematical background, experimentation, and knowing the data. Remember, just because you get the algorithm to run doesn’t mean the result isn’t nonsense.

If you’re new to all of this, then the Machine Learning Foundations course on Coursera is a good place to start — despite the creepy floating half-professor.

Source: InfoWorld Big Data

Data science could keep United out of more trouble

Data science could keep United out of more trouble

I’ve avoided flying United for many years. On my last trip to Japan about 10 years back, somewhere along the way an employee took my ticket and said I’d get another one in Japan. Wrong! On my return, United told me I had to buy a new ticket for around $7,000.

Anyhow, we’ve all heard about United’s overbooking disaster, where a passenger faced a lot worse abuse than I did. With the right data and analytics, another outcome could have been possible.

When the tickets were sold, United’s ticketing system could have seen there was a high probability that the other flight would arrive late and that crew members frequently bumped passengers. The ticketing system could have reserved a number of seats as standby or told the last four passengers booking them that they might be bumped. Then, when the other flight was coming in with the crew that needed to get back home, United simply could have avoided boarding the last four.

In fact, flight data is a cornucopia of statistical information. You could learn a lot about the following:

  1. Weather patterns by season and even in unseasonable years. Sure, we have radar, but how do these patterns affect objects in the air?
  2. Flight delays (travel sites already report this).
  3. Domino effects, such as how a delayed flight or weather pattern impacts other flights.
  4. Maintenance issues, such as how frequently by plane type (or airline) parts have to be replaced or fail.

Also, you can glean a lot of customer and customer preference information. The company I work for calls these “signals,” which I like better than “events,” because they aren’t always events and “time series” is too generic. You could learn the following:

  1. Which customers will likely cancel if assigned a middle seat (my bladder is small in the air and I have broad shoulders). This goes beyond my profile preference for an aisle or a window to identify how much I prefer an aisle.
  2. Which customers are most price sensitive and influenced by cost.
  3. How frequently a customer flies your airline after being bumped or experiencing other customer service problems.

Using statistics, machine learning, and a simple rules engine — and connecting some of these data sets — airlines could:

  1. Automatically offer discounts and other incentives to passengers with flexible schedules to fill empty seats.
  2. Offer status upgrades to passengers who are likely to be incentivized to fly your airline over others (American is doing this, but I don’t know how targeted it is).
  3. Detect probable weather problems, automatically hold seats, and start rebooking before the connection even lands. (Delta does this once the delay happens, but it does so poorly with suboptimal routes.)
  4. Avoid overbooking and simply offer preselected seats. Also, instead of “dumb bidding” in the open air, send a text message to passengers who are likely to take a lower offer. This prevents people from sitting around and waiting for higher compensation.
  5. When you have to select someone, choose the person least likely to care. You have the data.
  6. Detect problematic decision-making or identify employees who frequently do stupid things (like drag people off airplanes).
  7. Assuming there’s a connection between complaints and bad PR, detect when a policy or practice is likely to cause your stock to drop should it go viral on video.

I realize that not every problem can be solved by search (full disclosure: I work for a search company) and math, but a lot of the dumbest stuff and everyday annoyances could. All it takes is motivation. Unfortunately, so far, U.S.-based airlines seem to lack a strong economic reason to care about customer service.

Source: InfoWorld Big Data

Uber should use data science to fix its culture

Uber should use data science to fix its culture

Ever since a former employee spoke out about her miserable experience with her boss and HR, the media has piled on Uber. We’ve heard in the past that Uber uses data to analyze its ridership down to what seems like a creepy level. We’ve also heard that it has a toxic and misogynistic culture.

Ironically, some of the same data analysis Uber does on its riders could help it fix its culture.

On Monday, I spoke to Dr. Carissa Romero from Paradigm, a strategy firm that helps companies analyze themselves to improve inclusion and diversity based on the idea that diverse companies outperform others. Romero has a doctorate in psychology and is an expert in fixed and growth mindsets—people’s beliefs about the nature of talents and abilities—and founded Stanford’s applied research center on the subject.

I asked Dr. Romero about the techniques and tools companies can use to find problems and what kinds of interventions are effective. She began by making a distinction between the two fundamental types of bias.

Implicit versus explicit bias

The cases of Susan Fowler and “Amy Vertino” at Uber was one of explicit bias. Some of it even made it into written form. Finding explicit bias or harassment can be done by a simple text search.

Most workplace problems in this area, however, involve implicit bias. It can be equally as damaging—and the person making the mistake may not even know they’re doing it. For example, if I’m hiring a software developer and I have in my mind what that developer “is like,” I may inadvertently make judgments linked to race, gender, or culture that aren’t related to details actually important to the job.

This is also not something you find with a simple text search because they aren’t going to say “sex” or use a racial epithet. Also, many people who make these bias mistakes are not bad people and don’t have bad intent, but they have to make decisions differently and become better informed by data.

Uber’s explicit problems are a part of a self-admitted failure of leadership. You don’t need fancy data analysis to see that. Yet if the company addresses the issue, it’ll still have a lot of work on internal culture and practices if it wants to have a more diverse workplace.

Where is the data?

Much of the data a company needs to determine whether it’s treating all of its employees fairly resides in the systems it’s already using. This starts even before hiring. According to Dr. Romero, “On the recruiting and hiring side, we pull data from a company’s applicant tracking system.

“For example, a common applicant tracking system is Greenhouse. We pull data from Greenhouse to learn about things like the diversity of different applicant sources and pass-through rates at each stage of the hiring process.”

Companies also need to look at employees throughout their “lifecycle” at the company. Some of this information lives in their human resources information system or performance review system.

This isn’t necessarily enough. Paradigm also relies on engagement surveys and focus groups to better understand differences in how engaged employees feel and whether they think their voices are being heard. This qualitative data helps make the quantitative data more understandable.

How do you determine bias?

Bias can exist at different stages of employment, from how applicants are attracted to apply for a job to hiring, evaluation, promotion, and retention, as well as terminations. Different metrics apply to each of these stages.

According to Dr. Romero, in the recruiting phase, it pays to take a hard look at candidate sources. Often, employee referrals result in less diversity. When it comes to hiring, companies should look at the different pass-through rates: If black candidates pass through phone screening at a lower rate than white candidates, that’s an example of quantitative data the company can use to detect bias.

Once an employee is hired, performance review scores and promotion rates become key sources. Next, when examining a company’s employee retention rates, look at terminations and longevity. If the data is stratified by demographic group (race, gender, and so on) and there are large disparities, that may be an indication of bias.

Other, more subtle data can also be analyzed. When looking at performance reviews, are “soft skills” mentioned more often for women or people of color compared to men? According to Dr. Romero, “Our data scientist uses a machine learning algorithm to look at whether different language is used to describe candidates from different demographic groups, but we also very often do it manually where we pull a random sample of written feedback to manually code. Then we use statistical tools to analyze the differences.” In other words, they plug the data into R and use their algorithms to crunch data on employees in the same kinds of ways companies are using it to understand their customers.

Dr. Romero also focuses on qualitative data: the “why.” This emerges from interviewing people. Some questions she pointed out: Are recruiters reaching out only on LinkedIn? What are managers looking for in a candidate?

Unfortunately, it can be hard to identify specific individuals within a company using statistical analysis. If a manager has only a few reports and hasn’t had to interview many people, the sample size will be too low. Instead, Romero advises companies to focus on establishing practices to prevent it.

How do you fix it?

According to Dr. Romero, “When you’re evaluating your employees, if you have a standard set of questions that you use to evaluate people in that role, then you’re going to make it less likely that bias influences decisions.” In contrast, “Not having a process would make it more likely that you would have more of these individual cases that people are relying on stereotypes compared to when you have processes in place.”

Process is great, but I’ve worked in organizations that only went through the motions. These organizations subscribe to the mythical-man-month, cargo-cult school of process. According to Romero, such sloppiness can be avoided by creating up-front descriptions of what you’re seeking in each position and clearly establishing metrics for performance. When it comes to performance reviews, force the manager to give an example of why the rating is deserved. According to Dr. Romero:

When you know what you’re evaluating up front and use examples to support your evaluation, biases are less likely to come into play. Evaluators should decide ahead of time what to look for, and organize feedback by relevant attributes. When you’re not clear about what you’re looking for, you’re more likely to rely on an overall feeling. That feeling can be influenced by bias. For example, you may be influenced by how much you like that person personally (vs. how good of a fit for the role they are). You might just like them because they are similar to you in some irrelevant way (maybe you have the same hobbies). Or you might be influenced by a stereotype – for example, what does a typical person look like in this role?

Some issues are more subtle and involve company culture. “Women often feel it’s hard to get heard in a meeting because they’re often interrupted. You might have a moderator for every team meeting or put a sign in the room. Make sure individuals are aware, agendas are distributed ahead of time, ask people for their thoughts,” said Dr. Romero.

According to Dr. Romero, what isn’t effective is “diversity training” to raise awareness, nor does copying other companies’ strategies. “Coming up with strategies before you’ve taken a look at your company’s data, and analyzed your process and your culture, is a bad approach. I also think ignoring behavioral science research is a bad approach. So basically, a non-data-driven approach is bad (ignoring your own data and ignoring what behavioral science research tells us).”

Why the need is real

I asked Dr. Romero if everyone needs this stuff, even small companies and startups. “In general, yes,” she replied. “Companies use data when making business decisions, it makes sense to use data when making people-related decisions. A data science approach to understanding people in your organization is helpful.”

This is the crux of the matter: Well-managed companies use data to make decisions. Well-managed companies have processes for making repeated decisions. It only makes sense to have good processes and data for making decisions about people. Good processes and data also happen to help create far more diverse environments.

Obviously, you want to do this because it’s the right thing to do. But as Dr. Romero says, “If you want to get your best work out of employees, you want to create an environment where people from any background can be successful.” Ask McKinsey: Diverse organizations perform better.

Source: InfoWorld Big Data

6 reasons stores can't give you real-time offers (yet)

6 reasons stores can't give you real-time offers (yet)

Like most hardcore people, in the car I roll with my windows down and my radio cranked up to 11—tuned to 91.5, my local NPR station, where Terry Gross recently interviewed Joseph Turow, author of “The Aisles Have Eyes.” Turow reports that retailers are using data gathered from apps on your phone and other information to change prices on the fly.

Having worked in this field for a while, I can tell you that, yes, they’re gathering any data they can get. But the kind of direct manipulation Turow claims, where the price changes on the shelf before your eyes, isn’t yet happening on a wide scale. (Full disclosure: I’m employed by LucidWorks, which offers personalized/targeted search and machine-learning-assisted search as features in products we sell.)

Why not? I can think of a number of reasons.

1. Technology changes behavior slowly

Printers used to be a big deal. There were font and typesetting wars (TrueType, PostScript, and so on), and people printed out pages simply to read comfortably. After all, screen resolutions were low and interfaces were clunky; scanners were cumbersome and email was unreliable. Yet even after these obstacles were overcome, the old ways stuck around. There are still paper books (I mailed all of mine to people in prison), and the government still makes me print things and even get them notarized sometimes.

Obviously, change happens: I now tend to use Uber even if a cab is waiting, and I don’t bother to check the price difference, regardless of surge status. Also, today I buy all my jeans from Amazon—yet still use plastic cards for payment. The clickstream data collected on me is mainly used for email marketing and ad targeting, as opposed to real-time sales targeting.

2. Only some people can be influenced

For years I put zero thought into my hand soap purchase because my partner bought it. Then I split with my partner and became a soap buyer again. I did some research and found a soap that didn’t smell bad, didn’t have too many harsh chemicals, and played lip service to the environment. Now, to get me to even try something else you’d probably have to give it to me for free. I’m probably not somebody a soap company wants to bother with. I’m not easily influenced.

I’m more easily influenced in other areas—such as cycling and fitness stuff—but those tend to be more expensive, occasional purchases. To reach me the technique needs to be different than pure retailing.

3. High cost for marginal benefit

Much personalization technology, such as the analytics behind real-time discounts, is still expensive to deploy. Basic techniques such as using my interests or previously clicked links to improve the likelihood of my making a purchase are probably “effective enough” for most online retailers.

As for brick and mortar, I have too many apps on my phone already, so getting me to download yours will require a heavy incentive. I also tend to buy only one item because I forgot to buy it online—then I leave—so the cost to overcome my behavioral inertia and influence me will be high.

4. Pay to play

Business interests limit the effectiveness of analytics in influencing consumers, mainly in the form of slotting fees charged to suppliers who want preferential product placement in the aisles.

Meanwhile, Target makes money no matter what soap I buy there. Unless incentivized, it’s not going to care which brand I choose. Effective targeting may require external data (like my past credit card purchases at other retailers) and getting that data may be expensive. The marketplace for data beyond credit card purchases is still relatively immature and fragmented.

5. Personalization is difficult at scale

For effective personalization, you must collect or buy data on everything I do everywhere and store it. You need to run algorithms against that data to model my behavior. You need to identify different means of influencing me. Some of this is best done for a large group (as in the case of product placement), but doing it for individuals requires lots of experimentation and tuning—and it needs to be done fast.

Plus, it needs to be done right. If you bug me too much, I’m totally disabling or uninstalling your app (or other means of contacting me). You need to make our relationship bidirecitonal. See yourself as my concierge, someone who finds me what I need and anticipates those needs rather than someone trying to sell me something. That gets you better data and stops you from getting on my nerves. (For the last time, Amazon, I’ve already purchased an Instant Pot, and it will be years before I buy another pressure cooker. Stop following me around the internet with that trash!)

6. Machine learning needs to mature

Machine learning is merely math; much of it isn’t even new. But applying it to large amounts of behavioral data—where you have to decide which algorithm to use, which optimizations to apply to that algorithm, and which behavioral data you need in order to apply it—is pretty new. Most retailers are used to buying out-of-the-box solutions. Beyond (ahem) search, some of these barely exist yet, so you’re stuck rolling your own. Hiring the right expertise is expensive and fraught with error.

Retail reality

To influence a specific, individual consumer who walks into a physical store, the cost is high and the effectiveness is low. That’s why most brick-and-mortar businesses tend to use advanced data—such as how much time people spend in which part of the store and what products influenced that decision—at a more statistical level to make systemic changes and affect ad and product placement.

Online retailers have a greater opportunity to influence people at a personal level, but most of that opportunity is in ad placement, feature improvements, and (ahem) search optimization. As for physical stores, eventually, you may well see a price drop before your eyes as some massive cloud determines the tipping point for you to buy on impulse. But don’t expect it to happen anytime soon.

Source: InfoWorld Big Data

12 New Year's resolutions for your data

12 New Year's resolutions for your data

Your company was once at the forefront of the computing revolution. You deployed the latest mainframes, then minis, then microcomputers. You joined the PC revolution and bought Sparcs during the dot-com era. You bought DB2 to replace some of what you were doing with IMS. Maybe you bought Oracle or SQL Server later. You deployed MPP and started looking at cubes.

Then you jumped on the next big wave and put a lot of your data on the intranet and internet. You deployed VMware to prevent server sprawl, only to discover VM sprawl. When Microsoft came a-knocking, you deployed SharePoint. You even moved from Siebel to Salesforce to hop into SaaS.

Now you have data coming out of your ears and spilling all over the place. Your mainframe is a delicate flower on which nothing can be installed without a six-month study. The rest of your data is all on the SAN. That works out because you have a “great relationship with the EMC/Dell federation” (where you basically pay them whatever they want and they give you the “EMC treatment”). However, the SAN does you no good for finding actual information due to the effects of VM and application sprawl on your data organization.

Now the millennials want to deploy MongoDB because it’s “webscale.” The Hadoop vendor is knocking and wants to build a data lake, which is supposed to magically produce insights by using cheaper storage … and produce yet another storage technology to worry about.

Time to stop the madness! This is the year you wrangle your data and make it work for your organization instead of your organization working for its data. How do you get your data straight? Start with these 12 New Year’s resolutions:

1. Catalog where the data is

You need to know what you have. Whether or not this takes the form of a complicated data mapping and management system isn’t as important as the actual concerted effort to find it.

2. Map data use

Your data is in use by existing applications, and there’s an overall flow throughout the organization. Whether you track this “data lineage” and “data dependency” via software or sweat, you need to know why you’re keeping this stuff, as well as who’s using it and why. What is the data? What is the source system for each piece of data? What is it used for?

3. Understand how data is created

Remember the solid fuel booster at NASA that had a 1-in-300-year failure rate? Remember that the number was pretty much pulled out of the air? Most of the data was on paper and passed around. How is your data created? How are the numbers derived? This is probably an ongoing effort, as there are new sources of data every day, but it’s worthwhile to prevent your organization’s own avoidable and repeated disasters.

4. Understand how data flows through the organization

Knowing how data is used is critical, but you also need to understand how it got there and any transformation it underwent. You need a map of your organization’s data circulatory system, the big form of the good old data flow diagram. This will not only let you find “black holes” (where inputs are used but no results happen) and “miracles” (where a series of insufficient inputs can’t possibly produce the expected result), but also where redundant flows and transformations exist. Many organizations have lots of copies of the same stuff produced by very similar processes that differ by technology stack alone. It’s just data—we don’t have to pledge allegiance to the latest platform in our ETL process.

5. Automate manual data processing

At various times I’ve tried to sneak a post past my editor entitled something like “Ban Microsoft Excel!” (I think may have worked that into a post or two.) I’m being partly tongue in cheek, but people who routinely monkey with the numbers manually should be replaced by absolutely no one.

I recently watched the movie “Hidden Figures,” and among other details, it depicted the quick pace at which people were replaced by machines (the smarter folk learned how to operate the machines). In truth, we stagnated somewhere along the way, and a large number of people push bits around in email and Excel. You don’t have to get rid of those people, but the latency of fingers on the keyboard is awful. If you map your data, from where it originates and where it flows, you should be able to identify these manual data-munging processes.

6. Find a business process you can automate with machine learning

Machine learning is not magic. You are not going to buy software, turn it loose on your network, and get insights out of the box. However, right now someone in your organization is finding patterns by matching sets of data together and doing an “analysis” that can be done by the next wave of computing. Understand the basics (patterns and grouping, aka clustering, are the easiest examples), and try and find at least one place it can be introduced to advantage. It isn’t the data revolution, but it’s a good way to start looking forward again.

7. Make everything searchable using natural language and voice

My post-millennial son and my Gen-X girlfriend share one major trait: They click the microphone button more often than I do. I use voice on my phone in the car, but almost never otherwise. I learned to type at a young age, and I compose pretty accurate search queries because I practically grew up with computers.

But the future is not communicating with computers on their terms. Training everyone to do that has produced mixed results, so we are probably at the apex of computer literacy and are on our way down. Making your data accessible by natural language search isn’t simply nice to have—it’s essential for the future. It’s also time to start looking into voice if you aren’t there yet. (Disclaimer: I work for Lucidworks, a search technology company with products in this area.)

8. Make everything web-accessible

Big, fat desktop software is generally hated. The maintenance is painful, and sooner or later you need to do something somewhere else on some other machine. Get out of the desktop business! If it isn’t web-based, you don’t want it. Ironically, this is sort of a PC counterrevolution. We went from mainframes and dumb terminals to installing everything everywhere to web browsers and web servers—but the latest trip is worth taking.

9. Make everything accessible via mobile

By any stretch of the numbers, desktop computing is dying. I mean, we still have laptops, but the time we spend on them versus other computing devices is in decline. You can look at sales or searches or whatever numbers you like, but they all point in this direction. Originally you developed an “everything mobile” initiative because the executive got an iPad and wanted to use it on an airplane, and everything looked like crap in the iPad edition of Safari. Then it was the salespeople. Now it’s everyone. If it can’t happen on mobile, then it probably isn’t happening as often as or when/where it should.

10. Make it highly available and distributable

I’m not a big fan of the Oracle theory of computing (stuff everything into your RDBMS and it will be fine, now cut the check, you sheep). Sooner or later outages are going to eat the organization’s confidence. New York City got hit by a hurricane, remember?

It’s time to make your data architecture resilient. That isn’t an old client-server model where you buy Golden Gate or the latest Oracle replication product from a company it recently acquired, then hope for the best. That millennial may be right—you may need a fancy, newfangled database designed for the cloud and distributed computing era. Your reason may not even be to scale but that you want to stay up, handle change better, and have a more affordable offsite replica. The technology has matured. It’s time to take a look.

11. Consolidate

Ultimately the tree of systems and data at many organizations is too complicated and unwieldy to be efficient, accurate, and verifiable. It’s probably time to start chopping at the mistakes of yesteryear. This is often a hard business case to make, but the numbers are there, whether they show how often it goes down, how many people are spent maintaining it, or that you can’t recruit talent to maintain it. Sometimes if it isn’t broke, you still knock it down because it’s eating you alive.

12. Make it visual

People like charts—lots of charts and pretty lines.

This can be the year you drive your organization forward and prove that IT is more than a cost center. It can be the year you build a new legacy. What else are you hoping to get done with data this year? Hit me up on Twitter.

Source: InfoWorld Big Data

10 things you need to worry about in 2017

10 things you need to worry about in 2017

Each year, including last year, I’ve supplied you with “areas of concern”—that is, stuff that might not go well for you or our comrades in the coming 12 months. I’m happy to oblige once again this year with 10 items that may go bump in the night.

Hadoop distributions

Big data, analytics, and machine learning are alive and well, and they’ll eventually transform business in most of the ways they’ve promised. But the big, fat Hadoop distribution is probably toast.

This isn’t to say everyone involved is in trouble, but we’re looking at more of an à la carte situation, or at least a buffet, where you don’t have to swallow the whole elephant. Burned by projects that never completed or met their promise in previous years, companies will be more reluctant to bite off the whole dish and instead look at what they’re trying to do and actually need at the infrastructure level. Technology companies that can adapt to this reality will make even more money.

Hadoop vendors

Three major Hadoop vendors along with big “do everything companies” (especially the Big Blue one) are in this game. We already saw Pivotal essentially exit. It’s hard to see the market continue to support three Hadoop vendors. See the above item to figure out who I’m betting on.


Oracle likes to buy companies. It helps make up for the fact that the core Oracle database is old and clunky, and Oracle doesn’t make anything new or great. If it buys something you use, expect the price to go up. Oracle loves the long tail, particularly entrenched, hard-to-remove, older technology. Once it’s in the company’s clutches, you get that famed Oracle technical support, too.


Something will change at Databricks, the cloud company built around Spark, the open source distributed computing framework that has essentially supplanted Hadoop. While Spark is great, the Databricks business model isn’t as compelling, and it seems easily disrupted by one of the big three cloud vendors. The company is run by academics, and it needs hard-knuckled business types to sort out its affairs. I hope the change won’t be too disruptive to Spark’s development—and can be accomplished without hurt feelings, so we don’t lose progress.


Now that we have the Trumpocalypse to look forward to, you can expect “deregulation” of everything, from unlimited poison in your groundwater to the death of Net neutrality. Lest you think that will boost the tech economy, note that software vendors make big money selling compliance solutions, fewer of which will be necessary. Also, the Affordable Care Act (Obamacare) and electronic medical/health records have been a boon for tech. Some of Obamacare may remain, but very likely the digital transformation of health will be scaled way back.

Clinton’s plans had their own problems, but regardless of where you stand politically, the Trump presidency will hit us where it hurts—especially after California secedes. (Or will there be six Californias?)

Game consoles

How is this related to enterprise software? Well, the game industry is a good chunk of the tech sector, and some giants depend on console games as blockbusters. Game consoles are specialized computers with a very specific programming models and guaranteed upgrades. Everyone is doing “pro” versions to get shorter-term revenue grabs—instead of waiting, say, seven years to sell new consoles—which comes at the cost of a stable platform that game developers can depend on.

Meanwhile, mobile games are huge, Steam keeps rising, and people are playing computer games again. I suspect this will start to depress the console business. Game developers will struggle with how many platforms they need to keep up with, and some giants will stumble.

Yet another hacking scandal

Once again, tech, government, and business will fail to learn the lesson that security can’t be bought and deployed like a product. They will persist in hiring the cheapest developers they can find, flail at project management, and suffer nonexistent or hapless QA. If a program runs, then it has stmt.execute(“select something from whatever where bla =”+ sql_injection_opportunity) throughout the code. That’s in business—government is at least 20 years behind. Sure, we’re giving Putin a big hug, but don’t expect him to stop hacking us.

The economy

It seems like the Great Recession was just yesterday, but we’re due for another. At the same time, we don’t have a lot of big, new enterprise tech to brag about. I’m not saying it’s time to climb in the lifeboat, but you might want to make sure you have a safety net in case we’re hit with another downturn. My guess is it will be smaller than the dot-bomb collapse, so don’t fret too much.

Telco-cable mergers

With Google dialing back Google Fiber and an impending AT&T-Time Warner merger, our overpriced connections to the internet are unlikely to get cheaper—and speed increases will probably be less frequent.

Your math skills

Thanks to machine learning, it will be harder to command a six-figure developer salary without a mathematical background. As companies figure out what machine learning is and what it can do, before paying a premium for talent, they’ll start to require that developers understand probability, linear algebra, multivariable calculus, and all that junk. For garden-variety programming, they’ll continue to accelerate their plan to buy talent in “low-cost countries.”

Now let’s crank it to 11: As you may have heard, we’ve elected a narcissistic agent of the white supremacist (now rebranded “alt-right”) movement who doesn’t even know how to use a computer, and we’ve put him in charge of the nukes. This is going to be a disaster for everyone, of course, but for tech in particular if we all survive. But hey, next week I’ll try looking on the bright side.

Source: InfoWorld Big Data

Could Google or Facebook decide an election?

Could Google or Facebook decide an election?

At this writing, it’s Wednesday morning after the U.S. election. None of my friends is sober, probably including my editor.

I had a different article scheduled originally, which it made the assumption that I’d been wrong all along, because that’s what everyone said. The first article in which I mentioned President Trump posted on Sept. 10, 2015, and covered data analytics in the marijuana industry. Shockingly, both Trump and marijuana won big.

I thought I was being funny. Part of the reason I was sure “President Trump” was a joke was that Facebook kept nagging me to go vote. First, it wanted me to vote early; eventually it wanted me to vote on Election Day. It wasn’t only Facebook—my Android phone kept nagging me to vote. (You’d think it would have noticed that I’d already voted or at least hung out at one of the polling places it offered to find for me, but whatever.)

This made me think. With the ubiquity of Google and Facebook, could they eventually decide elections? Politics are regional. In my state, North Carolina, if you turn out votes in the center of the state it goes Democratic. If you turn out votes in the east and west, it goes Republican. Political operatives have geographically targeted voters in this manner for years, but they have to pay to get in your face. Google and Facebook are already there.

What if instead of telling everyone to vote, they were to target voters by region? Let’s say Google and Facebook support a fictitious party we’ll call Fuchsia. In districts that swing heavily Fuchsia, they push notifications saying “go vote.” In districts that go for the other guys, they simply don’t send vote notifications and ads and instead provide scant information on polling station locations. That alone could swing some areas.

Targeted notifications could have an even more dramatic effect in districts that could go either way. Google and Facebook collect tons of psychometric data; Facebook even got caught doing it. Facebook and Google don’t only know what you “like” but what you hate and what you fear. Existing political operations know this too, but Google and Facebook have it at a much much more granular level.

To go a step further, what if Facebook manipulated your feed to increase your fear level if fear is the main reason you vote? What if your personalized Google News focused on your candidates’ positives or negatives depending on whether they want you to stay home or go to the polls? In fact, if you incorporate search technology against current events and the news, you could even have articles on other topics that passively mention either your candidate or the candidate you fear.

The point I’m trying to make is that the same technology used to manipulate you into buying stuff can be used to manipulate how or if you vote. We’re still a little away from this, but not far. Even a small amount of targeting could turn a close vote in a key state.

Source: InfoWorld Big Data

Big data face-off: Spark vs. Impala vs. Hive vs. Presto

Big data face-off: Spark vs. Impala vs. Hive vs. Presto

Today AtScale released its Q4 benchmark results for the major big data SQL engines: Spark, Impala, Hive/Tez, and Presto.

The findings prove a lot of what we already know: Impala is better for needles in moderate-size haystacks, even when there are a lot of users. Presto also does well here. Hive and Spark do better on long-running analytics queries.

I spoke to Joshua Klar, AtScale’s vice president of product management, and he noted that many of the company’s customers use two engines. Generally they view Hive as more stable and tend to run their long-running queries on it. All of their Hive customers use Tez, and none use MapReduce any longer.

In my experience, the stability gap between Spark and Hive closed a while ago, so long as you’re smart about memory management. As I noted recently, I don’t see a long-term future for Hive on Tez, because Impala and Presto are better for those normal BI queries, and Spark generally performs better for analytics queries (that is, for finding smaller haystacks inside of huge haystacks). In an era of cheap memory, if you can afford to do large scale analytics, you can afford to do it in-memory, and everything else is more of a BI pattern.

While all of the engines have shown improvement over the last AtScale benchmark, Hive/Tez with the new LLAP (Live Long and Process) feature has made impressive gains across the board. The performance still hasn’t caught up with Impala and Spark, but according to this benchmark, it isn’t as slow and unwieldy as before — and at least Hive/Tez with LLAP is now practical to use in BI scenarios.

The full benchmark report is worth reading, but key highlights include:

  • Spark 2.0 improved its large query performance by an average of 2.4X over Spark 1.6 (so upgrade!). Small query performance was already good and remained roughly the same.

  • Impala 2.6 is 2.8X as fast for large queries as version 2.3. Small query performance was already good and remained roughly the same

  • Hive 2.1 with LLAP is over 3.4X faster than 1.2 and its small query performance doubled. If you’re using Hive, this isn’t an upgrade you can afford to skip.

Not really analyzed is whether SQL is always the right way to go and how, say, a functional approach in Spark would compare. You need to take these benchmarks within the scope of which they are presented.

The bottom line is that all of these engines have dramatically improved in one year. Both Impala and Presto continue lead in BI-type queries and Spark leads performance-wise in large analytics queries. I’d like to see what could be done to address the concurrency issue with memory tuning, but that’s actually consistent with what I observed in the Google Dataflow/Spark Benchmark released by my former employer earlier this year. Either way, it is time to upgrade!

Source: InfoWorld Big Data