Devs will lead us to the big data payoff at last

Devs will lead us to the big data payoff at last

In 2011, McKinsey & Co. published a study trumpeting that “the use of big data will underpin new waves of productivity growth and consumer surplus” and called out five areas ripe for a big data bonanza. In personal location data, for example, McKinsey projected a $600 billion increase in economic surplus for consumers. In health care, $300 billion in additional annual value was waiting for that next Hadoop batch process to run.

Five years later, according to a follow-up McKinsey report, we’re still waiting for the hype to be fulfilled. A big part of the problem, the report intones, is, well, us: “Developing the right business processes and building capabilities, including both data infrastructure and talent” is hard and mostly unrealized. All that work with Hadoop, Spark, Hive, Kafka, and so on has produced less benefit than we thought it would.

In part that’s because keeping up with all that open source software and stitching it together is a full-time job in itself. But you can also blame the bugbear that stalks every enterprise: institutional inertia. Not to worry, though: The same developers who made open source the lingua franca of enterprise development are now making big data a reality through the public cloud.

Paltry big data progress

On the surface the numbers look pretty good. According to a recent SyncSort survey, a majority (62 percent) are looking to Hadoop for advanced/predictive analytics with data discovery and visualization (57 percent) also commanding attention.

Yet when you examine this investment more closely, a comparatively modest return emerges in the real world. By McKinsey’s estimates, we’re still falling short for a variety of reasons:

  • Location-based data has seen 50 to 60 percent of potential value captured, mainly because not everyone can afford a GPS-enabled smartphone
  • In U.S. retail, we’re seeing 30 to 40 percent, due to a lack of analytical talent and an abundance of still-siloed data
  • Manufacturing comes in at 20 to 30 percent, again because data remains siloed in legacy IT systems and because management remains unconvinced that big data will drive big returns
  • U.S. health care limps along at a dismal 10 to 20 percent, beset by poor interoperability and data sharing, along with a paucity of proof that clinical utility will result
  • The E.U. public sector also lags at 10 to 20 percent, thanks to an analytics talent shortage and data siloed in various government agencies

These aren’t the only areas measured by McKinsey, but they provide a good sampling of big data’s impact across a range of industries. To date, that impact has been muted. This brings us to the most significant hole in big data’s process: culture. As the report authors describe:

Adapting to an era of data-driven decision making is not always a simple proposition. Some companies have invested heavily in technology but have not yet changed their organizations so they can make the most of these investments. Many are struggling to develop the talent, business processes, and organizational muscle to capture real value from analytics.

Given that people are the primary problem holding up big data’s progress, you could be forgiven for abandoning all hope.

Big data’s cloudy future

Nonetheless, things may be getting better. For example, in a recent AtScale survey of more than 2,500 data professionals across 1,400 companies and 77 countries, roughly 20 percent of respondents reported clusters of more than 100 nodes, a full 74 percent of which are in production. This represents double-digit year-over-year growth.

It’s even more encouraging to see where these nodes are running, which probably accounts for the increase in success rates. According to the same survey, more than half of respondents run their big data workloads in the cloud today and 72 percent plan to do so going forward. This aligns with anecdotal data from Gartner that interest in data lakes has mushroomed along with a propensity to build those lakes in public clouds.

This makes sense. Given that the very nature of data science — asking questions of our data to glean insight — requires a flexible approach, the infrastructure powering our big data workloads needs to enable this flexibility. In an interview, AWS product chief Matt Wood makes it clear that because “your resource mix is continually evolving, if you buy infrastructure it’s almost immediately irrelevant to your business because it’s frozen in time.”

Infrastructure elasticity is imperative to successful big data projects. Apparently more and more enterprises got this memo and are building accordingly. Perhaps not surprising, this shift in culture isn’t happening top-down; rather, it’s a bottom-up, developer-driven phenomenon.

What should enterprises do? Ironically, it’s more a matter of what they shouldn’t do: obstruct developers. In short, the best way to ensure an enterprise gets the most from its data is to get out of the way of its developers. They’re already taking advantage of the latest and greatest big data technologies in the cloud.

Source: InfoWorld Big Data

Report: OpenStack’s Global Traction Expands

Report: OpenStack’s Global Traction Expands

According to a newly released Forrester Research Report titled, OpenStack’s Global Traction Expands For Its Newton Release, OpenStack® has “grown into a de facto standard platform for the private cloud market and now serves as the foundation for public clouds, particularly in Europe and China.”

OpenStack® is the most widely deployed open source cloud computing software. The December 2016 report analyzes the state of OpenStack at the time of the October 2016 Barcelona Summit, which was convened to showcase Newton, the latest release of OpenStack software, and plan for the 14th release of the software, codenamed Ocata and expected in February 2017. The report also details important next steps for infrastructure and operations leaders investing in the OpenStack platform.

“In the past year, telcos like CableLabs, SK Telecom, and Verizon have shelved their previous objections to the Neutron networking project and flocked to the OpenStack community, contributing features like Doctor,” the report states. “Leading I&O professionals, application developers, and CIOs at firms like American Express, Disney, and Walmart have embraced OpenStack for their digital businesses. It’s the foundation of many private (and, increasingly, of many public) cloud services that give your company the agility it needs to respond to customer demand, from core systems to the mobile apps that deliver differentiated customer experiences.”

The report from Forrester Research also examines:

  • OpenStack growth among public cloud providers, which now includes 21 self-reported public providers globally. (Note: OpenStack Foundation data adds that these clouds are located in a combined 66 datacenters across 53 cities globally.)
  • How, “for some, OpenStack underpins a bigger effort to transform the network with network function virtualization (NFV).”
  • How “the Newton release focuses on container support and simple network requests. Magnum has expanded its scope to offer other orchestration cluster tools, including Docker Swarm, Kubernetes, and Mesos with either VM or bare metal.”
  • How, for those outside the Fortune 100 and high-tech, “the OpenStack ecosystem wants your participation and will be willing to provide you with ‘above and beyond’ support.”

Source: CloudStrategyMag

6 Big Data Predictions For 2017

6 Big Data Predictions For 2017

The market has evolved from technologists looking to learn and understand new big data technologies to customers who want to learn about new projects, new companies, and most importantly, how organizations are actually benefitting from the technology. According to John Schroeder, executive chairman and founder of MapR Technologies, Inc., the acceleration in big data deployments has shifted the focus to the value of the data. 

John has crystallized his view of market trends into these six major predictions for 2017: 

1 – Artificial Intelligence is Back in Vogue

In the 1960s, Ray Solomonoff laid the foundations of a mathematical theory of AI, introducing universal Bayesian methods for inductive inference and prediction. In 1980 the First National Conference of the American Association for Artificial Intelligence (AAAI) was held at Stanford and marked the application of theories in software. AI is now back in mainstream discussions and the umbrella buzzword for machine intelligence, machine learning, neural networks, and cognitive computing. Why is AI a rejuvenated trend? The three V’s come to mind: Velocity, Variety, and Volume. Platforms that can process the three V’s with modern and traditional processing models that scale horizontally providing 10 to 20X cost efficiency over traditional platforms. Google has documented how simple algorithms executed frequently against large datasets yield better results than other approaches using smaller sets. We’ll see the highest value from applying AI to high volume repetitive tasks where consistency is more effective than gaining human intuitive oversight at the expense of human error and cost.

2 – Big Data for Governance or Competitive Advantage

In 2017, the governance vs. data value tug of war will be front and center. Enterprises have a wealth of information about their customers and partners. Leading organizations will manage their data between regulated and non-regulated use cases. Regulated use cases data require governance; data quality and lineage so a regulatory body can report and track data through all transformations to originating source. This is mandatory and necessary but limiting for non-regulatory use cases like customer 360 or offer serving where higher cardinality, real-time and a mix of structured and unstructured yields more effective results.

3 – Companies Focus on Business- Driven Applications to avoid Data Lakes from becoming Swamps

In 2017 organizations will shift from the “build it and they will come” data lake approach to a business-driven data approach. Today’s world requires analytics and operational capabilities to address customers, process claims and interface to devices in real time at an individual level. For example any ecommerce site must provide individualized recommendations and price checks in real time. Healthcare organizations must process valid claims and block fraudulent claims by combining analytics with operational systems. Media companies are now personalizing content served though set top boxes. Auto manufacturers and ride sharing companies are interoperating at scale with cars and the drivers. Delivering these use cases requires an agile platform that can provide both analytical and operational processing to increase value from additional use cases that span from back office analytics to front office operations. In 2017, organizations will push aggressively beyond an “asking questions” approach and architect to drive initial and long term business value.

4 – Data Agility Separates Winners and Losers

Software development has become agile where dev ops provides continuous delivery. In 2017, processing and analytic models evolve to provide a similar level of agility as organizations realize data agility, the ability to understand data in context and take business action, is the source of competitive advantage not simply have a large data lake. The emergence of agile processing models will enable the same instance of data to support batch analytics, interactive analytics, global messaging, database and file-based models. More agile analytic models are also enabled when a single instance of data can support a broader set of tools. The end result is an agile development and application platform that supports the broadest range of processing and analytic models.

5 – Blockchain Transforms Select Financial Service Applications

In 2017, there will be select, transformational use cases in financial services that emerge with broad implications for the way data is stored and transactions processed. Blockchain provides a global distributed ledger that changes the way data is stored and transactions are processed. The blockchain runs on computers distributed worldwide where the chains can be viewed by anyone. Transactions are stored in blocks where each block refers to the preceding block, blocks are timestamped storing the data in a form that cannot be altered. Hackers find it impossible to hack the blockchain since the world has view of the entire blockchain. Blockchain provides obvious efficiency for consumers. For example, customers won’t have to wait for that SWIFT transaction or worry about the impact of a central datacenter leak. For enterprises, blockchain presents a cost savings and opportunity for competitive advantage.

6 – Machine Learning Maximizes Microservices Impact

This year we will see activity increase for the integration of machine learning and microservices. Previously, microservices deployments have been focused on lightweight services and those that do incorporate machine learning have typically been limited to “fast data” integrations that were applied to narrow bands of streaming data. In 2017, we’ll see development shift to stateful applications that leverage big data, and the incorporation of machine learning approaches that use large of amounts of historical data to better understand the context of newly arriving streaming data.

“Our predictions are strongly influenced by leading customers who have gained significant business value by integrating analytics into operational use cases,” said Schroeder. “Our customer use of the MapR converged data platform provides agility to Devops where they can use a broad range of processing models from Hadoop to Spark, SQL, NoSQL, files and message streaming — whatever is required for their current and future use cases in private, public and hybrid cloud deployments.”

Source: CloudStrategyMag

Hyperscale Data Center Count Passes The 300 Milestone In December

Hyperscale Data Center Count Passes The 300 Milestone In December

New data from Synergy Research Group shows that the number of large data centers operated by hyperscale providers hit the 300 mark in December, after a flurry of year-end data center openings by Amazon, Google, and Alibaba. One notable feature of the global footprint is that despite a major ongoing push to locate new operations in countries around the world, the U.S. still accounts for 45% of major cloud and internet data center sites. The next prominent locations are China and Japan, with 8% and 7% respectively. The three leading countries are then followed by the UK, Australia, Canada, Singapore, Germany, and India, each of which accounts for 3% to 5% of the total. The research is based on an analysis of the data center footprint of 24 of the world’s major cloud and internet service firms, including the largest operators in SaaS, IaaS, PaaS, search, social networking, and e-commerce.

On average each of the 24 firms had 13 data center sites. The companies with the broadest data center footprint are the leading cloud providers — AWS, Microsoft, and IBM. Each has 40 or more data center locations with at least two in each of the four regions — North America, APAC, EMEA, and Latin America. Google and Oracle also have a notably broad data center presence. The remaining firms tend to have their data centers focused primarily in either the U.S. (Apple, Twitter, Salesforce, Facebook, eBay, LinkedIn, Yahoo) or China (Tencent, Baidu). Previously Alibaba also was focused mainly in China but it has now opened data centers in the US, Hong Kong, Singapore, Japan, and the UAE. 

“Hyperscale growth goes on unabated and we are forecasting that hyperscale operators will pass the 400 data center mark by the end of 2018,” said John Dinsdale, a chief analyst and research director at Synergy Research Group. “What is remarkable is that the US still accounts for nearly half of all hyperscale data centers, reflecting the US dominance of cloud and internet technologies. While other countries are now featuring more prominently due to either their scale or the unique characteristics of their local markets, the major players continue to invest heavily in U.S. data center operations.”

Source: CloudStrategyMag

Q&A: Hortonworks CTO unfolds the big data road map

Q&A: Hortonworks CTO unfolds the big data road map

Hortonworks has built its business on big data and Hadoop, but the Hortonworks Data Platform provides analytics and features support for a range of technologies beyond Hadoop, including MapReduce, Pig, Hive, and Spark. Hortonworks DataFlow, meanwhile, offers streaming analytics and uses technologies like Apache Nifi and Kafka.

InfoWorld Executive Editor Doug Dineley and Editor at Large Paul Krill recently spoke with Hortonworks CTO Scott Gnau about how the company sees the data business shaking out, the Spark vs. Hadoop face-off, and Hortonworks’ release strategy and efforts to build out the DataFlow platform for data in motion.

InfoWorld: How would you definite Hortonworks’ present position?

Gnau: We sit in a sweet spot where we want to leverage the community for innovation. At the same time, we also have to be somewhat the adult supervision to make sure that all this new stuff, when it gets integrated, works. That gets to one core belief that we have, that we really are responsible for a platform and not just a collection of tech. We’ve modified the way that we bring new releases to market such that we only rebase the core. When I say “rebase the core,” that means new HDFS, new Yarn. We only rebase the core once a year, but we will integrate new versions of projects on a quarterly basis. What that allows us to do, when you think about when you rebase the core or when you bring in changes to the core Hadoop functionality, there’s a lot of interaction with the different projects. There’s a lot of testing, and it introduces instability. It’s software development 101. It’s not that it’s bad tech or bad developers. It introduces instability.

InfoWorld: This rebasing event, do you aim to do that at the same time each year?

Gnau: If we do it annually, yes, it will be at the same time each year. That would be the goal. The next target will be in the second half of 2017. In between, up to as frequently as quarterly, we will have nonrebasing releases where we’ll either add new projects or add new functionality or newer versions of projects to that core.

How that manifests itself is in a couple of advantages. Number one is we think we can get newer stuff out faster in a way that’s more consumable because of the stability that it implies for our customers. We also think conversely, that our customers will be more amenable to staying closer to the latest release because it’s very understandable what’s in and what changed.

The example I have for that is we recently did the 2.5 release, and basically in 2.5, there were only two things we changed: Hive and Spark. It makes it very easy if you think about a customer who has their operations staff running around doing change management. Inside of it, we actually allowed for the first time that customers could choose a new version of Spark or the old version of Spark or actually run both at the same time. Now if you’re running change management, you’re saying, “OK, I can install all the new software, and I can default it to run on the old version of Spark, so I don’t have to go test anything.” Where I have feature functionality that wants to take advantage of the new version of Spark, I can simply have them use that version for those applications.

InfoWorld: There’s been talk that Spark is displacing Hadoop. What’s happening as far as Spark versus Hadoop?

Gnau: I don’t think it’s Spark versus Hadoop. It’s Spark and Hadoop. We’ve been very successful and a lot of customers have been very successful down that path. I mentioned that even in our new release where, when the latest version of Spark came out, within 90 minutes of it being published to Git, it was in our distribution. We’re highly committed to that as an execution engine for the use cases where it’s popular, so we’ve invested not only in the packaging, but also with the contributions and committers we have, and in tools like Apache Zeppelin, which enables data scientists and Spark users to create notebooks and be more efficient about how they share algorithms and how they optimize the algorithms that they’re writing against those data sets. I don’t view it as either/or but more as an “and.”

In the end, for business-critical applications that are making a difference and are customer-facing, there is a lot of value behind the platform from a security, operationalization, backup and recovery, business continuity, and all those things that come with a platform. Again, I think the “and” becomes more important than the “or.” Spark is really good for some workloads and really horrible for others, so I don’t think it’s Spark versus the world. I think it’s Spark and the world for the use cases where it makes sense.

InfoWorld: Where does it make sense? Obviously you’re committed to Hive for SQL. Spark also offers a SQL implementation. Do you make use of that? This space is interesting in that all these platform vendors want to offer every tool for basically every kind of processing.

Gnau: There are Spark vendors that want to offer only Spark.

InfoWorld: That’s true. I’m thinking of Cloudera, you and MapR, the established Hadoop vendors. These platforms have lots of tools, and we’d like to understand which of those tools are being used for what sorts of analytics.

Gnau: Simplistic, interactive on reasonably small sets of data fit Spark. If you get into petabytes, you’re not going to be able to buy enough memory to make Spark work effectively. If you get into very sophisticated SQL, it’s not going to run. Yes, there are many tools for many things, and ultimately there is that interactive, simplistic, memory resident on small data use case that Spark fits. With any of those parameters, when you start to get to the bleeding edge of any of those parameters it’s going to be less effective, and the goal is to have that then bleed into Hive.

InfoWorld: How opinionated can you be about your platform and how free are you in deciding you are no longer going to support a tool or are retiring a tool?

Gnau: The hardest thing any product company can do is retire a product, the most horrid thing in the world. I don’t know that you will see us retire a whole lot, but maybe there will be things that get put out to pasture. The nice thing is that there is still a live community out there, so even though we may not be focused on trying to drive investment because we’re not seeing demand in the market, there will still be a community [that] can go out and pick up things, so I see it more as an out to pasture.

InfoWorld: To take one example, Storm is still obviously a core element and I assume that’s because you’ve decided it’s a better way to do stream processing than Spark or others.

Gnau: It’s not a better way. It provides windowing functions, which are important to a number of use cases. I can imagine a world where you’ll write SQL and you’ll send that SQL off, and we’ll grab it and we’ll actually help decide how it should run and where it should run. That’s going to be necessary for the thing itself to be sustainable.

There are some capabilities along those lines that we’re doing here and there as placeholders, but I think as an industry, if we don’t make it simpler to consume, there will be a problem industry-wide, regardless of whether we’re smart or Cloudera is smart, whatever. It will be an industry problem because it won’t be consumable by the masses. It’s got to be consumable and easy. We’re going to create some tools that will help you decide how you deploy and help you manage where you can have an application that thinks they’re talking to an API versus I’ve got to run Hive for this and HBase for this and having to understand all those different things.

InfoWorld: Can you identify technologies that are emerging that you expect to be in the platform in the coming year or so?

Gnau: The biggest thing that is important is the whole notion of data in motion versus data at rest. When I say “data in motion,” I’m not talking about just streaming. I’m not talking about just data flow. I’m talking about data that’s moving and how do you do all of those things? How do you apply complex event processing, simple event processing? How do you actually guarantee delivery? How do you encrypt and protect and how do you validate and create provenance, all the provenance in data in motion? I see that as a huge bucket of opportunity.

Obviously, we made the acquisition of Onyara and released Hortonworks DataFlow based on Apache NiFi. Certainly that’s one of the most visible things. I would say that’s it is not NiFi alone and what you would see inside of our Hortonworks DataFlow is that includes NiFi and Storm and Kafka, a bunch of components. You’ll see us building out DataFlow as a platform for data in motion, and we already have and will continue to invest along those lines. When I’m out and about and people say, “What do you think about streaming?” I say, well, streaming is a very small subset of the data-in-motion problem. It’s an important thing to solve. but we need to think about it as a bigger opportunity because we don’t want to solve just one problem and then have six other problems that prevent us from being successful. That’s going to be driven by devices, IoT, all the buzzwords out there.

InfoWorld: In this data-in-motion future, how central or how important is a time series database, a database built to store time series data as opposed to using something else?

Gnau: Time series analytics are important. I would submit that there are a number of ways that those analytics can be engineered. Time series database is one of the ways. I don’t know that a specific time series database is required for all the use cases. There may be other ways to get to the same answer, but time series and the temporal nature of data are increasingly important, and I think you will see some successful projects come up along those lines.

Source: InfoWorld Big Data

HPE & Cisco Maintain Lead In Cloud Infrastructure

HPE & Cisco Maintain Lead In Cloud Infrastructure

New Q3 data from Synergy Research Group shows that HPE maintained a narrow lead over Cisco in the strategically important cloud infrastructure equipment market, while Dell EMC is now challenging the top two after the completion of their historic merger. Meanwhile, ODMs (contract manufacturers) in aggregate continue to increase their share of the market, driven by continued heavy investments in data centers by hyperscale cloud providers. Microsoft and IBM round out the group of top cloud infrastructure vendors. HPE and Cisco have been in a closely contested leadership battle in this market for the last sixteen quarters, over which time their total revenues are virtually identical. Across the different types of cloud deployment, Cisco continues to hold a commanding lead in public cloud infrastructure while HPE has a clear lead in private cloud.

Total cloud infrastructure equipment revenues, including public and private cloud, hardware and software, are poised to reach $70 billion in 2016 and continue to grow at a double-digit pace. Servers, OS, storage, networking and virtualization software combined accounted for 94% of the Q3 cloud infrastructure market, with the balance comprising cloud security and cloud management. By segment, HPE has a clear lead in the cloud server segment and is a main challenger in storage, while Cisco is dominant in the networking segment and also has a growing server product line. Dell EMC is the second-ranked server vendor and has a clear lead on storage. Microsoft features heavily in the ranking due to its position in server OS and virtualization applications, while IBM maintains a strong position across a range of cloud technology markets.

“Growth in private cloud infrastructure is slowing down as enterprises shift more attention and workloads to the public cloud, but that means that there is a continued boom in shipments of infrastructure gear to public cloud providers,” said John Dinsdale, a chief analyst and research director at Synergy Research Group. “For traditional IT infrastructure vendors there is one fly in the ointment though — hyperscale cloud providers account for an ever-increasing share of data center gear and many of them are on a continued drive to deploy own-designed servers, storage and networking equipment, manufactured for them by ODMs. ODMs in aggregate now control a large and growing share of public cloud infrastructure shipments.”

Source: CloudStrategyMag

Faction Releases Internetwork eXchange (FIX)

Faction Releases Internetwork eXchange (FIX)

Faction has released its Faction Internetwork eXchange (FIX), allowing enterprises to easily and cost-effectively connect private cloud and colocation resources into public clouds privately and securely. This extends Faction’s patent-pending “bring your network as-is” private cloud networking to hybrid cloud designs, allowing enterprises to easily add the use of public cloud to their private clouds without complex networking changes.

“IT managers today want to take advantage of new technologies, but find connecting private cloud workloads to public clouds to be a difficult and costly proposition,” comments Matthew Wallace, VP of Product for Faction. “Our patent-pending SDN technology allows our customers to bring their existing networks to our private cloud without any reconfiguration. Faction Internetwork eXchange now extends that capability to public clouds, giving them the security, cost savings, and simplicity they need to take full advantage of public cloud resources now, without costly migrations or complicated new tools. The FIX is also designed to be a more cost-effective way to connect to the public cloud than a dedicated connection or using the public Internet.”

Faction Internetwork eXchange allows organizations to quickly add private, secure connectivity to hundreds of other networks over their existing infrastructure. No additional hardware or software is needed to create many virtual circuits to different networks, all over their existing ports. This allows customers to easily pursue hybrid cloud and multi-cloud strategies without spending money up front on hardware, tools, and training. Among the hundreds of available connection endpoints, Faction can easily connect customers into popular public cloud destinations, such as Amazon AWS, Microsoft Azure, Google Cloud Platform, IBM Softlayer, and many others.

Source: CloudStrategyMag

Google BigQuery provides insight into Stack Overflow discussion data

Google BigQuery provides insight into Stack Overflow discussion data

Software development discussion site Stack Overflow has started offering quarterly snapshots of its question-and-answer database through Google’s BigQuery.

Stack Exchange, parent company for Stack Overflow and its sister sites, has previously made its data available to researchers throught its online data explorer. But now researchers with a Google Cloud Platform account can plug directly into the data set using Google’s data exploration tools, which have fewer limitations than Stack Overflow’s.

If you have a Google Cloud account, you can log in and begin exploring the data directly from a SQL-style web interface. Results from queries can be exported to CSV or JSON, saved to other tables in Google BigQuery, or exported to Google Sheets. BigQuery also comes with a REST API, so it can be used with third-party visualization tools or software stacks.

Stack Overflow’s question-and-answer format is popular with developers seeking quick solutions to common problems. Though it has a reputation for being insular and unwelcoming, it’s  widely trafficked, and many of its highest-voted answers are widely circulated as great explainers. For example, a popular question about why processing a sorted array is faster than working with an unsorted one not only gives a detailed technical answer, but also serves as great explainer for the concept of branch prediction failure.

One possible application for Stack Overflow’s data, with or without BigQuery’s tool set, is sentiment analysis of topics and discussions taking place on Stack Overflow–in other words, getting broad hints about developers’ feelings about a technology.

If discussions about a language are paired with discussions about an IDE for that language, those threads could be parsed for details about what people are (or aren’t) doing most often with that pairing. Thus, you could figure out what developers might need but aren’t yet asking for.

Stack Overflow’s yearly surveys of its developers provide a similar snapshot of its audience’s mindsets: what languages are popular or how developers classify themselves. But such surveys are self-conscious and self-reporting, and they’re limited to the categories devised for them. Discussions on the site could provide more open-ended, direct, and detailed data about what developers like, hate, look for, and struggle with.

Note that this data set comes from Stack Overflow, and not from any of the other IT-related Stack Exchange sites, such as Server Fault (for IT admins) or Super User (for “computer enthusiasts and power users”). If these data sets go online through Google BigQuery as well, they could open up possibilities for even larger and more sophisticated analyses across multiple IT disciplines.

Source: InfoWorld Big Data

MariaDB crashes open souce big data analytics competitors

MariaDB crashes open souce big data analytics competitors

MySQL variant MariaDB is aiming for the OLAP market with the public release of its latest feature, ColumnStore 1.0.

The move is part of MariaDB’s mission to broaden its reach and be a cheaper alternative to analytics databases like Teradata or Vertica. But the company faces stiff open source competition.

Doing more with less

Originally announced in April, ColumnStore isn’t a new project; it’s a port of an existing one, InfiniDB, that used the MySQL engine. After the company that produced InfiniDB went defunct in 2015, MariaDB took over the project, continued supporting its existing customer base, and realized that InfiniDB’s column-oriented technology could add OLAP capabilities to the traditionally OLTP-oriented MySQL. (Column-stored data allows for high-speed reading and searching of datasets.)

MariaDB believes there are multiple advantages to blending the two approaches. One is being able to perform queries that mix both columnar InfiniDB data and row-based MariaDB data — for instance, being able to create SQL JOINs across both kinds of data. Another is having a native SQL querying layer for an OLAP solution, which many OLAP products have been adding separately with widely varying efficacy.

But the biggest advantage is cost. MariaDB claims that ColumnStore “on average costs 90.3% less per terabyte compared to commercial data warehouses,” but offers little specific detail — what size of database, which specific commercial competitors, etc. — to back up the claim. A sample customer story involving the World Bank’s Institute for Health Metrics and Evaluation mostly cites earlier versions of MySQL (due to existing infrastructure) and the in-memory MemSQL database as the other choices considered, rather than any of the more commercial data-warehousing solutions.

Not the only game in town

Late 2015 saw a major open source competitor to conventional data warehouses or OLAP analytics solutions emerge: Greenplum Database, the data warehouse solution open sourced by Pivotal.

In a way, Greenplum vs. ColumnStore amounts to a clash between two long-standing open source database projects. With ColumnStore, it’s MySQL/MariaDB; with Greenplum, it’s PostgreSQL, since Greenplum is derived from that project.

That said, the two have evolved far past their roots; the competition between them is less about what underlying technology they use and more about how large an existing audience each of them is likely to capture.

Greenplum is likely to appeal to those who are already settled on Pivotal in some form or another. ColumnStore is for those still on MariaDB, but about to outgrow it because they’re tackling problems of far larger scope than MariaDB was set to handle. By offering ColumnStore, MariaDB aims to stave off migrations not just to competing products, but to new-breed warehousing services like Snowflake that are both increasingly cost-effective and ANSI SQL-compliant.

Source: InfoWorld Big Data

IDG Contributor Network: Holiday shopping season and fraud: Not one without the other

IDG Contributor Network: Holiday shopping season and fraud: Not one without the other

Consumers are shopping online more than ever before. Holiday season has e-commerce marketing and sales teams working overtime to churn out attractive holiday campaigns. With holiday come a flurry of fraudulent transaction with fraudster lurking in the dark ready to spoil the spirit of the season. As sales increase, so will the total dollar amount of fraud transactions.

“Retailers need to constantly improve their level of fraud prevention by incorporating consumer purchasing behavior analytics and originating IP addresses for online orders. This should help minimize a spike in online orders,” said Max Silber, Director of Mobility at MetTel, a B2B communications and IT firm in New York.

Know thy past

According to enterprise e-commerce fraud prevention solution Riskified, retailers see a 100% rise in the number of purchases made using international credit cards and therefore advise merchants to scour past data to understand and dissect successful orders and transactions from fraudulent ones.

“Most merchants are likely to discover that they’ve been overly risk averse during the holidays. Our analysts have determined that top holiday sales days are actually far safer than average shopping days, and that any given order placed during the holidays is 55% less likely to be fraudulent. Partially because merchants were unaware of this, 4 out of 5 orders rejected during last year’s holidays were, in fact, legitimate,” wrote Riskified’s Ephraim Rinsky in a blog post.

Rinsky writes that it’s crucial for e-commerce merchants to understand the difference between customer profiles to understand future behavior.

“The fraud rate among returning customers is about half that of new customers: 1.4% compared to 2.6%. This means that returning customers should be treated very differently than new ones. This distinction is especially critical during the holidays, when order volume is so much greater.”

As e-commerce platforms keep accumulating consumer data, these businesses become even more valuable targets to cyber-criminals looking for economic gains.

“In addition to hacking into companies’ customer databases, cyber-criminals can also spoof companies’ identities to trick customers into divulging their personal information by sending emails with misleading subject lines such as “Click to track your transaction,” said Gus Anagnos, VP, Global Alliances of crowdsourced cybersecurity firm Synack.

The problem with same-day delivery

The interesting question fraud isn’t only about the increase in the sheer volume of transactions, but also about the improvement in logistics. Many retailers, namely Amazon, are offering same day delivery services, which opens up another front in the fight against cyber hacks.

“The increasing demand for same-day delivery will raise the bar for fraud detection service providers. The faster the turnaround from order to shipment, the more sophisticated the tool to give a go/no-go assessment for each transaction. It will be increasingly difficult for brands of any size to manually handle fraud detection on their own,” said Thom O’Leary from Fixergroup.

Same-day delivery decreases the amount of time between the transaction and the time your purchase takes to show up at your door. This means consumers have less time to take notice of the problem and then contact the merchant or their bank regarding the fraudulent transaction. This is important because after the item ships, there is little to be done to recover it. To complicate matters, most consumers contact their bank first which adds lead time to how long it takes the merchant to be notified of the issue, which can take weeks on occasion.

“There are solutions that can help mitigate this by analyzing the order information, such as the billing address, shipping address, and IP address of the purchaser to determine if there is a higher risk for fraud. In which case you can choose to hold the shipment until the transaction can be verified,” said ExpandLab‘s Eddie Spradley.

As the holiday season shopping is in full swing, it’s clear that e-commerce merchants need to do a better job tracking past data to understand future customer behavior and the consequent threats. This understanding becomes even more important with new and improved methods of delivery, such as same-day delivery, which poses a whole new dilemma for merchants.

This article is published as part of the IDG Contributor Network. Want to Join?

Source: InfoWorld Big Data