IDG Contributor Network: In an age of fake news, is there really such a thing as fake data?

IDG Contributor Network: In an age of fake news, is there really such a thing as fake data?

Deloitte Global predicts that medium and large enterprises will increase their use of machine learning in 2018, doubling the number of implementations and pilot projects underway in 2017. And, according to Deloitte, by 2020, that number will likely double again.

Machine learning is clearly on the rise among companies of all sizes and in all industries and depends on data so they can learn. Training a machine learning model requires thousands or millions of data points, which need to be labeled and cleaned. Training data is what makes apps smart, teaching them life lessons, experiences, sights, and rules that help them know how to react to different situations. What a developer of an AI app is really trying to do is simulate the experiences and knowledge that take people lifetimes to accrue.

The challenge many companies face in developing AI solutions is acquiring all the needed training data to build smart algorithms. While companies maintain data internally across different databases and files, it would be impossible for a company to quickly possess the amount of data that is needed. Only tech savvy, forward-thinking organizations that began storing their data years ago could even begin to try.

As a result, a new business is emerging that essentially sells synthetic data—fake data, really—that mimics the characteristics of the real deal. Companies that tout the benefits of synthetic data claim that effective algorithms can be developed using only a fraction of pure data, with the rest being created synthetically. And they claim that it drastically reduces costs and save time. But does it deliver on these claims?

Synthetic data: buyer beware

When you don’t have enough real data, just make it up.  Seems like an easy answer, right? For example, if I’m training a machine learning application to detect the number of cranes on a construction site, and I only have examples of 20 cranes, I could create new ones by changing the color of some cranes, the angles of others and the size of them, so that the algorithm is trained to identify hundreds of cranes.  While this may seem easy and harmless enough, in reality, things are not that easy. The quality of a machine learning application is directly proportional to the quality of the data with which it is trained. 

Data needs to work accurately and effectively in the real world. Users of synthetically derived data have to take a huge leap of faith that it will train a machine learning app to work out in the real world and that every scenario that the app will encounter has been addressed. Unfortunately, the real world doesn’t work that way.  New situations are always arising that no one can really predict with any degree of accuracy. Additionally, there are unseen patterns in the data that you just can’t mimic.

Yet, while accumulating enough training data the traditional way could take months or years, synthetic data is developed in weeks or months. This is an attractive option for companies looking to swiftly deploy a machine learning app and begin realizing the business benefits immediately. In some situations where many images need to be identified quickly to eliminate manual, tedious processes, maybe it’s okay to not have a perfectly trained algorithm—maybe providing 30 percent accuracy is good enough.

But what about the mission- or life-critical situations where a bad decision by the algorithm could result in disaster or even death? Take, for example, a health care app that works to identify abnormalities in X-rays. Or, an autonomous vehicle operating on synthetic training data. Because the app is trained only on what it has learned, what if it was never given data that tells it how to react to real-world possibilities, such as a broken traffic light?

How do you make sure you’re getting quality data in your machine learning app?

Because the use of synthetic data is clearly on the rise, many AI software developers, insights-as-a-service providers and AI vendors are using it to more easily get AI apps up and running and solving problems out of the gate. But when working with these firms, there are some key questions you should ask to make sure you are getting quality machine learning solutions.

Do you understand my industry and the business challenge at hand?

When working with a company developing your machine learning algorithm, it’s important that it understands the specific challenges facing your industry and the critical nature of your business. Before it can aggregate the relevant data and build an AI solution to solve it, the company needs to have an in-depth understanding of the business problem.

How do you aggregate data?

It’s also important for you to know how the provider is getting the data that may be needed. Ask directly if it uses synthetic data and if so, what percentage of the algorithm is trained using synthetic data and how much is from pure data. Based on this, determine if your application can afford to make a few mistakes now and then. 

What performance metrics do you use to assess the solution?

You should find out how they assess the quality of the solution. Ask what measurement tools they use to see how the algorithm operates in real-world situations. Additionally, you should determine how often they retrain the algorithm on new data.

Perhaps most important, you need to assess if the benefits of using synthetic data outweigh the risks. It’s often tempting to follow the easiest path with the quickest results, but sometimes getting it right—even when the road is longer—is worth the journey.

This article is published as part of the IDG Contributor Network. Want to Join?

Source: InfoWorld Big Data

IDG Contributor Network: What matters more: Controlling the Internet's wiring, or its data? Both

IDG Contributor Network: What matters more: Controlling the Internet's wiring, or its data? Both

In an interesting move, Facebook and Microsoft have forged an alliance to lay a new fiber optic cable under the Atlantic Ocean. Putting aside the environmental concerns this raises for many (including those of us inhabiting islands), it also raises questions about how control of the Internet — and the data belonging to its users — is basically a prize in a multi-strand tug of war involving technology companies and broadband service providers from here to Spain and back.

So, why does it matter who controls the building of Internet infrastructure? How does that relate to who controls user data? And what’s the implication for businesses that rely on Internet technology to deliver their products and services?

Google — part of Alphabet, Inc. — organizes the world’s data and, it could be argued, knows more about the average person than the average person knows about their closest friends and relatives. The company also accounts for more than 10 percent of all advertising spend globally. In holding the most information on, in, and about the Internet, Google could be the most powerful company on the planet.

Google also is involved, through one of the Alphabet companies, in the delivery of Internet services. The company has been experimenting with and investing in satellite and balloon technology (Project Loon) that could deliver Internet access to even the remotest regions of Earth — just as its existing Google Earth satellite project delivers images of those same regions, as well as eerily accurate photography of addresses such as your own.

IDG Contributor Network: Data sharing and medical research, fundraising is only the first step

IDG Contributor Network: Data sharing and medical research, fundraising is only the first step

Last week, Sean Parker (a founder of Facebook and, notoriously, Napster) announced the single largest donation to support immunotherapy cancer research. Totaling $250 million, the donation will support research to be conducted across six academic institutions, with the possibility of incorporating additional researchers if more funding is secured down the line.

I think it goes without saying that all donations to support medical research, particularly programs like immunotherapy that have a more difficult time receiving traditional funding, are fantastic.

However, a project like this isn’t just notable for the size of the donation, but also for the breadth of coordination that will be required to synthesize research across so many organizations. As past experience shows, innovating new models in research and discovery can be a challenge. For example, the now-defunct Prize4Life was founded to incentivize research into cures for ALS (Lou Gehrig’s disease). The organization was well funded and recognized for innovations such as a crowdsourcing approach to data science to try and foster breakthroughs. The data experiment failed however, and ultimately so did the organization.

More recently, Theranos has provided a cautionary tale for those looking to change processes without the strength of underlying data management and related quality standards. That company is perceived to have an execution problem, but what it really has is a data problem: trying to design testing that relies on the collection, analysis, and management of massive amounts of private data is a very ambitious undertaking.