It's (not) elementary: How Watson works

It's (not) elementary: How Watson works

What goes into making a computer understand the world through senses, learning and experience, as IBM says Watson does? First and foremost, tons and tons of data.

To build a body of knowledge for Watson to work with on Jeopardy, researchers put together 200 million pages of content, both structured and unstructured, including dictionaries and encyclopedias. When asked a question, Watson initially analyzes it using more than 100 algorithms, identifying any names, dates, geographic locations or other entities. It also examines the phrase structure and the grammar of the question to better gauge what’s being asked. In all, it uses millions of logic rules to determine the best answers.

Today Watson is frequently being applied to new areas, which means learning new material. Researchers begin by loading Word documents, PDFs and web pages into Watson to build up its knowledge. Question and answer pairs are then added to train Watson on the subject. To answer a question, Watson searches millions of documents to find thousands of possible answers. Along the way it collects evidence and uses a scoring algorithm to rate each item’s quality. Based on that scoring, it ranks all possible answers and offers the best one. The video below explains the process in more detail.

[embedded content]

Over time, Watson learns from its experience. It’s also updated automatically as new information is published. In terms of nuts and bolts, Watson uses IBM’s DeepQA software along with a variety of other proprietary and open-source technologies. In its original form, that included Hadoop and Apache UIMA (Unstructured Information Management Architecture) software and a cluster of 90 Power 750 computers packing a total of 2880 processor cores.

Today Watson is delivered via the cloud, but as competition heats up, IBM is keeping quiet about the underlying specifics.

“Our DeepQA reasoning and other foundational cognitive skills make use of deep-learning techniques, proprietary algorithms and open-source kernels and frameworks that make use of hardware technologies that are optimized for those workloads,” said IBM Watson vice president and CTO Rob High. 

Source: InfoWorld Big Data

Why being a data scientist 'feels like being a magician'

Why being a data scientist 'feels like being a magician'

The data scientist role was thrust into the limelight early this year when it was named 2016’s “hottest job,” and there’s been considerable interest in the position ever since. Just recently, the White House singled data scientists out with a special appeal for help.

Those in the job can expect to earn a median base salary of roughly $116,840 — if they have what it takes. But what is it like to be a data scientist? Read on to hear what three people currently on the front lines had to say.

How the day breaks down

That data scientists spend a lot of time working with data goes without saying. What may be less obvious is that meetings and face-to-face time are also a big part of the picture.

“Typically, the day starts with meetings,” said Tanu George, an account manager and data scientist with LatentView Analytics. Those meetings can serve all kinds of purposes, she said, including identifying a client’s business problem, tracking progress, or discussing reports.

tanu george latentviewLatentView Analytics

Tanu George is a data scientist with LatentView Analytics.

By midmorning the meetings die down, she said. “This is when we start doing the number crunching,” typically focused on trying to answer the questions asked in meetings earlier.

Afternoon is often spent on collaborative meetings aimed at interpreting the numbers, followed by sharing analyses and results via email at the end of the day.

Roughly 50 percent of George’s time is taken up in meetings, she estimates, with another 20 percent in computation work and 30 percent in interpretation, including visualizing and putting data into actionable form.

Meetings with clients also represent a significant part of the day for Ryan Rosario, an independent data scientist and mentor at online education site Springboard. “Clients explain the problem and what they’d like to see for an outcome,” he said.  

Next comes a discussion of what kinds of data are needed. “More times than not, the client actually doesn’t have the data or know where to get it,” Rosario said. “I help develop a plan for how to get it.”

ryan rosario data scientistRyan Rosario

Ryan Rosario is an independent data scientist and engineer.

A lot of data science is not working with the data per se but more trying to understand the big picture of “what does this mean for a company or client,” said Virginia Long, a predictive analytics scientist at healthcare-focused MedeAnalytics. “The first step is understanding the area — I’ll spend a lot of time searching the literature, reading, and trying to understand the problem.”

Figuring out who has what kind of data comes next, Long said. “Sometimes that’s a challenge,” she said. “People really like the idea of using data to inform their decisions, but sometimes they just don’t have the right data to do that. Figuring out ways we can collect the right data is sometimes part of my job.”

Once that data is in hand, “digging in” and understanding it comes next. “This is the flip side of the basic background research,” Long said. “You’re really finding out what’s actually in the data. It can be tedious, but sometimes you’ll find things you might not have noticed otherwise.”

virginia long medeanalyticsVirginia Long

Virginia Long is a predictive analytics scientist at MedeAnalytics.

Long also spends some of her time creating educational materials for both internal and external use, generally explaining how various data science techniques work.

“Especially with all the hype, people will see something like machine learning and see just the shiny outside. They’ll say, ‘oh we need to do it,'” she explained. “Part of every day is at least some explaining of what’s possible and how it works.”

Best and worst parts of the job

Meetings are George’s favorite part of her day: “They make me love my job,” she said.

For Rosario, whose past roles have included a stint as a machine learning engineer at Facebook, the best parts of the job have shifted over time.

“When I worked in Silicon Valley, my favorite part was massaging the data,” he said. “Data often comes to us in a messy format, or understandable only by a particular piece of software. I’d move it into a format to make it digestible.”

As consultant, he loves showing people what data can do.

“A lot of people know they need help with data, but they don’t know what they can do with it,” he said. “It feels like being a magician, opening their minds to the possibilities. That kind of exploration and geeking out is now my favorite part.”

Long’s favorites are many, including the initial phases of researching the context of the problem to be solved as well as figuring out ways to get the necessary data and then diving into it headfirst.

Though some reports have suggested that data scientists still spend an inordinate amount of their time on “janitorial” tasks, “I don’t think of it as janitorial,” Long said. “I think of it as part of digging in and understanding it.”

As for the less exciting bits, “I prefer not to have to manage projects,” Long said. Doing so means “I often have to spend time managing everyone else’s priorities while trying to get my own things done.”

As for Rosario, who was trained in statistics and data science, systems building and software engineering are the parts he prefers to de-emphasize.

Preparing for the role

It’s no secret that data science requires considerable education, and these three professionals are no exception. LatentView Analytics’ George holds a bachelor’s degree in electrical and electronics engineering along with an MBA, she said.

Rosario holds a BS in statistics and math of computation as well as an MS in statistics and an MS in computer science from UCLA; he’s currently finishing his PhD in statistics there.

As for MedeAnalytics’ Long, she holds a PhD in behavioral neuroscience, with a focus on learning, memory and motivation.

“I got tired of running after the data,” Long quipped, referring to the experiments conducted in the scientific world. “Half of your job as a scientist is doing the data analysis, and I really liked that aspect. I also was interested in making a practical difference.”

The next frontier

And where will things go from here?

“I think the future has a lot more data coming,” said George, citing developments such as the internet of things (IoT). “Going forward, all senior and mid-management roles will incorporate some aspect of data management.”

The growing focus on streaming data means that “a lot more work needs to be done,” Rosario agreed. “We’ll see a lot more emphasis on developing algorithms and systems that can merge together streams of data. I see things like the IoT and streaming data being the next frontier.”

Security and privacy will be major issues to tackle along the way, he added.

Data scientists are still often expected to be “unicorns,” Long said, meaning that they’re asked to do everything single-handedly, including all the coding, data manipulation, data analysis and more.

“It’s hard to have one person responsible for everything,” she said. “Hopefully, different types of people with different skill sets will be the future.”

Words of advice

For those considering a career in data science, Rosario advocates pursuing at least a master’s degree. He also suggests trying to think in terms of data.

“We all have problems around us, whether it’s managing our finances or planning a vacation,” he said. “Try to think about how you could solve those problems using data. Ask if the data exists, and try to find it.”

For early portfolio-building experience, common advice suggests finding a data set from a site such as Kaggle and then figuring out a problem that can be solved using it.

“I suggest the inverse,” Rosario said. “Pick a problem and then find the data you’d need to solve it.”

“I feel like the best preparation is some sense of the scientific method, or how you approach a problem,” said MedeAnalytics’ Long. “It will determine how you deal with the data and decide to use it.”

Tools can be mastered, but “the sensibility of how to solve the problem is what you need to get good at,” she added.

Of course, ultimately, the last mile for data scientists is presenting their results, George pointed out.

“It’s a lot of detail,” she said. “If you’re a good storyteller, and if you can weave a story out of it, then there’s nothing like it.”

Source: InfoWorld Big Data

Meet Apache Spot, a new open source project for cybersecurity

Meet Apache Spot, a new open source project for cybersecurity

Hard on the heels of the discovery of the largest known data breach in history, Cloudera and Intel on Wednesday announced that they’ve donated a new open source project to the Apache Software Foundation with a focus on using big data analytics and machine learning for cybersecurity.

Originally created by Intel and launched as the Open Network Insight (ONI) project in February, the effort is now called Apache Spot and has been accepted into the ASF Incubator.

“The idea is, let’s create a common data model that any application developer can take advantage of to bring new analytic capabilities to bear on cybersecurity problems,” Mike Olson, Cloudera co-founder and chief strategy officer, told an audience at the Strata+Hadoop World show in New York. “This is a big deal, and could have a huge impact around the world.”

Based on Cloudera’s big data platform, Spot taps Apache Hadoop for infinite log management and data storage scale along with Apache Spark for machine learning and near real-time anomaly detection. The software can analyze billions of events in order to detect unknown and insider threats and provide new network visibility.

Essentially, it uses machine learning as a filter to separate bad traffic from benign and to characterize network traffic behavior. It also uses a process including context enrichment, noise filtering, whitelisting and heuristics to produce a shortlist of most likely security threats.

By providing common open data models for network, endpoint, and user, meanwhile, Spot makes it easier to integrate cross-application data for better enterprise visibility and new analytic functionality. Those open data models also make it easier for organizations to share analytics as new threats are discovered.

Other contributors to the project so far include eBay, Webroot, Jask, Cybraics, Cloudwick, and Endgame.

“The open source community is the perfect environment for Apache Spot to take a collective, peer-driven approach to fighting cybercrime,” said Ron Kasabian, vice president and general manager for Intel’s Analytics and Artificial Intelligence Solutions Group. “The combined expertise of contributors will help further Apache Spot’s open data model vision and provide the grounds for collaboration on the world’s toughest and constantly evolving challenges in cybersecurity analytics.”

Source: InfoWorld Big Data

IBM promises a one-stop analytics shop with AI-powered big data platform

IBM promises a one-stop analytics shop with AI-powered big data platform

Big data is in many ways still a wild frontier, requiring wily smarts and road-tested persistence on the part of those hoping to find insight in all the petabytes. On Tuesday, IBM announced a new platform it hopes will make things easier.

Dubbed Project DataWorks, the new cloud-based platform is the first to integrate all types of data and bring AI to the table for analytics, IBM said.

Project DataWorks is available on IBM’s Bluemix cloud platform and aims to foster collaboration among the many types of people who need to work with data. Tapping technologies including Apache Spark, IBM Watson Analytics and the IBM Data Science Experience launched in June, the new offering is designed to give users self-service access to data and models while ensuring governance and rapid-iteration capabilities.

Project DataWorks can ingest data faster than any other data platform, from 50 to hundreds of Gbps, deriving from sources including enterprise databases, the internet of things (IoT) and social media, according to IBM.  What the company calls “cognitive” capabilities such as those found in its Watson artificial intelligence software, meanwhile, can help pave a speedier path to new insights, it says.

“Analytics is no longer something in isolation for IT to solve,” said Derek Schoettle, general manager of cloud data services for IBM Analytics, in an interview. “In the world we’re entering, it’s a team sport where data professionals all want to be able to operate on a platform that lets them collaborate securely in a governed manner.”

Users can open any data set in Watson Analytics for answers to questions phrased in natural language, such as “what drives this product line?” Whereas often a data scientist might have to go through hundreds of fields manually to find the answer, Watson Analytics allows them to do it near instantaneously, IBM said.

More than 3,000 developers are working on the Project DataWorks platform, Schoettle said. Some 500,000 users have been trained on the platform, and more than a million business analysts are using it through Watson Analytics.

Available now, the software can be purchased through a pay-as-you-go plan starting at $75 per month for 20GB. Enterprise pricing is also available.

“Broadly speaking, this brings two things to the table that weren’t there before,” said Gene Leganza, a vice president and research director with Forrester Research.

First is “a really comprehensive cloud-based platform that brings together all the elements you’d need to drive data innovation,” Leganza said. “It’s data management, it’s analytics, it’s Watson, it’s collaboration across different roles, and it’s a method to get started. It’s really comprehensive, and the fact that it’s cloud-based means everyone has access.”

The platform’s AI-based capabilities, meanwhile, can help users “drive to the next level of innovation with data,” he said.

Overall, it’s “an enterprise architect’s dream” because it could put an end to the ongoing need to integrate diverse products into a functioning whole, Leganza said.

Competition in the analytics market has been largely segmented according to specific technologies, agreed Charles King, principal analyst with Pund-IT.

“If Project DataWorks delivers what IBM intends,” King said, “It could change the way that organizations approach and gain value from analyzing their data assets.”

Source: InfoWorld Big Data

SAP woos SMB developers with an 'express' edition of Hana

SAP woos SMB developers with an 'express' edition of Hana

SAP has made no secret of the fact that its bets for the future rest largely on its Hana in-memory computing platform. But broad adoption is a critical part of making those bets pay off.

Aiming to make Hana more accessible to companies of all shapes and sizes, the enterprise software giant on Monday unveiled a downloadable “express” edition that developers can use for free.

The new express edition of SAP Hana can be used free of charge on a laptop or PC to develop, test, and deploy production applications that use up to 32GB of memory; users who need more memory can upgrade for a fee. Either way, the software delivers database, application, and advanced analytics services, allowing developers to build applications that use Hana’s transactional and analytical processing against a single copy of data, whether structured or unstructured.

Originally launched more than five years ago, Hana uses an in-memory computing engine in which data to be processed is held in RAM instead of being read from disks or flash storage. This makes for faster performance. Hana was recently updated with expanded analytics capabilities and tougher security, among other features.

Hana also forms the basis for S/4Hana, the enterprise suite that SAP released in early 2015.

The new express edition of Hana can be downloaded from the SAP developer center and installed on commodity servers, desktops, and laptops using a binary installation package with support for either SUSE Linux Enterprise Server or Red Hat Enterprise Linux. Alternatively, it can be installed on Windows or Mac OS by downloading a virtual machine installation image that is distributed with SUSE Linux Enterprise Server.

Tutorials, videos, and community support are available. The software can also be obtained through the SAP Cloud Appliance Library, which provides deployment options for popular public cloud platforms.

“The new easy-to-consume model via the cloud or PC and free entry point make a very attractive offering from SAP,” said Cindy Jutras, president of research firm Mint Jutras. “Now companies such as small-to-midsize enterprises have access to a data management and app development platform that has traditionally been used by large enterprises.”

Source: InfoWorld Big Data

Salesforce is betting its Einstein AI will make CRM better

Salesforce is betting its Einstein AI will make CRM better

If there was any doubt that AI has officially arrived in the world of enterprise software, Salesforce just put it to rest. The CRM giant on Sunday announced Einstein, a set of artificial intelligence capabilities it says will help users of its platform serve their customers better.

AI’s potential to augment human capabilities has already been proven in multiple areas, but tapping it for a specific business purpose isn’t always straightforward. “AI is out of reach for the vast majority of companies because it’s really hard,” John Ball, general manager for Salesforce Einstein, said in a press conference last week.

With Einstein, Salesforce aims to change all that. Billing the technology as “AI for everyone,” it’s putting Einstein’s capabilities into all its clouds, bringing machine learning, deep learning, predictive analytics, and natural language processing into each piece of its CRM platform.

In Salesforce’s Sales Cloud, for instance, machine learning will power predictive lead scoring, a new tool that can analyze all data related to leads — including standard and custom fields, activity data from sales reps, and behavioral activity from prospects — to generate a predictive score for each lead. The models will continuously improve over time by learning from signals like lead source, industry, job title, web clicks, and emails, Salesforce said. 

Another tool will analyze CRM data combined with customer interactions such as inbound emails from prospects to identify buying signals earlier in the sales process and recommend next steps to increase the sales rep’s ability to close a deal.

In Service Cloud, Einstein will power a tool that aims to improve productivity by pushing a prioritized list of response suggestions to service agents based on case context, case history, and previous communications.

Salesforce’s Marketing, Commerce, Community, Analytics, IoT and App Clouds will benefit similarly from Einstein, which leverages all data within Salesforce — including activity data from its Chatter social network, email, calendar, and ecommerce as well as social data streams and even IoT signals — to train its machine learning models.

The technology draws on recent Salesforce acquisitions including MetaMind. Roughly 175 data scientists have helped build it, Ball said.

Every vendor is now facing the challenge of coming up with a viable AI product, said Denis Pombriant, managing principal at Beagle Research Group.

“Good AI has to make insight and knowledge easy to grasp and manipulate,” Pombriant said. “By embedding products like Einstein into customer-facing applications, we can enhance the performance of regular people and enable them to do wonderful things for customers. It’s not about automation killing jobs; it’s about automation making new jobs possible.”

Most of Salesforce’s direct competitors, including Oracle, Microsoft, and SAP, have AI programs of their own, some of them dating back further than Salesforce’s, Pombriant noted.

Indeed, predictive analytics has been an increasingly significant part of the marketer’s toolbox for some time, and vendors including Pegasystems have been applying such capabilities to CRM.

“I think more than any other move, such as IoT, AI is the next big thing we need to focus on,” Pombriant said. “If IoT is going to be successful, it will need a lot of good AI to make it all work.”

New Einstein features will start to become available next month as part of Salesforce’s Winter ‘17 release. Many will be added into existing licenses and editions; others will require an additional charge.

Also on Sunday, Salesforce announced a new research group focused on delivering deep learning, natural language processing, and computer vision to Salesforce’s product and engineering teams.

Source: InfoWorld Big Data

New programming language promises a 4X speed boost on big data

New programming language promises a 4X speed boost on big data

Memory management can be challenge enough on traditional data sets, but when big data enters the picture, things can slow way, way down. A new programming language announced by MIT this week aims to remedy that problem, and so far it’s been found to deliver fourfold speed boosts on common algorithms.

The principle of locality is what governs memory management in most computer chips today, meaning that if a program needs a chunk of data stored at some memory location, it’s generally assumed to need the neighboring chunks as well. In big data, however, that’s not always the case. Instead, programs often must act on just a few data items scattered across huge data sets.

Fetching data from main memory is the major performance bottleneck in today’s chips, so having to fetch it more frequently can slow execution considerably.

“It’s as if, every time you want a spoonful of cereal, you open the fridge, open the milk carton, pour a spoonful of milk, close the carton, and put it back in the fridge,” explained Vladimir Kiriansky, a doctoral student in electrical engineering and computer science at MIT.

With that challenge in mind, Kiriansky and other researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have created Milk, a new language that lets application developers manage memory more efficiently in programs that deal with scattered data points in large data sets.

Essentially, Milk adds a few commands to OpenMP, an API for languages such as C and Fortran that makes it easier to write code for multicore processors. Using it, the programmer inserts a few additional lines of code around any instruction that iterates through a large data collection looking for a comparatively small number of items. Milk’s compiler then figures out how to manage memory accordingly.

With a program written in Milk, when a core discovers that it needs a piece of data, it doesn’t request it — and the attendant adjacent data — from main memory. Instead, it adds the data item’s address to a list of locally stored addresses. When the list gets long enough, all the chip’s cores pool their lists, group together those addresses that are near each other, and redistribute them to the cores. That way, each core requests only data items that it knows it needs and that can be retrieved efficiently.

In tests on several common algorithms, programs written in the new language were four times as fast as those written in existing languages, MIT says. That could get even better, too, as the researchers work to improve the technology further. They’re presenting a paper on the project this week at the International Conference on Parallel Architectures and Compilation Techniques.

Source: InfoWorld Big Data

Big data hits $46 billion in revenue — and counting

Big data hits billion in revenue — and counting

Big data has been a big buzzword for more than a few years already, and it has solid numbers to back that up, including $46 billion in 2016 revenues for vendors of related products and services. But the big data era is just beginning to dawn, with the real growth yet to come.

So suggests a new report from SNS Research, which predicts that by the end of 2020, companies will spend more than $72 billion on big data hardware, software, and professional services. While revenue is currently dominated by hardware sales and professional services, that promises to change: By the end of 2020, software revenue will exceed hardware investments by more than $7 billion, the researcher predicts.

“Despite challenges relating to privacy concerns and organizational resistance, big data investments continue to gain momentum throughout the globe,” the company said in a summary of the report, which was announced Monday.

Others echo the same sentiment.

“Sooner rather than later, big data will become table stakes for enterprises,” said Tony Baer, a principal analyst at Ovum. “It will not provide unique competitive edge to innovators, but will add a new baseline to the analytics and decision support that enterprises must incorporate into their decision-making processes.”

It is indeed still early days for such initiatives, said Frank Scavo, president of Computer Economics.

“Business intelligence and data warehousing are top areas for technology spending this year, but only about one-quarter of organizations are including big data in their investment plans,” said Scavo, citing his own company’s research. “So, what we are seeing today is just the tip of the iceberg.”

Cloud storage and services are making big data affordable for most organizations, but realizing the benefits can be a challenge. That’s in large part due to the current shortage of business analysts and IT professionals with the right skills, particularly data scientists, he said.

“If you’re planning to invest in big data, you’d better be ready to invest in your people to develop the needed skills,” Scavo said. “At the same time, if you’re an IT professional just starting out in your career, big data would be a great area to focus on.”

Source: InfoWorld Big Data

Google Analytics just got a new AI tool to help find insights faster

Google Analytics just got a new AI tool to help find insights faster

Services like Google Analytics are great for amassing key data to help you make the most of your web efforts, but zeroing in on the parts that matter most can be a time-consuming challenge. On Friday, Google added a new feature to its analytics service that taps AI to surface insights automatically.

Now available in the Assistant screen in the Google Analytics mobile app, the new automated insights feature “lets you see in 5 minutes what might have taken hours to discover previously,” wrote Ajay Nainani, product manager for Google Analytics, in a blog post.

The tool taps Google machine intelligence to find key insights from among the thousands of metric and dimension combinations that can be reported in Google Analytics. More specifically, it combs through data and offers relevant insights and recommendations.

If you’re a retailer trying to get ready for the holiday season, for instance, the tool can instantaneously surface opportunities and anomalies hiding in your data, such as which products are experiencing higher-than-normal sales growth, which advertising channels are driving the most conversions and the best returns, and what devices customers are using to engage with your brand.

The tool also offers quick tips on how to improve your Google Analytics data. And because it’s based on artificial intelligence, it gets smarter over time as it learns more about your business and how you use the software, Google says.

The new automated insights feature is now available with the official Google Analytics mobile app on Android and iOS for English-speaking users. Google’s now working on bringing it to the web version of the software and to other languages as well, Nainani said. Meanwhile, Google invites users to suggest insights they’d like to see automated and is collecting those ideas through an online form.

Source: InfoWorld Big Data

Meet the newest member of SAP's Hana family: a data warehouse

Meet the newest member of SAP's Hana family: a data warehouse

SAP has already placed big bets on Hana, and now it’s adding more with a new data warehouse tailored specifically for the in-memory computing platform.

Launched on Wednesday, SAP BW/4Hana promises to minimize data movement and duplication by enabling data to be analyzed wherever it resides, whether within or outside the enterprise. It can also integrate live streaming and time-series sensor data collected from internet of things (IoT) environments. 

Back in 2014, SAP added Hana support to its longstanding Business Warehouse data warehousing software, but BW/4Hana goes a big step further. Like S4/Hana, the enterprise suite SAP released last year, the new data warehouse is optimized for Hana, and will not run on any other platform.

“We believe we have to adhere to the principles of real-time, in-memory computing,” said Ken Tsai, a vice president and head of cloud platform and data management product marketing at SAP. “The classic way of building a data warehouse is no longer viable.”