Get started in data science: 5 steps you can take online for free

Get started in data science: 5 steps you can take online for free

Making a career change is never easy, but few things are more motivating than the prospect of a good salary and a dearth of competition. That’s a fair summary of the data science world today, as at least one well-publicized study has made clear, so why not investigate a little further?

There’s been a flurry of free resources popping up online to help those who are intrigued learn more. Here’s a small sampling for each step of the way.

1. Understand what it is

Microsoft’s website might not automatically spring to mind as a likely place to look, but sure enough, a few months ago the software giant published a really nice series of five short videos entitled “Data Science for Beginners.” Each video focuses on a specific aspect, such as “The 5 questions data science answers” and “Is your data ready for data science?”

2. Dig a little deeper

If you think you might be interested in a career in data science, you may want to start getting a feel for the lay of the land by tapping into some of the big blogs and community websites out there. The newly revamped OpenDataScience.com is one example; KDnuggets is another useful resource. A recent post on Data Science Central (another good site) lists key accounts to follow on Twitter. KDnuggets suggests some good e-books to read before plunging into a data science career.

Digital transformation is giving IT spending a big boost

Digital transformation is giving IT spending a big boost

Digital transformation may promise critical benefits for the companies undertaking it, but it’s also delivering a major boost to IT spending around the world.

That’s according to market researcher IDC, which on Monday released new data indicating that global spending on IT products and services will grow from nearly $2.4 trillion in 2016 to more than $2.7 trillion in 2020. A big part of that growth, it says, will come from companies investing in cloud, mobility, and big data technologies as part of their digital transformation efforts. Such efforts are now particularly prominent in financial services and manufacturing.

Purchases on the consumer side accounted for nearly a quarter of all IT revenues in 2015, thanks largely to what IDC calls “the ongoing smartphone explosion,” but in general consumer spending on PCs, tablets, and smartphones has been waning. Even the modest growth forecast for the tablet market will be driven by commercial segments, it said.

“While the consumer and public sectors have dragged on overall IT spending so far in 2016, we see stronger momentum in other key industries including financial services and manufacturing,” said Stephen Minton, vice president of customer insights and analysis at IDC. “Enterprise investment in new project-based initiatives, including data analytics and collaborative applications, remains strong.”

Google is using AI to compress images better than JPEG

Google is using AI to compress images better than JPEG

Small is beautiful, as the old saying goes, and nowhere is that more true than in media files. Compressed images are considerably easier to transmit and store than uncompressed ones are, and now Google is using neural networks to beat JPEG at the compression game.

Google began by taking a random sample of 6 million 1,280×720 images on the web. It then broke those down into nonoverlapping 32×32 tiles and zeroed in on 100 of those with the worst compression ratios. The goal there, essentially, was to focus on improving performance on the “hardest-to-compress” data, because it’s bound to be easier to succeed on the rest.

The researchers then used the TensorFlow machine-learning system Google open-sourced last year to train a set of experimental neural network architectures. They used one million steps to train them and then collected a series of technical metrics to find which training models produced the best-compressed results.

In the end, their models outdid the JPEG compression standard’s performance on average. The next challenge, the researchers said, will be to beat compression methods derived from video compression codecs on large images, becuase “they employ tricks such as reusing patches that were already decoded.” WebP, which was derived from the VP8 video codec, is an example of such a method.

High technology: How IT is fueling the budding cannabis industry

High technology: How IT is fueling the budding cannabis industry

The cannabis industry is growing up, and it would be tough to imagine more convincing proof than Microsoft’s recent announcement that it’s getting involved.

Though the software giant will stay very much in the background — its role will focus primarily on providing Azure cloud services for a compliance-focused software push — the move is still widely viewed as a telling sign.

“Having them come out and say, ‘we’re willing to have our name in the same sentence as the word cannabis,’ adds to the legitimacy of our industry,” said Kyle Sherman, cofounder and CEO of software maker Flowhub.

Stigma is a longstanding problem for those trying to run a legitimate business in the cannabis industry, thanks largely to the fact that marijuana remains illegal in the U.S. federal government’s eyes. Twenty-five states have already passed laws that allow for some degree of medical or legal use, but that can be cold comfort for entrepreneurs unable to get a bank account because of lingering concern.

New R extension gives data scientists quick access to IBM's Watson

New R extension gives data scientists quick access to IBM's Watson

Data scientists have a lot of tools at their disposal, but not all of them are equally accessible. Aiming to put IBM’s Watson AI within closer reach, analytics firm Columbus Collaboratory on Thursday released a new open-source R extension called CognizeR.

R is an open-source language that’s widely used by data scientists for statistical and analytics applications. Previously, data scientists would have had to exit R to tap Watson’s capabilities, coding the calls to Watson’s APIs in another language, such as Java or Python.

Now, CognizeR lets them tap into Watson’s so-called “cognitive” artificial-intelligence services without leaving their native development environment.

“Data scientists can now seamlessly tap into our cognitive services to unlock data that lives in unstructured forms like chats, emails, social media, images, and documents,” wrote Rob High, vice president and CTO for Watson, in a blog post.

6 'data' buzzwords you need to understand

6 'data' buzzwords you need to understand

Take one major trend spanning the business and technology worlds, add countless vendors and consultants hoping to cash in, and what do you get? A whole lot of buzzwords with unclear definitions.

In the world of big data, the surrounding hype has spawned a brand-new lingo. Need a little clarity? Read on for a glossary of sorts highlighting some of the main data types you should understand.

1. Fast data

The shining star in this constellation of terms is “fast data,” which is popping up with increasing frequency. It refers to “data whose utility is going to decline over time,” said Tony Baer, a principal analyst at Ovum who says he coined the term back in 2012.

It’s things like Twitter feeds and streaming data that need to be captured and analyzed in real time, enabling immediate decisions and responses. A capital markets trading firm may rely on it for conducting algorithmic or high-frequency trades.

Use Apache Spark? This tool can help you tap machine learning

Use Apache Spark? This tool can help you tap machine learning

Finding insight in oceans of data is one of enterprises’ most pressing challenges, and increasingly AI is being brought in to help. Now, a new tool for Apache Spark aims to put machine learning within closer reach.

Announced on Friday, Sparkling Water 2.0 is a major new update from H2O.ai that’s designed to make it easier for companies using Spark to bring machine-learning algorithms into their analyses. It’s essentially an API (application programming interface) that lets Spark users tap H2O’s open-source artificial-intelligence platform instead of — or alongside — the algorithms included in Spark’s own MLlib machine-learning library.

Among the highlights of the new software is the ability to run Spark and Scala through H2O’s Flow user interface. Sparkling Water 2.0 also brings a new visualization component to MLlib, giving users the ability to see their algorithmic results in an easy-to-digest form.

The software supports the Apache Zeppelin notebook as well as Spark 2.0 and all previous versions. It offers production support for machine-learning pipelines. Model and data governance can be handled through H2O’s Steam data-science hub.

Amazon's Elastic File System is now open for business

Amazon's Elastic File System is now open for business

Following an extended preview period, Amazon’s Elastic File System is now generally available in three geographical regions, with more on the way.

Originally announced last year, EFS is a fully managed elastic file storage service for deploying and scaling durable file systems in the Amazon Web Services cloud. It’s currently available in the U.S. East (northern Virginia), U.S. West (Oregon), and EU (Ireland) regions, the company announced Wednesday.

Customers can use EFS to create file systems that are accessible to multiple Amazon Elastic Compute Cloud (Amazon EC2) instances via the Network File System (NFS) protocol. They can also scale those systems up or down without needing to provision storage or throughput.

EFS is designed for a wide range of file workloads, including big data analytics, media processing, and genomics analysis, AWS said.

Cloud or on-prem? This big-data service now swings both ways

Cloud or on-prem? This big-data service now swings both ways

There are countless “as-a-service” offerings on the market today, and typically they live in the cloud. Back in 2014, startup BlueData blazed a different trail by launching its EPIC Enterprise big-data-as-a-service offering on-premises instead.

On Wednesday, BlueData announced that the software can now run on Amazon Web Services (AWS) and other public clouds, making it the first BDaaS platform to work both ways, the company says.

“The future of big data analytics will be neither 100 percent on-premises nor 100 percent in the cloud,” said Kumar Sreekanti, CEO of BlueData. “We’re seeing more multicloud and hybrid deployments, with data both on-prem and in the cloud. BlueData provides the only solution that can meet the realities of these mixed environments in the enterprise.”

BlueData’s EPIC (short for “Elastic Private Instant Clusters”) platform taps embedded Docker container technology to let businesses spin up virtual Hadoop or Spark clusters within minutes on their existing infrastructure, the company says, giving data scientists on-demand access to the applications, data, and infrastructure.

3 reasons Twitter just bought machine-learning startup Magic Pony

3 reasons Twitter just bought machine-learning startup Magic Pony

Twitter has made no secret of its interest in machine learning in recent years, and on Monday the company put its money where its mouth is once again by purchasing London startup Magic Pony Technology, which has focused on visual processing.

“Magic Pony’s technology — based on research by the team to create algorithms that can understand the features of imagery — will be used to enhance our strength in live [streaming] and video and opens up a whole lot of exciting creative possibilities for Twitter,” Twitter cofounder and CEO Jack Dorsey wrote in a blog post announcing the news.

The startup’s team includes 11 Ph.Ds with expertise across computer vision, machine learning, high-performance computing, and computational neuroscience, Dorsey said. They’ll join Twitter’s Cortex group, made up of engineers, data scientists, and machine-learning researchers.

Terms of the deal were not disclosed.