Mesosphere DC/OS brings elastic scale to Redis, Couchbase

Mesosphere DC/OS brings elastic scale to Redis, Couchbase

Mesosphere DC/OS, the datacenter automation solution built atop the Apache Mesos orchestration system to provide one-click management for complex applications, has now hit its 1.9 revision.

With this release, Mesosphere is once again emphasizing DC/OS as a solution for deploying and maintaining large, complex data-centric applications. Version 1.9 adds out-of-the-box support for several major data services and a passel of improvements for DC/OS’s existing container support.

Everyone into the pool!

DC/OS manages a datacenter’s worth of Linux machines as if they were a single pooled resource maintained by high-level commands from a CLI and GUI. Apps like Apache Cassandra, Kafka, Spark, and HDFS — many of them not known for being easy to manage — can be deployed with a few command-line actions and scaled up or down on demand or automatically.

Among the new additions are two major stars of the modern open source data stack: The database/in-memory caching store Redis and the NoSQL database solution Couchbase. Redis in particular has become a valuable component for big data applications as an accelerator for Apache Spark, so being able to accelerate other DC/OS apps with it is a boon.

Version 1.9 also adds support for Elastic; DataStax Enterprise, the commercial offering based on the Apache Cassandra NoSQL system; and Alluxio, a data storage acceleration layer specifically designed for big data systems like Spark.

Managing applications like these through DC/OS makes better use of the utilization in a given cluster. Each application supported in DC/OS has its own scheduling system, so apps with complementary behaviors can be packed together more efficiently and automatically migrated between nodes as needed. DC/OS also ensures apps that upgrade frequently (like scrappy new big data frameworks) can be rolled out across a cluster without incurring downtime.

There’s barely a data application these days that isn’t tied into machine learning in some form. Given that Mesosphere was already promoting DC/OS for data-centric apps, it only makes sense the company is also pushing DC/OS as a management solution for machine learning apps built on its supported solutions. This claim has some validity with GPU resources, as DC/OS can manage GPU as simply another resource to be pooled for application use.

Container conscious

Because DC/OS also manages containers with Google’s Kubernetes project, it’s been described as a container solution, but only in the sense that containers are one of many kinds of resources DC/OS manages.

Containers have long been criticized for being opaque. Prometheus, now a Cloud Native Computing Foundation project, was originally developed by Soundcloud for getting insight into running containers, and DC/OS 1.9 supports Prometheus along with Splunk, ELK, and Datadog as targets for managing the logs and metrics it collects from containers.

Version 1.9 also introduces a feature called container process injection. With it, says the company, developers “remotely run commands in any container in the same namespace as the task being investigated.” Containers are not simply opaque by nature, but also ephemeral, so being able to connect to them and debug them directly while they’re still running will be useful.

Source: InfoWorld Big Data

MapR unveils platform for IoT analytics at the edge

MapR unveils platform for IoT analytics at the edge

At Strata + Hadoop World in San Jose, Calif., Tuesday, MapR Technologies took the wraps off a new small footprint edition of its Converged Data Platform geared for capturing, processing and analyzing data from internet of things (IoT) devices at the edge.

MapR Edge, designed to work in conjunction with the core MapR Converged Enterprise Edition, provides local processing, aggregation of insights at the core and the ability to then push intelligence back to the edge.

“You can think of it as a mini-cluster that’s close to the source and can do analytics where the data resides, but then send data back to the core,” says Dale Kim, senior director, Industry Solution, at MapR Technologies.

“The use cases for IoT continue to grow, and in many situations, the volume of data generated at the edge requires bandwidth levels that overwhelm the available resources,” Jason Stamper, analyst, Data Platforms & Analytics, 451 Research, added in a statement. “MapR is pushing the computation and analysis of IoT data close to the sources, allowing more efficient and faster decision-making locally, while also allowing subsets of the data to be reliably transported to a central analytics deployment.

Many core IoT use cases, like vehicles and oil rigs, operate in conditions with limited connectivity, making sending massive streams of data back to a central analytics core impractical. The idea behind MapR Edge is to capture and process most of that data at the edge, where the data is created, then send summarized data back to the core, which then aggregates that summarized data from hundreds or thousands of edge IoT devices.

MapR Technologies calls this concept “Act Locally, Learn Globally,” which means that IoT applications leverage local data from numerous sources for constructing machine learning or deep learning models with global knowledge. These models are then deployed to the edge to enable real-time decisions based on local events.

To make it work, MapR Edge integrates a globally distributed elastic data fabric that supports distributed processing and geo-distributed database applications.

MapR Edge capabilities include:

  • Distributed data aggregation. Provides high-speed local processing, useful for location-restricted or sensitive data such as personally identifiable information (PII), and consolidates IoT data from edge sites.
  • Bandwidth awareness. Adjusts throughput from the edge to the cloud and/or data center, even with environments that are only occasionally connected.
  • Global data plane. Provides global view of all distributed clusters in a single namespace, simplifying application development and deployment.
  • Converged analytics. Combines operational decision-making with real-time analysis of data at the edge.
  • Unified security. End-to-end IoT security provides authentication, authorization and access control from the edge to the central clusters. MapR Edge also delivers secure encryption on the wire for data communicated between the edge and the main data center.
  • Standards based. MapR Edge adheres to standards including POSIX and HDFS API for file access, ANSI SQL for querying, Kafka API for event streams and HBase and OJAI API for NoSQL database.
  • Enterprise-grade reliability. Delivers a reliable computing environment to tolerate multiple hardware failures that can occur in remote, isolated deployments.

MapR Edge deployments are intended to be used in conjunction with central analytics and operational clusters running on the MapR Converged Enterprise Edition. It is available in 3-5 node configurations and optimized for small form-factor commodity hardware like the Intel NUC Mini PC. MapR Edge deployments can store up to 50TB per cluster.

Jack Norris, senior vice president, Data and Applications, MapR Technologies, notes that MapR has all the data protection capabilities of MapR Converged Data Platform.

“There’s redundancy built in,” he says. “High, availability, self-healing, all the capabilities of the MapR technology are extended to the edge device.”

“Our customers have pioneered the use of big data and want to continuously stay ahead of the competition,” Ted Dunning, chief application architect, MapR Technologies, said in a statement Tuesday. “Working in real-time at the edge presents unique challenges and opportunities to digitally transform an organization. Our customers want to act locally, but learn globally, and MapR Edge lets them do that more efficiently, reliably, securely and with much more impact.”

This story, “MapR unveils platform for IoT analytics at the edge” was originally published by CIO.

Source: InfoWorld Big Data

Microsoft Leads In Burgeoning SaaS Market

Microsoft Leads In Burgeoning SaaS Market

New Q4 data from Synergy Research Group shows that the enterprise SaaS market grew 32% year on year to reach almost $13 billion in quarterly revenues, with ERP and collaboration being the highest growth segments. For the third successive quarter Microsoft is the clear leader in overall enterprise SaaS, having overtaken long-time market leader Salesforce. Other leading SaaS providers include SAP, Oracle, Adobe, ADP, IBM, Workday, Intuit, Cisco and Google. Among the major SaaS vendors those with the highest growth rates were Oracle and Google, the latter thanks to a big push for its G Suite collaborative apps.

The enterprise SaaS market is somewhat mature compared to other cloud markets like IaaS and PaaS and consequently has a lower growth rate. Nonetheless, Synergy forecasts that it will more than double in size over the next three years, with strong growth across all segments and all geographic regions.

“There are a variety of factors driving the SaaS market which will guarantee substantial growth for many years to come,” said John Dinsdale, a chief analyst and research director at Synergy Research Group. “Traditional enterprise software vendors like SAP, Oracle and IBM are all pushing to convert their huge base on on-premise software customers to a SaaS subscription relationship. Meanwhile relatively new cloud-based vendors like Workday and Zendesk are aggressively targeting the enterprise market and industry giants Microsoft and Google are on a charge to grow their subscriber bases, especially in the collaboration market.”

Source: CloudStrategyMag

CTP Achieves Google Cloud Partner Specialization In Application Development

CTP Achieves Google Cloud Partner Specialization In Application Development

Cloud Technology Partners (CTP) has announced that it has achieved Google’s Cloud Application Development Specialization. CTP is one of the first Google consulting partners to earn this specialization, highlighting the success of its Digital Innovation services, which help clients design, build and run cloud-native applications.

“Google is a leading public cloud platform for building and deploying cloud-native applications, and is often the platform of choice for our clients wanting to develop data-intensive workloads,” said Rob Lancaster, vice president of Global Alliances at Cloud Technology Partners. “Achieving the Google Cloud Application Development Specialization reaffirms to the marketplace that CTP has the vision, the skills, and a track record of customer success building and deploying solutions on Google.”

The Google Cloud Partner Specialization program is designed to provide Google Cloud Customers with qualified partners that have demonstrated technical proficiency and proven success in key service areas.

CTP has worked with Google on a number of innovative client projects including developing an IoT and data analytics application for Land O’Lakes which was featured on stage at last year’s Google Cloud Next conference. By leveraging cloud, IoT and big data technologies, Land O’Lakes farmers are now producing 650% more corn today on 13 percent fewer acres than they were 50 years ago.

“CTP helped us build applications that streamline data capture and knowledge transfer, all in real time,” said Teddy Bekele, Vice President IT at Land O’Lakes.

“We welcome the recognition by Google in response to the tremendous results we’ve delivered for our clients that leverage Google Cloud Platform, and we look forward to continuing to expand our Google Cloud expertise and offerings,” said John Treadway, senior vice president of Cloud Technology Partner’s Digital Innovation practice.

Source: CloudStrategyMag

Zoomdata Announces Expanded Support For Google Cloud Platform

Zoomdata Announces Expanded Support For Google Cloud Platform

Zoomdata has announced support for Google’s Cloud Spanner and PostgreSQL on the Google Cloud Platform (GCP), as well as enhancements to the existing Zoomdata Smart Connector for Google BigQuery. With these new capabilities, Zoomdata is one of the first visualization analytics partners to offer such deeply integrated and optimized support for Google Cloud Platform’s Cloud Spanner, PostgreSQL, Google BigQuery, and Cloud DataProc services.

Google Cloud Spanner is the first and only relational database service that is both strongly consistent and horizontally scalable. Zoomdata’s Smart Connector for Cloud Spanner is available today for testing on Google Cloud Launcher. It supports key data analytic capabilities, including streaming analytics (Live Mode), aggregate analytics (group by), time series handling, and federated data blending of data from Cloud Spanner and other data sources via Zoomdata Fusion.

Zoomdata has also added a Zoomdata Smart Connector for PostgreSQL to its Google Cloud Platform launcher. Optimized to take full advantage of the powerful, object-relational database system, users can now easily connect to and quickly visualize and explore data from PostgreSQL running on GCP. In addition, Zoomdata enhanced its Smart Connector for Google BigQuery to include support for visual drill-through to full record “details,” as well as enhancing the speed at which visualizations are generated.

“The Zoomdata team is committed to delivering a big data visualization experience that optimizes GCP’s core data management services, including support for Google BigQuery,” said Russ Cosentino, Zoomdata co-founder and VP, Channels. “As a launch partner for Google Cloud Dataproc, and now offering optimized support for Google Cloud Spanner and PostgreSQL on GCP, Zoomdata is an ideal choice for helping business users deliver value against their data workloads on Google.”

Zoomdata is an open platform that provides visual analytics solutions for big and fast data. Architected for both cloud and on-premise deployments, its modern architecture delivers visual analysis of huge datasets in seconds. Zoomdata’s patented Data Sharpening™ technology delivers the industry’s fastest visual analytics for real-time streaming and historical data. Zoomdata’s microservices architecture makes this possible by using Apache Spark as a complementary high performance engine. Zoomdata Fusion enables users to perform analytics across disparate data sources in a single view — without the need to move or transform data.

Source: CloudStrategyMag

Dataguise DGSecure Is Now Integrated In Google Cloud Storage

Dataguise DGSecure Is Now Integrated In Google Cloud Storage

Dataguise has announced that DgSecure Detect now supports sensitive data detection on Google Cloud Storage (GCS). Integration with GCS extends the range of platforms supported by DgSecure Detect, which helps data-driven enterprises move to the cloud with confidence by providing precise sensitive data detection across the enterprise, both on premises and in the cloud. With DgSecure Detect, organizations can leverage Google’s powerful, simple, and cost-effective object storage service with a complete understanding of where sensitive data is located — an important first step to ensuring data protection and privacy compliance.

DgSecure Detect discovers, counts, and reports on sensitive data assets in real time within the unified object-based storage of GCS. The highly scalable, resilient, and customizable solution precisely identifies and summarizes the location of this data, down to the element level. DgSecure allows organizations to comb through structured, semi-structured, or unstructured content to find any data deemed “sensitive” by the organization. The range of sensitive data that is discoverable by DgSecure Detect is nearly unlimited using the solution’s custom sensitive data type definition capabilities.

Sensitive Data Detection Capabilities for Google Cloud Storage:

  • Detects high volumes of disparate, constantly moving, and changing data with time-stamping to support incremental change and life cycle management
  • Supports a flexible information governance model that has a mix of highly invested (curated) data as well as raw, unexplored (gray) data, such as IoT (Internet of Things) data, clickstreams, feeds, and logs
  • Processes structured, semi-structured, and unstructured or free-form data formats
  • Provides automated detection and processing of a variety of file formats and file/directory structures, leveraging meta-data and schema-on-read where applicable
  • Provides deep content inspection using patent-pending techniques, such as neural-like network (NLN) technology, and dictionary-based and weighted keyword matches to detect sensitive data more accurately.

These new capabilities enable enterprises from a range of industries—including finance, insurance, healthcare, government, technology and retail — to gain accurate insight on where sensitive data resides in GCS so it can be protected properly. DgSecure helps organizations comply with regulatory mandates for PII, PHI, and PCI data, such as the European Union’s General Data Protection Regulation (GDPR), the Health Insurance Portability and Accountability Act (HIPAA), and other data privacy and data residency laws.

“With support for GCS, Dataguise provides broad cross-platform support of sensitive data detection within the industry’s most popular data repositories and platforms, both on premises and in the cloud,” said JT Sison, VP, marketing and business development, Dataguise. “Demonstration of DgSecure Detect at Google Cloud Next will be the first public display of the technology, and we invite attendees to meet with Dataguise and Google regarding this innovative solution.”

 

Source: CloudStrategyMag

SBS Group Selected As An Indirect Cloud Solution Provider By Microsoft

SBS Group Selected As An Indirect Cloud Solution Provider By Microsoft

Microsoft has named SBS Group an Indirect Cloud Solution Provider (ICSP). Formerly called a Tier-2 Distribution Partner, the ICSP provides the connection between Microsoft and resellers of Microsoft’s cloud solutions including Azure, Office 365, PowerBI, and the recently launched Dynamics 365 service.

The ICSP program is built to help ease the complexity of selling Microsoft solutions. Technology resellers can partner with an ICSP for support in with sales, service, administration and billing. The Stratos Cloud Alliance, SBS Group’s new ICSP program, is the only ICSP specializing in Dynamics 365, Microsoft’s business solution focused service. SBS Group has vast experience in the Dynamics landscape having operated in the Microsoft ERP and CRM spaces for over 30 years. The Stratos Cloud Alliance will leverage that knowledge and experience to provide superior Dynamics 365 implementation, training and support services for technology partners to resell. Additionally, the Stratos Cloud Alliance will offer unique partner enablement services, giving partners the option to develop, market, and deliver their own Dynamics 365 solutions and services.

“We serve several communities including customers, partners, independent software vendors, and Microsoft,” said James Bowman, president and CEO of SBS Group. “It is our mission to deliver innovative solutions that serve the evolving needs of these communities.  Seven years ago, we pioneered the Master VAR program, enabling other Dynamics partners to grow their businesses. Last year, we led the Microsoft Dynamics community into the ‘cloud’ when we launched the first online Cloud Solution Provider (CSP) Marketplace focused on Dynamics solutions. We are leveraging these experiences in launching the Stratos Cloud Alliance. This program will help ERP and CRM-focused partners in their digital transformation process and enable Managed Service Providers (MSP’s) and IT-focused solution providers to expand their solution portfolios for their customers.”

The Stratos Cloud Alliance (SCA) features a comprehensive portfolio of Microsoft Cloud Business and Productivity Solutions, ISV Products and Tools, and Partner and Customer Services.  The SCA offers three flexible partner models (including a white-label option) with value-added features and benefits for ERP and CRM resellers, Managed Service Providers, Accounting and Consulting firms. All partner tiers are powered by best-in-class e-commerce capabilities and include dedicated partner teams and support services designed to simplify onboarding and streamline the partner experience.

Source: CloudStrategyMag

Google's new cloud service eases data preparation for machine learning

Google's new cloud service eases data preparation for machine learning

One of the challenges that data scientists face when running machine learning workloads is processing information before it’s ready for use. Google unveiled a new cloud service Thursday aimed at easing that pain.

Google Cloud Dataprep will automatically detect data schemas, joins, and anomalies like missing or duplicate values, without requiring coding. After that, it will help users build a set of rules for processing the information. Those rules are then built in Apache Streams format and can be imported into products like Google’s Cloud Dataflow for processing information as it’s imported into services like the BigQuery data warehouse service.

While Cloud Dataprep is built to prepare data for machine learning, the system also uses machine learning itself to try to determine which rules will be most useful for customers. As of Thursday, it’s available in private beta.

BigQuery is receiving a number of enhancements as well, including a new Commercial Datasets program that’s now available in public beta. It will let users take information from AccuWeather, Dow Jones, Xignite, HouseCanary, and Remine and directly feed it into BigQuery for further processing.

BigQuery can also now query data stored in Cloud Bigtable, Google’s managed NoSQL database offering for low-latency data. That means users can write one SQL query that can tap into information from Bigtable and BigQuery. In the past, they’d have to write a program to search Bigtable.

Advertising customers will be able to send data from Google Adwords, DoubleClick Campaign Manager, DoubleClick for Publishers, and YouTube to BigQuery for further use in analytics and other big data applications. That feature may help encourage the company’s fleet of advertising customers to try Google’s Cloud as it faces down Amazon and Microsoft.

Speaking of database news, the company announced that its Cloud SQL managed database offering now offers beta support for PostgreSQL in addition to MySQL.

All of the news was announced as part of Google Cloud Next, the company’s user conference for businesses and enterprises taking place in San Francisco. The announcements come alongside other news about the company’s cloud platform, including changes to pricing and support for custom runtimes in AppEngine.

Source: InfoWorld Big Data

Uber should use data science to fix its culture

Uber should use data science to fix its culture

Ever since a former employee spoke out about her miserable experience with her boss and HR, the media has piled on Uber. We’ve heard in the past that Uber uses data to analyze its ridership down to what seems like a creepy level. We’ve also heard that it has a toxic and misogynistic culture.

Ironically, some of the same data analysis Uber does on its riders could help it fix its culture.

On Monday, I spoke to Dr. Carissa Romero from Paradigm, a strategy firm that helps companies analyze themselves to improve inclusion and diversity based on the idea that diverse companies outperform others. Romero has a doctorate in psychology and is an expert in fixed and growth mindsets—people’s beliefs about the nature of talents and abilities—and founded Stanford’s applied research center on the subject.

I asked Dr. Romero about the techniques and tools companies can use to find problems and what kinds of interventions are effective. She began by making a distinction between the two fundamental types of bias.

Implicit versus explicit bias

The cases of Susan Fowler and “Amy Vertino” at Uber was one of explicit bias. Some of it even made it into written form. Finding explicit bias or harassment can be done by a simple text search.

Most workplace problems in this area, however, involve implicit bias. It can be equally as damaging—and the person making the mistake may not even know they’re doing it. For example, if I’m hiring a software developer and I have in my mind what that developer “is like,” I may inadvertently make judgments linked to race, gender, or culture that aren’t related to details actually important to the job.

This is also not something you find with a simple text search because they aren’t going to say “sex” or use a racial epithet. Also, many people who make these bias mistakes are not bad people and don’t have bad intent, but they have to make decisions differently and become better informed by data.

Uber’s explicit problems are a part of a self-admitted failure of leadership. You don’t need fancy data analysis to see that. Yet if the company addresses the issue, it’ll still have a lot of work on internal culture and practices if it wants to have a more diverse workplace.

Where is the data?

Much of the data a company needs to determine whether it’s treating all of its employees fairly resides in the systems it’s already using. This starts even before hiring. According to Dr. Romero, “On the recruiting and hiring side, we pull data from a company’s applicant tracking system.

“For example, a common applicant tracking system is Greenhouse. We pull data from Greenhouse to learn about things like the diversity of different applicant sources and pass-through rates at each stage of the hiring process.”

Companies also need to look at employees throughout their “lifecycle” at the company. Some of this information lives in their human resources information system or performance review system.

This isn’t necessarily enough. Paradigm also relies on engagement surveys and focus groups to better understand differences in how engaged employees feel and whether they think their voices are being heard. This qualitative data helps make the quantitative data more understandable.

How do you determine bias?

Bias can exist at different stages of employment, from how applicants are attracted to apply for a job to hiring, evaluation, promotion, and retention, as well as terminations. Different metrics apply to each of these stages.

According to Dr. Romero, in the recruiting phase, it pays to take a hard look at candidate sources. Often, employee referrals result in less diversity. When it comes to hiring, companies should look at the different pass-through rates: If black candidates pass through phone screening at a lower rate than white candidates, that’s an example of quantitative data the company can use to detect bias.

Once an employee is hired, performance review scores and promotion rates become key sources. Next, when examining a company’s employee retention rates, look at terminations and longevity. If the data is stratified by demographic group (race, gender, and so on) and there are large disparities, that may be an indication of bias.

Other, more subtle data can also be analyzed. When looking at performance reviews, are “soft skills” mentioned more often for women or people of color compared to men? According to Dr. Romero, “Our data scientist uses a machine learning algorithm to look at whether different language is used to describe candidates from different demographic groups, but we also very often do it manually where we pull a random sample of written feedback to manually code. Then we use statistical tools to analyze the differences.” In other words, they plug the data into R and use their algorithms to crunch data on employees in the same kinds of ways companies are using it to understand their customers.

Dr. Romero also focuses on qualitative data: the “why.” This emerges from interviewing people. Some questions she pointed out: Are recruiters reaching out only on LinkedIn? What are managers looking for in a candidate?

Unfortunately, it can be hard to identify specific individuals within a company using statistical analysis. If a manager has only a few reports and hasn’t had to interview many people, the sample size will be too low. Instead, Romero advises companies to focus on establishing practices to prevent it.

How do you fix it?

According to Dr. Romero, “When you’re evaluating your employees, if you have a standard set of questions that you use to evaluate people in that role, then you’re going to make it less likely that bias influences decisions.” In contrast, “Not having a process would make it more likely that you would have more of these individual cases that people are relying on stereotypes compared to when you have processes in place.”

Process is great, but I’ve worked in organizations that only went through the motions. These organizations subscribe to the mythical-man-month, cargo-cult school of process. According to Romero, such sloppiness can be avoided by creating up-front descriptions of what you’re seeking in each position and clearly establishing metrics for performance. When it comes to performance reviews, force the manager to give an example of why the rating is deserved. According to Dr. Romero:

When you know what you’re evaluating up front and use examples to support your evaluation, biases are less likely to come into play. Evaluators should decide ahead of time what to look for, and organize feedback by relevant attributes. When you’re not clear about what you’re looking for, you’re more likely to rely on an overall feeling. That feeling can be influenced by bias. For example, you may be influenced by how much you like that person personally (vs. how good of a fit for the role they are). You might just like them because they are similar to you in some irrelevant way (maybe you have the same hobbies). Or you might be influenced by a stereotype – for example, what does a typical person look like in this role?

Some issues are more subtle and involve company culture. “Women often feel it’s hard to get heard in a meeting because they’re often interrupted. You might have a moderator for every team meeting or put a sign in the room. Make sure individuals are aware, agendas are distributed ahead of time, ask people for their thoughts,” said Dr. Romero.

According to Dr. Romero, what isn’t effective is “diversity training” to raise awareness, nor does copying other companies’ strategies. “Coming up with strategies before you’ve taken a look at your company’s data, and analyzed your process and your culture, is a bad approach. I also think ignoring behavioral science research is a bad approach. So basically, a non-data-driven approach is bad (ignoring your own data and ignoring what behavioral science research tells us).”

Why the need is real

I asked Dr. Romero if everyone needs this stuff, even small companies and startups. “In general, yes,” she replied. “Companies use data when making business decisions, it makes sense to use data when making people-related decisions. A data science approach to understanding people in your organization is helpful.”

This is the crux of the matter: Well-managed companies use data to make decisions. Well-managed companies have processes for making repeated decisions. It only makes sense to have good processes and data for making decisions about people. Good processes and data also happen to help create far more diverse environments.

Obviously, you want to do this because it’s the right thing to do. But as Dr. Romero says, “If you want to get your best work out of employees, you want to create an environment where people from any background can be successful.” Ask McKinsey: Diverse organizations perform better.

Source: InfoWorld Big Data

3 Kaggle alternatives for collaborative data science

3 Kaggle alternatives for collaborative data science

What’s the best way to get a good answer to a tough question? Ask a bunch of people, and make a competition out of it. That’s long been Kaggle‘s approach to data science: Turn tough missions, like making lung cancer detection more accurate, into bounty-paying competitions, where the best teams and the best algorithms win.

Now Kaggle is rolling into Google, and while all signs point to it being kept as-is for now, there will be jitters about the long-term prospects for a site with such a devoted community and an an idiosyncratic approach.

Here are three other sites that share a similar mission, if not explicitly followed in Kaggle’s footsteps. (Note that some sites, like CrowdAnalytix, may consider accepted solutions in contests as works for hire and thus their property.)

CrowdAI

A product of the École Polytechnique Fédérale de Lausanne in Switzerland, CrowdAI is an open source platform for hosting open data challenges and gaining insight into how the problems in question were solved. The platform is quite new, with only six challenges offered so far, but the tutorials derived from those challenges are detailed and valuable, providing step-by-step methodologies to reproduce that work or create something similar. The existing exercises cover common frameworks like Torch or TensorFlow, so it’s a good place to acquire hands-on details for using them.

DrivenData

DrivenData, created by a consultancy that deals in professional data problems, hosts online challenges lasting a few months. Each is focused specifically on pressing problems facing the world at large, like predicting the spread of diseases or mining Yelp data to improve restaurant inspection processes. Like Kaggle, DrivenData also has a data science jobs listing board — a feature people are worried might go missing from Kaggle post-acquisition.

CrowdAnalytix

Backed by investors from Accel Partners and SAIF Partners, CrowdAnalytix focuses on hosting data-driven problem-solving competitions, rather than sharing information that result from them. Contests are offered for finding solutions to problems in categories like modeling, visualization, and research, and each has bounties in the thousands of dollars. Some previous challenges include predicting the real costs of workers’ compensation claims or airline delays. Other contests, though, aren’t hosted for money, but for providing a competitive option to learn a related discipline, such as the R language.

Source: InfoWorld Big Data