Artificial Intelligence

Machine learning, crowdsourced data, and the birth of Gengo.ai

Wednesday, October 31, 2018

Google-Play-Store-Developers-Claim-Leaderboard

How crowdsourced data can help you construct great data sets to drive your machine learning and AI platforms: A chat with Charly Walther from Gengo.

The glue that holds machine learning and artificial intelligence together is data. Without the data to create complex learning algorithms from, and to create those life-like AI experiences - neither of them are worth a brass farthing.

Charly Walther, the VP of product and growth at Gengo.ai, joined Gengo from Uber, where he was a product manager in Uber’s Advanced Technologies Group knows this space well. He says that good AI should learn to produce data all by itself, and that building proprietary data sets through unique ways such as crowdsourcing, can help lead the path to a successful platform.

So we recently took a few moments with Charly, to get his insight on all things machine learning, and how Gengo.ai was able to produce a high-quality multilingual data structure using it - plus how developers can lean on the platform for their own AI projects.

ADM: Can you explain what machine-learning training data is?

Walther: It’s a very exciting time for startups entering the field of machine learning and AI. While the opportunities are tremendous, the hype surrounding machine learning distracts investors and founders from a key hurdle: in many cases it’s data, not algorithms, that will determine whether the final product will deliver on the promises made.

Put simply, AI training data refers to input data for a machine-learning model alongside the desired output, which the AI should eventually learn to produce by itself. For example, a training data set can contain images of cats and dogs labeled correctly as cats and dogs respectively, tweets marked as either positive or negative for a sentiment analysis algorithm, or audio recordings alongside transcripts for a transcription machine-learning engine.

Without rigorously accurate and categorized data, machine-learning algorithms simply cannot attain an advanced level of operation. Algorithms find relationships, develop understanding, make decisions, and evaluate their confidence based on the training data they’re given. And the better the training data is, the better the model performs.

So, if businesses want to succeed with their machine-learning projects, their top priority should be building proprietary data sets. The question is, what’s the most cost-effective way to do that?

ADM: How does Gengo fit into the crowdsourced data landscape?

Walther: Gengo started life as crowd-powered translation service with a focus on quality and precision. After several years, we had built a multilingual crowd of 21K+ skilled linguists and we were providing a range of language services to leading Fortune 500 companies like Amazon and Facebook. Often, we’ve helped them with projects to improve their AI investments — which gave us an idea: what if we optimized our offering for the needs of developers working on machine-language projects that require specialized language datasets? This led to the formation of Gengo.ai: a platform for any business or enterprise around the world that needs access to high-quality multilingual data to succeed with their AI strategy.

ADM: How does your approach differ from other crowdsourcing approaches?

Walther: With Gengo.ai, developers have access to a very large crowd platform that is 100-percent focused on language tasks — those that involve natural language, speech, communication, and multilingual projects. Many developers are probably already familiar with Amazon Mechanical Turk: an early player in this field, and a great resource, especially for smaller tasks. It’s also cost-competitive, because it draws on an almost infinite pool of cheap labor.

But it also has its drawbacks. With Mechanical Turk, you’re less likely to find a concentration of specific experts in your crowd, which can be a factor for developers with specialized needs. Also, Amazon offers no hand-holding to ensure the quality of the data, which means the burden of quality control falls on you, the client. This means developers will often need to iterate through several submissions to the crowd to arrive at the dataset they need. This can impact overall project duration and costs.

Other crowdsourcing solutions such as Upwork are great for finding 1-2 people with whom you can closely interact with. However, these don’t scale well because they aren’t built with platform technology such as job distribution systems and quality management systems to manage tasks effectively across 10, 100, or 1,000+ people.

ADM: What are the benefits of using crowdsourcing to provision language training data?

Walther: Like most outsourcing decisions, it comes down to defining your core competence. Our clients tell us that they know that they don’t have the in-house expertise to manage the collection and curation of a large language data set, that the opportunity cost of building a platform for this with their own engineering resources is too high, and therefore the ROI doesn’t pencil out. By outsourcing, the development team also benefits from the cost and time efficiencies of a service that’s specifically designed to manage the process of defining, submitting and gathering language-based data. Ultimately, it all boils down to an important outcome: faster time to market of a better-trained AI product.

ADM: How should a developer structure a request for language training data?

Walther: Here are some general guidelines to maximize your project success.

Project scope is a critical. Be clear on whether you already have data and need it labelled/curated/etc. Or whether you don't have data and want data to be collected. Or both.
Determine whether your data tasks require the workers to work in a specific tool (maybe even proprietary to your company), or whether you will leave it up to the data provider to figure out the working environment.
Figure out what specific instructions the workers will need. For example, what to do when a data point doesn't apply: should they skip it, or mark it as N/A, or what? You probably want to give your own instructions, then have the data provider experts add additional instructions since they have seen more edge cases and might be able to anticipate certain problems.
Align with your provider on timeline requirements. There can be a certain ramp-up phase to identify the right worker cohort, and sometimes the data provider might even need to onboard additional workers. Be careful with urgent requests and define the timeline early on.
Pricing is a of course a consideration, at the outset, specify whether (a) you have a certain amount of data points for which you need pricing, or (b) you have a budget and want to know how many data points can be collected/curated within that budget.
Calibrate early on. Where large volumes of data are being managed, you should start with a trial run to align on quality, speed, detail of instructions, etc. Once you’re happy with the results, only then should you hit GO on the entire data set.

Tell me about some interesting AI training projects that you’ve worked on recently.

Walther: Here are three examples that give you an idea of the enormous breadth of projects we undertake:

A Japanese auto company needed to train its navigation system to understand non-native Japanese speakers. For this, we sourced exactly the right crowd to provide the data.
A highly curated data set for a system designed to provide machine speech in situations where Chinese intonation is critical. The objective was to make sure the machine can pronounce both accurately and naturally.
Sentiment analysis in Arabic for a client who wanted to develop a machine capable of understanding political discourse.