Criteo Open Sources One Terabyte Machine Learning Dataset

Posted on Tuesday, June 23, 2015 by STUART PARKERSON, Global Sales

Criteo is releasing to the open source community an anonymized machine learning dataset with more than four billion lines totaling over one terabyte in size, built through Criteo’s advertising click prediction dataset. Criteo’s terabyte dataset is hosted on Microsoft Azure, and details on how to access, utilize and download it can be found at the Criteo Labs website.

The goal for releasing the dataset is to support academic research and innovation in distributed machine learning algorithms. Anonymized datasets pulled from real-world applications can help allow academic researchers to test, refine and advance the various machine learning platforms. 

Criteo relies on its own proprietary distributed learning algorithms to predict when a consumer is most likely to click on a particular ad with a goal of increasing the return on an advertiser’s investment in ad delivery. Criteo sees over 30 billion HTTP requests per day (including as many as two million requests per second), delivers three billion unique banner advertisements per day, and stores 20 terabytes of new data daily with a capacity for 37 petabytes of raw storage.

The released dataset has already been put to use as a benchmark by researchers at Carnegie Mellon University. “Criteo's one terabyte dataset has proven invaluable for benchmarking the scalability of the learning algorithms for high throughput click-through-rate estimation, which we are developing as part of our Marianas Labs project,” said Alexander Smola, Professor at Carnegie Mellon University.

More App Developer News

Tether QVAC SDK Powers AI Across Devices and Platforms



APAC 5G expansion to fuel 347B mobile market by 2030



How AI is causing app litter everywhere



The App Economy Is Thriving



NIKKE 3.5 anniversary update livestream coming soon



New AI tool targets early dementia detection



Jentic launch gives AI agents api access



Experts warn ai-generated health content risks misinterpretation without human oversight



Ludo.ai Unveils API and MCP Beta to Power AI Game Asset Pipelines



AccuWeather Launches ChatGPT Integration for Live Weather Updates



Stop Using Business Jargon: 5 Ways Buzzwords Damage Job Performance



IT spending rises as banks balance legacy and innovation



Tech hiring slumps as Software Developer job postings fall



AI is becoming more widespread in collaboration tools



FCC prohibits new foreign router models citing critical infrastructure risks



ChatGPT Carbon Footprint Matches 1.3 Million Cars Report Finds



Lens Launches MCP Server to Connect AI Coding Assistants with Kubernetes



Accelerating corporate ai investment returns



Enviromates tech startup launches global participation platform



Private Repository Secures the AI-driven Development Boom



UK Fintech Platform Enviromates Connects Projects Brands and Consumers



Env Zero and CloudQuery Announce Merger



How Industrial AI Is Transforming Operations in 2026



AI generated work from managers is damaging trust among employees



Foresight Secures $25M to Bridge Infrastructure Execution Gap



Copyright © 2026 by Moonbeam

Address:
1855 S Ingram Mill Rd
STE# 201
Springfield, Mo 65804

Phone: 1-844-277-3386

Fax:417-429-2935

E-Mail: contact@appdevelopermagazine.com