Open Source

Community Data License Agreement announced by Linux Foundation

Wednesday, October 25, 2017

Community Data License Agreement has been announced by The Linux Foundation to help open up the worlds data.

The Linux Foundation has announced the Community Data License Agreement (CDLA) family of open data agreements. In an era of expansive and often underused data, the CDLA licenses are an effort to define a licensing framework to support collaborative communities built around curating and sharing “open” data.

Inspired by the collaborative software development models of open source software, the CDLA licenses are designed to enable individuals and organizations of all types to share data as easily as they currently share open source software code. Soundly drafted licensing models can help people form communities to assemble, curate and maintain vast amounts of data, measured in petabytes and exabytes, to bring new value to communities of all types, to build new business opportunities and to power new applications that promise to enhance safety and services.

The growth of big data analytics, machine learning and artificial intelligence (AI) technologies has allowed people to extract unprecedented levels of insight from data. Now the challenge is to assemble the critical mass of data for those tools to analyze. The CDLA licenses are designed to help governments, academic institutions, businesses and other organizations open up and share data, with the goal of creating communities that curate and share data openly.

For instance, if automakers, suppliers and civil infrastructure services can share data, they may be able to improve safety, decrease energy consumption and improve predictive maintenance. Self-driving cars are heavily dependent on AI systems for navigation, and need massive volumes of data to function properly. Once on the road, they can generate nearly a gigabyte of data every second. For the average car, that means two petabytes of sensor, audio, video and other data each year.

Similarly, climate modeling can integrate measurements captured by government agencies with simulation data from other organizations and then use machine learning systems to look for patterns in the information. It’s estimated that a single model can yield a petabyte of data, a volume that challenges standard computer algorithms, but is useful for machine learning systems. This knowledge may help improve agriculture or aid in studying extreme weather patterns.

And if governmental agencies share aggregated data on building permits, school enrollment figures, sewer and water usage, their citizens benefit from the ability of commercial entities to anticipate their future needs and respond with infrastructure and facilities that arrive in anticipation of citizens’ demands.

“An open data license is essential for the frictionless sharing of the data that powers both critical technologies and societal benefits,” said Jim Zemlin, Executive Director of The Linux Foundation. “The success of open source software provides a powerful example of what can be accomplished when people come together around a resource and advance it for the common good. The CDLA licenses are a key step in that direction and will encourage the continued growth of applications and infrastructure.”

CDLA Licenses Promote Sharing While Reducing Risk

The Linux Foundation, in collaboration with a broad set of participating organizations, drafted the CDLA licenses with the needs of companies, organizations and communities that have valuable data assets such as these to share. The intention of the licenses is for contributors and consumers of open datasets to actively use and support the contribution of data in a uniform fashion, while clarifying the terms of that sharing and reducing risk.

There are two CDLA licenses: a Sharing license that encourages contributions of data back to the data community, and a Permissive license that puts no additional sharing requirements on recipients or contributors of open data. Both encourage and facilitate the productive use of data.

Community Data License Agreement implications include:

Data producers can share with greater clarity about what recipients may do with it. Data producers can also choose between Sharing and Permissive licenses and select the model that best aligns with their interests. In either case, data producers should enjoy the clarity of recognized terms and disclaimers of liabilities and warranties.

Data communities can standardize on a license or set of licenses that provide the ability to share data on known, equal terms that balance the needs of data producers and data users. Data communities have a high degree of flexibility to add their own governance and requirements for curating data as a community, particularly around areas such as personally identifiable information.

Data users who are looking for datasets to help kick off training an AI system or for any other use will have the ability to find data shared under a known license model with terms that clearly state their rights and responsibilities.

The CDLA is data privacy agnostic and relies on the publisher and curators of the data to create their own governance structure around what data they curate and how. Each producer or curator of data will have to work through various jurisdictional requirements and legal issues.