Web
Analytics
top of page
Writer's pictureDigital Life Initiative

Data Subsidization for AI Training and Global Data Protection Inequalities

Updated: 2 days ago



By Noa Mor (The Hebrew University of Jerusalem) and Eran Toch (Tel Aviv University)


Sora, OpenAI’s highly-anticipated video generation model, was launched across over 150 countries in December 2024. The festivity, however, did not include most of the European countries, as the product was not supported there. Although the company did not explain the unavailability in Europe, it appears related to the EU’s regulation of digital technologies. This regulatory framework includes, among other resources, the AI Act, the Digital Market Act, and the General Data Protection Regulation (GDPR), which imposes limitations on personal data processing and is enforced by strong Data Protection Authorities (DPAs). Sam Altman, OpenAI’s CEO, stated: “We're going to try our hardest to be able to launch [Sora] there, but we don't have any timeline to share yet,” and that this process will take “a while.” This is not the first time OpenAI’s product launch has been delayed in the EU due to regulatory considerations. The same happened earlier in 2024 with ChatGPT’s advanced voice feature, which became available in the EU a few months after its global release, likely after the DPAs’ review.

 

OpenAI is not alone. In June 2024, Meta announced that, following the European DPAs’ request to delay the training of its LLMs “using public content shared by adults on Facebook and Instagram,” it would not launch Meta AI—the company’s advanced chatbot—in Europe “at the moment.” Meta explained that “without including local information we’d only be able to offer people a second-rate experience.”

 

Apple AI’s features, as well as Google’s Bard and Gemini, are additional examples of AI products whose EU launches were suspended due to regulatory issues. Regarding Bard’s delay, the European DPAs explained that they “had not had any detailed briefing nor sight of a data protection impact assessment or any supporting documentation at this point,” and were awaiting this information and responses to questions they had. Both Google products have since become available in Europe.

 

The Deepening Global Inequalities in Data Protection

 

While European privacy protection is not flawless, it enjoys influence unmatched by many other privacy regimes, even those with modern data protection laws (often inspired by the GDPR). In addition to the political and economic power associated with the EU, in the privacy context, much of this impact may be attributed to vigorous data authorities.

 

True, despite the mentioned delays, Sora and other AI technologies will probably become available in the EU in the future. However, as with other AI products that made their way to the European market, they will likely do so when they are more mature and after their privacy, safety, and broader implications have been better scrutinized and addressed. In contrast, AI technologies more frequently reach non-European users in earlier, sometimes experimental, phases in these aspects.

 

One concerning outcome of the lower privacy protections afforded to non-Europeans, coupled with prolonged periods during which their data may have been used for AI training, is that their data effectively subsidizes global data usage for these purposes and compensates for the stricter controls protecting Europeans’ privacy. In other words, non-Europeans’ data may be used more intensively than Europeans’ data to power technologies used by both. Indeed, while Meta, for example, takes pride in adjusting its treatment of European users’ data for AI training, and notifying them about such uses and their rights, people in other regions have been feeding the company’s AI models for years without similar safeguards.

 

Advanced AI models—particularly generative AI models—rely on enormous amounts of data, which are increasingly scarce. If fewer populations “share the burden” of personal data usage by private companies, users from less-protective privacy regimes may be the subject of further intrusive monitoring, collection, and use of data. This asymmetry is also evident in Meta’s four distinct AI Terms of Service: for the EU, the UK, Brazil, and the rest of the world.

 

Furthermore, the possible additional data burden that will be placed on the shoulders of non-Europeans will likely hit the hardest on marginalized communities. These groups, often less technologically savvy, may struggle to navigate the privacy settings to (partly) limit the use of their data. The Pew Research Center found, for instance, that older American social media users are less likely than younger users to change their privacy settings. 

 

The privacy harms caused by personal data usage for AI training can be substantial. These concerns might be masked by the term “Publicly Available Information” (PAI), often used to describe such data. PAI is typically perceived and presented as content that is accessible to everyone. It is frequently framed by digital platforms in their policies and communications as information that users “share” or “choose to share” with the public (see examples here, here, and here), But do they really? Default settings and other design features, privacy policies, the opacity that surrounds the use of personal data, and the ambiguity regarding its purposes are all factors that strongly serve the vast creation and collection of personal data. The scope and nature of data collected and used extends far beyond what many users might believe they are “publicly sharing.” This may cover detailed data and metadata deciphered from users’ content, behavior, and engagement patterns, including the prompts users have used when interacting with AI technologies. With the transformative technological advancements, the applications of this data have morphed and expanded drastically, enabling myriad analytics points and use cases. Furthermore, digital platforms, like Meta, can also influence the nature of the content that will be (publicly) generated by tweaking the exposure and engagement metrics for its four billion users. Moreover—certain users’ data, such as hashtags, group names, emojis, and product ranking, may serve – for free—the otherwise expensive labeling and sorting stages of the data, which precede the training process. The company, therefore, has a strong interest in users continuously generating such “public” data.   

 

The unfair data practices that undermine users’ agency over their (“publicly available”) data were highlighted in Melinda Claybaugh, Meta’s Global Privacy Director, in her answers to the Australian Senate inquiry in September 2024.

 

Claybaugh confirmed that, unlike the Europeans, Australian users’ data created from 2007 onwards is being used for training the company’s AI models without providing an opt-out option. Senator David Shoebridge asked: “The truth of the matter is that, unless you consciously had set those posts to private, since 2007, Meta has just decided you will scrape all of the photos and all of the text from every public post on Instagram or Facebook that Australians have shared since 2007, unless there was a conscious decision to set them on private. But that’s actually the reality, isn’t it?” Claybaugh confirmed. What was also exposed in that inquiry was that while minor accounts were not scraped for this purpose, information (such as images) of minors that was posted in adult accounts, would be scraped. Shoebridge was unable to confirm if the company also scraped adult accounts that were created when the users were still minors.

 

Of course, the depicted disadvantaged privacy state is not the problem of Australians alone but a warning to any population lacking robust privacy controls.

 

Where Do We Go from Here, Then?

 

To address the concerns around the division between “Class A” and “Class B” privacy regimes in general and regarding personal-data burden allocation in particular,  a few directions could be considered:

 

Better Understanding. Efforts and resources should be dedicated to studying the nature and implications of the growing segmented privacy landscape. What are the gaps between communities worldwide regarding personal data used for AI training? What are their short and long-term human rights ramifications? Which groups and individuals will be most adversely affected? What are the local and global market and competition results? What are the technological performance consequences? What are the linguistic and cultural outcomes? How could we mitigate these challenges? Governments, international bodies, civil society organizations, private companies, and academics should engage in questions like these to inform policy-making and privacy strategies.

 

Collaborating. A global collaboration between privacy authorities should be encouraged to bridge existing privacy gaps. This collaboration would not only benefit places lacking efficient data protection, but form a robust privacy coalition, where diverse perspectives will be exchanged, strategies sought, and concerns addressed. Designated bodies, gatherings, and publications could support these efforts. Beyond global solidarity considerations and the broad expected advantages, such joint efforts are justified by the interconnected nature of privacy regimes.

 

Setting Expectations and Seeking Alternatives. Much of the responsibility rests with tech companies themselves. In its latest Human Rights Report, Meta stated: “Every day, we actively seek to translate human rights principles into meaningful action.” A meaningful next step for the company to take tomorrow could be to reduce global discrimination in privacy protections among its users. To sustain innovation, this may entail pursuing alternatives to the current unfair personal data use, such as diligent reliance on synthetic data, responsible payment for data, digitization of offline archives, more data-efficient models, and privacy-centered training processes. An additional requirement from private companies—and a much more modest one — would be to make privacy differences among communities, including data used for AI training, visible to everyone and easily accessible in their Transparency Reports, Human Rights Reports, and user notifications.  

 

 

 

Noa Mor

The Hebrew University of Jerusalem

Faculty of Law 

DLI Alum

 

 

Eran Toch

Tel Aviv University

Department of Industrial Engineering

DLI Alum



Cornell Tech | 2025




Comments


bottom of page