Tuesday, 8 April 2025
26.1 C
Singapore
26.9 C
Thailand
19.8 C
Indonesia
27.6 C
Philippines

OpenAI under fire as new study reveals signs of using copyrighted content in training

A new study suggests OpenAIโ€™s models memorised copyrighted content, raising concerns about fair use and data transparency.

A recent study suggests that some of OpenAIโ€™s AI models may have learned directly from copyrighted material โ€” without permission. This finding adds weight to ongoing legal battles by authors, developers, and other rights-holders who say their work has been unfairly used to build these models.

While OpenAI has argued that using such content is covered by โ€œfair use,โ€ the study raises new concerns over whether that defence holds up. Researchers from the University of Washington, Stanford University, and the University of Copenhagen worked together to develop a method for checking whether AI models have memorised specific pieces of text during training.

New method reveals hidden memorisation

The study focused on what the authors call โ€œhigh-surprisalโ€ words โ€” uncommon words that stand out when placed in certain sentences. For example, โ€œradarโ€ in the sentence โ€œJack and I sat perfectly still with the radar hummingโ€ is considered a high-surprisal word. Itโ€™s less expected in this context than words like โ€œengineโ€ or โ€œradio,โ€ which might appear more often before the word โ€œhumming.โ€

Using this idea, the researchers tested several of OpenAIโ€™s models, including GPT-3.5 and GPT-4. They took text snippets from fiction books and articles published in The New York Times, removed the high-surprisal words, and asked the models to guess the missing word.

If a model guessed the correct word with high accuracy, it suggested that the model had seen that exact phrase or passage during its training โ€” an indication of memorisation. Since the data included copyrighted books and journalism, this poses serious ethical and legal concerns.

GPT-4 showed signs of memorising books

The tests showed that GPT-4 โ€” OpenAIโ€™s most advanced model โ€” appears to have memorised sections from popular fiction. Some of this content came from a dataset called BookMIA, which includes samples from copyrighted ebooks. The model also seemed to recall parts of New York Times articles less frequently than it did with fiction.

These findings point to the possibility that GPT-4 was trained, at least in part, on copyrighted materials. Thatโ€™s a major issue for creators whose work may have been included without their consent.

Abhilasha Ravichander, a PhD student at the University of Washington and one of the studyโ€™s authors, explained the significance of this discovery. โ€œTo have large language models that are trustworthy, we need to be able to audit them and understand how they work,โ€ she said. โ€œOur study offers one way to investigate that, but it also highlights the urgent need for more transparency around the data these models are trained on.โ€

OpenAI has been the subject of several lawsuits over its use of copyrighted content. It has defended its approach by arguing that training AI with such content is fair useโ€”a legal principle in the US that allows for limited use of copyrighted material without needing permission.

At the same time, OpenAI has tried to show it takes content rights seriously. It has licensing agreements with some publishers and offers an โ€œopt-outโ€ process so creators can request that their work not be used in training.

Still, the company continues to lobby governments around the world to support looser rules regarding AI training. It wants clearer legal protections that would allow models to be trained on a broad range of online contentโ€”including some copyrighted materialโ€”without facing legal risks.

But as this new study shows, thereโ€™s a fine line between learning from data and copying it. And until lawmakers draw that line, the debate around fair use in AI training will likely remain heated.

Hot this week

NTT DATA partners with UPS on 10-year digital transformation deal

NTT DATA signs 10-year deal with UPS to modernise IT infrastructure, support AI innovation, and manage data centre operations.

Apple may launch an AI-powered Health app with a coaching feature next year

Apple may introduce an AI-powered Health app with coaching, food tracking, and fitness guidance in 2026, possibly as a new subscription service.

Nintendo Switch 2 launching globally on June 5, Southeast Asia release set for Q3 2025

The Nintendo Switch 2 will launch globally on June 5 and cost US$449.99. Southeast Asia will receive it in Q3 2025, along with eShop support.

Pixel 10 to feature more cameras, but with downgraded specs

Google's Pixel 10 may feature more cameras but with downgraded specs, including a telephoto lens, while the Pixel 10 Pro retains its advanced setup.

Portworx by Pure Storage introduces modern virtualisation at enterprise scale

Pure Storageโ€™s Portworx Enterprise 3.3 brings enterprise-scale virtualisation to Kubernetes with cost savings, scalability, and data protection.

Quantum mechanics could fix joystick drift once and for all

Tunnelling magnetoresistance (TMR) technology could solve joystick drift by offering better accuracy, lower power consumption, and more stability.

Google Pixel 10 base model features a telephoto camera but with some trade-offs

The Google Pixel 10 base model could feature a telephoto camera but with compromises on sensor sizes and resolutions.

Meta’s new AI model tests raise concerns over fairness and transparency

Metaโ€™s AI model Maverick ranked high on LM Arena, but developers donโ€™t get the same version tested, raising concerns over fairness.

Microsoft reveals AI-powered Quake II demo with clear limitations

Microsoft releases a playable AI-generated Quake II demo but admits it has clear limitations and is more research than an actual game.

Related Articles