The Internet Archive Is Under Siege — And the Collateral Damage Could Be Civilization’s Memory

="">

The Internet Archive, the San Francisco-based nonprofit that has spent a quarter century preserving the web’s sprawling, chaotic history, is caught in a vise. On one side: publishers who’ve won legal victories restricting the organization’s lending practices. On the other: the very artificial intelligence companies those publishers claim to fear most. Caught in between is the public — and the historical record itself.

The Electronic Frontier Foundation fired a sharp broadside this month, arguing in a detailed public statement covered by Slashdot that major publishers’ legal campaign against the Internet Archive won’t accomplish what it claims to. Blocking the Archive from lending digitized books, the EFF contends, does nothing to prevent large AI companies from training models on copyrighted text. What it does do is cut off access for researchers, historians, people with disabilities, and anyone without proximity to a well-stocked physical library.

The argument is blunt: publishers are targeting the wrong entity.

This confrontation has been building for years, but the stakes have sharpened considerably as generative AI transforms how copyrighted material is consumed and reproduced. The core legal battle traces to Hachette v. Internet Archive, the 2023 federal court ruling that found the Archive’s National Emergency Library — launched during COVID-19 lockdowns to provide unlimited digital lending — constituted copyright infringement. The Second Circuit upheld that ruling, and publishers have since pressed their advantage, demanding restrictions that go well beyond the emergency lending program.

The lawsuit initially involved four major publishers: Hachette Book Group, HarperCollins, John Wiley & Sons, and Penguin Random House. Their complaint was straightforward — the Internet Archive was scanning physical books and lending digital copies without authorization, undermining the commercial market for ebooks. The court agreed. But the EFF and other digital rights advocates say the remedy has metastasized beyond its original scope, threatening the Archive’s Controlled Digital Lending program and its broader preservation mission.

Controlled Digital Lending, or CDL, operates on a simple premise: a library that owns a physical copy of a book should be able to lend a digitized version of that same copy, one borrower at a time, mirroring the “own-to-loan” ratio of traditional lending. The concept has support from legal scholars and library associations. It also has powerful enemies.

The AI Misdirection

Here’s where the argument gets tangled. Publishers have increasingly framed their opposition to the Internet Archive in the language of AI risk. The concern, as articulated in court filings and public statements, is that digitized books could feed the training datasets of large language models. It’s a legitimate worry — companies like OpenAI, Meta, and Google have faced their own copyright lawsuits over training data — but the EFF argues this framing is strategically misleading when aimed at the Archive.

The reason is straightforward. The Internet Archive isn’t an AI company. It doesn’t train models. It doesn’t sell access to bulk text corpora. The organizations actually scraping the web for training data operate at an entirely different scale, with entirely different business models and entirely different legal exposure. Shutting down the Archive’s lending program does approximately nothing to prevent GPT-5 or its successors from ingesting copyrighted material.

What it does accomplish: eliminating one of the few remaining institutions that provides free, open access to knowledge that would otherwise be locked behind paywalls or physically inaccessible.

The EFF’s statement draws a sharp line. “The publishers’ real targets should be the companies making billions from AI,” the organization argues, not a nonprofit library operating on a shoestring budget. The Internet Archive’s annual revenue is roughly $37 million — less than what some AI startups spend on compute in a single quarter.

And yet the Archive is the one in the legal crosshairs.

This isn’t purely about legal strategy. There’s an economic logic at work. Suing the Internet Archive is cheaper, simpler, and more likely to produce favorable precedent than going after deep-pocketed tech giants with armies of lawyers. A ruling that restricts digital lending sets a baseline that publishers can then extend to other contexts. The Archive, in this reading, isn’t the target so much as the test case.

The collateral damage, though, is real and growing. The Wayback Machine — the Archive’s most visible product, containing over 866 billion saved web pages — isn’t directly at issue in the publishing lawsuit. But the organization’s financial and operational capacity to maintain all its services is strained by ongoing litigation. The Archive suffered a significant cyberattack in October 2024 that compromised user data and temporarily knocked services offline, further taxing its resources. As the EFF noted, the combination of legal fees, security investments, and operational costs threatens the institution’s long-term viability.

Brewster Kahle, the Archive’s founder and digital librarian, has been vocal about what he sees as an existential threat — not just to his organization, but to the concept of universal access to knowledge. The Archive stores not only books and web pages but also music, video, software, and government documents. Much of this material exists nowhere else in digital form. Some of it exists nowhere else at all.

Consider what disappears if the Archive can’t operate. Millions of web pages already gone from the live internet, preserved only in the Wayback Machine. Out-of-print books that no publisher has any commercial interest in reprinting. Government reports scrubbed during administration transitions. The digital historical record is far more fragile than most people realize, and the Internet Archive is, in many cases, the only safety net.

The publishing industry’s position isn’t without merit. Copyright exists for a reason, and the market for ebooks is a significant revenue stream. When the Archive lent unlimited copies during the pandemic, it did bypass the licensing structures that publishers and authors depend on. The court was right to find that the National Emergency Library overstepped.

But the question now is whether the remedy is proportionate to the harm. Publishers aren’t just asking for the emergency program to remain shut down — they’re pushing to restrict the Archive’s standard lending practices and establish precedents that could constrain digital libraries broadly. The Authors Guild has supported the publishers’ position, arguing that any unlicensed digital lending undermines the market for authorized editions.

So where does AI actually fit in?

The major AI copyright cases are proceeding on separate tracks. The New York Times sued OpenAI and Microsoft in December 2023. A coalition of authors including Sarah Silverman and Michael Chabon filed suit against Meta over its use of copyrighted books to train LLaMA. Getty Images sued Stability AI over image generation trained on its photographs. These cases involve companies with billions in revenue and explicit commercial use of copyrighted material for profit. They’re the cases that will actually determine how copyright law applies to AI training.

The Internet Archive litigation, by contrast, is about library lending — a practice with centuries of legal and cultural precedent. Conflating the two issues serves publishers’ rhetorical purposes but obscures the actual policy questions at stake.

There’s a broader institutional failure here too. The U.S. Copyright Office has been studying the intersection of AI and copyright for over a year without producing definitive guidance. Congress has held hearings but passed no legislation. In the absence of clear rules, litigation fills the vacuum — and litigation, by its nature, produces winners and losers rather than balanced policy.

The library community has watched these developments with mounting alarm. The American Library Association has consistently supported the principle of controlled digital lending, viewing it as the natural extension of first-sale doctrine into the digital age. First sale — the legal principle that allows libraries to lend physical books they’ve purchased without seeking permission — doesn’t currently apply to digital copies, a gap that critics say leaves digital-only knowledge effectively unlendable.

Some legal scholars argue that CDL should be protected under fair use, particularly for preservation purposes. Others contend that any digital copying without a license is infringement, regardless of how the copy is used. The Hachette decision sided with the latter view, at least as applied to the Archive’s specific practices. Whether that reasoning extends to all forms of digital lending remains an open question — one that future cases will inevitably test.

Meanwhile, the AI companies continue to train. OpenAI has acknowledged using “publicly available” text from the internet, a category that conveniently includes enormous amounts of copyrighted material. Meta’s LLaMA models were trained on datasets including Books3, a collection of over 196,000 pirated books. Google has been characteristically vague about its training data but has scanned millions of books through its Google Books project, which survived its own lengthy copyright battle.

None of these companies need the Internet Archive to access copyrighted text. They have the resources, the infrastructure, and — in some cases — the existing datasets to train on whatever they want. Restricting the Archive restricts the public. It doesn’t restrict AI.

That’s the fundamental asymmetry the EFF is highlighting. The entities with the power and the motive to infringe copyright at massive scale are well-funded corporations. The entity being most aggressively curtailed is a nonprofit library. The publishers’ legal campaign, whatever its merits on the narrow question of unauthorized lending, does not address the AI problem it increasingly invokes to justify its scope.

The question for policymakers — and ultimately for the courts — is whether the public interest in preservation and access counts for anything in this calculus. Copyright law has always involved a balance between the rights of creators and the needs of the public. The Constitution itself frames copyright as existing “to promote the Progress of Science and useful Arts,” not as an absolute property right.

If the Internet Archive is forced to significantly curtail its operations, no private company is going to step in and preserve the historical record out of altruism. Google’s book-scanning project has largely stalled. Amazon has no interest in lending books for free. The market won’t produce a replacement for the Archive because the Archive does things the market doesn’t value — until they’re gone.

That’s the real risk. Not that AI companies will be emboldened, or that publishers will lose revenue, but that a vast portion of human knowledge will simply become inaccessible. Not destroyed, necessarily, but locked away — in out-of-print editions, in expired web pages, in formats no one can read anymore. The digital dark age that archivists have warned about for decades wouldn’t arrive with a dramatic crash. It would arrive quietly, one court order at a time.

The EFF’s intervention is unlikely to change the legal outcome of Hachette v. Internet Archive at this point. But it’s doing something arguably more important: reframing the debate. This isn’t about one nonprofit’s lending practices. It’s about who gets to access knowledge, who gets to preserve it, and whether the law can distinguish between a library and a tech conglomerate.

So far, the answer to that last question isn’t encouraging.

The Internet Archive Is Under Siege — And the Collateral Damage Could Be Civilization’s Memory first appeared on Web and IT News.

awnewsor

Next The Great Culling: How AI Is Hollowing Out the Game Development Workforce From the Inside »

Previous « A Security Scanner Became the Weapon: How a Supply Chain Attack on Trivy Spawned a Self-Replicating Worm Across 47 npm Packages

Published by

awnewsor

3 months ago

Apple Delays EU Siri AI Rollout to April 2025 for DMA Compliance

Apple has confirmed plans to introduce its advanced Siri features powered by Apple Intelligence across…

4 hours ago

Web and IT News

Moody’s Warning Forces Tech Leaders to Weigh Post-Quantum Costs Against AI Budgets

Moody’s Ratings has put chief information officers and chief financial officers on notice. Slow adoption…

4 hours ago

Web and IT News

The Thinnernet Vision: One Engineer’s Bid for a Predictable, Low-Bandwidth Parallel Internet

Giovanni thinks the modern internet has grown too fat. Pages balloon with scripts, trackers and…

4 hours ago

Web and IT News

Apple’s Calculated Bet on Google’s Gemini to Finally Fix Siri

Craig Federighi stood before developers at WWDC and laid out the mechanics. Apple’s software chief…

4 hours ago

Web and IT News

Meta Pulls Facial Recognition Code From Smart Glasses App After Quiet Deployment

Meta built a facial recognition system for its Ray-Ban smart glasses. Then it shipped the…

4 hours ago

Web and IT News

Apple’s iOS 27 Leaves Older iPhones Behind on Features

Apple kicked off its Worldwide Developers Conference on June 8 with promises that iOS 27…

4 hours ago

This website uses cookies.

The Internet Archive Is Under Siege — And the Collateral Damage Could Be Civilization’s Memory

Related Post

Recent Posts

Apple Delays EU Siri AI Rollout to April 2025 for DMA Compliance

Moody’s Warning Forces Tech Leaders to Weigh Post-Quantum Costs Against AI Budgets

The Thinnernet Vision: One Engineer’s Bid for a Predictable, Low-Bandwidth Parallel Internet

Apple’s Calculated Bet on Google’s Gemini to Finally Fix Siri

Meta Pulls Facial Recognition Code From Smart Glasses App After Quiet Deployment

Apple’s iOS 27 Leaves Older iPhones Behind on Features