Josh Dzieza | The Verge

AI research papers are getting better, and it’s a big problem for scientists

2026-06-09T14:17:41-04:00

Last summer, Peter Degen’s postdoctoral supervisor came to him with an unusual problem: One of his papers was being cited too much. Citations are the currency of academia, but there was something unusual about these. Published in 2017, the paper had assessed the accuracy of a particular type of statistical analysis on epidemiological data and had received a respectable few dozen citations in other research papers over the years, but now it was being referenced every few days, hundreds of times, placing it among the most cited papers of his career. Another professor might be thrilled. Degen’s adviser asked him to investigate.

Degen, a postdoctoral researcher at the University of Zurich Center for Reproducible Science and Research Synthesis, found that the citing papers all followed a similar pattern. Like the original, they were analyzing the Global Burden of Disease study, a publicly available dataset compiled by the Institute for Health Metrics and Evaluation at the University of Washington. But they were using the dataset to churn out a seemingly endless supply of predictions: about the future likelihood of stroke among adults over 20 years old, of testicular cancer among young adults, of falls among elderly people in China, of colorectal cancer among people who eat minimal whole grains, of disease X among population Y, and so on.

Searching on GitHub for code that would be used to do this sort of analysis, Degen followed some links and wound up on the Chinese social media site Bilibili, where he discovered a Guangzhou-based company touting tutorials on how to produce publishable research in under two hours using its software tools and AI writing assistance. These studies were not very good. Researchers who analyzed a subset of studies about headaches found they were rife with errors and misrepresentations. But they were also not as flagrantly wrong as AI-generated papers of the recent past, making them more difficult to filter out.

“It’s a huge burden on the peer-review system, which is already at the limit,” Degen said. “There’s just too many papers being published and there’s not enough peer reviewers, and if the LLMs make it so much easier to mass produce papers, then this will reach a breaking point.”

Optimists about generative AI have high hopes for its ability to produce future scientific breakthroughs — accelerating discovery, eliminating most types of cancer — but the technology is currently undermining one of the pillars of scientific research, inundating editors and reviewers with an endless stream of papers. Paradoxically, the better the technology gets at producing competent papers, the worse the crisis becomes.

For the past decade, academic publishing has been contending with so-called “paper mills,” black-market companies that mass-produce papers and sell authorship slots to academics, doctors, or others who hope to gain a competitive edge by having published research on their resumes. It has been a game of cat and mouse, with publishers — often pressed by so-called science sleuths, researchers who specialize in ferreting out fraudulent research — closing one vulnerability only to have the mills find a new one. Generative AI was a boon to the mills, helping them to skirt plagiarism detectors by creating wholly new images and text. Still, the technology’s telltale hallucinations meant that publishers could at least theoretically screen out much of their work. In practice, papers still got through, only to get retracted when sleuths encountered a diagram of a rat with inexplicably gargantuan genitals labeled “testtomcels” or prose sprinkled with “as an AI assistant”s that someone forgot to delete.

But now AI has improved to the point where it can produce convincing papers almost wholesale, allowing desperate academics in need of a publication to mill papers of their own. The result is a deluge of scientific slop that threatens to swamp publishing, peer review, grant making, and the research system as it exists today.

Matt Spick, a lecturer in health and biomedical data analytics at the University of Surrey and an associate editor at Scientific Reports, first noticed the phenomenon when he received three strikingly similar papers analyzing the US National Health and Nutrition Examination Survey (NHANES), another public dataset. He checked Google Scholar and realized that it wasn’t a coincidence: There had been a sudden explosion in papers citing NHANES that all followed a similar formula, each purporting to discover an association between, for example, eating walnuts and cognitive function or drinking skim milk and depression.

“If you’ve got enough computing power, you go through and you measure every single pairwise association, and eventually you find some that haven’t been written on before and you just publish: There is a correlation between this and that,” Spick said. These correlations are often misleading simplifications of phenomena with multiple causes or random statistical flukes. “One was that how many years you spend in education will cause postoperative hernia complications. That is just a random correlation. What am I supposed to do with that? Leave school early so that I won’t get a postoperative hernia complication later?”

Over the years, sleuths have developed a variety of methods for detecting inauthentic papers. Some search for “tortured phrases,” instances where someone was trying to skirt plagiarism detectors by feeding an existing paper through a synonym generator, which often has the effect of turning technical terms like “reinforcement learning” into nonsense like “reinforcement getting to know,” to cite one recent example. Other sleuths track duplicated images, perform network analysis of authors, or check citations for hallucinated publications, a classic sign of LLM use. Spick searches for masses of papers following the same template as they analyze public datasets.

“Reinforcement getting to know”

These papers may not necessarily be wrong, though they are often misleading. Nor are they strictly speaking fraudulent. They’re just useless, and suddenly very easy to make. Last year, several journals began restricting submissions of papers analyzing public datasets, citing a flood of redundant research.

Spick fears these measures may be fighting the last battle. In recent months, AI companies have released a range of “agentic” science assistants capable of analyzing data, generating hypotheses, and writing research papers with a high degree of autonomy. While a possible step toward the goal of AI-accelerated science, these systems also come with novel risks. When Carnegie Mellon researchers tested several agentic tools, they found that they sometimes invented data or used misleading techniques, but that these errors were only apparent upon close analysis of the full workflow; the final papers looked polished.

Announcing an AI paper writing assistant earlier this year, OpenAI’s then-vice president for science, Kevin Weil, predicted, “I think 2026 will be for AI and science what 2025 was for AI and software engineering.” Spick and some colleagues, curious what it could do, gave the tool, called Prism, some data from an already published paper documenting ripening times of eggplants and peppers. Prism analyzed the data, proposed a new statistical method that could be applied to it, and wrote an entire paper complete with charts and correct citations.

“We were all looking at each other like, ‘What the [expletive], this is actually a decent piece of work!’” Spick recalled. Unlike the generated papers he’d encountered previously, this one didn’t follow a template, nor was it using a single well-known database. It took 25 minutes and 50 seconds to produce.

“I’m genuinely not sure at what point we will suddenly realize that more are getting through than we realize because we can’t easily tell the difference anymore,” Spick said.

This raises some philosophical questions, Spick said, like: Does it matter who or what writes the paper if the information is accurate? And should science be in the business of publishing every possible fact?

“Part of science is supposed to be the filter. We’re supposed to publish the stuff that we think is interesting, not publish literally everything that we can possibly find,” Spick said. “Because if we do that, science is just spamming the world with all the data, irrespective of whether it constitutes actual new knowledge or not, and in any kind of medium-term time frame, it’s almost impossible to work out what’s meaningful and what isn’t.”

This is the immediate practical challenge posed by AI agents. They threaten to overwhelm the human systems that create and organize knowledge. Research funders are contending with onslaughts of proposals perfectly tailored to their particular grant, unable to parse which projects represent the next step in years of work and which were generated in minutes. Conference organizers, journal editors, and peer reviewers are all struggling to sort through a flood of material that all seems good enough at first glance to warrant a close read. There is an enormous and growing asymmetry between the time it takes to produce new work and the time it takes a subject-matter expert to vet it.

For Marit Moe-Pryce, the managing editor of the international relations journal Security Dialogue, submissions are up 100 percent over where they were a year before. Just as problematic: All the submissions have become pretty good. Gone are the blatant hallucinations and leftover prompts; everything has suddenly become coherent, well structured, and stylistically similar, difficult to say whether it is a wholly generated paper, an experienced academic, or a young scholar using AI as an editor.

“The main problem that we see currently from the desk is that the fraudulent side and the academic side are conflating, which ends up with a big gray mass of articles that we as editors need to sit and try to figure out, ‘What is this? Is this something that we need to engage with? Is it not?’” Moe-Pryce said.

One paper made it past at least 10 editors and two rounds of peer review before she noticed a fake citation — a very plausible one, involving several former editors of the journal on a topic they could have written about but never did. She then found several more. She doesn’t know at what stage of revision the hallucinations were introduced, but the close call underscored the level of care required to ensure nothing false gets published. Now that models increasingly cite real papers, she has to read for whether the works cited are the ones an expert would actually use, AI not yet having mastered the difference between canonical literature and more peripheral work.

“It’s incredibly detailed, and this is a normal part of the editorial work. The difference is that now you have to do that for all the rubbish that comes through the door,” Moe-Pryce said. “That’s why our workload becomes so unmanageable.”

“AI currently holds the potential to bring down the publishing system as we know it.”

Academic papers go through a multi-stage review process before publication. First, manuscripts are triaged for obvious problems, then sent to a journal’s editor, who decides whether it might be worth publishing. The editor then sends it to an associate editor with experience in the field, who again vets it before recruiting two or three subject-matter specialists — the “peers” in peer review — to read the paper and write responses. The editors and reviewers are typically working for free, volunteering their time in addition to their primary academic job.

The review system was already struggling under increasing volumes of submissions, and now AI is increasing those volumes while also making the bad ones more difficult to filter out. Moe-Pryce now spends more time sorting papers before deciding what to send out for review, and prospective reviewers, swamped themselves, are less and less likely to respond. Where she previously could send four queries out and get three replies, it now takes her a dozen tries to get two people. Increasingly, she reaches out to 20 reviewers and hears nothing.

“It’s fatigue. Academic journals have mushroomed, and then you have AI helping everyone fraudulent or not generate more, faster, so you have a massive increase in volume,” she said. “AI currently holds the potential to bring down the publishing system as we know it.”

The journal Accountability in Research has seen a 60 percent surge in submissions this year, according to David Resnik, an associate editor at the journal. Ironically, he has been besieged by likely AI-generated papers about fraudulent academic papers that have mined public data compiled by the organization Retraction Watch.

He, too, is struggling to find reviewers. At times, he’s had to send out 20 requests just to get two responses — and he’s suspected that some of the responses he’s received are AI-generated themselves. He has reason to be suspicious. A survey conducted by the publishing company Frontiers last year found that more than half of researchers have used AI assistance in their peer review.

“I’m very worried about this straining, breaking the back of the peer-review system,” said Resnik.

AI agents arrive at a time when the quality filters of academia are already struggling to cope with a superabundance of papers. The number of scientific papers published has grown exponentially in recent years, according to an analysis of data published in Quantitative Science Studies, while the number of PhDs who might review them has not. Unfortunately, the authors attribute this explosion in productivity not to rapid progress in science but to the fact that commercial and professional incentives align to publish the maximum quantity of papers.

Many journals have shifted to an “open access” model where they earn revenue by charging authors processing fees to have their papers published, as opposed to charging for subscriptions. In earnings calls, publishing companies tout the recent 20 percent or more increase in submissions as a positive growth story. Universities and funding agencies, meanwhile, look at researchers’ publication metrics when deciding whom to fund or promote, which means researchers are under pressure to “publish or perish.” Nor is it only traditional academics who are under this pressure to publish. Overseas medical students can improve their chance at a US residency program by having a few peer-reviewed papers on their resume. In China, medical doctors have strong incentives to publish despite neither having the time nor resources to conduct research, making quick paper generation an attractive option.

If you introduce an infinite paper-writing machine to a system that defines productivity by the number of papers written, people will use it to write a lot of papers. A study published in Nature this year found that scientists who adopted AI published three times more papers and received nearly five times more citations than those who didn’t. They also became research project leaders 1.37 years earlier than those who did not use AI. While individually beneficial, the embrace of AI to mass-produce papers may be detrimental to science as a collective endeavor, beyond exhausting journal editors and peer reviewers. The same study found a collective narrowing of focus as these newly productive scientists gravitated toward well-studied fields with abundant existing data for AI to synthesize.

There are no easy solutions to this problem. In 2022, the scientific organization STM launched an initiative called Integrity Hub to contend with paper mills. Since then, it has been engaged in an “arms race” with AI, according to Joris van Rossum, the project’s program director — assembling automated tools that check for plagiarism, then tortured phrases, then fake citations — but the group must now consider more sweeping remedies.

“I’m very worried about this straining, breaking the back of the peer-review system.”

“We anticipate a future where it’s going to be more realistic to enable submitters to demonstrate authenticity rather than trying to detect fabrication,” he said. That is, once fraudulent manuscripts are impossible to detect, publishers will have to find a way for researchers to prove their work is real — perhaps by working with instrument manufacturers to develop ways of watermarking their images, he said, or having researchers submit more of the data behind their work so it can be analyzed for suspicious signals.

This would entail changing the way research is done on a massive scale, and while it might stem outright fraud, it would do little to reduce the volume problem. Using AI to assist with peer review, as some have proposed — and some reviewers are already doing, permitted or not — raises a nest of other possible risks. Studies have found that models often continue to cite retracted studies as valid and write superficially good reviews while overlooking methodological problems. AI reviewers also appear to prefer AI-generated writing.

“It’s not really a tractable problem,” said Reese Richardson, a postdoctoral fellow at Northwestern University who studies mass-produced papers. “I think that the only way out of this situation is to actually change the way that the scientific enterprise awards prestige and awards resources. As long as we have this hyper-competitive, hyper-unequal rat race where people’s productivity and their worth as scientists is being measured by how many publications they put out and how many times they get cited, it’s just going to incentivize this behavior.”

Vincent Larivière, the editor-in-chief of Quantitative Science Studies, had a similar diagnosis. His journal has seen a 40 percent increase in submissions this year.

“We need a reform of what matters in science,” Larivière said. The conflation of scientific productivity with publication counts has had a distorting effect on science, causing research to gravitate toward small, tractable problems that are guaranteed to result in something publishable. AI could do great things, he said — help cure cancer, develop fusion energy — but right now it is being used to generate papers to “pad CVs.”

“Of course we need more science,” he said, “but do we need more papers?”

There’s an internet choke point in the Middle East — is the solution in the North Pole?

2026-06-09T14:17:23-04:00

The vast majority of the world’s data — emails, financial transactions, the internet — is carried by fiber optic cables that run along the ocean floor and converge at a few narrow choke points. Periodically, policymakers will release reports noting that this arrangement seems risky, but these routes are the shortest, often in use since the telegraph era, and the system has managed remarkably well. Cables break regularly, and traffic gets rerouted until a repair ship can come and fix the cut. But the war in Iran, coming after several years of disruptions from conflict in Yemen, is spurring governments and companies to consider alternate routes, including one going across the North Pole.

The current problems began in 2024, when a Houthi missile struck a cargo ship in the Bab-el-Mandeb Strait off the coast of Yemen, causing the vessel to drift for days and drag its anchor across three of the more than a dozen submarine cables crammed into the narrow Red Sea passage.

Cable repair is carried out by specialized ships that fish up the broken ends and splice them back together. It’s delicate work that involves slowly dragging grapnels along the seafloor and floating very still for hours while fiber strands are spliced together, none of which can be safely done in a war zone. Consequently, it took more than four months to broker the agreements necessary to bring in a ship. Last September, another four cables were severed, likely by a commercial vessel dragging its anchor, again disrupting internet traffic in Africa, Asia, and the Middle East. Again, months of negotiations before a repair could be done.

“The Persian Gulf will never go back to what it was before”

The Red Sea cuts spurred companies and governments to look for alternate routes, and the Strait of Hormuz seemed promising. Then the US and Israel attacked Iran, cable projects were halted, and now the world is looking elsewhere once again.

“When the Red Sea shut everything down, everyone swung over to the Persian Gulf, and now you can’t do that either,” said Roderick Beck, a cable industry veteran who sources telecom capacity for ISPs. “The Persian Gulf will never go back to what it was before, when the Iranians wouldn’t dare assert control.”

The Gulf states, which have been aggressively building data centers in an attempt to shift their economies from oil to AI, are looking to avoid the Red Sea by going overland, building routes to Europe via Syria, Iraq, and Oman. But the most ambitious proposal is in Europe, where the repeated cable cuts have the continent looking to the Arctic.

Earlier this year, a European Union panel on cable resilience recommended building two Arctic cables in order to find a route to Asia without traveling through the Red Sea, where 90% of Europe’s traffic currently passes. One cable would go through Canada’s Northwest Passage. The other would link Scandinavia to Asia by going straight across the North Pole.

The second of these routes is already in the early planning stages. Called Polar Connect, it’s being led by Nordic academic-network operators, Sweden’s polar research agency, and the telecom firm GlobalConnect Carrier. This year, the EU designated it a “Cable Project of European Interest” and has put approximately 9 million euros toward preparatory work. (The EU report estimated the full cost would be approximately 2 billion.) A route survey is planned for this summer.

“It started before the unrest, but the geopolitical situation has resulted in an increased interest in finding alternate routes,” said Pär Jansson, Senior Vice President (Carrier) at GlobalConnect, the telecom company working on the Polar project. The group’s white paper notes that Europe’s data currently has three routes to Asia, none of them ideal: through the Red Sea, through Russia, or through the US, a “long route controlled by non-European entities.” The cable would make Europe’s data infrastructure more resilient, lower latency between the EU and Asia, and “strengthen Europe’s autonomy,” Jansson said, adding that it could also allow for better environmental monitoring of the Arctic.

“The problem is icebergs”

Others have attempted an Arctic cable, never successfully. “People have discussed this for at least 20 years,” said Alan Mauldin, a research director at TeleGeography, the cable industry research firm. Installation would be challenging and expensive, requiring retrofitting a cable ship for Arctic conditions and procuring icebreakers to escort it across the North Pole. But the real obstacle is maintenance.

“What if there is damage to the cable from, it’s called ice scour, when ice scrapes against her cable and damages it. Then you can’t repair it until summer,” Mauldin said. “We’ve seen so many projects come and go. There’s a reason for that, right? It’s very challenging.”

Beck raised the same repair issue. “The problem is icebergs,” said Beck. They can drag along the bottom of the ocean floor, digging long grooves deeper than a cable can be buried. “That’s what happened to Quintillion. Twice.”

Quintillion was the last attempt at an Arctic cable. In 2016 it acquired the assets of Arctic Fibre, the previous attempt to build an Arctic cable between Europe and Asia. Quintillion activated a portion that ran from Nome along the northern coast of Alaska to Prudhoe Bay, but in June 2023, sea ice broke it. Because there are no icebreaker cable ships, Quintillion had to wait for the summer ice to melt before it could fix the cable. Then in January of last year, an iceberg struck again. This time in deep winter, no one could repair the cable for eight months. The rest of the route was never laid.

The expensive repair costs and potential for lengthy downtimes makes an Arctic cable financially unattractive, Mauldin and Beck said. The question is whether governments now see the cable as strategically important enough to outweigh that. “I think the EU is really big on this thing because they think it’s data sovereignty, but it would be enormously expensive. It’s never been done before,” said Beck.

Jansson is aware of the challenges, but he believes the new geopolitical situation and new technologies will make it feasible. Tech companies are building data centers in the Nordic countries, he said, and will want fast and resilient connectivity, but ultimately it will require public investment. He places the cost estimate for the Norway-Japan leg at “below 1 billion euros.”

The goal is for it to go live in 2030. That may be the easy part.

How Project Maven taught the military to love AI

2026-06-09T14:17:08-04:00

In the first 24 hours of the assault on Iran, the US military struck more than 1,000 targets, nearly double the scale of the “shock and awe” attack on Iraq over two decades ago. This acceleration was made possible by AI systems that speed up the targeting process. Chief among them is the Maven Smart System.

In her new book, Project Maven: A Marine Colonel, His Team, and the Dawn of AI Warfare, journalist Katrina Manson investigates the development of Maven from its inception in 2017 as an experiment in applying computer vision to drone footage. The project spurred employee protests at Google, the military’s initial contractor, prompting the company to back out. Pushed forward by a Marine intelligence officer named Drew Cukor, whose story forms the backbone of Project Maven, the system ended up being built by Palantir and draws on technologies developed by Microsoft, Amazon, Anthropic, and others. Now used across the US armed forces and recently purchased by NATO, Maven synthesizes satellite imagery, radar, social media, and dozens of other data sources to identify and target entities on the battlefield. It also speeds up what’s called the “kill chain.”

Maven combines computer vision with a sort of workflow management system that finds targets, pairs them with weapons, and allows users to quickly click through the other steps of a targeting cycle. A process that once took hours can now be completed in seconds. An official tells Manson that the technology has allowed the US to go from hitting under a hundred targets a day to a thousand, and with the addition of LLMs, up to five thousand targets a day.

One of the thousand targets struck on the first day of the Iran war was a girls’ school, killing more than 150 people, mostly children. The school had previously been part of an Iranian naval base, yet it was listed online as a school and playgrounds were visible on satellite imagery. While much of the coverage after the strike focused on possible hallucinations by Claude, the technology historian Kevin Baker wrote in The Guardian that Maven and the acceleration it enabled is the more relevant place to look. “A chatbot did not kill those children,” he wrote. “People failed to update a database, and other people built a system fast enough to make that failure lethal.”

The pace of war is set to accelerate further. Manson uncovers military programs to develop fully autonomous weapons — including an explosive-laden drone Jet Ski — capable of targeting and destroying targets on their own.

I spoke to Manson about Maven and how AI is changing warfare.

This interview has been condensed and edited for clarity.

Colonel Cukor was an early and determined proponent of AI. Can you say a bit about him and what his initial motivations were?

He is chief of Project Maven, so he was the day-to-day doer and leader, but he also had this very long-term vision, which comes from his frustration that US military operators in Afghanistan were equipped with very poor intelligence tools. There was this idea that the US essentially fought that war 40 times over, every six months, because information wasn’t being handed over [when troops rotated in]. He was frustrated that data was in Excel and PowerPoint and he wanted an analytic tool that would bring intelligence to the frontline military operators. But he also had this vision for what he called “white dots” — that there would be white dots shown on a map infused with intelligence information, like a coordinate, what is there, the elevation, what is known about it. And this becomes one of the driving forces of what he tries to create through Project Maven.

How was Maven initially conceived in the military, was it as this interface and information management system?

It comes out of this project called Project Maven that starts in 2017. The actual project already existed and had already got a funding stream. It was to use AI against satellite imagery, but then it got repurposed for drone video imagery. This is because the US is thinking about how to develop AI for technologies for any potential conflict against China. They had this idea that eventually war would run faster than humans could think, so they wanted to bring AI into this. The initial idea proposed by Colonel Cukor is to apply AI to drone video footage. They were sometimes managing to analyze as little as 4 percent of the collection, so they wanted AI essentially to take the place of human eyes in analyzing what was there, but it was always bigger.

The public first heard about Maven with the Google protests in 2018, and I remember Google at the time saying that this technology would not be used to kill people. But it sounds like targeting was always the intention?

A spokesperson from Google at the time said that flagging images for review on the drone feed with the help of AI was intended to save lives and was for non-offensive uses only. That is not what my reporting shows. My reporting shows that many of the US military operators were motivated by the aim to save US lives and reduce civilian harm, so in that sense, it is “not offensive” because you’re analyzing intelligence information. But in the wider sense and very quickly, in the very real sense, AI target selection was intended for targeting.

I asked someone in the book if targeting offensive weapon strikes were intended to be part of Project Maven, and he replied, “yeah, of course, it’s not like we’re doing it for kicks. The goal of the intel is to take out high-value targets.”

When the Google deal falls apart, that’s when Palantir steps in. Can you tell me about Palantir’s role in the project?

Two things happen. Microsoft and AWS [Amazon Web Services] take a much bigger role in producing the algorithms and also in the compute, and alongside that, Cukor goes to Palantir and says, “Can you help?” He’s pitching this idea of the white dots on a screen. He has this 10-year vision for how the US military will remake themselves, and they’ve been trying out algorithms, which at that stage are not very good at identifying anything, and are also having to sit in systems that aren’t fit for purpose. They had a lot of problems with users not believing in AI and finding the displays very distracting. So he wants a user interface that will please the user.

So he pitches to Palantir that they create a user interface, which actually Palantir doesn’t want to do. I’m told they didn’t believe that AI was going to take off, and they also didn’t want to just make a fancy user interface. They wanted to crunch the data. But that wasn’t initially what Cukor was pitching them and he was very persuasive. He also wanted them to be less arrogant, and he ends up counseling them on how to attempt to remake their reputation inside the Department of Defense and to get these contracts, which initially, I don’t think are worth much money. But today, nearly 10 years later, I’ve reported that Maven Smart System is going to become by the end of September a “program of record” and Palantir is the prime contractor, so in the end, it’s going to be lucrative for them.

Ukraine sounded like a pretty big inflection point in the development of these systems. What happened there?

This becomes a really important moment where the artillery fire team realizes that AI can help them speed up their operations and targeting. It becomes much more explicit that intelligence is going to feed into operations. When the US is supporting Ukraine, even before the invasion of Russia, the 18th Airborne Corps is over in Wiesbaden in Germany and very quickly they start to use computer vision on the Maven Smart System to figure out where the Russian positions are, where the tanks are, what is happening. The algorithms fail very quickly. The algorithms were used to the desert in the Middle East and in Afghanistan. The algorithms couldn’t recognize tanks and other features in the snow. They collect new satellite footage over the Russian tanks and other equipment and send them back to the US to retrain the algorithms really quickly, so they become much better at spotting tanks.

The US starts sending what they end up calling “points of interest” to the Ukrainians, who then use that to target Russian equipment and personnel. The language of “points of interest” is interesting because the US is trying to thread this needle to provide support to the Ukrainians without becoming seen in Russia’s eyes as a direct participant in the war. So they evolved this idea that a “target” is something that has gone through a process, and they are giving the Ukrainians everything just shy of that. I’m able to report that at the high point on one day in 2022, the US passes 267 points of interest to the Ukraine.

What are the parts of the targeting process that are getting automated that cause that kind of acceleration?

The US military would say nothing is yet automated, because there is this extra stage of targeting, which is really key, which is the legal decision to strike something. In the case of why the kill chain is speeding up, what I’ve been told is that a lot of the processes involved in getting permission to strike a target have traditionally been extremely analog and slow, involving telephones and swivel chairs. So this is part of shifting this process onto digital platforms and then eventually getting to automate it.

The 18th Airborne Corps had humans at six key steps. So the human decides when and how to shoot at a target. They assess what’s called an operational approach. They assess the data collected, they decide to act, communicate the decision, execute the fire, and then communicate what happened. And then with the arrival of Maven’s AI, they reduced the human role in the loop to only two places: the decision to act and the action itself. They can supervise the machine making the decision during the automated collection process, but the assessments throughout would all be AI enabled. Even at the NGA [National Geospatial-Intelligence Agency], they are producing intelligence reports that no human eyes or hands have touched that are entirely AI generated. So there’s been this huge shift into really making data and the system king.

The other reason that they’re able to get to so many targets in a day is because the Maven Smart System is using large language models. I’ve reported [they’re using] Claude from Anthropic, and I was told it was helping speed up the processes. And Centcom [US Central Command] themselves said that with the help of AI, they were able to speed up processes that used to take days and hours down to as little as seconds. The commander, the US would say, is still making the decision. But I’ve also spoken to US military ethicists who say that there is a risk of the gamification of war, and that people may end up trusting the targets that they’re being offered on screen without understanding fully the data that’s supporting it.

Now, the pushback is that this is data that’s better tagged than ever been before, that this AI-based system, essentially being a database system, means that you can audit the data and go deep into it and also give headquarters a way of following what military operators at the edge are doing with much greater transparency and accountability than ever before. This enormous operation that the US has undertaken in Iran will ultimately be a case in point. And we’ll be looking for data and accountability about how the US has, in the end, used this platform.

There’s a technology scholar, Kevin Baker, who wrote a piece about how Claude got a lot of blame initially for the school strike in Iran. But he pointed to this longer term acceleration and said that these steps may have left time for deliberation or noticing errors or contradictory intelligence. I’m curious if there were concerns in the military that things were getting too fast?

There’s a really significant debate inside the US military about how far they should lean into this. Some are saying it’s inevitable, and others are really warning that that human assessment at the last minute is the thing that can save lives. And I don’t think that the debates proved out, but the direction of travel is clear in that the Maven Smart System is becoming a program of record. That Central Command commander is taking time out of these operations to go on to X and say that they are using AI and that they’re finding it helpful. Then you have people like retired Defense Secretary Jim Mattis saying that targeting is no substitute for strategy, that hitting a lot of things, essentially, doesn’t get you to victory.

There’s one example that I keep going back to in my mind, which is in 1999, when the US strikes the Chinese Embassy in Belgrade. In the analysis that the US offers publicly afterwards, they say that the embassy was incorrectly labeled on a map. The embassy had moved recently. The map hadn’t been updated. One map had; others hadn’t. Someone even tried to make a call because they got worried and wanted to check, but they weren’t able to reach someone in time.

In an example like that, if your systems flag a problem and they’re digitally connected, on the one hand, it could be much easier to raise anomalies, problems, risks of mistake. On the other, the target selection from what could be an erroneous targeting database could be made even quicker without those checks. So the decision that the US military makes about leaning into AI on the targeting cycle will only be as good as the data that is feeding it.

You Could Be Next

2026-03-16T14:00:57-04:00

The LinkedIn post seemed like yet another scam job offer, but Katya was desperate enough to click. After college, she’d struggled to make a living as a freelance journalist, gone to grad school, then pivoted to what she hoped would be a more stable career in content marketing — only to find AI had automated much of the work. This company was called Crossing Hurdles, and it promised copywriting jobs starting at $45 per hour.

Katya clicked and was taken to a page for another company, called Mercor, where she was instructed to interview on-camera with an AI named Melvin. “It just seemed like the sketchiest thing in the world,” Katya says. She closed the tab. But a few weeks later, still unemployed, she got a message inviting her to apply to Mercor. This time, she looked up the company. Mercor, it seemed, sold data to train AI, and she was being recruited to create that data. “My job is gone because of ChatGPT, and I was being invited to train the model to do the worst version of it imaginable,” she says. The idea depressed her. But her financial situation was increasingly dire, and she had to find a new place to live in a hurry, so she turned on her webcam and said “hello” to Melvin.

It was a strange, if largely pleasant, experience. Manifesting on Katya’s laptop as a disembodied male voice, Melvin seemed to have actually read her résumé and asked specific questions about it. A few weeks later, Katya, who like most workers in this story asked to use a pseudonym out of fear of retaliation, received an email from Mercor offering her a job. If she accepted, she should sign the contract, submit to a background check, and install monitoring software onto her computer. She signed immediately.

She was added to a Slack channel, where it was clear she was entering a project already underway. Hundreds of people were busy writing examples of prompts someone might ask a chatbot, writing the chatbot’s ideal response to those prompts, then creating a detailed checklist of criteria that defined that ideal response. Each task took several hours to complete before the data was sent to workers stationed somewhere down the digital assembly line for further review. Katya wasn’t told whose AI she was training — managers referred to it only as “the client” — or what purpose the project served. But she enjoyed the work. She was having fun playing with the models, and the pay was very good. “It was like having a real job,” she says.

Two days after Katya started, the project was abruptly paused. A few days after that, a supervisor popped into the room to let everyone know it had been canceled. “I’m working assuming that I can plan around this. I’m saving up for first and last month’s rent for an apartment,” Katya says, “and then I’m back on my ass. No warning, no security, nothing.” Several days later, she got an email from Mercor with another offer, this one for a job evaluating what seemed to be conversations between chatbots and real users — many appeared to be from people in Malaysia and Vietnam practicing English — according to various criteria, like how well the chatbot followed instructions and the appropriateness of its tone. Sign the contract, the email said, and you’ll have a Zoom onboarding call in 45 minutes. It was 6:30PM on a Sunday night. Scarred from the abrupt disappearance of the previous gig, she accepted the offer and worked until she couldn’t stay awake.

Machine-learning systems learn by finding patterns in enormous quantities of data, but first that data has to be sorted, labeled, and produced by people. ChatGPT got its startling fluency from thousands of humans hired by companies such as Scale AI and Surge AI to write examples of things a helpful chatbot assistant would say and to grade its best responses. A little over a year ago, concerns began to mount in the industry about a plateau in the technology’s progress. Training models based on this type of grading yielded chatbots that were very good at sounding smart but still too unreliable to be useful. The exception was software engineering, where the ability of models to automatically check whether bits of code worked — did the code compile, did it print HELLO WORLD — allowed them to trial-and-error their way to genuine competence.

The problem was that few other human activities offer such unambiguous feedback. There are no objective tests for whether financial analysis or advertising copy is “good.” Undeterred, AI companies set out to make such tests, collectively paying billions of dollars to professionals of all types to write exacting and comprehensive criteria for a job well done. Mercor, the company Katya stumbled upon, was founded in 2023 by three then-19-year-olds from the Bay Area, Brendan Foody, Adarsh Hiremath, and Surya Midha, as a jobs platform that used AI interviews to match overseas engineers with tech companies. The company received so many inquiries from AI developers seeking professionals to produce training data that it decided to adapt. Last year, Mercor was valued at $10 billion, making its trio of founders the world’s youngest self-made billionaires. OpenAI has been a client; so has Anthropic.

Each of these data companies touts its stable of pedigreed experts. Mercor says around 30,000 professionals work on its platform each week, while Scale AI claims to have more than 700,000 “M.A.’s, Ph.D.’s, and college graduates.” Surge AI advertises its Supreme Court litigators, McKinsey principals, and platinum recording artists. These companies are hiring people with experience in law, finance, and coding, all areas where AI is making rapid inroads. But they’re also hiring people to produce data for practically any job you can imagine. Job listings seek chefs, management consultants, wildlife-conservation scientists, archivists, private investigators, police sergeants, reporters, teachers, and rental-counter clerks. One recent job ad called for experts in “North American early to mid-teen humor” who can, among other requirements, “explain humor using clear, logical language, including references to North American slang, trends, and social norms.” It is, as one industry veteran put it, the largest harvesting of human expertise ever attempted.

These companies have found rich recruiting ground among the growing ranks of the highly educated and underemployed. Aside from the 2008 financial crash and the pandemic, hiring is at its lowest point in decades. This past August, the early-career job-search platform Handshake found that job postings on the site had declined more than 16 percent compared with the year before and that listings were receiving 26 percent more applications. Meanwhile, Handshake launched an initiative last year connecting job seekers with roles producing AI training data. “As AI reshapes the future of work,” the company wrote, announcing the program, “we have the responsibility to rethink, educate, and prepare our network to navigate careers and participate in the AI economy.”

There is an underlying tension between the predictions of generally intelligent systems that can replace much of human cognitive labor and the money AI labs are actually spending on data to automate one task at a time. It is the difference between a future of abrupt mass unemployment and something more subtle but potentially just as disruptive: a future in which a growing number of people find work teaching AI to do the work they once did. The first wave of these workers consists of software engineers, graphic designers, writers, and other professionals in fields where the new training techniques are proving effective. They find themselves in a surreal situation, competing for precarious gigs pantomiming the careers they’d hoped to have.

Each of the more than 30 workers I spoke with occupied a position along a vast and growing data-supply chain. There are people crafting checklists that define a good chatbot response, typically called “rubrics,” and other people grading those rubrics. Others grade chatbot answers according to those rubrics, and still others take the rubrics and write out what’s often described as a “golden output,” or the ideal chatbot answer. Others are asked to explain every step they took to arrive at this golden output in the voice of a chatbot thinking to itself, producing what’s called a “reasoning trace” for AI to follow later when it encounters a similar task out in the real world.

Sometimes the labs want only rubrics for prompts their AI can’t already do, which means companies like Mercor ask workers to produce “stumpers,” or requests that will make the model fail. “It sounds easy, but it’s really hard,” says a worker who was trying to stump models by asking them to make inventory-management dashboards. Models fail in counterintuitive ways. They may be able to solve advanced-physics exam questions, but ask them for transit directions and they’ll recommend transferring on nonconnecting train lines. Finding these weak spots takes time and creativity.

One type of project gathers groups of lawyers, human-resources managers, teachers, consultants, or bankers for something Mercor calls world-building. “You and your team will role-play a real-life team within your profession,” the training materials read. The teams are given dedicated emails, calendars, and chat apps and asked to create a hundred or more documents that would be associated with some corporate undertaking, like a fictional mining company analyzing whether to enter the data-center business.

After several 16-hour days of fantasy document production, one worker recounts, the resulting slide decks, meeting notes, and financial forecasts are sent to another team, which uses them as grist in their attempts to stump a model operating in this simulated corporate environment. Then, having stumped the model, that team writes new, more nuanced rubrics, golden answers, and so on. Workers can only guess who the customer is or how many others are working on the project — based on references to teams like Management Consulting World No. 133, there could be hundreds, maybe thousands.

There are people hired to evaluate the ability of image models to follow their prompts and others who summarize video clips in extraordinary detail, presumably to train video models. Efforts to improve AI’s ability to have spoken conversations have resulted in a surging demand for voice actors, who might find themselves recording “authentic, emotionally resonant” speeches, according to one listing. “I just tell people I’m an AI trainer, then it sounds more professional than what I’m doing,” says an aspiring screenwriter who was instructed to record himself pretending to ask a chatbot for a fitness plan while pots and pans clanged in the kitchen. Another time, he was told to record himself dispensing financial advice over the phone to a parade of people he assumed were other workers.

This audio might then be broken down and sent to someone like Ernest, who used to make a living as an online tutor until the company he worked for replaced him with a chatbot. When we spoke, he was listening to minutelong clips of random dialogue slowed to 0.1x speed and marking when someone started and stopped speaking down to the millisecond. Many of the clips included a person talking with a chatbot and interjecting “huh” or “I see,” so he assumes he was improving AI’s ability to have naturally flowing conversation, but he has no actual idea.

As is standard practice in the field, the project was referred to by a codename and the client only ever as “the client.” The entire system is designed so that workers have minimal insight into the supply chain they are part of. If they find out who the customer is, they are contractually forbidden from telling anyone, even their own colleagues. Nor are they allowed to describe the details of their work beyond broad generalities like “providing expertise in XYZ domain to improve models for a top AI lab,” according to one Mercor agreement. So afraid are workers of inadvertently violating their confidentiality agreements and getting fired that when they discuss their work in public forums, they mask their already codenamed projects with additional codenames, for example by referring to a project called “Raven” as “Poe.”

“I’m being handed a shovel and told to dig my own grave.”

Katya’s second project with Mercor was far more stressful. There was less work to go around, and it came in fits and starts. Managers would drop a message in the Slack channel saying new tasks were incoming in half an hour, and, she says, “everyone in Slack would drop what they were doing and jump on them like piranhas,” working as fast as they could while the bar showing how many tasks remained slid toward zero. Then they were back in Slack again, politely begging supervisors for more work and more hours, talking about their kids’ birthdays or their need to pay rent, or telling anyone who might be listening that their availability was wide open in case there was more work to be done. Soon, Katya was dropping everything at the sound of a Slack ding too. “Sometimes I’m on the toilet or at dinner and I get the Slack notification. I’m like, ‘Oh, sorry, I gotta work now.’”

That project soon ended and then came another. It was nearly identical to the first, which she had enjoyed, but now, on top of writing rubrics, she had to stump the model and complete the more difficult task in the same amount of time. She was also getting paid $8 an hour less. This is common at Mercor. Nearly every worker I spoke with reported that demands increased, time requirements shrank, and pay decreased as projects continued. Those who couldn’t meet the new demands got “offboarded” and replaced by new recruits.

Chris joined Mercor last year, after a difficult few months struggling to find film work. Unlike many people who suspect they’re casualties of automation, he knew for certain that this was the case. He’d had a recurring job drafting episodes for an unscripted television show — doing preinterviews, sketching scenes, writing the reality TV equivalent of a screenplay. But in late 2024, he was told the show would be running on a “skeleton crew” and his work was no longer needed. He found out later the company was using ChatGPT to draft new episodes. So that October, when Chris received an offer to write an entire sci-fi screenplay for a major AI company, he said “yes,” grim as the prospect was. Since then, he has gone from gig to gig. “This is my only source of income right now,” he says. “I know people who are award-winning producers and directors, and they’re not advertising that they’re doing this work, but that’s how they’re putting food on the table.”

His first jobs with Mercor were, like Katya’s, relatively pleasant and well paid, but soon came the 6PM fist-bump-emoji Slack exhortations to “come on team, let’s push through this,” followed by sudden halts and months of silence. “You were just constantly waiting for the crack of the starting gun at any hour of the day,” Chris says. Then it was crunch time again and managers, increasingly panicked as deadlines neared, started threatening workers with offboarding if they didn’t complete tasks quickly enough.

The time he spent working was tracked to the second by software called Insightful, which monitored everything he did on his computer. Time that the software deems “unproductive” could be deducted from his pay, and if a few minutes passed without him typing, the system pinged him to ask whether he had been working. Sometimes Chris saw people post in Slack that they’d gone over the target time on a particularly tricky task and that they hoped it would be okay; the next day, they would be gone.

Increasingly worried he would be offboarded too, he started working off the clock, deactivating Insightful while reading instructions so he could move faster. If he went over the target time, he turned the clock off and kept working for free.

Companies say this software is necessary to accurately track hours and prevent workers from cheating, which, in this case, means using AI, something all data companies strictly forbid. The ground truth of verified human expertise is what they’re selling, and when AI trains on AI-generated data, it gradually degrades, a phenomenon researchers call “model collapse.” Employees of data companies say it is a constant battle to screen out AI slop. For workers, AI is a particular temptation as pressure increases. When the retail expert trying to stump models with analytics dashboards had her target time dropped from eight hours per task to five to three and a half, she turned off Insightful and sought outside help. “To be honest, I went into Copilot and ChatGPT and put my prompt in there and said, ‘How can I work this so you guys can’t answer it?’” Then she went to another chatbot and asked if the prompt sounded AI generated and, if so, to make it sound more human.

“It’s just so horrible, the mental effect of it,” says Mimi, a screenwriter who has worked on multiple streaming shows and has been training AI for Mercor for several months. She found out about Mercor from a fellow screenwriter who dropped one of its job links in a Writers Guild of America Facebook group.

Like a lot of people in this line of work, Mimi is conflicted. “One documentary-maker who’s won Emmys, he messaged me and he was like, ‘I’m being handed a shovel and told to dig my own grave,’ and that’s exactly how everyone thinks about it,” she says. Still, as a single mom, she needed the money. She was thankful for the work at first, then the project was paused, unpaused, and paused again. For five weeks, she was told a project would be starting imminently. When it finally did, requirements were added, while the expected time shortened, and she raced to keep up under the watchful eye of Insightful. She felt that someone put it well on Slack when they said it was like they were living in a fishbowl waiting for their human masters to drop in food, and only the ones who were fast enough to swim to the top could eat.

“Last night, I got so fucking stressed because my kid came home and it was 7PM, and I get this message, ‘The tasks are out!’ and I’m just working, just trying to get as many hours in before I can go to bed,” Mimi says, choking up. “I spend no time with my kid, and at one point, he can’t find something for school and I just start screaming at him. This work is turning me into a fucking demon.” She’s especially disturbed by the surveillance: “The idea that somebody can measure your time and that all the little bits that go into being a human are taken away because they’re not profitable, that you can’t charge for going to the toilet because that’s not time you’re working, you can’t charge for making a cup of coffee because that’s not time you’re working, you can’t charge for having a stretch because your back hurts. This is why unions were formed, so people could have guaranteed hours and guaranteed lunch breaks and guaranteed holidays and sick pay. This is the gig economy to the very extreme.”

This is what concerns her more than the AI itself: that it’s bringing to knowledge work the sort of precarious platform labor that has transformed taxi driving and food delivery. Meanwhile, she watches in horror the desperate gratitude of her colleagues as they rejoice at the 7PM announcement of incoming work.

“How long are these tasks expected to last?” one worker asked in Slack.

“I’m wondering too, I’d like to know whether I can sleep or not.”

With no answer forthcoming, they swapped tips on how to stave off sleep.

“Nobody knows what’s going on. Everybody’s really confused.”

When Mercor began recruiting aggressively last year, it framed itself as a more worker-friendly version of the platforms that had come before it. Criticizing his rival Scale AI on a podcast, Foody, Mercor’s CEO, said, “Having phenomenal people that you treat incredibly well is the most important thing in this market.” Workers who joined during this time do report being treated well; the pay was better than elsewhere, and instead of being managed by opaque algorithms, as is common, there were actual human supervisors they could go to with questions.

But people who have worked in management at data companies say they often start out this way, wooing workers off incumbent platforms with promises of better treatment, only for conditions to degrade as they compete to win eight-figure contracts doled out by the half-dozen AI companies who are interested in buying this data in bulk. At Mercor, there was the additional complication of management largely consisting of people in their 20s with minimal work experience who had been given hundreds of millions of investor dollars to pursue rapid growth.

“I don’t care if somebody’s 21 and they’re my manager,” says Chris, the reality TV producer. “But they’ve never worked at this scale. When you try to find some kind of guidance in Slack, very maturely and clearly explaining what the situation is, you get a meme back with a corgi rolling its eyes and it says, ‘Use your judgment.’ But it’s like, ‘Use your judgment and fuck it up, and you get fired.’ You went to Harvard, you graduated last year, and your guidance for a group of people, many of whom are experienced professionals, is a meme?”

Lawyers, designers, producers, writers, scientists — all complained of inexperienced managers giving contradictory instructions, demanding long hours or mandatory Zoom meetings for ostensibly flexible work, and threatening people with offboarding for moving too slowly, threats that were particularly galling for mid-career professionals who felt their 20-year-old bosses barely understood the fields they were trying to automate.

“The founders pride themselves on ‘9-9-6,’” says a lawyer, referring to a term that originated in China to describe 72-hour workweeks associated with burnout and suicide but has been appropriated by Silicon Valley as aspirational. “You need to be accessible at all hours, and they’re going to pump out messages at 6AM, and you better jump because the perception is you will be offboarded and another person will replace you.”

“It’s not just that team leads are young, project managers are young, senior project managers are young. It’s that the senior-senior project managers, the ones responsible for the project in its entirety, are young. I guess that comes from the top because they’re young, right?” says Lindsay, a graphic designer and illustrator in her 50s who came to Mercor after 85 percent of her work evaporated over the past year, owing, she believes, to improvements in generative AI.

Increasingly desperate for work, she scoured job boards; it seemed the only listings matching her expertise were offers to help build the technology she blamed for demolishing her career. “I swallowed my hatred and signed up,” she says. After some initial work producing graphic-design data, she was invited to join a job for Meta grabbing videos from Instagram Reels and tagging whatever was in them. It was boring, and at $21 per hour, the pay was middling, but Lindsay needed the money. So, she discovered when she was brought into the project’s Slack, did approximately 5,000 others.

In early November, a Mercor representative announced that Lindsay’s project would be ending owing to “scope changes,” though workers had previously been told the project would run through the end of the year. Lindsay and thousands of others found themselves removed from the company’s Slack.

Soon, an email arrived in their inbox, inviting them to a new project called Nova paying $16 per hour.

Thousands of workers poured into the new Slack only to discover it was the exact same job, now paying 24 percent less. All but two of the Slack channels had been deleted, including the watercooler, support, and help rooms. The ability to direct-message one another had also been cut off. There were no team leads to be found. With no one to ask for assistance, workers flooded the main rooms with pleas and indignation.

“Nobody knows what’s going on. Everybody’s really confused,” says Lindsay. “The messages are coming so fast in that channel. It’s just absolute chaos. ‘Help, please. What do I do? What am I supposed to do? Where do I go? Can I get started tasking? Am I supposed to redo all the assessments that I’ve done before?’”

Someone emailed support asking for help, and for some reason that email was sent to every one of the thousand-some people on the project, who seized on it and began to reply-all with their bafflement and outrage. “It was absolute carnage,” says Lindsay. “There’s no other word for it.”

Workers began posting complaints on Mercor’s subreddit, only to have their posts quickly deleted by the Mercor representatives who moderate it. In response, two unsanctioned Mercor subreddits were created, where workers could freely express such sentiments as “CHILDREN RUN THIS COMPANY, THEY WILL SOON HAVE THEIR DAY OF RECKONING.”

“It’s just really sad,” says Lindsay. “There are some people in there where it’s genuinely the difference between them being able to feed their families and not feed their families.”

“I hate gen AI,” she adds. “I think AI should be used for curing cancer. I think it should be used for space exploration, not in the creative industries. But I need to be able to pay my rent. And then when people like Mercor pull this stuff where they treat you like nothing more than a lab rat — I’ve been working for a very long time. I have never, ever been treated as badly as this.”

Intermittent work, extreme secrecy, and abrupt firings are the norm across the data industry. On Surge AI’s work platform, called Data Annotation Tech, workers are not only regularly terminated without explanation; they are often not even told they’ve been fired. They just log in one day and find the dashboard empty of tasks. The phenomenon is so ubiquitous they call it simply “the dash of death.”

Last year, a Texan with a master’s degree in divinity who was teaching voice models to respond to queries with appropriate levels of feeling — different tones for a user telling them their dog died versus asking for a trip itinerary — logged in to work one morning and found his dashboard empty. Scrolling to the bottom of the page for the support button, he discovered it no longer worked. That’s when he knew he had been terminated. His mind raced through possible reasons: Had he worked too much? Had his quality slipped? He knew he would never find out. “I felt cut adrift,” he says. Anxious about how he would pay his bills and care for his ailing dog, he grew depressed, then horrified. He thought about his teacher friends who couldn’t get their students to write and all the people graduating with now-worthless computer-science degrees. “The technology makes us see everything as a utility, something to be used,” he says, a category that he feels includes discarded data workers like himself. He resolved to become a chaplain, figuring that no matter what the AI future holds, people will need a fellow human to be there for them.

The on-again, off-again nature of the work is not just the result of company culture; it stems from the cadence of AI development itself. People across the industry described the pattern. A model builder, like OpenAI or Anthropic, discovers that its model is weak on chemistry, so it pays a data vendor like Mercor or Scale AI to find chemists to make data. The chemists do tasks until there is a sufficient quantity for a batch to go back to the lab, and the job is paused until the lab sees how the data affects the model. Maybe the lab moves forward, but this time, it’s asking for a slightly different type of data. When the job resumes, the vendor discovers the new instructions make the tasks take longer, which means the cost estimate the vendor gave the lab is now wrong, which means the vendor cuts pay or tries to get workers to move faster. The new batch of data is delivered, and the job is paused once more. Maybe the lab changes its data requirements again, discovers it has enough data, and ends the project or decides to go with another vendor entirely. Maybe now the lab wants only organic chemists and everyone without the relevant background gets taken off the project. Next, it’s biology data that’s in demand, or architectural sketches, or K–12 syllabus design.

To compete, data companies arrange things so that they will always have workers on call while preserving their freedom to drop them at a moment’s notice. “Every vendor is going to have some kind of setup whereby they don’t really make promises to people,” says a senior employee of a major data company. The companies rarely have much notice of these shifts themselves, sometimes because the AI developers aren’t sure exactly what data they need in the first place, other times because they are shopping around for the best deal. “They want to keep us in the dark,” the employee continues, “so we inevitably keep the contributors in the dark, then a purchase falls through and you have a thousand people you’ve trained and formed a relationship with just saying, like, ‘What the fuck? Why isn’t there work?’ It’s a horrible feeling from an operator’s perspective, too, but obviously it’s way worse for them.”

The workers at the bottom of this supply chain exist in a state of extreme precarity and maximum competitive frenzy — especially because their strict confidentiality agreements make it impossible for them to establish any kind of seniority or relationship that might outlast a particular project. “The power is all on one side because they can’t talk about it,” says Matthew McMullen, a strategy and operations executive who has worked in the industry since the self-driving-car boom in the mid-2010s. “The labs benefit from you not being able to leverage your experience in the market, and this silence is like their pricing power. The silence is their ability to extract mass information from people without giving them the power to object or to unionize or to make companies themselves. As long as they can’t prove what they’ve done, these raters can’t demand what they’re worth. The only way that people can demand things is by showing their ability to step up, to take on more work. The only power that they have is to keep going, to get back in line.”

Which is what they do. When a project for Mercor ends, managers often post a link to other projects on the platform and encourage people to apply. “But again, there are thousands of people applying, so you throw your application into a hole and hope to hear back at some undefined point,” says Katya. While they wait, workers sign up for Handshake, Micro1, Alignerr, or another of the ever-growing number of data providers.

These companies are always recruiting. Like Mercor, many use AI interviewers and automated evaluations, meaning they have no incentive to limit the number of interviews they do. Mercor offers referral bonuses of several hundred dollars, leading some to promote the company so aggressively that mentions of it have been banned from several subreddits. Katya has applied for dozens of jobs and gotten three, not an unusual ratio.

Nor do companies bear any cost for overhiring. Because workers are ostensibly independent contractors, they are not owed paid time off, breaks, healthcare, overtime pay, or unemployment benefits. It’s free to keep them hanging around, and a surplus of vetted workers ensures they will jump quickly to finish tasks before someone else does. It all combines to create an arrangement in which employers can turn labor on and off like a tap. (Reached for comment, Mercor spokesperson Heidi Hagberg said that “the nature of this is project based contract work, meaning it can extend, pause, or end at any time, especially as the client’s scopes and needs evolve,” and that many of the worker complaints “were centered around the misalignment of expectations of a full-time job versus -project-based work.”)

If you move fast and get lucky and have the right combination of expertise and stay on the right side of each platform’s unique and mysterious recipe of productivity metrics, you can make decent money. I spoke to a playwright making $10,000 a month, a multitalented chemist who at various points found gigs demonstrating poker and singing for AI. But even then, there is an inescapable awareness of ephemerality because producing training data means working toward your own obsolescence. While the number of people doing data work may continue to rise, any particular gig will last only as long as it takes for the machines to successfully mimic it. It takes years for a human to develop expertise, and sooner or later, they’re going to run out of skills to sell.

A worker with a master’s in linguistics had found steady rubric work for a year, but late in 2025, he noticed it was becoming more difficult to stump the models. Any obscure theory or Indigenous language he asked about, the model would find the correct papers. Instead of submitting three or four rubrics per week, he was lucky to get one. Everyone else on the project was following the same trajectory, so he wasn’t surprised when it came to an end. Their know-how had been extracted. In the past, he’d always been able to find a new gig, but now when he looked around, he saw only requests for medical experts, human-resources managers, and teachers. He has now been without work for five months and isn’t sure what to do next.

These platforms are reminiscent of Uber and Lyft a decade ago. Yet in some ways these workers are in a worse position, more replaceable despite their advanced degrees

To the extent that policy responses to AI automation are discussed at all, they mostly concern what to do when AI renders large categories of workers obsolete. Maybe this will happen, but another possibility is that particular tasks will get automated and humans redistributed to other parts of the production process, some revising so-so AI output, others crafting rubrics to improve it. Much of this work will be inherently intermittent, which means it will be done by independent contractors, workers whom current regulations leave almost wholly unprotected. Daron Acemoglu, a professor of economics at MIT who studies automation, compares the situation to that of weavers, who before the industrial revolution were “like the labor aristocracy,” self-employed artisans in control of their own time. Then came weaving machines, and in order to survive, they were forced to take new jobs in factories, where they worked longer hours for less money under the close supervision of management. The problem wasn’t simply that technology took their jobs; it enabled a new organization of work that gave all power to the owners of capital, who made work a nightmare until labor organizing and regulation set limits.

Early labor skirmishes are already happening, mostly in California, which has some of the most aggressive rules around classifying platform workers. Three class-action lawsuits have been filed against Mercor in the past six months. (Similar suits were previously filed against Surge AI and Scale AI, which is settling.) The lawsuits all accuse the companies of misclassifying workers as independent contractors given the “extraordinary control” they exert over them. This is “an entirely new kind of work,” one that the company trains people to do and that cannot be done except on the company’s platform. Workers have so little visibility into what they’re working on that one person, alleges a suit filed in December, accepted a Mercor project only to be tasked with recording himself reading sexually explicit scripts. Once he discovered this, the worker risked deactivation if he abandoned the project, forcing him to “choose between being paid and being humiliated.”

These companies are reminiscent of Uber and Lyft a decade ago, says Glenn Danas, a partner at the law firm Clarkson, which is suing Mercor and several other data platforms. Yet in some ways these workers are in a worse position, more replaceable despite their advanced degrees. Uber drivers have to be physically present in a city to work, and they can organize and push for regulation there. If the same were to happen with data workers, companies could just recruit from somewhere else where people will work for less. When Mercor cut pay for its Meta project to $16 per hour, it dropped below the minimum wage in California and other states, yet people there kept working because they needed the money. This was something at least one supervisor acknowledged, writing in Slack, “While we won’t actively hire from any states where the minimum wage is above the project’s rate, if you are already active on the project and would like to work at the $16/hr rate, we want to enable you to do so.”

Entire professions risk a similar race to the bottom, says Acemoglu, if companies are able to pit workers against one another, each selling their data before someone else can underbid them. “We may also need unionlike organizations that exercise some sort of collective ownership and prevent any kind of simple divide-and-rule strategies by large companies to drive down data prices,” he says. “If there isn’t the legal infrastructure for a data economy of this sort, many of the people who produce the data will be underpaid or, to use a more loaded term, exploited.”

Katya was among the thousands of people invited to join the $16-an-hour Project Nova and was appalled by the low pay. “I think that was Mercor’s experiment in how close to the bottom they can scrape without jeopardizing the data that they’re getting,” she says. Her main project had been paused for weeks and might resume the next day or never.

In the end, she decided the money wasn’t worth it. She applied to work at a local coffee shop. It wasn’t the career pivot she’d imagined when she went to grad school; she just hoped working as a barista would be more stable. “At least when you work at a coffee shop for minimum wage, you have some friends to talk to and a boss who pretends to care about you. You have some kind of security; you know what your hours are going to be week to week,” she says.

But then she heard her phone ding. One of her projects was back on.

How many AIs does it take to read a PDF?

2026-02-23T06:04:40-05:00

Image: Kristen Radtke / The Verge

Last November, the House Oversight Committee had just released 20,000 pages of documents from the estate of Jeffrey Epstein, and Luke Igel and some friends were clicking around, trying to follow the threads of conversation through garbled email threads and a PDF viewer that was, frankly, “gross.” In the coming months, the Department of Justice would release its own batches of files, more than three million of them — again, all PDFs.

This was a problem. While the Department of Justice had run optical character recognition over the text, it was not very good, Igel said, rendering the files more or less unsearchable.

“There was no interface the government put out that allowed you to actually see any sort of summary of things like flights, things like calendar events, things like text messages. There was no real index. You just had to get lucky and hope that the document ID that you were looking at contains what you’re looking for,” said Igel, cofounder of the AI video editing startup Kino. What if, Igel thought, they built a Gmail clone to view and search all this correspondence in a more intuitive way?

To do this, they would need to extract the information contained in PDFs, which is far less straightforward than it might sound. Despite rapid progress in AI’s ability to build complex software and solve advanced physics problems, the ubiquitous format of PDF remains something of a grand challenge. Edwin Chen, the CEO of the data company Surge, includes it among AI’s “unsexy failures” limiting real-world usefulness. Last year, he found that even state-of-the-art models asked to extract information from a PDF will instead summarize it, confuse footnotes with body text, or outright hallucinate contents. In a half-joking timeline of AI development, the researcher Pierre-Carl Langlais placed “PDF parsing is solved!” shortly before AGI.

First, Igel’s friend, the “tech jester” Riley Walz, used his remaining credits on Google’s Gemini. It only worked reliably for some of the cleanest scans, and would be prohibitively expensive to run on millions of documents anyway, so Igel reached out to his former MIT classmate Adit Abraham, who happened to work in the office above his, where he ran a PDF-parsing AI company called Reducto.

PDFs are notoriously difficult for machines to parse, in part, because they were never meant to be read by them

Reducto, one of several companies trying to solve PDFs, was able to extract information from email threads with cryptic decoding errors, heavily redacted call logs, and low-quality scans of handwritten flight manifests. After the data was exported in a usable format, Igel and Walz went on a building spree, creating essentially a full Epstein-themed app ecosystem: Jmail, an unsettling, searchable prototype of Epstein’s inbox; Jflights, an interactive globe crisscrossed with flight paths, each one clickable to view underlying PDFs of flight data, passenger manifests, and scanned email invitations; Jamazon, to search Epstein’s Amazon purchases; and Jikipedia, to search businesses and people who turn up in the files, citing, naturally, more PDFs.

“That’s where the magic of extracting information of PDFs became real for me,” Igel said. “It’s going to completely change the way a lot of jobs happen.”

PDFs are notoriously difficult for machines to parse, in part, because they were never meant to be read by them. The format was developed by Adobe in the early 1990s as a way to reproduce documents while preserving their precise visual appearance, first when printing them on paper, then later when depicting them on a screen. Where formats like HTML represent text in logical order, PDF consists of character codes, coordinates, and other instructions for painting an image of a page.

Optical character recognition (OCR) can turn those pictures of words back into text computers can use, but if it comes across a PDF where text is displayed in multiple columns — as many academic papers are — it will plow ahead left to right and create an unintelligible jumble. OCR tools are designed to detect and correct for these sorts of formatting variations, but tables, images, diagrams, captions, footnotes, and headers all present further obstacles. If you give an AI assistant like ChatGPT a PDF, it will cycle through a variety of these tools, sometimes fail, sometimes pass the PDF to a large vision model to perform OCR, sometimes hallucinate, and generally take a very long time and use a lot of computing power for uneven results.

“The key issue is that they cannot recognize editorial structure,” said Langlais. “It’s all fine while it’s relatively simple text, but then you’ve got all these tables, you’ve got forms. A PDF is part of some kind of textual culture with norms that it needs to understand.”

A further problem that arises from and compounds PDF’s inherent difficulty is that models rarely train on them. This has begun to change, partly because AI developers are increasingly desperate for high-quality data, and PDFs contain a disproportionate amount of it. Government reports, textbooks, academic papers — all PDFs. “PDF documents have the potential to provide trillions of novel, high-quality tokens for training language models,” wrote researchers at the Allen Institute for AI last year in a paper announcing a new specialized PDF-reading model.

“The lore has it that the very first PDF ever was an IRS 1040,” said Duff Johnson, CEO of the PDF Association, the industry organization that helps develop the PDF global standard, ISO 32000-2:2020, itself a PDF nearly a thousand pages long. In 1994, the IRS wanted a way to share forms that were absolutely consistent without printing and mailing every possible document, so it mailed CDs full of PDFs instead. From there, PDF spread with email to become a fundamental component of digital work. Book publishers sending manuscripts to the printer, patent applicants submitting diagrams of new devices, anyone who needed to share a document that would look the same to whomever received it turned to PDF.

“There’s no other technology solving the problem the PDF solves,” said Duff. Websites are temporary, appearing differently depending on the browser, mediated by CSS. Links rot. Word docs change depending on your machine and can be edited and overwritten. A PDF is the same no matter who opens it, when, or how.

“That’s what engineering companies need. That’s what lawyers need. That’s what governments need. That’s what anybody who’s doing anything in the world, who has records to maintain, they need that,” Duff said. “Earlier today I opened up a PDF from 1995. I didn’t worry about it. I just opened it. It worked fine. It worked perfectly. I would expect no less.” (It was a PDF about PDFs.)

“So I’m very certain that we will improve fairly fast, but because all these language models are probabilistic, there is just no way to guarantee it will be correct”

There has been a shift over the last year or so toward specialized PDF-parsing models, said Luca Soldaini, a researcher at the Allen Institute for AI who worked on their PDF model, olmOCR. They trained a vision language model — like a large language model, but with pixels instead of word tokens — on about 100,000 PDFs: public domain books, academic papers, brochures, documents from the Library of Congress with human-written transcriptions. The model was further trained to optimize specific problem areas, like parsing tables without mixing up the rows and columns.

“If text is large on the page, the model will learn to say, ‘Oh, that’s probably a header,’” said Soldaini. The model was the most popular one the institute released last year, Soldaini said, rivaling the institute’s generalist models. A PDF reading AI doesn’t capture the spotlight like those models, Soldaini said, but people are actually using it.

A few months later, researchers at Hugging Face, the company that runs a popular open-source AI platform, had just published a 5 billion-document dataset for training multilingual models and were thinking about what to do next. They had already processed the whole of Common Crawl, the enormous archive of mostly HTML text scraped from the web that forms the foundation of many large language models. Like many AI researchers, Hugging Face’s Hynek Kydlíček recalled, they were wondering whether they had run out of easily available data.

“We thought, let’s look at the Common Crawl and, like, maybe there is more stuff we just haven’t seen,” said Kydlíček. Indeed, there was: roughly 1.3 billion PDFs. “That’s how we figured out that PDFs could be actually a super big and super high-quality source we can still train on,” Kydlíček said. “But the format of PDFs is, like, super super hard to extract text from.”

Kydlíček and his collaborators rigged up a system that separated PDFs into easy to parse — mostly text — and difficult to parse, full of images and charts. The hard PDFs were sent to a version of olmOCR that had been modified by Reducto, called RolmOCR. After they stripped out the PDFs of horse racing results that made up an inexplicably large quantity of the corpus, the team declared they had “liberated three trillion of the finest tokens,” now available for model training.

Yet parsing PDFs well enough for model training is one thing. Extracting them with the degree of accuracy demanded by lawyers and engineers is another. When the Hugging Face team did their first tests, they found their model would invent text when there wasn’t any, filling blank pages with nonsense and describing images and art. They trained it to correct these errors, but it’s impossible to anticipate every formatting oddity or off-kilter scan.

“It’s solved in like 98 percent of cases, and like in many areas you always have this problem of getting these last 2 percent,” Kydlíček said. “I would say OCR is one of the best economic use cases for visual language models, so there are a lot of eyes on it right now, a lot of people throwing a lot of resources onto this. So I’m very certain that we will improve fairly fast, but because all these language models are probabilistic, there is just no way to guarantee it will be correct.”

One of the teams doing the best work, Kydlíček said, is Reducto, the company Igel is using to parse the Epstein files. Abraham cofounded the company as a service that managed customers’ long-term histories with language models, similar to the “memory” feature that is now standard in chatbots. Abraham kept getting requests to manage people’s files as well, which naturally were in the form of PDFs. He found working with them to be “shockingly hard.”

Like self-driving cars, PDFs have a long tail of unusual challenges

“One of our core intuitions was all these documents were made for humans like you and I to interpret, and there’s a lot of visual information here that we take for granted, like that every gap between two paragraphs is me telling you, ‘Hey, this is a new idea.’ Every indentation is me telling you, ‘Hey, this is a sub idea of the parent idea.’ The question was like, how do you encode all of that context?”

Much of the team had a background in self-driving vehicles, where computer vision models “segment” data into entities like car, pedestrian, dumpster. They took a similar approach to PDFs, using a model to first divide the page into headers, tables, footnotes, and so on, before passing them to other specialized models for parsing. When they posted about their approach in early 2024, the response was immediate.

“This wasn’t supposed to be a pivot,” said Abraham. Other developers reached out to say that their progress had been stymied by PDFs. “It kind of spiraled from there.”

Reducto now uses a growing assortment of small, specialized models taking multiple passes to parse a PDF. When the segmenting model detects a table, it goes to a table-parsing model. If a chart is detected, different elements get sent to different models: one trained to extract axes, another to read legends, and so on. A vision language model then takes a pass on the output to correct errors. Using this approach, Reducto is able to turn charts into spreadsheets with a high degree of accuracy, something Abraham says the company’s financial clients have long requested and that stymies far larger frontier models.

Still, like self-driving cars, PDFs have a long tail of unusual challenges.

“There’s a big difference between getting a car to stay in a lane versus getting a car to handle whatever would show up on the street, and we see with PDFs a similar thing. I’ve seen the most insane documents you could imagine,” said Abraham. PDF files that contain other PDFs, legal documents with passages sometimes underlined and sometimes crossed out, faxes of medical forms that doctors have scrawled over and drawn lines connecting ideas on different edges of the page. “I don’t think PDFs are a fully solved problem. I wish that were the case. We’re close, but there’s still plenty to do.”

There will be no shortage of PDFs to parse. The format does not appear to be going anywhere. Why would it, asked Duff of the PDF Association, with some incredulity at the very thought. Companies once tried to unseat PDF, Duff said, but their products are “now a footnote in history,” while PDFs continue to proliferate.

“Look at the Google Trends for PDF,” Duff said. It shows a steadily rising curve (with dips in August) year after year. “No other technology looks like that. More and more people over time are including PDF in their searches, because that tends to be where the high-quality content is.

“What’s going to happen is that all the world’s systems will instead understand and use PDF better and better,” Duff said. “The AI companies didn’t focus on PDF, because PDF is very hard, until they realized that, well, it turns out a lot of the really high-quality stuff is in fact in PDF, and so now we have to deal with it.”

Why are Epstein’s emails full of equals signs?

2026-06-09T14:16:39-04:00

Many of the emails released by the Department of Justice from its investigation into Jeffrey Epstein are full of garbled symbols like:

Or:

The scrambled text is so ubiquitous that it’s spurred conspiracy theories that it could be some kind of code. But as believable as it might be that a cabal of elite sex traffickers would communicate in a secret language, the reality is probably more boring: The symbols are likely artifacts from the way the Department of Justice converted the emails to PDFs.

“The glyphs and symbols are probably some artifact of a poor conversion process,” said Chris Prom, professor and archivist at the University of Illinois Urbana-Champaign. Specifically, the symbols look like remnants of Multipurpose Internet Mail Extensions, or MIME, a 30-year-old standard for encoding emails. The protocol underlying email transmits messages as short strings of simple ASCII characters, so as people started writing longer messages and trying to include formatting and symbols, MIME was developed as a way of encoding them in ASCII.

With MIME, the “=” is used to signal either that a string of text should be broken for transmission and rejoined — a “soft line break” — or, when followed by two other characters, that it should be converted to a particular non-ASCII mark. If you wanted to actually write “=” in an email, for example, it would be encoded as “=3D.” During normal use, the recipient’s email client decodes these symbols before displaying the formatted message.

Whatever software the Department of Justice used to extract the emails and convert them to PDFs appears to have mangled some of the decoding, said Peter Wyatt, the chief technology officer of the PDF Association, who examined a batch of the Epstein documents.

“It was in the news, and it was a whole lot of PDFs,” he said. The association performed similar analyses of the Mueller report and Manafort documents. “Generally speaking, we’re interested in anything to do with PDF. That’s kind of what we do and what we’re about.”

The clarity of the text and URLs led Wyatt to believe these documents were extracted digitally then converted to PDF, rather than physically printed and scanned, as the Mueller report was. “So things have improved since that time,” Wyatt said.

Specifically, the Department of Justice likely extracted the email data, converted it to PDF, then redacted it. In order to strip the document of metadata and bake in the redactions so that the black bars couldn’t be removed, they then converted the documents to image files like JPEG before converting them back into PDF. The software used to initially extract and convert the data also captured portions of the underlying MIME format instead of properly decoding it. Or more simply: emails, sometimes partially decoded, converted to PDF, converted to JPEG, converted to PDF.

That at least explains the profusion of “=”. But it doesn’t fully explain why the “=” sometimes replaces letters, like the “J” in “Jeffrey.” No one I spoke to could definitively answer this question, except to say that email is hard and converting it to PDF is harder, and the DoJ was converting a lot of documents in a hurry. (The redactions have been notably inconsistent throughout the files, too.)

Prom thought it might be a character set conversion problem, which he saw frequently when the archival tool he was testing couldn’t find the specific character set or font the email server was using.

Craig Ball, a forensic examiner who teaches at the University of Texas at Austin School of Law pointed out that different email clients implement standards in slightly different ways, adding to the difficulty of conversion. “My hunch is that this is an incompatibility between the code pages used by the transmitting mail client (possibly a BlackBerry) and the application used to print the messages to PDF,” Ball wrote. “The presence of BlackBerry and iPhone signatures in these emails suggests the messages traversed multiple systems with different encoding practices, compounding the decoding issues during PDF generation.”

“You’re looking at hundreds of different methods of converting these files from hundreds of different people using whatever software they had available to them, some of which might have been good, some of which might not have been,” said Prom.

“The PDF standard is quite complex,” wrote Prom. “And email to PDF is particularly fraught.”

Jimmy Wales trusts the process

2026-06-09T14:16:21-04:00

Wikipedia will be 25 years old in January. During that time, the encyclopedia has gone from a punchline about the unreliability of online information to the factual foundation of the web. The project’s status as a trusted source of facts has made it a target of authoritarian governments and powerful individuals, who are attempting to undermine the site and threaten the volunteer editors who maintain it. (For more on this conflict and how Wikipedia is responding, you can read my feature from September.)

Now Wikipedia’s cofounder Jimmy Wales has written a new book, The Seven Rules of Trust: A Blueprint for Building Things That Last. In it, Wales describes a global decline in people’s trust in government, media, and each other, instead looking to Wikipedia and other organizations for lessons about how trust can be maintained or recovered. Trust, he writes, is at its core an interpersonal assessment of someone’s reliability and is best thought of in personal terms, even at the scale of organizations. Transparency, reciprocity — you have to give trust to get trust — and a common purpose are other ingredients that he attributes to Wikipedia’s success.

We spoke over video call about his book, how Wikipedia handles contentious topics, and the threats facing the project and other fact-based institutions.

The interview has been condensed and edited for clarity.

The Verge: You wrote a book about trust, and a global crisis in trust. Can you tell me what that crisis is and how we got there?

Jimmy Wales: If you look at the Edelman Trust Barometer survey, which has been going since 2000, you’ve seen this steady erosion of trust in journalism and media and business and to some degree in each other. I think it gives rise in a business context to a lot of increased cost and complexity, and politically, I think it’s tied up with the rise of populism. So I think it’s important that we focus on this issue and think about, What’s gone wrong? How do we get back to a culture of trust?

What do you think has gone wrong?

I think there’s a number of things that have gone wrong. The trend actually goes back to before the Edelman data. Some of the things I would point to are the decline of the business model for local journalism. To the extent that the business model for journalism has been very difficult, full stop, you see the rise of low-quality outlets, clickbait headlines, all of that. But also that local piece means people aren’t necessarily getting information that they can verify with their own eyes, and I think that tends to undermine trust. In more recent times, obviously the toxicity of social media hasn’t been helpful.

Why has Wikipedia so far bucked that trend and continued to be fairly widely trusted?

Part of the rationale for writing the book is to say, “Look, Wikipedia has gone from being kind of a joke to one of the few things people trust, even though we’re far from perfect.” I think transparency is hugely important. The idea that Wikipedia is an open, collaborative system and you can come and see how decisions are made, you can join and participate in those decisions — that’s been very helpful. I think neutrality is really important. The idea that we shouldn’t take sides on controversial topics is one that resonates with a lot of people. I don’t want to come to an encyclopedia or frankly a newspaper and be told only one side of the story. I want to get the full picture so I can understand the situation for myself.

You brought up the Edelman survey and decline in trust in media, government, and to a lesser extent individuals. Are we seeing a decline in trust or a transfer of trust from institutions to individuals? In the book, you say we are hardwired to trust at an interpersonal level by gauging other people’s authenticity, which is a trait that plays very well on social platforms, where some very trusted figures also gain extra trust by telling their followers not to trust in the media, the FDA, the universities. Do you see this dynamic playing a role, and if you do, how has Wikipedia, which is an institution, continued to be trusted?

I think there’s some truth to that. But I also think it’s incomplete because I think a lot of people who support Donald Trump will also say they don’t really trust him. They just think it’s not relevant. They’ve sort of lost faith in the idea of people being honest. So they’re more likely to say, “All politicians lie, so why is that a big deal?” I obviously think it is a big deal. I think that’s very problematic.

Similarly, I think a lot of the people who are jumping on a bandwagon undermining trust in science, for example, basically see a way to get successful doing it. I mean, that’s a pretty cynical view of those particular people, and I’m not a very cynical person, but it’s hard to come to any other conclusion sometimes, that there’s a lot of grifting going on.

I interviewed Francis Fry for the book, and she’s a Harvard academic who also has business experience. One of the things she said to me was, people often say that once you’ve lost trust — that’s it, you’ll never get it back. And she says that’s not true. You can rebuild trust. There are certain definable things that organizations and people can do to rebuild trust. So when we think about institutions being attacked, they probably should reflect on what made them vulnerable.

You have some examples in the book, like the back-and-forth about masking and covid, and obviously journalists do make errors. But I tend to think that most publications are fairly transparent about issuing corrections, though maybe not to the level of Wikipedia. How much of the decline in trust has to do with actual mistakes made by those institutions, versus people or groups that want to be able to define their own reality undermining what they see as rival centers of facts, whether that’s academia or science or journalism?

I absolutely think it’s both. In many cases, we have seen media with a real blind spot, and I typically would view it more often as a blind spot problem, rather than deliberately being biased. I live in London. All three of the major political parties were all opposed to Brexit, and in London you could not really find anybody who was openly supporting Brexit, not among my social group. Everybody thought it was a completely ridiculous idea. And yet the public voted for it.

I think a big part of that was that London wasn’t listening and the media tended too often to portray Brexit support as having to do with racism and so on. Which, of course, if that’s how you come at people, they tend to not go, “Oh, you’re right, I’m sorry. I’m going to stop being racist now and change my political views.” They’re more likely to say, “Hold on a minute, you’re not listening to me. I’m not being racist. There are these problems, functional problems, and I don’t think I’m being listened to.” To the extent the media isn’t representative of broader segments of society and isn’t listening to problems that people are having, that’s a problem. And then we also have people who are taking advantage of it and who see that opportunity to campaign and build trust by pointing the finger at the other guy.

Debates on Wikipedia talk pages can get heated. People rebut other people’s proposals without a lot of pleasantry. There is real conflict, but they are generally productive conflicts. People keep engaging with each other and usually reach a compromise, which I feel is very unique in online discourse. What do you think the mechanism or mechanisms are that make this possible?

We have a purpose to build an encyclopedia, to be high-quality and neutral, and we have a commitment to civility as a virtue in the community. We’re human beings, so of course sometimes those conversations are, I might say, a bit brusque but hopefully not stretching quite into personal attacks. There’s also this view that you really shouldn’t attack people personally. And if it gets overheated, you should probably apologize, and things like that, which is not that unusual except in online contexts. I mean, normally I think most people in real life, if you get into a proper nasty quarrel with someone, there is a sort of feeling like, Yeah, that wasn’t productive and maybe we need to apologize to each other and find a better way to deal with each other. In terms of how do we foster more of that? I think in online spaces, it has to do with changing culture. And in many cases, I think it’s the design of algorithms.

I don’t go on Facebook very much anymore, but if one day I logged in and Facebook had an option that said, “We’d like to show you things we think you will disagree with, but that we have some signals in our algorithm that are of quality. Would you like to see that?” I’d be like, yes, sign me up for that. As opposed to: “Our research has shown that you tend to get agitated about trolls, so we’re going to send more trolls your way because you stay on the site longer.” Or “we’re only going to send you stuff we think you’re going to agree with,” which is also not really healthy intellectually.

One of your other examples of a functional online space was the subreddit /changemyview, which feels similar to Wikipedia in some ways. It’s text-based. There are rules. You’re there for a specific purpose. Is it possible for a big platform like Facebook or X or whatever to become a healthy space, or do you need to be kind of constrained and purpose-built?

I think it’s hard for sure. And I think that’s a great question because I don’t think anybody knows right now. On Facebook, you’ll find pockets of groups that have good, well-run community members who are keeping the peace and insisting on certain standards. And you find horrible places as well. I think Reddit it’s the same. And another thing that I do think is interesting is looking back, because I’m now old, and I remember before the World Wide Web and I remember Usenet, which was a giant, enormous, largely unmoderated message board. That was super toxic. It had endless flame wars and horribleness and spam and all kinds of nonsense. So I always try to mention that when people have this view of the lovely, sweet days of the early internet — it was such a utopia. I’m like, it was kind of horrible then too. It turns out we don’t need algorithms to be horrible to each other. That’s actually something humans can do, and humans can be great to each other at the same time. But I do think, as consumers of internet spaces, I think we should say, “Actually, I really would much rather be in places that are good for me.”

You recently weighed in on one of the most contentious topics on Wikipedia or anywhere, the Israel-Gaza conflict. You wrote that you thought that it shouldn’t be called a genocide in wiki voice. You normally stay out of content debates on Wikipedia. Why did you decide to weigh in on that one?

I think it’s really important that Wikipedia remain neutral and that we refrain from saying things that are controversial in wiki voice. I think that’s not healthy for us and not healthy for the world. So it felt important to weigh in and say, “Let’s take a deeper look at this.” And the other thing is normally, we have this idea of consensus in the community, and I would say it has a certain usually constructive ambiguity, like what is consensus? How do you define that? We’ve avoided for good reason, I think, saying, “it’s 80 percent” or any kind of simple rule like that. And the reason is because there are so many different areas in editing where there are different levels of certainty and different levels of consensus. My simplest example is, which picture of the Eiffel Tower should we have as the main picture on the Eiffel Tower wiki page? Well, maybe somebody does a straw poll and it’s 60-40. Personally, if I’m in the 40 percent, I’m going to go, Most people don’t agree with me, oh well, because it isn’t that important.

Whereas in other cases, if you’ve got a significant number of good Wikipedians who are saying, “I don’t agree with this, I don’t think this should be in wiki voice, you shouldn’t go for 60 percent.” That’s nowhere near good enough, particularly not if it has enormous implications for the reputation of Wikipedia and neutrality. We should hold ourselves to a very high standard. This is the kind of thing that over the years, we have to reexamine over and over and over. Where are we drawing these lines? And are we doing a good job of it? And should we ratchet it up and be more serious about it? And over the years, we have gotten more serious about it. And I think we should be even more serious about it.

Some of the editors said they felt that there was a consensus, that they’d debated this question for months, and that to frame the article as you wanted would be to give both sides of the debate equal weight, rather than to represent the proportional view of experts and institutions. What are your thoughts on that critique?

Yeah, I think they’re wrong. I think we have to always dig deep and examine it, and I think it’s absolutely fine to say, “The consensus of academic genocide researchers is that this was genocide.” That, as far as I can tell, is a fact, so that’s fine. Report on that fact. That doesn’t mean that Wikipedia should say it in our own voice.

And that’s actually important more broadly that if there’s significant disagreement within the community of Wikipedians and we don’t have consensus, and if people are putting forward policy-based reasons to disagree with that, which they are, then hold on. We should always be looking for as much agreement as possible. So what can we all agree on? Oftentimes that may be stepping back, going meta and saying, “Okay, well, we can all agree to report on the facts. We’re not all going to agree on using wiki voice here. So we’re not going to do that. But we are going to report the facts that we can all agree on.”

And it’s important for two reasons. One, it’s what you want from an encyclopedia. You don’t want to be jumping to a conclusion while there’s still live debate. And two, socially within the community, it means we can all have a win-win situation where we can all point at this and say, “Yeah, we disagree but we can point to this with pride and say, ‘Actually, this is a good presentation. If you read this, you’ll understand the debate.’” Brilliant. That’s where we want to be.

When I see people attack Wikipedia for bias, it often comes down to which sources editors deem reliable. They’ll say, “Well, you don’t let us cite Breitbart, so now it’s going to be biased.” How are you thinking about how to draw the line of what is an acceptable source, and how to maintain neutrality as these decisions no longer seem neutral to people who have a completely different media diet made up of sources deemed unreliable?

It’s something we will always be grappling with. Wikipedia does not have firm rules. That’s one of the core pillars. We don’t completely ban sources. We may deprecate them and say, “Well, it’s not preferred as a source. We’d rather have something better.” And then I make no apologies at all for saying not all sources are equal. I always say, if I have a choice between The New England Journal of Medicine and Breitbart, I’m going with The New England Journal of Medicine. That’s just the way it is, and I think that’s fine. When I say we have to grapple with it and take seriously the question of bias, I think we do. But sometimes we’re going to conclude, Actually, I think we’re fine here.

Elon Musk has been a loud voice complaining about bias on Wikipedia. Now he has Grokipedia, an AI-rewritten version of Wikipedia that draws on a bunch of sources that Wikipedia won’t allow. Have you looked at Grokipedia?

A little. Not enough. I need to do a deep dive.

What are your thoughts on it?

I think a lot of the criticism that it’s getting is not surprising to me. I use large language models a lot and I know about the hallucination problem, and I see it all the time. Large language models really aren’t good enough to write an encyclopedia. And what’s particularly true is the more obscure the topic, then the more likely they are to hallucinate. I also think in terms of the question of trust, I’m not sure anybody’s going to trust an encyclopedia that has a thumb on the scales. Which is to say, when I’m not happy about something in Wikipedia, I open a conversation and enter the discourse. I’m sure if Elon doesn’t like something, it’s just going to change. I don’t see how you can trust a process like that. You know, it is reported that Grokipedia seems to agree with Elon Musk’s political views quite well. Fine. It’s Elon, but that might not be what we all want from an encyclopedia.

Are you concerned that it could be what some people want, or that people will start to use or prefer an AI-revised version of Wikipedia that conforms to their worldview?

Obviously you can’t dismiss that out of hand, but I actually reflect on various research that we cite in the book about trust, that if people feel like there’s a thumb on the scale, then even if they agree with that thumb on the scale, they are likely to trust it less.

I have great confidence in ordinary people. I think that if you ask people, “Would you prefer to have a news source that reflects all your own prejudices and biases and that you agree with every day?” or “Would you rather get something that is neutral and gives you insight into things you might not agree with?” I don’t think it’d be a contest. Most people would prefer the latter. That doesn’t mean they automatically click on it, and they may prefer their preferred outlet. That’s fine. That’s humanity. But I don’t think we’re about to all go off into our little mind bubbles permanently.

How are you thinking about Wikipedia and AI more generally? The internet is increasingly full of AI-generated slop, and the foundation noted earlier this year that bots scraping the site were straining Wikipedia’s servers. Do you see AI presenting a threat, possible benefit, both?

Both. AI slop on the internet I don’t think is a huge issue for Wikipedia because we’ve spent, you know, now nearly 25 years studying sources and debating the quality of sources. And so I think Wikipedians aren’t likely to be fooled by, you know, sort of fluff content that is generated by AI.

Obviously, crawling Wikipedia and hammering our servers, that’s not cool. So we hope we find a reasoned solution to that. The money that supports Wikipedia is the small donors giving an average of just over $10. They’re not donating to subsidize billion-dollar companies crawling Wikipedia. So you know, “pay for what you’re using” seems like a fair request.

Then the other thing that I think is super interesting are questions around how might we, the community, might use the technology in a new way. I’m not a very good programmer, but I’m a programmer and I just wrote a little thing that I can feed it a short Wikipedia entry that maybe has five sources and feed it the five sources and say, “Is there anything in the sources that should be in Wikipedia but isn’t? Or is there anything in Wikipedia that isn’t supported by the sources?” I haven’t even had time to play with it, but even at a first pass, I thought, this is actually not terrible.

Going back to why Wikipedia works, editors do seem to largely trust each other to be working in good faith, but it also seems like they have a lot of trust or respect for Wikipedia’s rules and processes in a way that feels rare in online communities. Where does that come from?

I think it probably has to do with everything being genuinely community-driven and genuinely consensus-driven. The rules aren’t imposed, the rules are people writing down accepted best practices. Certainly in the early days, that was absolutely how it worked. We would be doing something for a while and then we would notice, like, Oh, actually, you know, best practice is this, so we should maybe write that down as a guide for people, and it becomes policy at some point. That helps to build trust in the rules, that they’re genuinely not imposed top-down, that they are the product of our values and a process and the purpose of Wikipedia.

Feeding the machine

2026-01-30T10:57:17-05:00

When he was 19 years old, Brendan Foody started Mercor with two of his high school friends as a way for his other friends, who also had startups, to hire software engineers overseas. It launched in 2023 as essentially a staffing agency, albeit a highly automated one. Language models reviewed resumes and did the interviewing. Within months, Mercor was bringing in $1 million in annualized revenue and turning a modest profit.

Then, in early 2024, the company Scale AI approached Mercor with a big request: They needed 1,200 software engineers. At the time, Scale was one of the only well-known names in the historically back-of-house business of producing AI training data. It had grown to a valuation of nearly $14 billion by orchestrating hundreds of thousands of people around the world to label data for self-driving cars, e-commerce algorithms, and language-model-powered chatbots. Now that OpenAI, Anthropic, and other companies were trying to teach their chatbots to code, Scale needed software engineers to produce the training data.

This, Foody sensed, could herald a larger change in the AI industry. He’d heard about growing demand for specialized data work, and now here was Scale asking for a thousand coders. When the engineers he recruited started complaining about missed pay (Scale has a reputation among data workers for chaotic platform management and is being sued in California over wage theft, among other infractions), Foody decided to cut out the middleman.

In September, Foody announced that Mercor had reached $500 million annualized revenue, making it “the fastest growing company of all time.” The previous titleholder was Anysphere, which makes the AI coding tool Cursor. In a sign of the times, Cursor recently noted that its users produce the exact sort of training data labs are paying for, and The Information recently reported that OpenAI and xAI are interested in buying it.

Mercor’s most recent fundraising round valued the company at $10 billion. Foody and his two cofounders are 22 years old, making them the youngest self-made billionaires. At least one of their early employees has already left to start an AI data company of her own.

While discussions of AI infrastructure typically focus on the gargantuan buildout of data centers, an analogous race is happening with training data. Labs have already exhausted all the easily accessible data, adding to questions about whether early rapid progress through sheer increases in scale will continue. Meanwhile, most recent improvements have come through new training techniques that make use of smaller datasets tailor-made by experts in particular fields, like programming and finance, and AI companies will pay premium prices for it.

There are no good statistics on how much labs are spending, but rough estimates from investors and industry insiders place the figure at over $10 billion this year and growing, the vast majority coming from five or so companies. These companies have yet to find a way to make money from AI, but the people selling them training data have. For now, they are some of the only AI companies turning a profit.

“It’s every nook and cranny of human expertise.”

The data industry has long been the most undervalued and unglamorous aspect of AI development, according to a 2021 study by Google researchers, seen as regrettably necessary janitorial work to be done as quickly and cheaply as possible. Yet modern machine learning could not exist without its ecosystem of data suppliers, and the two spheres move in tandem.

The enormous datasets that proved the viability of machine learning in the early 2010s were made possible by the emergence several years before of Amazon Mechanical Turk, an early crowdsourcing platform where thousands of people could be paid pennies to label images of dogs and cats. The push to develop autonomous vehicles fed the growth of a new batch of companies, among them Scale AI, which refined the crowdsourcing approach through a dedicated work platform called Remotasks where workers used semi-automated annotation software to draw boxes around stop signs and traffic cones.

The turn to language model chatbots after the launch of ChatGPT initiated another transformation of the industry. ChatGPT got its humanlike fluency from a training approach called reinforcement learning from human feedback, or RLHF, which involved paying contractors to rate the quality of chatbot responses. A second model trained on these ratings, then rewarded ChatGPT whenever it did something that this second model predicted humans would like. Providing these ratings was a more nuanced affair than past iterations of crowdsourced data work, particularly as the chatbots got more advanced; it takes someone with medical training to judge whether medical advice is good.

Scale supplied much of the human ratings, but a new company, Surge AI, self-funded by a data scientist named Edwin Chen, quietly grew to become the industry’s other major provider. In Chen’s past jobs at Google, Twitter, and Facebook, he had been dismayed at the poor quality of the data he received from vendors, full of mislabelings done for minimal pay by people who lacked relevant backgrounds. The vendors, Chen said, were just “body shops,” throwing people at the problem and trying to substitute quantity for quality.

Where Scale had its Remotasks platform, Surge has Data Annotation Tech: smaller, more targeted in its recruiting, and with tighter quality controls. It also paid better, around $30 an hour, though like Scale, Surge is also being sued in California for misclassification and unpaid wages. Demand from OpenAI and the labs trying to catch up was immense. The company has been profitable since it launched, and last year, it reportedly took in more than $1 billion in revenue, surpassing Scale’s reported $870 million. Earlier this year, Reuters reported that Surge is considering taking funding for the first time, looking for a $1 billion investment at a $15 billion valuation. According to Forbes, Chen still owns approximately 75 percent of it.

Data about which chatbot responses people prefer is a crude signal, however. Models are prone to learning simple hacks like “tell the user they’ve made an excellent point” instead of something as complex as “check for factual consistency with reliable sources.” Even when domain experts are doing the judging, the results often just sound more expert but are still too unreliable to actually be useful. Models ace bar exams but invent case law, pass CPA tests but pick the wrong cells in a spreadsheet. In July, researchers at MIT released a study finding that 95 percent of the businesses that have adopted generative AI have seen zero return.

AI companies hope that reinforcement learning with more granular criteria will change this. Recent improvements in math and coding are a proof of concept. OpenAI’s o1 and DeepSeek’s R1 showed that given a bunch of math and coding problems and a few step-by-step examples of how humans thought their way to solutions, models can become quite adept at these domains. As they trial-and-error their way to correct solutions, models weigh possible approaches, backtrack, and display other problem-solving techniques developers have called “reasoning.”

The problem is that math and coding problems are idealized, self-contained tasks compared to what a software engineer might encounter in the real world, so scores on benchmarks don’t reflect actual performance. To make models useful, AI companies need more data that is reflective of real tasks an engineer might do — hence the rush to hire software engineers.

The other problem is that math and coding might be the easiest possible domains for AI to conquer. For reinforcement learning to work, models need a clear signal of success to optimize for. This is why the method works so well for games like Go: Winning is a clear, unambiguous outcome, so models can try a million ways to achieve it. Similarly, code either runs or it doesn’t. The analogy isn’t perfect; ugly, inefficient code can still run, but it provides something verifiable to optimize for.

Few other things in life are like this. There is no universal test for determining whether a legal brief or consulting analysis is “good.” Success depends on the context, goals, audience, and countless other variables.

“There seems to be a belief in the community that there’s a single reward function, that if we can just specify what we want these AI systems to do, then we can train them to [do it],” said Joelle Pineau, chief AI officer at Cohere, an enterprise-focused AI lab. But, she said, the reality is more varied and nuanced.

“[Reinforcement learning] wants one reward function. It’s not very good about finding solutions when you have multiple conflicting values that need to coexist, so we may need a very different paradigm than that.”

In lieu of a new paradigm, AI companies are attempting to brute force the problem by paying — via companies like Mercor and Surge — thousands of lawyers, consultants, and other professionals to write out in painstaking detail the criteria for what counts as a job well done in every conceivable context. The hope is that these lists, often called grading rubrics, will allow models to reinforcement-learn their way to competence in the same way they have begun doing with software engineering.

It was like breaking a billion-dollar piñata over all the data startups. Handshake saw demand triple overnight.

Rubrics are extremely labor-intensive to produce. People who work on them said that it is not unusual to spend 10 hours or more refining a single one, which might include more than a dozen different criteria. Companies guard the details of their training methods closely, but an example OpenAI released for its recent medical benchmark offers a good indication of what they’re like. Asked a question about an unresponsive neighbor, the model gets rewarded if its response includes advice to check for a pulse, locate a defibrillator, perform CPR, and 16 other criteria. There are nearly 50,000 such criteria in the benchmark, with different ones applying to different prompts. Labs are ordering tens to hundreds of thousands of rubrics with millions of criteria between them per training run, according to people in the data industry.

These rubrics need to be “super granular,” according to Mercor’s Foody. Producing consulting rubrics, Foody said, would start by creating a taxonomy of all the industries a consulting company operates in, then all the types of consulting it does in each of those industries, then all the types of reports and analyses a consultant might produce in each of those categories.

Performing these tasks typically requires doing things on computers, and each of those things needs a rubric, too. Sending an email requires a lot of steps — opening a browser, beginning a new message, typing it out, and so on. But what if your only verifier for success was whether the email was sent or received? It’s important to check for more actions than just one, according to Aakash Sabharwal, Scale’s VP of engineering.

Models learn to perform these tasks in simplified versions of software called reinforcement learning environments, often described as AI “gyms,” where models can stumble around until they figure out how to do the clicking and dragging required to score well on the grading rubric. The market for these environments is booming, too.

As with rubrics, each one needs to be tailored to its use. “Sometimes it’s a DoorDash or a Salesforce clone, but a lot of times it’s just an enterprise-specific environment,” said Alex Ratner, cofounder and CEO of Snorkel AI. Snorkel makes annotation software but recently launched a human data service of its own.

Ratner cites a recurring irony in AI development known as Moravec’s paradox, named for a researcher working on computer vision in the 1980s who observed that the things that come easiest to humans are often the most difficult for machines. At the time, conventional wisdom was that machine vision would be solved before chess; after all, only a select few humans have the talent and training to be grandmasters, whereas even children can see. Now models can solve complex one-off coding challenges, but they flounder on more basic real-world engineering tasks without close human supervision, misusing tools and making obvious errors.

“That kind of real work, with ambiguous, intermediate metrics of success that seem way more mundane than a coding competition, that is where models struggle,” Ratner said. “That’s the counterintuitive frontier, and that’s where people are trying to lean in, ourselves included, with building more complex environments, more nuanced rubrics.”

According to vendors, the most in-demand fields are the ones that sit at the sweet spot of verifiability and economic value. Software engineering continues to be the largest, followed by finance and consulting. Law is popular, though so far it is proving to be less verifiable and thus amenable to reinforcement learning. Physics, chemistry, math are all in demand. Really, it’s nearly anything you can imagine. There are ads for nuclear engineers and animal trainers.

“It’s everything from clinical hospital settings to legal deep research to — we got a request for woodworking the other day,” Ratner said. “It’s every nook and cranny of human expertise.”

Encoding all of humanity’s skill and know-how into checklists is an enormous, possibly quixotic undertaking, but the frontier labs have billions to spend, and the sheer scale of their demand is reconfiguring the data industry. New entrants seem to appear by the day, and everyone is touting successively more pedigreed experts getting paid ever higher rates.

Surge touts its Fields Medalist mathematicians, Supreme Court litigators, and Harvard historians. Mercor advertises its Goldman analysts and McKinsey consultants. Handshake AI, another fast-growing expert provider, boasts of its physicists from Berkeley and Stanford and the ability to draw alumni from more than 1,000 universities.

Garrett Lord, the CEO and cofounder of Handshake, started picking up signals about the changing data market last year, when incumbent data providers came around asking for experts. Handshake had experts. Lord founded the company in 2014 as a sort of LinkedIn-meets-Glassdoor for college students and recent grads looking for internships and first jobs. More than a thousand college career centers pay for access, as do companies looking to recruit from Handshake’s 20 million alumni, grad students, masters, and PhDs. Early this year, Lord entered the AI data market himself, launching essentially a second company within his existing one, called Handshake AI.

Then, in June, Meta hired away Scale’s CEO and took a 49 percent stake in the company. Rival labs fled, wary that Scale would no longer be a neutral provider — could they trust the data now that it was being provided by a quasi-Meta subsidiary? It was like breaking a billion-dollar piñata over all the data startups. Handshake saw demand triple overnight.

In November, Handshake surpassed a $150 million run rate, exceeding the original decade-old business. There is more demand than the company can meet, Lord said. “We’ve gone from three to 150 people in five months,” Lord said. “We’ve had 18 people start on a Monday. We’re running out of desks.”

The ravenous demand of AI model-builders is pulling any company that might have data to offer into its gravitational field. Turing, which began as a staffing agency but pivoted to training data after OpenAI approached the company in 2022, also saw demand spike following the Scale deal. As did Labelbox, which makes annotation software but last year launched its own expert-annotator service, called Alignerr, where buyers can search for experts, called “Alignerrs,” who’ve been vetted by Labelbox’s AI interviewer, named Zara.

Staffing agencies, content moderation subcontractors, and other adjacent businesses are also reorienting around the labs. Invisible Technologies started 10 years ago as a personal assistant bot that directed tasks to workers overseas, but it started posting twentyfold revenue increases as AI labs hired those workers to produce data. This year, it brought on an ex-McKinsey executive as CEO, took on venture funding, and is positioning itself as an AI training company. The company Pareto followed the same trajectory, launching in 2020 by offering executive assistants based in the Philippines and now selling AI training data services.

The company Micro1 began in 2022 as a staffing agency for hiring software engineers, who had been vetted by AI, but now it’s a data labeling company too. In July, Reuters reported that the company had seen annualized revenue go from $10 million to $100 million this year and was finalizing a Series A funding round valuing the company at $500 million.

Even Uber is angling to get a piece of the action. In October, it bought a Belgian data labeling startup and is in the process of rolling out an annotation platform to US workers, so drivers can annotate when they aren’t driving.

“This Cambrian explosion happened, and now let’s see who survives.”

Then there is a long list of smaller, niche players. The company Sapien is paying data labelers in crypto. Rowan Stone, CEO of Sapien, told The Verge in July that the data labeling company — which specializes in vertical models focused on just one thing and has Scale cofounder Lucy Guo on its advisory board — is “absorbing the collective knowledge of humanity.” They aren’t even the only human data startup paying in crypto tokens.

Stellar, Aligned, FlexiBench, Revelo, Deccan AI — everyone is touting their talent networks, their experts in the loop, their data enrichment pipelines. The company Mechanize rose above the scrum on a wave of viral outrage by announcing in April that its goal was “the full automation of all work.” How will it accomplish this provocative goal? By selling training data and environments, like everyone else.

Like Nvidia, the dominant designer of AI chips, these companies sell the picks and shovels for the AI gold rush, capturing the billions in debt-financed spending flowing out of the frontier labs as they race to achieve superintelligence. It’s a safer business than prospecting, and it is much easier to start selling data than to design new chips, so startups are proliferating.

“It’s like everyone and their mother realized, ‘Hey, I’m doing a human data startup,’” said Adam J. Gramling, a former Scale employee who said he received approximately 300 recruiting messages on LinkedIn when he announced his departure in one of Scale’s recent rounds of layoffs. “This Cambrian explosion happened, and now let’s see who survives.”

The data industry may be growing quickly, but it is a historically tumultuous business. The industry is littered with former giants felled by a sudden change in training techniques or customer departure. In August 2020, the Australian data annotation company Appen’s market cap surpassed the equivalent of $4.3 billion USD; now, it’s less than $130 million, a 97 percent decline. For Appen, 80 percent of its revenue came from just five clients — Microsoft, Apple, Meta, Google, and Amazon — which made even a single client departure an existential event.

Today’s market is also highly concentrated. On a recent podcast, Foody compared Mercor’s customer concentration to Nvidia, where four customers represent 61 percent of its revenue. If investors tire of giving money to model-builders, or the labs take a different approach to training, the effects could be devastating. All of the AI developers use multiple data suppliers already, and as the exodus from Scale showed, they are quick to take their money elsewhere.

All this lends itself to a fiercely competitive atmosphere. On podcasts and in interviews, the CEOs take swipes at the business models of their rivals. Chen still thinks most of his competitors are “body shops.” Foody refers to Surge and Scale as legacy crowdsourcers in an era of highly paid experts. Handshake’s Lord says his rivals are spending thousands on recruiters spamming physicists on TikTok, but they’re all already on his platform. All three say Scale had quality problems even before it was tainted by Meta’s investment. Every time one of these barbs is reported, a Scale spokesperson snipes back, accusing Foody of seeking publicity or mocking Chen for his lengthy fundraising round. Scale is also currently suing Mercor, claiming it poached an employee who stole clients on their way out the door.

For now, there is more than enough money flowing from the labs for everyone. They want rubrics, environments, experts of every conceivable type, but they’re still buying the old types of data too. “It’s always increasing,” says Surge’s Chen. “These ever-increasing new forms of training, they’re almost complementary to each other.”

Even Scale is growing after its post-Meta setback, and major customers have come back, at least in some capacity. Interim CEO Jason Droege said in an onstage interview in September that the company is still working with Google, Microsoft, OpenAI, and xAI. To better compete in the enterprise AI space, Scale has also started a program called the “Human Frontier Collective” for white-collar professionals in STEM fields like computer science, engineering, mathematics, and cognitive science.

Scale told The Verge that both its data and applications businesses are each generating nine figures of revenue, with its data business growing each month since the Meta investment and its application business doubling from the first half to the second half of 2025. It also said that the third quarter of 2025 was its public sector business’s best quarter since 2020, partly due to government contracts. Scale also reportedly expects revenue for this year to more than double, to $2 billion. (The company declined to comment on the figure on the record.)

It has diversified into selling evaluations, the tests that AI developers use to see where their models are weak and need more training data, according to Bing Liu, Scale’s head of research. The business strategy: Companies will ideally use the evaluations to see where their own models are lacking in data — and then, ideally, buy those types of data from Scale.

The 11-digit valuations of just-launched data companies could be seen as signs of an AI bubble, but they could also represent a bet on a certain trajectory of AI development. (Both can also be true.) The goal held out by the AI labs when justifying their enormous expenditures is an imminent breakthrough to artificial general intelligence, something, to use the definition in OpenAI’s charter, that is “highly autonomous” and can “outperform humans at most economically valuable work.”

The term is amorphous and disputed, but one thing artificial general intelligence should be able to do is, well, generalize. If you train it to do math and accounting, it should be able to do your taxes without further rounds of reinforcement learning on tax law, state-specific tax rules, the most recent edition of TurboTax, and so on. A generally capable agent should not need massive amounts of new data to handle each variety of task in every domain.

“The future where the AI labs are right is one where as performance goes up, the need for human data goes down, until you can take the human out of the loop entirely,” said Daniel Kang, assistant professor of computing and data science at the University of Illinois Urbana-Champaign, who has written about the demand for training data. Instead, the opposite seems to be happening. Labs are spending more on data than ever before, and improvements are coming from bespoke datasets tailored to increasingly specific applications. Given current training trends, Kang predicts that getting high-quality human data in each discrete domain will be the primary bottleneck for future AI progress.

In this scenario, AI looks more like a “normal technology,” Kang said. Normal technology here being something like steam engines or the internet — potentially transformative, but also not computer god. (This is also, he hypothesized, why companies are less keen to trumpet their spending on data than they are on data centers: It cuts against their fundraising narrative.) In the AI-as-normal future, companies will need to buy new data whenever they want to automate a particular task, and keep buying data as workflows change.

The data companies are betting on that too. “The labs very much want to say that we’re going to have superintelligence that generalizes as soon as possible,” said Foody. “The way it’s playing out in practice is that reinforcement learning has a limited generalization radius, so they need to build evals across all the things that they want to optimize for, and their investments in that are exploding very quickly.”

Other companies, predicting that the frontier models will not “just hit this point of generalization where it’s just magic and you can do everything,” in the words of Ryan Wexler, who manages AI infrastructure investments at SignalFire, are positioning themselves to cater to the many companies that will need to tune models to suit their purposes.

SignalFire invested in Centaur AI, a medical and scientific data company. Rather than the frontier labs, most of Centaur’s customers are medical institutions like Memorial Sloan Kettering or Medtronic with highly specific applications and low margins for error. Last year, the smart mattress company Eight Sleep wanted to add “snore detection” to its bed’s suite of capabilities. Existing models struggled, so the company hired Centaur to enlist more than 50,000 people to label snores.

“The attempts to make the God model, I don’t know what will happen there, but I’m very confident that demand will keep growing among everyone else,” said Centaur’s founder and CEO, Erik Duhaime. “Everyone was sold some dream that this will be easy, plug and play,” Duhaime said. “Now they’re realizing, ‘Oh, we need to customize this thing for our use case.’”

Matt Fitzpatrick, the CEO of Invisible, is also focusing on its enterprise services. If you look at “spend curves over time,” he said, the enterprise is “where a lot of this will move.” Since January, the company has overhauled its business to focus more on attracting enterprise clients, with about 30 percent of its data annotation pool now being people with PhDs and master’s degrees. Fitzpatrick describes the company as a “digital assembly line” where experts “anywhere on Earth” can be called in to generate data. Invisible is currently often being asked to provide environments for software development and contact centers, he said.

If AGI is to be achieved one order of contact-center training rubrics at a time, the future looks bright for data vendors, which is perhaps why a new grandeur has entered the language of the CEOs. Turing’s CEO predicts that AI data annotator will become the most common job on the planet in the coming years, with billions of people evaluating and training models. Handshake’s Lord sees the nascent formation of a new category of work, comparing it to Uber drivers a decade ago.

“We’re going to need a huge build-out of data and evals across every industry in the economy,” Foody said. At Mercor, he says, the customer support team responds to tickets the AI agent can’t manage, but also updates its rubrics so it can field those questions next time. “If you zoom out,” he said, “it feels like the entire economy will become a reinforcement learning environment.”

If investors don’t find this vision as enticing as a country of geniuses in a data center, as Anthropic’s Dario Amodei described the impending transformation, they can take consolation in the fact that someone, at least, has found a way to make money off AI.

How Wikipedia survives while the rest of the internet breaks

2026-01-20T13:03:36-05:00

When armies invade, hurricanes form, or governments fall, a Wikipedia editor will typically update the relevant articles seconds after the news breaks. So quick are editors to change “is” to “was” in cases of notable deaths that they are said to have the fastest past tense in the West. So it was unusual, according to one longtime editor who was watching the page, that on the afternoon of January 20th, 2025, hours after Elon Musk made a gesture resembling a Nazi salute at a rally following President Donald Trump’s inauguration and well into the ensuing public outcry, no one had added the incident to the encyclopedia.

Then, just before 4PM, an editor by the name of PickleG13 added a single sentence to Musk’s 8,600-word biography: “Musk appeared to perform a Nazi salute,” citing an article in The Jerusalem Post. In a note explaining the change, the editor wrote, “This controversy will be debated, but it does appear and is being reported that Musk may have performed a Hitler salute.” Two minutes later, another editor deleted the line for violating Wikipedia’s stricter standards for unflattering information in biographies of living people.

But PickleG13 was correct. That evening, as the controversy over the gesture became a vortex of global attention, another editor called for an official discussion about whether it deserved to be recorded in Wikipedia. At first, the debate on the article’s “talk page,” where editors discuss changes, was much the same as the one playing out across social media and press: it was obviously a Nazi salute vs. it was an awkward wave vs. it couldn’t have been a wave, just look at the touch to his shoulder, the angle of his palm vs. he’s autistic vs. no, he’s antisemitic vs. I don’t see the biased media calling out Obama for doing a Nazi salute in this photo I found on Twitter vs. that’s just a still photo, stop gaslighting people about what they obviously saw. But slowly, through the barbs and rebuttals and corrections, the trajectory shifted.

Wikipedia is the largest compendium of human knowledge ever assembled, with more than 7 million articles in its English version, the largest and most developed of 343 language projects. Started nearly 25 years ago, the site was long mocked as a byword for the unreliability of information on the internet, yet today it is, without exaggeration, the digital world’s factual foundation. It’s what Google puts at the top of search results otherwise awash in ads and spam, what social platforms cite when they deign to correct conspiracy theories, and what AI companies scrape in their ongoing quest to get their models to stop regurgitating info-slurry — and consult with such frequency that they are straining the encyclopedia’s servers. Each day, it’s where approximately 70 million people turn for reliable information on everything from particle physics to rare Scottish sheep to the Erfurt latrine disaster of 1184, a testament both to Wikipedia’s success and to the total degradation of the rest of the internet as an information resource.

“It’s basically the only place on the internet that doesn’t function as a confirmation bias machine.”

But as impressive as this archive is, it is the byproduct of something that today looks almost equally remarkable: strangers on the internet disagreeing on matters of existential gravity and breathtaking pettiness and, through deliberation and debate, building a common ground of consensus reality.

“One of the things I really love about Wikipedia is it forces you to have measured, emotionless conversations with people you disagree with in the name of trying to construct the accurate narrative,” said DF Lovett, a Minnesota-based writer and marketer who mostly edits articles about local landmarks and favorite authors but later joined the salute debate to argue that “Elon Musk straight-arm gesture controversy” was a needlessly awkward description. “It’s basically the only place on the internet that doesn’t function as a confirmation bias machine,” he said, which is also why he thinks people sometimes get mad at it. Wikipedia is one of the few platforms online where tremendous computing power isn’t being deployed in the service of telling you exactly what you want to hear.

Whether Musk had made a Nazi salute or was merely awkward, the editors decided, was not for them to say, even if they had their opinions. What was a fact, they agreed, was that on January 20th, Musk had “twice extended his right arm toward the crowd in an upward angle,” that many observers compared the gesture to a Nazi salute, and that Musk denied any meaning behind the motion. Consensus was reached. The lines were added back. Approximately 7,000 words of deliberation to settle, for a time, three sentences. This was Wikipedia’s process working as intended.

It was at this point that Musk himself cannonballed into the discourse, tweeting that the encyclopedia was “legacy media propaganda!”

This was not Musk’s first time attacking the site — that appears to have been in 2019, when he complained that it accurately described him as an early investor in Tesla rather than its founder. But recently he has taken to accusing the encyclopedia of a liberal bias, mocking it as “wokepedia,” and calling for it to be defunded. In so doing, he has joined a growing number of powerful people, groups, and governments that have made the site a target. In August, Republicans on the US House Oversight Committee sent a letter to the Wikimedia Foundation requesting information on attempts to “inject bias” into the encyclopedia and data about editors suspected of doing so.

Musk repeating the salute before saying: “My heart goes out to you. It is thanks to you that the future of civilization is assured.”

" data-portal-copyright="" />

When governments have cowed the press and flooded social platforms with viral propaganda, Wikipedia has become the next target, and a more stubborn one. Because it is edited by thousands of mostly pseudonymous volunteers around the world — and in theory, by anyone who feels like it — its contributors are difficult for any particular state to persecute. Since it’s supported by donations, there is no government funding to cut off or advertisers to boycott. And it is so popular and useful that even highly repressive governments have been hesitant to block it.

Instead, they have developed an array of more sophisticated strategies. In Hong Kong, Russia, India, and elsewhere, government officials and state-aligned media have accused the site of ideological bias while online vigilantes harass editors. In several cases, editors have been sued, arrested, or threatened with violence.

When several dozen editors gathered in San Francisco this February, many were concerned that the US could be next. The US, with its strong protections for online speech, has historically been a refuge when the encyclopedia has faced attacks elsewhere in the world. It is where the Wikimedia Foundation, the nonprofit that supports the project, is based. But the site has become a popular target for conservative media and influencers, some of whom now have positions in the Trump administration. In January, the Forward published slides from the Heritage Foundation, the think tank responsible for Project 2025, outlining a plan to reveal the identities of editors deemed antisemitic for adding information critical of Israel, a cudgel that the administration has wielded against academia.

“It’s about creating doubt, confusion, attacking sources of trust,” an editor told the assembled group. “It came for the media and now it’s coming for Wikipedia and we need to be ready.”

In 1967, Hannah Arendt published an essay in The New Yorker about what she saw as an inherent conflict between politics and facts. As varieties of truth go, she wrote, facts are fragile. Unlike axioms and mathematical proofs that can be derived by anyone at any time, there is nothing necessary about the fact, to use Arendt’s example, that German troops crossed the border with Belgium on the night of August 4th, 1914, and not some other border at some other time. Like all facts, this one is established through witnesses, testimony, documents, and collective agreement about what counts as evidence — it is political, and as the propaganda machines of the 20th century showed, political power is perfectly capable of destroying it. Furthermore, they will always be tempted to, because facts represent a sort of rival power, a constraint and limit “hated by tyrants who rightly fear the competition of a coercive force they cannot monopolize,” and at risk in democracies, where they are suspiciously impervious to public opinion. Facts, in other words, don’t care about your feelings. “Unwelcome facts possess an infuriating stubbornness,” Arendt wrote.

This infuriating stubbornness turns out to be important, though. A lie might be more plausible or useful than a fact, but it lacks a fact’s dumb arbitrary quality of being the case for no particular reason and no matter your opinion or influence. History once rewritten can be rewritten again and becomes insubstantial. Rather than believe the lie, people stop believing anything at all, and even those in power lose their bearings. This gives facts “great resiliency” that is “oddly combined” with their fragility. Having a stubborn common ground of shared reality turns out to be a basic precondition of collective human life — of politics. Even political power seems to recognize this, Arendt wrote, when it establishes ideally impartial institutions insulated from its own influence, like the judiciary, the press, and academia, charged with producing facts according to methods other than the pure exercise of power.

Leonardo DiCaprio is an American actor and film producer." data-portal-copyright="" /> Outside Wikipedia, original research is a key part of scholarly work. However, Wikipedia editors must base their contributions on reliable, published sources, not their own original research." data-portal-copyright="" /> On the floor of the US Senate, Republican Sen. Jim Inhofe displayed a snowball — on February 26th, 2015, in winter — as evidence the globe was not warming, in a year that was found to be Earth’s warmest on record at the time." data-portal-copyright="" /> is a species of riffle beetle in the superfamily Byrrhoidea.
The species was named after actor and environmentalist Leonardo DiCaprio to acknowledge his work in "promoting environmental awareness and bringing the problems of climate change and biodiversity loss into the spotlight."" data-portal-copyright="" />

Wikipedia has come to play a similar role of factual ballast to an increasingly unmoored internet, but without the same institutional authority and with its own methods developed piecemeal over the last two decades for arriving at consensus fact. How to defend it from political attacks is not straightforward. At the conference, many editors felt both that attacks from the Trump administration were a genuine threat and that being cast as “the resistance” risked jeopardizing the encyclopedia’s position of trusted neutrality.

“I would really argue not to take the attack approach, to really take the passive approach,” said one editor when someone broached the idea of actively debunking some of the false information swamping the rest of the internet. “People see us as credible because we don’t attack, because we are just providing information to everyone all the time in a boring way. Sometimes boring is good. Boring is credible.”

Even the editor at the summit who had been most directly affected by the Trump administration urged against a direct response. Jamie Flood had been a librarian and outreach specialist at the National Agricultural Library, where among other duties she led group trainings and uploaded research on topics like germplasm and childhood nutrition to Wikipedia. Museums and libraries around the world employ such “Wikipedians in residence” to act as liaisons with the encyclopedia’s community for the same reason that the World Health Organization partnered with Wikipedia during the covid-19 pandemic to make the latest information available: if you want research to reach the public, there is no better place.

Along with several other Wikipedians employed by the federal government, Flood had just been laid off by DOGE, collateral damage in a general dismantling of research and archival institutions. “I’m a casualty of this administration’s war on information,” Flood said.

“‘Imagine a world where all knowledge is freely available to everyone.’”

Still, Wikipedia absolutely should not counterattack, Flood said. “Wikipedia is always in the background. They’re not making a big statement, and I don’t think they should. I’ve been training people for a long time and I still go back to this early quote of Jimmy Wales, one of the founders: ‘Imagine a world where all knowledge is freely available to everyone.’ That’s enough. That’s a statement in and of itself. In a time of misinformation, in a time of suppression, having this place where people can come and bring knowledge and share knowledge, that is a statement.”

Wikipedia should be, in other words, as stubborn as a fact. But then, facts are fragile things.

A common refrain among Wikipedians is that the site works in practice but not in theory. It seems to flout everything we’ve learned about human behavior online: anonymous strangers discussing divisive topics and somehow, instead of dissolving into factions and acrimony, working together to build something of value.

The project’s origins go back to 1999. Wales, a former options trader who had founded a laddish web portal called Bomis, wanted to start a free online encyclopedia. He hired an acquaintance from an Ayn Rand listserv that Wales previously ran, a philosophy PhD student named Larry Sanger. Their first attempt, called Nupedia, was not so different from encyclopedias as they have existed since Diderot’s Encyclopédie in 1751. Experts would write articles that went through seven stages of editorial review. It was slow going. After a year, Nupedia had just over 20 articles.

In an attempt to speed things along, they decided to experiment with wikis, a web format gaining popularity among open-source software developers that allowed multiple people to collaboratively edit a project. (Wiki is the Hawaiian word for “quick.”) The wiki was intended to be a forum where the general public could contribute draft articles that would then be fed into Nupedia’s peer-review pipeline, but the experts objected and the crowdsourced site was given its own domain, Wikipedia.com. It went live on January 15th, 2001. Within days, it had more articles than all of Nupedia, albeit of varying quality. After a year, Wikipedia had more than 20,000 articles.

“…write about what people believe, rather than what is so”

There were few rules at first, but one that Wales said was “non-negotiable” was that Wikipedia should be written from a “neutral point of view.” The policy, abbreviated as NPOV, was imported from the “nonbias policy” Sanger had written for Nupedia. But on Wikipedia, Wales considered it as much a “social concept of cooperation” as an editorial standard. If this site was going to be open to anyone to edit, the only way to avoid endless flame wars over who is right was, provocatively speaking, to set questions of truth aside. “We could talk about that and get nowhere,” Wales wrote to the Wikipedia email list. “Perhaps the easiest way to make your writing more encyclopedic is to write about what people believe, rather than what is so,” he explained.

Ideally, the neutrality principle would allow people of different views to agree, if not on the matter at hand, then at least on what it was they were disagreeing about. “If you’ve got a kind and thoughtful Catholic priest and a kind and thoughtful Planned Parenthood activist, they’re never going to agree about abortion, but they can probably work together on an article,” Wales would later say.

This view faced an immediate challenge, which is that people believe all sorts of things: that the Earth is 6,000 years old, that climate change is a scam, that the Holocaust was a hoax, that the Irish potato famine was overblown, that chiropractors are all charlatans, that they have discovered a new geometry, and that Mother Teresa was a jerk.

Lawrence Mark Sanger is an American Internet project developer and philosopher who cofounded Wikipedia along with Jimmy Wales. " data-portal-copyright="" /> Anti-denialist banner at the 2017 Climate March in Washington, DC." data-portal-copyright="" /> Mary Teresa Bojaxhiu was an Albanian Indian Catholic nun, founded the Missionaries of Charity, and is a Catholic saint." data-portal-copyright="" /> Young Earth creationism (YEC) is a form of creationism that holds as a central tenet that the Earth and its lifeforms were created by supernatural acts of the Abrahamic God between about 10,000 and 6,000 years ago, contradicting established scientific data that puts the age of Earth around 4.54 billion years." data-portal-copyright="" />

In response, the early volunteers added another rule. You can’t just say things; any factual claim needs a citation that readers can check for themselves. When people started emailing Wales their proofs that Einstein was wrong about relativity, he clarified that the cited source could not be your own “original research.” Sorry, Wales wrote to an Einstein debunker, it doesn’t matter whether your theory is true. When it is published in a physics journal, you can cite that.

Instead of trying to ascertain the truth, editors assessed the credibility of sources, looking to signals like whether a publication had a fact-checking department, got cited by other reputable sources, and issued corrections when it got things wrong.

At their best, these ground rules ensured debates followed a productive dialectic. An editor might write that human-caused climate change was a fact; another might change the line to say there was ongoing debate; a third editor would add the line back, backed up by surveys of climate scientists, and demand peer-reviewed studies supporting alternate theories. The outcome was a more accurate description of the state of knowledge than many journalists were promoting at the time by giving “both sides” equal weight, and also a lot of work to arrive at. A 2019 study published in Nature found that Wikipedia’s most polarizing articles — eugenics, global warming, Leonardo DiCaprio — are the highest quality, because each side keeps adding citations in support of their views. Wikipedia: a machine for turning conflict into bibliographies.

Coupled with some technical features of wikis, like the ability for anyone to edit anyone else’s writing, and some early administrative rules, like not being allowed to undo someone else’s edit more than three times per day, users were practically forced to talk through disagreements and arrive at “consensus.” This became Wikipedia’s governing principle.

This may make the process sound more peaceful than it is. Disputes were constant. Early on, Sanger, who had remained partial to a more hierarchical, expert-driven model, clashed repeatedly with editors he decried as “anarchists” and demanded greater authority for himself, which the editors rejected. When revenue from Bomis dried up after the dot-com crash, Wales laid Sanger off and took over management of the project.

Wales governed from a greater remove, appearing only occasionally to broker peace between warring editors, resolve an impasse, or reassure people that they didn’t need to spend time devising procedures to screen out a sudden influx of neo-Nazis that were planning to overwhelm discussion, because if they showed up, “I will personally ban them all if necessary, and that’s that.” Editors sometimes ironically referred to him as their “God King” or “benevolent dictator,” but he described his role as a sort of constitutional monarch safeguarding the community as it developed the processes to fully govern itself. Because Wikipedia was under a Creative Commons license, anyone who didn’t like the way the project was run could copy it and start their own, as a group of Spanish users did when the possibility of running ads was raised in 2002. The next year Wales established a nonprofit, the Wikimedia Foundation, to raise funds and handle the technical and legal work required to keep the project running. The encyclopedia itself, however, would be entirely edited and managed by volunteers.

In early 2004, Wales delegated his moderating powers to a group of elected editors, called the Arbitration Committee. From that point onward, he was essentially another editor, screenname Jimbo Wales, liable to have his edits undone like anyone else. He attempted several times to update his own birthdate to reflect the fact that his mother says he was born slightly before midnight on August 7th, 1966, not on August 8th, as his birth certificate read, only to be reprimanded for editing his own page and trying to cite his own “original research.” (After several years of debates and citable coverage from reliable sources, August 7th eventually won, with a note explaining the discrepancy.)

AGF

Over the ensuing two decades, editors amended policies to cope with conspiracy theorists, revisionist historians, militant fandoms, and other perennial goblins of the open web. There were the three core content guidelines of Neutral Point of View, Verifiability, and No Original Research; the five pillars of Wikipedia; and a host of rules around editor conduct, like the injunction to avoid ad hominem attacks and assume good faith of others, defined and refined in interlinked articles and essays. There are specialized forums and noticeboards where editors can turn for help making an article more neutral, figuring out whether a source was reliable, or deciding whether a certain view was fringe or mainstream. By 2005, the pages where editors stipulated policy and debated articles were found to be growing faster than the articles themselves. Today, this administrative backend is at least five times the size of the encyclopedia it supports.

The most important thing to know about this system is that, like the neutrality principle from which it arose, it largely ignores content to focus on process. If editors disagree about, for example, whether the article for the uninhabited islands claimed by both Japan and China should be titled “Senkaku Islands,” “Diaoyu Islands,” or “Pinnacle Islands,” they first try to reach an agreement on the article’s Talk page, not by arguing who is correct, but by arguing which side’s position better accords with specific Wikipedia policies. If they can’t agree, they can summon an uninvolved editor to weigh in, or file a “request for comment” and open the issue to wider debate for 30 days.

If this fails and editors begin to quarrel, they might get called before the Arbitration Committee, but this elected panel of editors will also not decide who is right. Instead, they will examine the reams of material generated by the debate and rule only on who has violated Wikipedia process. They might ban an editor for 30 days for conspiring off-Wiki to sway debate, or forbid another editor from working on articles about Pacific islands over repeated ad hominem attacks, or in extreme cases ban someone for life. Everyone else can go back to debating, following the process this time.

As a result, explosive political controversies and ethnic conflicts are reduced to questions of formatting consistency. But because process decides all, process itself can be a source of intense strife. The topics of “gun control” and “the Balkans” are officially designated as “contentious” due to recurring edit wars, where people keep reverting each other’s edits without attempting to build consensus; so, too, are the Wikipedia manual of style and the question of what information belongs in sidebars. In one infamous battle, debate over whether to capitalize “into” in the film title Star Trek Into Darkness raged for more than 40,000 words.

Because disputes on Wikipedia are won or lost based on who has better followed Wikipedia process, every dispute becomes an opportunity to reiterate the project’s rules and principles

In 2009, law professors David A. Hoffman and Salil K. Mehra published a paper analyzing conflicts like these on Wikipedia and noted something unusual. Wikipedia’s dispute resolution system does not actually resolve disputes. In fact, it seems to facilitate them continuing forever.

These disputes may be crucial to Wikipedia’s success, the researchers wrote. Online communities are in perpetual danger of dissolving into anarchy. But because disputes on Wikipedia are won or lost based on who has better followed Wikipedia process, every dispute becomes an opportunity to reiterate the project’s rules and principles.

Trolls who repeatedly refuse to follow the process eventually get banned, but initial infractions are often met with explanations of how Wikipedia works. Several of the editors I spoke with began as vandals only to be won over by someone explaining to them how they could contribute productively. Editors will often restrict who can work on controversial topics to people who have logged a certain number of edits, ensuring that only those bought into the ethos of the project can participate.

In 2016, researchers published a study of 10 years of Wikipedia edits about US politics. They found that articles became more neutral over time — and so, too, did the editors themselves. When editors arrived, they often proposed extreme edits, received pushback, and either left the project or made increasingly moderate contributions.

This is obviously not the reigning dynamic of the rest of the internet. The social platforms where culture and politics increasingly play out are governed by algorithms that have the opposite effect of Wikipedia’s bureaucracy in nearly every respect. Optimized to capture attention, they boost the novel, extreme, and sensational rather than subjecting them to increased scrutiny, and by sending content to users most likely to engage with it, they sort people into clusters of mutual agreement. This phenomenon has many names. Filter bubbles, epistemological fragmentation, bespoke realities, the sense that everyone has lost their minds. On Wikipedia, it’s called a “point of view split,” and editors banned it early. You are simply not allowed to make a new article on the same topic. Instead, you must make the case for a given perspective’s place amid all the others while staying, literally, on the same page.

In February, the conservative organization Media Research Center released a report claiming that “Wikipedia Effectively Blacklists ALL Right-Leaning Media.” It was essentially a summary of a publicly available policy page on Wikipedia that lists discussions about the reliability of sources and color codes them according to the latest consensus — green for generally reliable, yellow for lack of clear consensus, and red for generally unreliable. ProPublica is green because it has an “excellent reputation for fact-checking and accuracy, is widely cited by reliable sources, and has received multiple Pulitzer Prizes.” Newsweek is yellow after a decline in editorial standards following its 2013 acquisition and recent use of AI to write articles. Newsmax, the One America News Network, and several other popular right-leaning sources are red due to repeatedly publishing stories that were proven wrong. (As are some left-leaning sources, like Occupy Democrats.) The New York Post (generally unreliable, but marginally reliable on entertainment) used the report as the basis for an editorial titled “Big Tech must block Wikipedia until it stops censoring and pushing disinformation.”

The page is called Reliable sources/Perennial sources, as in sources that are perennially discussed. Editors made the page in 2018 as a repository for past discussions that they could refer to instead of having to repeatedly debate the reliability of the Daily Mail — the first publication to be deprecated, the year before — every time someone tried to cite it. It is not a list of preapproved or banned sources, the page reads. Context matters, and consensus can change.

But to Wikipedia’s critics, the page has become a symbol of the encyclopedia’s biases. Sanger, the briefly tenured cofounder, has found a receptive audience in right-wing activist Christopher Rufo and other conservatives by claiming Wikipedia has strayed from its neutrality principle by making judgments about the reliability of sources. Instead, he argues, it should present all views equally, including things “many Republicans believe,” like the existence of widespread fraud in the 2020 election and the FBI playing a role in the January 6th Capitol attack.

Last spring, the reliable source page collided with one of the most intense political flashpoints on Wikipedia, the Israel-Palestine conflict. In April, an editor asked whether it was time to reevaluate the reliability of the Anti-Defamation League in light of changes to the way it categorizes antisemitic incidents to include protests of Israel, among other recent controversies. About 120 editors debated the topic for two months, producing text equal to 1.9 The Old Man and the Seas, or “tomats,” a standard unit of Wikipedia discourse. The consensus was that the ADL was reliable on antisemitism generally but not when the Israel-Palestine conflict was involved.

Unusually for a Wikipedia administrative process, the decision received enormous attention. The Times of Israel called it a “staggering blow” for the ADL, which mustered Jewish groups to petition the foundation to overrule the editors. The foundation responded with a fairly technical explanation of how Wikipedia’s self-governing reliability determinations work.

In the year since, conservative and pro-Israel organizations have published a series of reports examining the edit histories of articles to make a case that Wikipedia is biased against Israel. In March, the ADL itself issued one such report, called “Editing for Hate,” claiming that a group of 30 “malicious editors” slanted articles to be critical of Israel and favorable to Palestine. As evidence, the report highlights examples like the removal of the phrase “Palestinian terrorism” from the introduction of the article on Palestinian political violence.

Yet the edit histories show that these examples are often plucked from long editing exchanges, the outcome of which goes unmentioned. The “terrorism” line that the ADL cited was indeed removed — it had also only just been added, was added back shortly after being cut, then was removed again, added back, and revised repeatedly before editors brokered a compromise on the talk page.

Breitbart, Pirate Wires, and other right-leaning publications now regularly mine Wikipedia’s lengthy debates for headlines like “How Wikipedia Launders Regime Propaganda,” accusing the site of being a mouthpiece for the Democratic Party, or “Cover Up: Wikipedia Editors Propose Deleting Page on Iran Advocating for Israel’s Destruction,” despite the article having just been created and the outcome being to merge the contents into the article on Iran-Israel relations. These reports are a dependable source of viral outrage on X. The strategy also appears effective at convincing lawmakers. In May, Rep. Debbie Wasserman Schultz (D-FL) and 22 other members wrote to the Wikimedia Foundation citing the ADL report and demanding Wikimedia “rein in antisemitism, uphold neutrality.”

Alice O'Connor, better known by her pen name Ayn Rand, was a Russian-born American writer and philosopher. She is known for her fiction and for developing a philosophical system that she named Objectivism. " data-portal-copyright="" /> Here are two black swans, but even with no black swans to possibly falsify it, "All swans are white" would still be shown falsifiable by "Here is a black swan" — it would still be a valid observation statement in the empirical language, even if empirically false." data-portal-copyright="" />

The August letter from House Republicans requesting information on attempts to influence the encyclopedia, data on editors who had been disciplined by Arbcom, and other records also cited the ADL report.

While some search for bias in the minutiae of edit histories, others try to encompass all of Wikipedia. Last year, a researcher at the conservative Manhattan Institute scraped Wikipedia for mentions of political terms and public officials and used a GPT language model to analyze them for bias. The report found “a mild to moderate” tendency to associate figures on the political right with more negative sentiment than those on the left. The study, which was not peer reviewed, has become a regular fixture in claims of liberal bias on Wikipedia.

The report still illustrates the challenges of evaluating the neutrality of a text as vast and stripped of subjective opinion as Wikipedia. An examination of the datasets shows that the passages GPT classified as non-neutral are often anodyne factual statements: that a lawmaker won or lost an election, represented a certain district, or died. It also conflated unrelated people of the same name, so, for example, most of the non-neutral statements about Mike Johnson concern not Mike Johnson the current Republican House Majority Speaker but a robber in a 1923 silent film, a prog-rock guitarist, multiple football players, and a famous yodeler.

But the more fundamental question is whether balanced sentiment or balanced anything across the contemporary political spectrum is the correct expectation for a project that operates by a different standard, one based on measures of reliability. Supposing the sentiment readings do reflect a real imbalance, is that due to the biases of editors, biases in their sources, or some other external imbalance, like a tendency by right-leaning politicians to express negative sentiments of fear or anger (a possibility the report raises, then dismisses).

Wikipedia has a long history of attempting to disentangle and correct its various biases. The site’s editor community has been overwhelmingly white, male, and based in the United States and Europe since the site began. In 2018, 90 percent of editors were men, and only 18 percent of biographies in the encyclopedia were of women. That year, the Canadian physicist Donna Strickland won a Nobel Prize, and people turning to Wikipedia to learn about her discovered she lacked an article.

Women have been historically excluded from the sciences, underrepresented in coverage of the sciences, and therefore underrepresented in the sources Wikipedia editors can cite

But the causal connection between these facts was not straightforward. Women have been historically excluded from the sciences, underrepresented in coverage of the sciences, and therefore underrepresented in the sources Wikipedia editors can cite. An editor had tried to make an article on Strickland several months before the Nobel but was overruled due to a lack of coverage in reliable sources. “Wikipedia is a mirror of the world’s biases, not the source of them. We can’t write articles about what you don’t cover,” tweeted then-executive director Katherine Maher.

Wikipedia’s sourcing guidelines are conservative in their deference to traditional institutions of knowledge production, like established newsrooms and academic peer review, and this means that it is sometimes late to ideas in the process of moving from fringe to mainstream. The possibility that covid-19 emerged from a lab was relegated to a section on conspiracy theories and is only now, after reporting by reliable sources, gaining a toehold on the covid pandemic article. Similarly, as awareness grew of the ways Western academic and journalistic institutions have excluded the perspectives of colonized people, critics argued that Wikipedia’s reliance on these same institutions made it impossible for the encyclopedia to be truly comprehensive.

Not all the bias comes from the project’s sources, though. A study that attempted to control for offline inequalities by examining only contemporary sociologists of similar achievement found that male academics were still more likely to have articles. As volunteers, editors work on topics they think are important, and the encyclopedia’s emphases and omissions reflect their demographics. Minor skirmishes in World War II and every episode of The Simpsons have an article, some of which are longer than the articles on the Ethiopian civil war or climate change in the Maldives. In an effort to fill in these gaps, the foundation has for several years funded editor recruitment and training initiatives under the banner of “knowledge equity.”

“Most editors on Wikipedia are English-speaking men, and our coverage is of things that are of interest to English-speaking men,” said a retired market analyst in Cincinnati who has been editing for over 20 years. “Our sports coverage is second to none. Video games, we got it covered. Wars, the history of warfare, my god. Trains, radio stations… But our coverage of foods from other countries is very low, and there is an absolute systemic bias against coverage of women and people of color.” For her part, she tries to fill gaps around food, creating new articles whenever she encounters a Peruvian chili sauce or African fufu that lacks one.

Yet these initiatives have come under attack as “DEI” by conservative influencers and Musk, who called for Wikipedia to be defunded until “they restore balance.”

If you think something is wrong on Wikipedia, you can fix it yourself

These accusations of bias, familiar from attacks on the media and social platforms, encounter some unique challenges when leveled against Wikipedia. Crucially, if you think something is wrong on Wikipedia, you can fix it yourself, though it will require making a case based on verifiability rather than ideological “balance.”

Over the years, Wikipedia has developed an immune response to outside grievances. When people on X start complaining about Wikipedia’s suppression of UFO sightings or refusal to change the name of the Gulf of Mexico to Gulf of America, an editor often restricts the page to people who are logged in and puts up a notice directing newcomers to read the latest debate. If anything important was missed, they are welcome to suggest it, the notice reads, provided their suggestion meets Wikipedia’s rules, which can be read about on the following pages. That is, Wikipedia’s first and best line of defense is to explain how Wikipedia works.

Occasionally, people stick around and learn to edit. More often, they get bored and leave.

It was not unusual for skirmishes to break out over the Wikipedia page for Asian News International, or ANI. It is the largest newswire service in India, and as its Wikipedia article explains, it has a history of promoting false anti-Muslim and pro-government propaganda. It was these facts that various anonymous editors — not logged into Wikipedia accounts, so appearing only as IP addresses — attempted to remove last spring.

As typically happens, an experienced editor quickly reinstated the deleted sentences, noting that they had been removed without explanation. Then came another drive-by edit: actually, ANI is not propaganda and very credible, someone wrote, citing a YouTube video. Reverted: YouTube commentary is not a reliable source. Then another IP address, deleting a sentence about ANI promoting a false viral story about necrophilia in Pakistan. Reverted again. Another IP address, deleting the mention of propaganda with the explanation that the sources were “leftist dogs and swine.”

As the edit battle escalated, an editor locked the page so that only people who were logged in and had made a certain number of edits could make changes, ending the barrage of IP addresses.

Two months later, ANI sued.

The lawsuit revealed that several of the IP addresses had belonged to representatives of ANI attempting to remove unflattering information about the company. Blocked from doing so, ANI sued for defamation under a recent amendment to India’s equivalent of Section 230 that places stricter requirements on platforms to moderate content. When the Wikimedia Foundation declined to reveal the identities of three editors who had defended the page, the presiding judge said he would ask the government to block the site, threatening to cut off the country with the highest number of English Wikipedia readers after the US and the UK. “If you don’t like India,” the judge said, “please don’t work in India.”

During the appeal, Wikimedia’s lawyer argued that disclosing the identities of editors would destroy the encyclopedia’s self-regulating system and expose contributors to reprisals. Also, he noted, the sentences in question, like every assertion on Wikipedia, were only summarizing other sources, and those sources — the publications The Caravan and The Ken — had not been sued for defamation. (As with editors, the foundation’s first response to external threats is often to explain how Wikipedia works.) The judge dismissed the argument, saying that journalism might be “read by a hundred people, you don’t bother about it… it does not have the gravitas.” Wikipedia, however, is read by millions.

By this point the case had garnered enough coverage to warrant its own Wikipedia page. This seemed to enrage the judge, particularly the line noting that the judge’s demand to reveal the identities of editors had been described as “censorship and a threat to the flow of information.” This “borders on contempt,” the judge said, demanding that the foundation take the page down within 36 hours. In a rare move, the foundation complied.

The case alarmed editors around the world. An open letter calling on the Wikimedia Foundation to protect the anonymity of the editors garnered more than 1,300 signatures, the most of any letter directed at the foundation. Nevertheless, last December, the foundation disclosed the editors’ identities to the judge under seal. Responding to outrage on Wikipedia’s editor forum, Wales asked for calm and urged people not to jump to conclusions.

The Wikimedia Foundation has historically taken a hard line against attempts to influence the project. In 2017, when the Turkish government demanded several articles be deleted, Wikipedia refused and was blocked for nearly three years as it fought to the country’s Constitutional Court and won. For the second half of 2024, the most recent data available, the foundation complied with about 8 percent of requests for user data, compared to Google’s 82 percent and Meta’s 77 percent. And the data provided was sparse, because Wikipedia retains almost none.

Instead of brute censorship, what has emerged is a sort of gray-zone information warfare

But attempts to influence the site have grown more sophisticated. The change is likely due to multiple factors: a global rise of political movements that wish to control independent media, the increased centrality of Wikipedia, and a technical change to the website itself. In 2015, Wikipedia switched to the encrypted HTTPS extension by default, making it impossible to see what pages users visited, only that they were visiting the Wikipedia domain. This meant that governments that had previously been censoring specific articles on opposition figures or historic protests had to choose between blocking all of Wikipedia or none of it. Almost every country save China (and Russia, for several hours) chose to not to block it. This was a victory for open knowledge, but it also meant governments had a greater interest in controlling what was written in the encyclopedia.

Instead of brute censorship, what has emerged is a sort of gray-zone information warfare. After mainland China quashed protests against the Hong Kong national security law in 2019, a battle began over how the protests would be remembered. Editors in mainland China — which can edit using VPNs — argued for the inclusion of state-friendly media that described the protests as “riots” or “terrorist attacks” while removing citations to independent media for unreliability and bias. In one case, an editor attempted to strip all citations to one of Hong Kong’s premier papers, Apple Daily, hours before it was shut down by the government. By conspiring offline and using fake accounts, they won elections to admin positions and with them the power to see other editors’ IP addresses, which they discussed using to reveal their opponents’ identities to the police. Shortly afterward, the Wikimedia Foundation banned or restricted more than a dozen editors operating from mainland China, saying that the project had been “infiltrated” and that “some users have been physically harmed as a result.”

Russia employed similar tactics after its invasion of Ukraine in 2022. State media and government officials attacked Wikipedia in the press with accusations of anti-Russian bias, promulgation of fake news, and foreign manipulation. The site remained accessible, but Russian search engines put a banner above it saying it was in violation of the law. Meanwhile, the government harassed the foundation with a series of fines for publishing “false” information about the military, which the foundation has refused to pay. Finally, on the encyclopedia, state-aligned editors pushed the government’s view while vigilantes doxxed and threatened their opposition. Last year, the head of Wikimedia Russia was declared a “foreign agent” and forced to resign from his job as a professor at Moscow State University.

In neighboring Belarus, editor Mark Bernstein was doxxed by a pro-Russian group in 2022, arrested, and sentenced to three years of home confinement. As many as five other editors have been detained by Belarusian authorities in recent months, according to media reports and editors.

As these battles continued, the Russian government supported the creation of a more compliant alternative, called Ruwiki, which launched early last year with the copying of 1.9 million articles from the originals, edited to reflect the government view. On Ruwiki, edits must comply with Russian laws and are subject to approval from outside experts. There, the map of Ukraine does not include Donetsk or Kherson, the war is a “special operation” in response to NATO aggression, and accounts of torture in Bucha are fake news.

The first large-scale anti-Zionist demonstrations in Palestine, March 1920, during the Occupied Enemy Territory Administration. The crowd of Muslim and Christian Palestinians are shown outside Damascus Gate, Old City of Jerusalem." data-portal-copyright="" /> Palestinian political violence refers to acts of violence or terrorism committed by Palestinians with the intent to accomplish political goals in the context of the Israeli–Palestinian conflict." data-portal-copyright="" /> On January 6th, 2021, the United States Capitol in Washington, DC, was attacked by a mob of supporters of President Donald Trump in an attempted self-coup, two months after his defeat in the 2020 presidential election. " data-portal-copyright="" /> Pareidolia is the tendency for perception to impose a meaningful interpretation on a nebulous stimulus, usually visual, so that one detects an object, pattern, or meaning where there is none." data-portal-copyright="" /> is a 1952 novella by the American author Ernest Hemingway. " data-portal-copyright="" />

Wikipedia remains online in Russia, but with Ruwiki, the government may now feel emboldened to block it. In May, at a hearing on media safety for children, the head of the Russian Duma Committee on the Protection of the Family said that the encyclopedia’s “interpretation of our historical events feels so hostile that we need to raise the issue of blocking this information resource,” and that the encyclopedia’s depiction of history is opposed to Russian “traditional, spiritual values.”

The goal of these campaigns is what the Wikimedia Foundation calls “project capture.” The term originates in an independent report the foundation commissioned in response to the takeover of the Croatian-language Wikipedia by a cabal of far-right editors.

In 2010, a group of editors won election to admin positions and began citing far-right alternative media to rewrite history. On Croatian Wikipedia, the Nazis invaded Poland to stop a genocide against the German people, Croatia’s role in the Holocaust is foreign propaganda, and Ratko Mladić was a decorated military leader whose conviction by the UN for genocide (briefly noted quite far down) was the result of an international conspiracy. When other editors attempted to correct the articles, the admins banned them for violating rules against hate speech or harassment.

The encyclopedia became so warped that it began receiving press coverage. The Croatian Minister of Education warned students not to use it. In an interview with a Croatian paper, Wales confirmed the foundation was aware of the problem and looking into it. Yet the foundation has a policy of allowing Wikipedia projects to self-govern, and interfering with Croatian Wikipedia risked opening a door to the many governments and companies that want things on Wikipedia changed.

Editors mounted a resistance and attempted to vote the admins out, but the admins defeated the attempt using votes from what were later revealed to be dozens of fake accounts. But because the admins were the only ones with the technical ability to trace IP addresses, the opposition had no way to prove this. The cabal now controlled all the levers of power. By 2019, nearly all of the editors who opposed them had been banned or harassed off the project.

In 2020, one of the few remaining dissident editors compiled a comprehensive textual and statistical analysis of editing patterns of dozens of accounts and filed a request for an admin to run IP traces to see if they were sock puppets. The admin stalled, then attempted to fudge the traces, but did so in such a transparent way that it was clear the accounts were indeed fakes.

This was the evidence required to procedurally break the cabal. High-ranking admins called “stewards” from other-language Wikipedias administered a new vote on banning the Croatian admins. This time, the admins lost. Their ringleader, username Kubura, was banned from all Wikipedia projects forever, a punishment that had been leveled against less than a dozen others in Wikipedia history. A local daily covered the incident with the headline “Kubura’s Downfall: Banned Globally, His Followers Retreat, Leaderless.”

Wikipedia’s processes are only effective if they are administered by people who believe in the spirit of the project

The foundation’s postmortem analysis compared the takeover to “state capture, one of the most pressing issues of today’s worldwide democratic backsliding.” The clique still cited the reliability of sources and invoked rules of debate, but it bent these processes to serve their nationalist purpose. As many governments have discovered, it is extremely difficult to insert propaganda into Wikipedia without running afoul of some rule or another. But what the Croatia capture showed is that Wikipedia’s processes are only effective if they are administered by people who believe in the spirit of the project. If they can be silenced or replaced, it becomes possible to steer the encyclopedia in a different direction.

Donna Theo Strickland (born May 27th, 1959) is a Canadian optical physicist and pioneer in the field of pulsed lasers. " data-portal-copyright="" /> A telescope in the Very Large Telescope system producing four orange laser guide stars." data-portal-copyright="" /> Oral tradition, or oral lore, is a form of human communication in which knowledge, art, ideas, and culture are received, preserved, and transmitted orally from one generation to another." data-portal-copyright="" />

One editor I spoke with, who asked to remain anonymous for reasons that will be obvious, had been editing Wikipedia for several years while living in a Middle Eastern country where much other media is tightly controlled. One day he received a call from a member of the intelligence service inviting him to lunch. He cried for hours — everyone knew what this meant.

The meeting was cordial but clear. They didn’t want him to stop editing Wikipedia. They wanted his help. They knew the encyclopedia has rules and you can’t just insert flagrant propaganda, but as a respected member of the community, maybe he could edit in ways that were a little friendlier to the government, maybe decide in its favor when certain topics came up for debate. In exchange, maybe the service could help him if he ever got in trouble with the police, for example, over his sexuality; he was gay in a country where that was illegal.

He fled the country weeks later. He now edits from abroad, but he knows of five to 10 others who have faced arrest or intimidation over their editing. They must do constant battle with editors he believes to be government agents who push the state’s perspective, debating tirelessly for hours because it is literally their job.

It’s a rare person who is able to uproot their life in the service of a volunteer side project. Understandably, many others faced with such threats become more cautious in their editing or stop altogether. Multiple editors based in India said that they now avoid editing topics related to their country. The ANI case had a chilling effect, as have recurring harassment campaigns. The far-right online publication OpIndia regularly accuses Wikipedia of “anti-Hindu and anti-India bias,” in ways that parallel attacks from the US right, down to citations of Manhattan Institute research and quotes from the disgruntled cofounder, Sanger. The organization has published the real names and employers of editors it accuses of being “leftists” or “Islamists,” leading at least one veteran editor to delete their account.

Even ancient history can be cause for reprisals. In February, after the release of a Bollywood action film about Chhatrapati Sambhaji Maharaj, a 17th-century king who fought the Mughals, accounts on X began whipping up outrage over several facts on Sambhaji’s Wikipedia page that they deemed to be anti-Hindu. When editors reversed attempts to delete the offending lines, another X user posted their usernames and called on government officials to investigate them. Days later, local press reported that the Maharashtra cyber police opened cases against at least four editors.

“If you issue cases and file complaints against editors, they tend not to edit those pages anymore”

“Various editors have left Wikipedia over this persecution, fearing their own safety,” said an Indian Wikipedia editor who asked to remain anonymous out of fear of retaliation. “I believe this is completely useful for the right wing, if you issue cases and file complaints against editors, they tend not to edit those pages anymore, fearing for their safety in real life.”

He still edits, but mostly sticks to the safer ground of the Roman Empire.

In April, the Trump administration’s interim US attorney for DC, Edward Martin Jr., sent a letter to the Wikimedia Foundation accusing the organization of disseminating “propaganda” and intimating that it had violated its duties as a tax-exempt nonprofit.

From a legal perspective, it was an odd document. The tax status of nonprofits is not generally the jurisdiction of the US attorney for DC, and many of the supposed violations, like having foreign nationals on its board or permitting “the rewriting of key, historical events and biographical information of current and previous American leaders,” are not against the law. Sanger is quoted, criticizing editor anonymity. In several cases, the rules Martin accuses Wikipedia of violating are Wikipedia’s own, like a commitment to neutrality. But the implied threat was clear.

“We’ve been anticipating something like this letter happening for some time,” a longtime editor, Lane Rasberry, said. It fits the pattern seen in India and elsewhere. He has been hearing more reports of threats against editors who work on pages related to trans issues and has been conducting security trainings to prevent their identities being revealed. Several US-based editors told me they now avoid politically contentious topics out of fear that they could be doxxed and face professional or legal retaliation. “There are more Wikipedia editors getting threats, more people getting scared,” Rasberry said.

The "little green men" were Russian soldiers who were masked and wore unmarked uniforms upon the outbreak of the Russo–Ukrainian War in 2014." data-portal-copyright="" /> The 2019–2020 Hong Kong protests (also known by other names) were a series of demonstrations against the Hong Kong government's introduction of a bill to amend the Fugitive Offenders Ordinance in regard to extradition. " data-portal-copyright="" /> May 2015 satellite image of the Crimean Peninsula." data-portal-copyright="" /> The owl of Athena, a symbol of knowledge in the Western world." data-portal-copyright="" /> Sambhaji, also known as Shambhuraje, ruled from 1681 to 1689 as the second king (Chhatrapati) of the Maratha Empire, a prominent state in early modern India." data-portal-copyright="" /> Stanislav Alexandrovich Kozlovsky is a Russian scientist-psychologist and specialist in the field of cognitive neuroscience of memory and perception." data-portal-copyright="" />

Talking to editors, I encountered a confounding spread of opinions about the seriousness of the threat to Wikipedia, often in the same conversation. The site has sloughed off more than two decades of attacks, and so far the latest round is no different. The Heritage Foundation plan to dox editors has yet to materialize. Musk’s calls for his followers to stop donating have resulted in surges in donations, according to publicly available data.

In India, the High Court struck down the order to take down the article about ANI’s defamation case, though the case itself is ongoing. Wikipedia’s critics on the right and in the Silicon Valley elite often propose generative AI as the solution to Wikipedia’s perceived biases, for each user a bespoke source of ideologically agreeable information. Yet all these projects remain wholly reliant on Wikipedia, and so far the most aggressive such initiative, Musk’s Grok, has spent much of its existence flailing between fact-checking Musk’s own conspiracy theories and proclaiming itself MechaHitler.

But new threats continue to appear. In August, the foundation lost its case arguing for an exemption from the UK’s Online Safety Act, which would force Wikipedia to verify the identities of its editors, though it is continuing to appeal. In Portugal the foundation received a court order arising from a defamation case brought by Portuguese American businessman Cesar DePaço, who objected to information on his page about past criminal allegations and links to the far-right Portuguese party Chega. Complying with the ruling, the foundation struck several facts from his biography and disclosed “a small amount of user data” about eight editors. The foundation is now bringing the case before the European Court of Human Rights. And in the US, there is the recent House Oversight letter.

No matter the outcome, these cases contribute to a general increase in pressure on the project’s already strained editors. English Wikipedia has fewer than 40,000 active editors, defined as users who have made five or more edits in the last month. The number of active administrators, crucial to maintaining the site and enforcing policy, peaked in 2008 and now stands at around 450. AI threatens to squeeze the editor pipeline further. The more people who get information from AI summaries of Wikipedia rather than the site itself, the fewer people who will wander down a rabbit hole, encounter an error that needs correcting, and become editors themselves.

“Wikipedia should not be taken for granted.”

At the same time, people are using AI to add plausible-looking but false or biased information to the encyclopedia, increasing the workload for editors. Harassment, ideological editing campaigns, government investigations, targeted lawsuits — even if they lead nowhere, they will make the prospect of editing more daunting and increase the odds that current editors burn out. “Wikipedia should not be taken for granted,” Rasberry said. “This is an existential threat.”

The first reactions to the Martin letter on the Wikipedia editor forums were radical: the foundation should leave the US, maybe for France, or Iceland, or Germany. This would not be unprecedented, an editor pointed out. The Encyclopédistes fled to Switzerland when the ancien régime attempted to censor them. Maybe the site should go dark in protest.

But moderation soon prevailed. “The community needs to chill on the blackout talk,” wrote an editor by the name of Tazerdadog. “We’re not there yet.” Right now, the best response to these threats is to double down on Wikipedia’s policies, particularly the refusal to be censored and its dedication to neutral point of view, they wrote.

NPOV

“I 100% agree with you, Tazerdadog,” replied “Jimbo Wales.” “Emphasizing to the WMF that NPOV is non-negotiable is not really the issue.” In fact, Wales wrote, he is chairing a working group on strengthening the policy. The initiative was announced in March, framed as a response to the global rise in threats to sources of neutral information, and to a fragmentation of the public’s understanding of the very concepts of neutrality and facts. Wikipedia’s response, it seemed, would be to neutral harder.

In May, I met Wales for coffee at a members club in Chelsea where he had been granted an honorary membership after giving a talk. (Wikipedia, as journalists have noted for years, did not make Wales a tech billionaire.) Extravagant bouquets of pastel flowers were arranged in an arch above the doorway and festooned the tables of the interior. Wales, dressed to meet his wife at the Chelsea Flower Show, matched the decor in a green linen suit and floral shirt. He does not, he said, normally dress like a leprechaun.

He was not particularly concerned about the attacks on Wikipedia, he said, though he warned that he is “pathologically optimistic.” Wikipedia has been attacked since it began. It fought Turkey’s ban to the Constitutional Court and won. Even Russian Wikipedia has proven resilient. In the US, the government lacks much of the leverage it has deployed against other institutions. Wikipedia doesn’t rely on government funding, and protections for online speech are strong. In the last fiscal year, the foundation took in $170 million in donations, with an average size of about $10.

As for the accusations of bias, why not investigate? Whether the attacks are in good faith or bad, it doesn’t really matter, Wales said. The foundation had already decided that it was a good time, given the fragmented and polarizing world, to examine and bolster Wikipedia’s neutrality processes. Wales, leaning over the coffee table, seemed excited at the prospect.

“If somebody turns up on a talk page and says, ‘Hey, this article is a mess, it’s wrong. It’s really biased,’ the right answer is to not scream at them and run and hide. The right answer is go, ‘Oh, tell me more. Let’s dig in. Where is it biased? How do we think about how do we fix that?’”

Let’s figure out the best methodologies for studying neutrality, Wales said. Let’s look at how editors evaluate the reliability of sources. Maybe Wikipedia does use the label “far-right” more than “far-left,” Wales said, a criticism that has been leveled at the site. Is that because the media uses the term more, and does Wikipedia use the term more or less than the media does, and does the media use the term more because there are more far-right movements in the world today?

“You have to chew on these things. There’s no simple answers.”

But there are answers. If the social platforms and language models that increasingly shape our understanding of the world are inscrutable black boxes, Wikipedia is the opposite, maybe the most legible, endlessly explainable information management system ever made. For any sentence, there is a source, and a reason that that source was used, and a reason for that reason.

“Let’s dig in,” Wales repeated. “Let’s assess the evidence. Let’s talk to a lot of different people. Let’s really try and understand.” Come, be part of the process. His working group is starting to discuss the best approach. The meetings, Wales acknowledged, have been very tedious so far.

As for the letter from the interim DC attorney, Trump withdrew Martin’s nomination in May, though he still has a position leading the Justice Department’s retribution-oriented “task force on weaponization.” In any case, the Wikimedia Foundation responded promptly.

“The foundation staff spent a lot of passion writing it,” Wales said of the reply. “Then they ran it by me for review, and I was ready to jump in, but I was like, actually, it’s perfect.”

“It’s very calm,” Wales said. “Here are the answers to your questions, here is what we do.” It explains how Wikipedia works.

An edit-a-thon is an event where some editors of online communities such as Wikipedia, OpenStreetMap (also known as a "mapathon"), and LocalWiki edit and improve a specific topic or type of content. " data-portal-copyright="" /> The Quaker business method or Quaker decision-making is a form of group decision-making and discernment, as well as of direct democracy, used by Quakers, or members of the Religious Society of Friends, to organise their religious affairs. " data-portal-copyright="" /> Wikipedia's goal is to create a well-written, reliable encyclopedia like the Encyclopædia Britannica, except Wikipedia is much, much bigger: Britannica has about 120,000 articles, while the English Wikipedia has over 7 million articles." data-portal-copyright="" /> The Wikipedia Monument, located in Słubice, Poland, is a statue designed by Armenian sculptor Mihran Hakobyan honoring Wikipedia contributors." data-portal-copyright="" /> Socrates was known to steadfastly assume others around him were acting in good faith." data-portal-copyright="" /> Jimmy Donal Wales (born August 7th, 1966), also known as Jimbo Wales, is an American internet entrepreneur and former financial trader. " data-portal-copyright="" />

How AI can make history

2024-02-15T09:00:00-05:00

Like millions of other people, the first thing Mark Humphries did with ChatGPT when it was released in late 2022 was ask it to perform parlor tricks, like writing poetry in the style of Bob Dylan — which, while very impressive, did not seem particularly useful to him, a historian studying the 18th-century fur trade. But Humphries, a 43-year-old professor at Wilfrid Laurier University in Waterloo, Canada, had long been interested in applying artificial intelligence to his work. He was already using a specialized text recognition tool designed to transcribe antiquated scripts and typefaces, though it made frequent errors that took time to correct. Curious, he pasted the tool’s garbled interpretation of a handwritten French letter into ChatGPT. AI corrected the text, fixing all the Fs that had been misread as an S and even adding missing accents. Then Humphries asked ChatGPT to translate it to English. It did that, too. Maybe, he thought, this thing would be useful after all.

For Humphries, AI tools held a tantalizing promise. Over the last decade, millions of documents in archives and libraries have been scanned and digitized — Humphries was involved in one such effort himself — but because their wide variety of formats, fonts, and vocabulary rendered them impenetrable to automated search, working with them required stupendous amounts of manual research. For a previous project, Humphries pieced together biographies for several hundred shellshocked World War I soldiers from assorted medical records, war diaries, newspapers, personnel files, and other ephemera. It had taken years and a team of research assistants to read, tag, and cross-reference the material for each individual. If new language models were as powerful as they seemed, he thought, it might be possible to simply upload all this material and ask the model to extract all the documents related to every soldier diagnosed with shell shock.

“That’s a lifetime’s work right there, or at least a decade,” said Humphries. “And you can imagine scaling that up. You could get an AI to figure out if a soldier was wounded on X date, what was happening with that unit on X date, and then access information about the members of that unit, that as historians, you’d never have the time to chase down on an individual basis,” he said. “It might open up new ways of understanding the past.”

Improved database management may be a far cry from the world-conquering superintelligence some predict, but it’s characteristic of the way language models are filtering the real world. From law to programming to journalism, professionals are trying to figure out whether and how to integrate this promising, risky, and very weird technology into their work. For historians, a technology capable of synthesizing entire archives that also has a penchant for fabricating facts is as appealing as it is terrifying, and the field, like so many others, is just beginning to grapple with the implications of such a potentially powerful but slippery tool.

AI seemed to be everywhere at the 137th annual meeting of the American Historical Association last month, according to Cindy Ermus, an associate professor of history at the University of Texas at San Antonio. She chaired one of several panels on the topic. Ermus described her and many of her colleagues’ relationship to AI as that of “curious children,” wondering with both excitement and wariness what aspects of their work it will change and how. “It’s going to transform every part of historical research, from collection, to curation, to writing, and of course, teaching,” she said. She was particularly impressed by Lancaster University lecturer Katherine McDonough’s presentation of a machine learning program capable of searching historic maps, initially trained on ordnance surveys of 19th-century Britain.

“It’s going to transform every part of historical research, from collection, to curation, to writing, and of course, teaching.”

“She searched the word ‘restaurant,’ and it pulled up the word ‘restaurant’ in tons of historical maps through the years,” Ermus said. “To the non-historian, that might not sound like a big deal, but we’ve never been able to do that before, and now it’s at our fingertips.”

Another attendee, Lauren Tilton, professor of liberal arts and digital humanities at the University of Richmond, had been working with machine learning for over a decade and recently worked with the Library of Congress to apply computer vision to the institution’s vast troves of minimally labeled photos and films. All archives are biased — in what material is saved to begin with and in how it is organized. The promise of AI, she said, is that it can open up archives at scale and make them searchable for things the archivists of the past didn’t value enough to label.

“The most described materials in the archive are usually the sort of voices we’ve heard before — the famous politicians, famous authors,” she said. “But we know that there are many stories by people of minoritized communities, communities of color, LGBTQ communities that have been hard to tell, not because people haven’t wanted to, but because of the challenges of how to search the archive.”

AI systems have their own biases, however. They have the well-documented tendency to reflect the gender, racial, and other biases of their training data — the fact that, as Ermus pointed out, when she asked GPT-4 to create an image of a history professor, it drew an elderly white man with elbow patches on his blazer — but they also display a bias that Tilton calls “presentism.” Because the vast preponderance of training data is scraped from the contemporary internet, models reflect a contemporary worldview. Tilton encountered this phenomenon when she found image recognition systems struggled to make sense of older photos, for example, labeling typewriters as computers and their paperweights as their mice. These were image recognition systems, but language models have a similar problem.

Impressed with ChatGPT, Humphries signed up for the OpenAI API and set out to make an AI research assistant. He was trying to track 18th-century fur traders through a morass of letters, journals, marriage certificates, legal documents, parish records, and contracts in which they appear only fleetingly. His goal was to design a system that could automate the process.

One of the first challenges he encountered was that 18th-century fur traders do not sound anything like a language model assumes

One of the first challenges he encountered was that 18th-century fur traders do not sound anything like a language model assumes. Ask GPT-4 to write a sample entry, as I did, and it will produce lengthy reflections on the sublime loneliness of the wilderness, saying things like, “This morn, the skies did open with a persistent drizzle, cloaking the forest in a veil of mist and melancholy,” and “Bruno, who had faced every hardship with the stoicism of a seasoned woodsman, now lay still beneath the shelter of our makeshift tent, a silent testament to the fragility of life in these untamed lands.”

Whereas an actual fur trader would be far more concise. For example, “Fine Weather. This morning the young man that died Yesterday was buried and his Grave was surrounded with Pickets. 9 Men went to gather Gum of which they brought wherewith to Gum 3 Canoes, the others were employed as yesterday,” as one wrote in 1806, referring to gathering tree sap to seal the seams of their bark canoes.

“The problem is that the language model wouldn’t pick up on a record like that, because it doesn’t contain the type of reflective writing that it’s trained to see as being representative of an event like that,” said Humphries. Trained on contemporary blog posts and essays, it would expect the death of a companion to be followed by lengthy emotional remembrances, not an inventory of sap supplies.

By fine-tuning the model on hundreds of examples of fur trader prose, Humphries got it to pull out journal entries in response to questions, but not always relevant ones. The antiquated vocabulary still posed a problem — words like varangue, a French term for the rib of a canoe that would rarely appear in the model’s training data, if ever.

After much trial and error, he ended up with an AI assembly line using multiple models to sort documents, search them for keywords and meaning, and synthesize answers to queries. It took a lot of time and a lot of tinkering, but GPT helped teach him the Python he needed. He named the system HistoryPearl, after his smartest cat.

He tested his system against edge cases, like the Norwegian trader Ferdinand Wentzel, who wrote about himself in the third person and deployed an odd sense of humor, for example, writing about the birth of his son by speculating about his paternity and making self-deprecating jokes about his own height — “F. W.’s Girl was safely delivered of a boy. – I almost believe it is his Son for his features seem to bear some resemblance of him & his short legs seem to determine this opinion beyond doubt.” This sort of writing stymied earlier models, but HistoryPearl could pull it up in response to a vaguely phrased question about Wentzel’s humor, along with other examples of Wentzel’s wit Humphries hadn’t been looking for.

The tool still missed some things, but it performed better than the average graduate student Humphries would normally hire to do this sort of work. And faster. And much, much cheaper. Last November, after OpenAI dropped prices for API calls, he did some rough math. What he would pay a grad student around $16,000 to do over the course of an entire summer, GPT-4 could do for about $70 in around an hour.

“They’re still talking about the technology as if it is a theoretical thing without the full understanding that it poses a very real, existential threat to our whole raison d’être as higher educators.”

“That was the moment where I realized, ‘Okay, this begins to change everything,’” he said. As a researcher, it was exciting. As a teacher, it was frightening. Organizing fur trading records may be a niche application, but a huge number of white collar jobs consist of similar information management tasks. His students were supposed to be learning the sorts of research and thinking skills that would allow them to be successful in just these sorts of jobs. In November, he published a newsletter imploring his peers in academia to take the rapid development of AI seriously. “AI is simply starting to outrun many people’s imaginations,” he wrote. “They’re still talking about the technology as if it is a theoretical thing without the full understanding that it poses a very real, existential threat to our whole raison d’être as higher educators.”

In the meantime, though, he was pleased that his tinkering had resulted in what he calls a “proof of concept”: reliable enough to be potentially useful, though not yet enough to fully trust. Humphries and his research partner, the historian Lianne Leddy, submitted a grant to scale their research up to all 30,000 voyageurs in their database. In a way, he found the labor required to develop this labor-saving system comforting. The largest improvements in the model came from feeding it the right data, something he was able to do only because of his expertise in the material. Lately, he has been thinking that there may actually be more demand for domain experts with the sort of research and critical assessment skills the humanities teach. This year he will teach an applied generative AI program he designed, run out of the Faculty of Arts.

“In some ways this is old wine in new bottles, right?” he said. In the mid 20th century, he pointed out, companies had vast corporate archives staffed by researchers who were experts, not just in storing and organizing documents, but in the material itself. “In order to make a lot of this data useful, people are needed who have both the ability to figure out how to train models, but more importantly, who understand what is good content and what’s not. I think that’s reassuring,” he said. “Whether I’m just deluding myself, that’s another question.”