Future of AI relies on a high school teacher’s free database

In front of a suburban house on the outskirts of the northern Germany city of Hamburg, a single word—“LAION”—is scrawled in pencil across a mailbox. It’s the only indication that the home belongs to the person behind a massive data gathering effort central to the artificial intelligence boom that has seized the world’s attention.

That person is high school teacher Christoph Schuhmann, and LAION, short for “Large-scale AI Open Network,” is his passion project. When Schuhmann is not teaching physics and computer science to German teens, he works with a small team of volunteers building the world’s biggest free AI training data set, which has already been used in text-to-image generators such as Google’s Imagen and Stable Diffusion.

Databases like LAION are central to AI text-to-image generators, which rely on them for the enormous amounts of visual material used to deconstruct and create new images. The debut of these products late last year was a paradigm-shifting event: it sent the tech sector’s AI arms race into hyperdrive and raised a myriad of ethical and legal issues. Within a matter of months, lawsuits had been filed against generative AI companies Stability AI and Midjourney for copyright infringement, and critics were sounding the alarm about the violent, sexualized, and otherwise problematic images within their datasets, which have been accused of introducing biases that are nearly impossible to mitigate.

But these aren’t Schuhmann’s concerns. He just wants to set the data free.

Large language

The 40-year-old teacher and trained actor helped found LAION two years ago after hanging out on a Discord server for AI enthusiasts. The first iteration of OpenAI’s DALL-E, a deep learning model that generates digital images from language prompts—say, creating an image of a pink chicken sitting on a sofa in response to such a request—had just been released, and Schuhmann was both inspired and concerned that it would encourage big tech companies to make more data proprietary. “I instantly understood that if this is centralized to one, two or three companies, it will have really bad effects for society,” Schuhmann said.

In response, he and other members on the server decided to create an open-source dataset to help train image-to-text diffusion models, a months-long process similar to teaching someone a foreign language with millions of flash cards. The group used raw HTML code collected by the California nonprofit Common Crawl to locate images around the web and associate them with descriptive text. It does not use any manual or human curation.

Within a few weeks, Schuhmann and his colleagues had 3 million image-text pairs. After three months, they released a dataset with 400 million pairs. That number is now over 5 billion, making LAION the largest free dataset of images and captions.

As LAION’s reputation grew, the team worked without pay, receiving a one-off donation in 2021 from the machine-learning company Hugging Face. Then one day, a former hedge fund manager entered the Discord chat.

Emad Mostaque offered to cover the costs of computing power, no strings attached. He wanted to launch his own open-source generative AI business and was keen to tap LAION to train his product. The team initially scoffed at the proposal, taking him for a kook.

“We were very skeptical in the beginning,” Schuhmann said, “But after four weeks or so we got access to GPUs in the cloud that would normally have cost around $9,000 or $10,000.”

When Mostaque launched Stability AI in 2022, he used LAION’s dataset for Stable Diffusion, its flagship AI image generator, and hired two of the organization’s researchers. A year on, the company is currently seeking a $4 billion valuation, thanks largely to the data made available by LAION. For his part, Schuhmann hasn’t profited from LAION and says he isn’t interested in doing so. “I’m still a high school teacher. I have rejected job offers from all different kinds of companies because I wanted this to stay independent,” he said.

New oil?

Many of the images and links in databases like LAION have been sitting in plain sight on the web, in some cases for decades. It took the AI boom to reveal its true value, as the bigger and more diverse a dataset is, and the higher quality the images in it, the clearer and more precise an AI-generated image will be.

That realization, in turn, has raised a number of legal and ethical questions about whether publicly-available materials can be used to feed databases—and if the answer is yes, if creators should be paid.

To build LAION, founders scraped visual data from companies such as Pinterest, Shopify and Amazon Web Services—which did not comment on whether LAION’s use of their content violates their terms of service—as well as YouTube thumbnails, images from portfolio platforms like DeviantArt and EyeEm, photos from government websites including the US Department of Defense, and content from news sites such as The Daily Mail and The Sun.

If you ask Schuhmann, he says that anything freely available online is fair game. But there is currently no AI regulation in the European Union, and the forthcoming AI Act, whose language will be finalized early this summer, will not rule on whether copyrighted materials can be included in big data sets. Rather, lawmakers are discussing whether to include a provision requiring the companies behind AI generators to disclose what materials went into the data sets their products were trained on, thus giving the creators of those materials the option of taking action.

The basic idea behind the provision, European Parliament Member Dragos Tudorache told Bloomberg, is simple: “As a developer of generative AI, you have an obligation to document and be transparent about the copyrighted material that you have used in the training of algorithms.”

Such regulation wouldn’t be an issue for Stability AI, but it could be a problem for other text-to-image generators—“no one knows what Open AI actually used to train DALL-E 2,” Schuhmann said, citing it as an example of how tech companies lock up public data. It would also upend what is now the status quo in data collection.

“It has become a tradition within the field to just assume you don’t need consent or you don’t need to inform people, or they don’t even have to be aware of it. There is a sense of entitlement that whatever is on the web, you can just crawl it and put it in a data set,” said Abeba Birhane, a Senior Fellow in Trustworthy AI at Mozilla Foundation who has studied LAION.

Although LAION has not been sued directly, it has been named in two lawsuits: one accusing Stability and Midjourney of using copyrighted images by artists to train their models, and another by Getty Images against Stability, which alleges that 12 million of its images were scraped by LAION and used to train Stable Diffusion.

Because LAION is open-source, it’s impossible to know which or how many other companies have used the dataset. Google has acknowledged that it tapped LAION to help train its Imagen and Parti AI text-to-image models. Schuhmann believes that other large companies are quietly doing the same and simply not disclosing it.

Worst of the Web

Sitting in the living room as his son played Minecraft, Schuhmann likened LAION to a “small research boat” on top of “big information technology tsunami,” taking samples of what’s beneath to display to the world.

“This is a tiny amount of what’s available publicly on the Internet,” he said of LAION’s database. “It’s really easy to get because even we, with maybe a budget of $10,000 from donors, can do it.”

But what’s publicly available isn’t always what the public wants—or is legally allowed to see. In addition to SFW photos of cats and fire trucks, LAION’s dataset contains millions of images of pornography, violence, child nudity, racist memes, hate symbols, copyrighted art, and works scraped from private company websites. Schuhmann said he was unaware of any child nudity in LAION’s data set, though he acknowledged he did not review the data in great depth. If notified about such content, he said, he would remove links to it immediately.

Schuhman consulted lawyers and ran an automated tool to filter out illegal content before he began assembling the database, but he is less interested in sanitizing LAION’s holdings than in learning from them. “We could have filtered out violence from the data we released,” he said, “but we decided not to because it will speed up the development of violence detection software.” LAION does provide a takedown form to request the removal of photos, but the dataset has already been downloaded thousands of times.

Offensive content lifted from LAION appears to have been integrated into Stable Diffusion, where despite recently tightened filters, it’s easy to generate fake Islamic State beheading photos or Holocaust images. Some experts believe such material can also create biases within an AI generator itself: Tools like Dall-E-2 and Stable Diffusion have been criticized for reproducing racial stereotypes even when a text prompt doesn’t imply the subject’s race.

Such biases were why Google decided not to release Imagen, which had been trained on LAION.

When reached for comment, Stability AI said it trained Stable Diffusion on a curated subset of LAION’s database. The company sought to “give the model a much more diverse and wide-ranging dataset than that of the original SD,” it wrote in an e-mail, adding that they tried to remove “adult content using LAION’s NSFW filter.”

Even advocates of open source-based AI warn of the implications of training AI on uncurated datasets. According to Yacine Jernite, who leads the Machine Learning and Society team at Hugging Face, generative AI tools based on tainted data will reflect its biases. “The model is a very direct reflection of what it’s trained on.”

Introducing guardrails after the product is up and running isn’t sufficient, Jernite added, as users will always find ways to circumvent the safety measures. “That’s what happens when you take a model that is trained to emulate what people do on the Internet in general and then say, ‘Okay, but don’t do that.’ People will find a way to still make it do that,” they said.

Gil Elbaz, founder of the data nonprofit Common Crawl, doubts whether “there’s a straight line that you can draw from the training sets to what’s produced,” and instead likened the process to an artist who goes to museums for inspiration but is blocked from making replicas of artworks. Instead, he said, “it’s important for society to decide what use cases are legal or not legal.”

It won’t only be left up to society. As regulators in Europe craft legislation to navigate the uses of artificial intelligence, they are grappling with the fact that the data now being mined for the current AI boom has for years been generated in a legal gray zone that is only now coming under serious scrutiny. “AI wouldn’t have been possible at this level of complexity without years of the accumulation of data,” said Tudorache, the European Parliament member.

But to Schuhmann, it’s not the datasets that should be monitored. In his eyes, the worst-case scenario for AI is one in which Big Tech is able to crowd out developers by catering their tools to a regulatory framework. “If we try to slow things down and over-regulate,” he warned, “there is a big danger that in the end, only a few big corporate players can afford to fulfill all the formal requirements.”

Image credits: Maria Feck/Bloomberg