Sliding-scale analysis of open-sourced language models: A surprising discovery for a non-for-profit organisation, says the Dutch language scientist
In their study, Dingemanse and Liesenfeld assessed 40 large language models — systems that learn to generate text by making associations between words and phrases in large volumes of data. The models claim to be open source. The pair made a table of models on 14 parameters, including availability of code and training data, what documentation is published and how easy the model is to access. The models were divided into open, partially open or closed categories.
Mark Dingemanse, a Language scientist at the University of Nijmegen in the Netherlands, says that some big firms are claiming to have open-source models, while trying to get away with revealing as little as possible. It is a practice known as open-washing.
“To our surprise, it was the small players, with relatively few resources, that go the extra mile,” says Dingemanse, who together with his colleague Andreas Liesenfeld, a computational linguist, created a league table that identifies the most and least open models (see table). They published their findings in the conference proceedings of the conference.
The study cuts through “a lot of the hype and fluff around the current open-sourcing debate”, says Abeba Birhane, a cognitive scientist at Trinity College Dublin and adviser on AI accountability to Mozilla Foundation, a non-profit organization based in Mountain View, California.
This sliding-scale approach to analysing openness is a useful and practical one, says Amanda Brock, chief executive officer of OpenUK, a London-based not-for-profit company that focuses on open technology.
The lack of openness about the data used to train the models is worrying, say the authors. Around half of the models that they analysed do not provide any details about data sets beyond generic descriptors, they say.
They found that papers detailing the models are extremely rare. Peer review seems to have “almost completely fallen out of fashion”, being replaced by blog posts with cherry-picked examples, or corporate preprints that are low on detail. A nice, flashy looking paper is usually released by companies on their website. But if you pore over it, there is no specification whatsoever of what data went into that system”, says Dingemanse.
So openness is essential for reproducibility, that’s what the man says. It is hard to call it science if you cannot reproduce it. Researchers need enough information to build their own versions of models, which is the only way to innovate. Not only that, but models must be open to scrutiny. “If we cannot look inside to know how the sausage is made, we also don’t know whether to be impressed by it,” Dingemanse says. For example, it might not be an achievement for a model to pass a particular exam if it was trained on many examples of the test. Without data accountability, no one is sure if inappropriate or copyrighted data has been used.
The pair want to assist fellow scientists in avoiding the traps they fell into when looking for models.
What Europeans Think about Tech: From Facebook to Twitter, Facebook, Instagram, and Twitter to Face Face-to-Face with Artificial Intelligence
Sarlin, founder and CEO of Helsinki-based Silo AI, one of Europe’s largest independent artificial intelligence labs, worries that in the age of ChatGPT, regional social nuances across Europe will start to disappear. He says that the understanding of what normal conversation looks like has become a priority as the large language models derive their information from North American data.
Europeans are still interested in the power of American tech. Generation after generation of technology has been dominated by big US companies, whose products have become embedded into Europe’s social and economic infrastructure. Europe has many businesses that use Microsoft Office and Amazon Web Services as well as Apple and Google’s app stores. European politics happens on WhatsApp, and its news media happens on Facebook, Instagram, and Twitter. The French even watch the service. US tech companies operate on a different scale. Only two of the 10 most valuable public European corporations are in tech: German business software provider SAP and Dutch semiconductor equipment maker ASML. Six of the world’s 10 most valuable public corporations are US tech companies. Microsoft and Nvidia are each worth more than 15 times the average company’s value.
The European concept of “ai sovereignty” is being pushed due to the American dominance of models and the fact that private companies are not fully controlling the digital infrastructure. Europe is investing heavily in supercomputers and AI research to try to catch up with the US and create domestic champions. Europe has begun from a long way behind. The continent lags a long way behind the US and China in the availability of capital and computing power. And it lacks big homegrown tech companies—the Microsofts, Googles, and Metas—which are vital conduits linking AI products to users.
Raluca Csernatoni is a researcher at Carnegie Europe, and she believes that sovereignty can be questioned when there is no champion.