What Are the Stochastic Parrots Really Parroting?

Last essay about the chatbot plague we're experiencing for a while, as I'm off to Western Australia to witness the April 20 total solar eclipse near Exmouth.

Apr 06, 2023

As the popularity of ChatGPT, Bing1 and—unusually late to the game—Google’s Bard is increasing meteorically, their critics—me included—point out that there’s nothing there (and that the dangers are somewhere else).

Neural networks are, in essence, an effort to copy the plasticity of the human brain. And while the plasticity of the brain might be mimicked—nobody knows for sure—other qualities of the brain are not copied, such as agency, the capacity to recognise and create a narrative (the power of story) and—last but certainly not least—sentience (let alone consciousness). Of course, it’s not easy—maybe impossible—to just ‘add’ those qualities, meaning it’s also highly unrealistic to expect that these qualities will arise by themselves. That’s not going to happen with the current generation of LLM (Large Language Model) chatbots.

What they will do is parrot the information they’ve been fed, without knowing how to shift the false from the true and deciding relevance through the amount of mentions, not the truth or quality of them. So where do ChatGPT, Bing and Google’s Bard get their information?

ChatGPT (via Scribbr):

ChatGPT is an AI language model that was trained on a large body of text from a variety of sources (e.g., Wikipedia, books, news articles, scientific journals). The dataset only went up to 2021, meaning that it lacks information on more recent events.

It’s also important to understand that ChatGPT doesn’t access a database of facts to answer your questions. Instead, its responses are based on patterns that it saw in the training data.

So ChatGPT is not always trustworthy. It can usually answer general knowledge questions accurately, but it can easily give misleading answers on more specialist topics.

Another consequence of this way of generating responses is that ChatGPT usually can’t cite its sources accurately. It doesn’t really know what source it’s basing any specific claim on. It’s best to check any information you get from it against a credible source.

TL;DR: ChatGPT uses anything it’s fed, irrespective of veracity. It doesn’t check if a source is a credible one (let alone that it remembers which source it is citing at all). Keep in mind that finding a credible source is a complex and time-consuming process that basically requires agency; that is an inquiring mind intent on finding the most credible source possible. ChatGPT, Bing and Google’s Bard do not have agency, so they are not inclined to, and therefore will never search for, credible sources. They may stumble upon them by accident, not on purpose. Even worse, they may fake a credible source.

Therefore, as quoted above, it can—and will—give misleading answers; that is, it lies. But the term ‘lie’ sounds bad, so ChatGPT and Bing’s programmers call it ‘hallucinating’, because the chatbot just parrots parts of its training data without caring if it’s false or true, because it doesn’t know the difference between the two. Hence the euphemism ‘hallucinate’, because the poor chatbots only reflect what they encounter, and it’s not their problem if the training data contains obvious falsehoods.

This ‘hallucinating’ could have been prevented, I suspect, by carefully curating the training data. But again, like checking if a source is credible, this is a complex and time-consuming process. So of course Open AI—responsible for developing both ChatGPT and Bing Chat—didn’t do this, but preferred to get their large language model to market as quickly as possible and deal with the fallout later. That’s a commercial decision, not an ethical one (and the twain rarely meet).

A large language model (LLM) consists of a neural network with many parameters (typically a billion weights or more), trained on large quantities of unlabelled text using self-supervised learning (according to Wikipedia, which I’ll use as the credible source for this). The next sentence in that entry is telling: “This has shifted the focus of natural language processing research away from the previous paradigm of training specialised supervised models for specific tasks.”

TL;DR: it’s easier to feed chatbots tons of unfiltered data then bringing them up on a carefully selected diet.

Longer version: it’s quicker to train the next generation of LLMs with a huge amount of unlabelled data (emphasis mine) than to gradually expand it from scratch while feeding it carefully curated data. The latter will, most probably, eventually deliver a much better LLM (and Google’s reluctance to enter the fray by delaying and subsequently only gradually opening Bard for public consumption speaks volumes in this matter). But in this capitalistic world, first to market often matters more than best to market, and ‘damn the torpedoes’.

Make no mistake, the torpedoes have already hit both ChatGPT and Bing Chat. Nevertheless, despite these very obvious flaws, most people love these chatbots, meaning they’re here to stay (hopefully only temporarily until an improved version takes over).

In the meantime, these chatbot’s responses depend hugely on their training data, and on the correction of their responses by their AI trainers, albeit that the latter will have a hard time keeping up considering how much data their training set contains. If ChatGPT’s training set is bigger than the 570 GB mentioned in the footnote, then it has—according to estimates done here in Quora—about 179 million words per GB, so more than 102 billion words. As per Wikipedia, about 375 people work at OpenAI, so if I optimistically assume that 100 of them are AI trainers, then each of them have to keep up with at least a billion words (at an average of 100,000 words per novel that’s ten thousand novels of input).

Therefore, it’s fair to assume that the amount of AI trainers is unable to keep up with the huge amount of training data (which will only increase over time), meaning the utter majority of the chatbots’ responses will be unsupervised. It’s also safe to assume that the training data itself is not curated, but is—vaguely—from ‘a variety of sources’. So is that training data representative for humanity’s knowledge in general? Again, it’s safe to assume that it’s not, because the credibility of its sources has not been checked. Which we already see in many of the chatbot’s responses, which contain obvious falsehoods and misrepresentations (and this flaw is even openly admitted by its creators).

In this light, I strongly suspect that today’s LLMs do not reflect the ‘average’ or median knowledge of humanity at large, but merely the median knowledge of the training data, which will necessarily be—because it’s not curated for objectivity—heavily biased. Even if the training data contains—amongst other sources not mentioned—“Wikipedia, books, news articles, scientific journals”.

For one, keep in mind that most of the publications of scientific journals are behind paywalls (research costs money which needs to somehow be recovered). And if anybody reading this thinks that the utmost of news articles are unbiased, then I have a bridge to nowhere to sell to you.

Here’s the problem: early idealists hoped that making information free on the internet would help educate the masses. Unfortunately, as social media—where what we see is determined by algorithms from the providers aimed at maximising engagement—are increasingly demonstrating, the masses at large do not care to be educated, but predominantly use the internet to find like-minded people who believe in the same things as they do, and if a lot of these beliefs are conspiracy theories (to give one example), then so be it. I think it’s fair to say that the internet has helped more people to maintain their beliefs—ill-begotten or not—than questioning them, and that the internet—unintentionally, as the internet is just a tool—prevented more people from learning more (by reinforcing their beliefs through connecting to like-minded people) than it has helped people learning more.

Then, as rightwing propaganda has become much less restrained (especially after the election of people like Trump, Johnson, Bolsanaro and Orban, to name some of the usual suspects), the amount of misinformation on the internet has exploded. Not to mention the amount of politically motivated propaganda and sheer misinformation spread through bots, especially during elections. If the training data for the chatbots has not been curated, then this misinformation phenomenon is also found in there.

TL;DR: the chatbots’ training data is heavily biased, which shows up in their responses.

The old computing adage ‘garbage in = garbage out’ still applies: lies in = lies out. Hence the chatbots often display racist, misogynistic and ableist behaviour. Once people point that out, the people handling the chatbots—be they ‘programmers’, ‘trainers’, whatever—try to correct them, adding bias afterwards. Wouldn’t it be more efficient to remove bias on the input side?

But this is internet capitalism 2.0; that is, we throw anything to the wall to see what sticks and damn the torpedoes, because we must be first to market. So expect deepfakes that are indistinguishable from true pictures, more bots—which are ever more disingenuous—on social media (and other places) then there already are, AI art that gives mere plagiarism a bad name and AI-generated fiction that makes derivative drivel look like a masterpiece.

Which makes me wonder, is there already a chatbot on substack? If not, when will the first one—even if curated by a human—appear? I sincerely hope checks and balances are in place, and I think that the substack community at large (report chatbot-like posts if you see them) can be instrumental in this. Let’s keep this a humans-only place on the internet.

Author’s note: hereby my thanks for the new subscribers of late. I’m not sure if emailing this “thank you” directly will be appreciated, so hereby!

More accurately, Bing Chat (even more precise: “Bing Conversational Experiences”, as Bing is also still a search engine run by Microsoft;

The Divergent Panorama

Discussion about this post