AI’s Black Boxes Just Got a Little Less Mysterious

Researchers at the A.I. company Anthropic claim to have found clues about the inner workings of large language models, possibly helping to prevent their misuse and to curb their potential threats.

One of the weirder, more unnerving things about today’s leading artificial intelligence systems is that nobody — not even the people who build them — really knows how the systems work.

That’s because large language models, the type of A.I. systems that power ChatGPT and other popular chatbots, are not programmed line by line by human engineers, as conventional computer programs are.

Instead, these systems essentially learn on their own, by ingesting vast amounts of data and identifying patterns and relationships in language, then using that knowledge to predict the next words in a sequence.

One consequence of building A.I. systems this way is that it’s difficult to reverse-engineer them or to fix problems by identifying specific bugs in the code. Right now, if a user types “Which American city has the best food?” and a chatbot responds with “Tokyo,” there’s no real way of understanding why the model made that error, or why the next person who asks may receive a different answer.

And when large language models do misbehave or go off the rails, nobody can really explain why. (I encountered this problem last year when a Bing chatbot acted in an unhinged way during an interaction with me. Not even top executives at Microsoft could tell me with any certainty what had gone wrong.)

The inscrutability of large language models is not just an annoyance but a major reason some researchers fear that powerful A.I. systems could eventually become a threat to humanity.