Do we need a new Turing Test?

Beating the immitation game matters much less than it seems

Do we need a new Turing Test?

Since late last year, ChatGPT’s surprising conversational ability has captured the world’s imagination and spurred the question, “Can machines really think?” in many bar-room conversations, quickly followed by “How would you test it?”.

This testing question is fascinating and deep, but it likely says more about our innate human need to understand our place in the world than it does about AI. Any system we call “intelligent” can be seen at two levels: as a thinking entity to be interacted with or as a functional system that acts in a certain way (performing tasks).

The first way leads to anthropomorphization of the system (treating it as analogous to a human or ascribing human behavior to it), the latter to a somewhat dismissive sense that the system is “a mere tool” (or the derogatory term of the moment for LLMs: a Stochastic Parrot). (*1)

As a society, we will likely see increasing intelligent systems over the next years and decades: getting closer and closer to a system that “appears” intelligent. A test will end up being something that, at best, gives you a “level” of intelligence or disqualifies a system from being called “thinking” on a technicality.

In the near term, it is likely going to be far more helpful to focus not on intelligence but on trust. Specifically, trust that the system will be able to perform the function it was designed for.

The fact that bots are now markedly better at solving “Are you a Robot?” web CAPTCHA questions than humans should definitely also give us pause for thought!

Let’s dive in…

Made with Midjourney

The Turing Test: Old or New or Defunkt?

In a recent Wired article, Ben Ash Blum asks whether we need a new Turing Test, and although not directly answering his own question, he makes the point that the real challenge may be how humans relate to such systems, not how intelligent they are.

In 2004, I had a strangely prescient moment and wrote a paper [Willmott04] arguing that the Turing test wasn’t likely to be that useful to us. (The conference website is sadly defunct, but the technical report remains online.) Specifically, I argued that a Turing Test-like process to determine if a specific system was intelligent was a pointless exercise and that it would be more important to learn how to live in a world where, arbitrarily, many of the systems we interact with are automated. Today we certainly live in such a world and it seems increasingly unlike one will always know which systems respond through a human and which through an AI system.

The original Turing test

The original Turing Test was a thought experiment first described in Alan Turing’s 1950 "Computing Machinery and Intelligence" paper [Turing50], named the “Imitation Game” by Turing. With a few modifications, this is broadly seen as a useful test of machine intelligence.

Turing’s imitation game went like this:

  • Turing first proposes a party game “in which a man and a woman go into separate rooms, and guests try to tell them apart by writing a series of questions and reading the typewritten answers sent back. In this game, both the man and the woman aim to convince the guests that they are the other.”
  • The AI version of the game continues, “We now ask the question, "What will happen when a machine takes the part of A in this game?" Will the interrogator decide wrongly as often when the game is played like this as he does when the game is played between a man and a woman?

There are variations of the game, but this setup covers the crux: "Could a machine fool a human interrogator into thinking it was human if all that could be used for communication were handwritten notes?”

We are, therefore, talking about the ability of a system to imitate human behavior and for a credible set of observers (the interrogator(s)) to be able to detect the imitation.

Philosophical Debate

The Turing test sparked a great deal of scientific debate. In particular, there are still strongly opposing camps on whether or not it, or something similar, can be used as a test of  intelligence.

The crux of the argument “against” is that it would be possible to construct a symbol manipulation system that responded to the quotes purely on a syntactic basis with no semantic understanding of the meaning of messages. (John Searle and many others.)

The crux of the argument “for” is that if a system convincingly simulates intelligence in every instance, it must be intelligent. In other words, imitation = being.

Both of these hold some attraction in terms of logic, but it seems unlikely we’ll find a way to “prove” one or the other is correct since they depend whether some (possibly non-physical) property of the system exists (or is necessary). Specifically, whether some kind of “conciouness” is required that is elevated above pure syntactic knowledge.

LLMs as Searle’s Chinese Room

In many ways, today's LLMs look very much like an implementation of Searle’s famous “Chinese Room”: a system trained on human knowledge that responds to queries using mechanical processing of the input query using probabilities.

LLMs now perform at such a level that one could certainly persuade some human observers they are dealing with a real human for extended periods. The narrower the domain of discourse, the easier this would be.

As such, we are very close to saying we can pass versions of the Turing test. That should already give us a clue that it’s likely to be of low practical value in the long term. (To be clear, Turing already assumed this would be the case!)

Trust holds more Weight than Intelligence

At some point in the future, humanity will likely be faced with the question of whether “AI systems have rights,” and these questions will be tied up both with intelligence and the even more squirrelly concepts of sentience and consciousness.

For now, though, the more pressing issue with any system is “Do we trust it to function as intended?”

Intelligence (or the “being able to think”) is a measure of capability. Trust, however, is a measure of both capability and intent. This is key in trusting systems to operate on our behalf, even for simple tasks such as taking notes.

Trust is also generally bounded to a certain domain of action. One might trust a person to park a car but not to plan a budget. For someone else, the reverse may be true.

Trust in a system comes from multiple sources including:

  • What is the system capable of?
  • How has it been tested?
  • Are there applicable certifications?
  • Who operates the system?
  • What are its terms of use?
  • Who is responsible if something goes wrong?
  • In the case of failure, can we recover, or was permanent harm done?

The key point is that these questions are precisely the same questions one would ask about a service provided by a human or human organization. Much of human society and commerce is underpinned by mechanisms to establish such trust.

It seems highly likely that we’ll accept AI systems as “trusted” entities long before we figure out if we can universally agree to call them “intelligent.”

In the long run, there will be at least three questions to ask (and tests to deevise that go with them):

  • Intelligence: What are the capabilities of systems? Measured on some kind of scale up to human intelligence and beyond.
  • Sentience: Can the system be considered sentient? A different bar that will be much fought over (especially since sentient entities should probably have certain rights).
  • Trust: a measure of confidence we have in a system performing the function it promises in a given context.

The basis of that trust will  depend more heavily on the environment, who the system represents, whether someone owns the system, what binding statements it makes, legal knowledge, and other context than it will on the precise technical implementation of the system.

In the meantime, it’s awe-inspiring that Alan Turing had the foresight to ask such deep questions 70 years before the technology was even close to achieving what he envisaged!

References

Notes

  • I’m not a fan of using pseudo-offensive terms like Stochastic Parrot for LLMs or any AI system for two reasons: 1) The point seems to be to point out the “non-thinking” nature of the system, yet it simultaneously invokes the idea of an animal which is alive, 2) given the point is to argue the model is not smart, this might also offend parrots.