Testing ChatGPT in Mathematics — Can ChatGPT do maths?
ChatGPT, “a chatbot powered by a state-of-the-art language model trained by OpenAI. It is designed to assist users in having natural, human-like conversations on a wide range of topics. Because it is powered by a powerful language model, it can understand and respond to many different types of questions and statements, and can help users with tasks such as answering questions, providing information, and even engaging in small talk.”
This was the definition generated by the chatbot itself. So, with this understanding and with Artificial Intelligence being a key theme in many, if not all industries, this Chatbot has to be tested on what it can and not do.
Before I proceed, one need to understand that this is a Large Language Model where quite large set of text data was fed and using Natural Language Processing and Deep Learning, a bot was built to answer queries based on its ‘learning’ of the training data.
It would have remained an academic project with a large investment and research to understand Artificial Intelligence but in the words of Sam Altman of OpenAI, “soon you will be able to have helpful assistants that talk to you, answer questions, and give advice.” So, this is why, the public beta has to be tested to check its ability to replace existing variety of chatbots in many companies, its understanding of conversations, accuracy of its responses and more importantly, its limitations.
Testing a NLP model is not a simple task in itself and one like OpenAI, arguably the largest, is much more complicated. Without getting into a the methodology of how to approach testing Artificial Intelligence, this blog will study one aspect of ChatGPT that it needs to master for any meaningful commercial implementation.
Mathematics and GPT:
One has to start with a caveat that ChatGPT can do only simple mathematics. However, it claims that it can support probability and statistics. Hence, I didn’t trouble it with Calculus, complex linear algebra and other advance mathematics.
Instead, I restricted testing ChatGPT’s understanding of some of the basic building blocks of mathematics and some numerical problems that are more logical and slightly mathematical.
Why is this important?
OpenAI has used Reinforcement Learning from Human Feedback to build ChatGPT. In simpler terms, this means that the model has been trained with the help of people who ‘taught’ the model desired outputs and based on this, the model has learnt.
- Unlike prose or language, Maths is objective. While an approach can be creative, it should always be logical. By reading and ‘learning’ a large volume of poetry, one can come up with a variety of poems but that falls under the realm of creativity with no scope for objective evaluation. However, ‘learning’ maths is different. One needs to apply what is learnt in solving different sets of problems. This is where evaluation of model’s learning becomes easier because most of mathematics is objective. Choosing mathematics as test is to basically test ChatGPT’s ability to ‘learn’ rather than what it has learnt.
- In commercial applications, discussions with a chatbot will involve discussions about money — what is one owed, what one has to pay, options, frequencies, penalties, paybacks, refunds and other types of transactions that involve money. In other words, mathematical computations. So, a chatbot should be more intelligent than just reading from FAQs and trained on few sets of questions.
- Feedback — ChatGPT is an exponential achievement. However, compared to ‘intelligence’ as we humans understand it, it is quite nascent. Personally, I wouldn’t use the word ‘intelligence’ to describe any current implementations of AI. Only a constructive and objective feedback of what we expect from a Chatbot can help build the next iteration of ChatGPT, that again will be an exponential growth.
I started with GPT’s understanding of numbers.
Immediately, GPT failed to answer the basics of numbers. It thought natural numbers could be less than zero and gave an example. But natural numbers are 1 to infinity and doesn’t contain negative numbers.
How about Prime numbers? GPT thought that 0 is a prime number.
But it knew list of prime numbers and did basic division correctly. But the problem continued with its understanding of ‘whole number.’ Again, it failed in a fundamental building block of mathematics.
I didn’t test GPT for fractions as I knew that GPT has not ‘learnt’ mathematics like how it is taught to children.
There appears to be a big problem with GPT and decimals. When I mixed decimals with addition, GPT failed remarkably most of the time. It did get addition of one number with decimal and one without correctly. On other occasions, it was a toss of a coin, mostly skewed towards failure.
How about comparison of numbers? Can GPT place a number? Smaller, larger, equal etc. As it stands, it failed in certain scenarios. It did get most of the things correctly. But when one manipulates with multiples of 100, 1000, 10000 etc, it failed. Is it because it failed to understand decimal systems?
So, extrapolating this, I tried to find whether there was an issue in GPT to understand powers of 10. It did. It got the numbers correctly but the English word for it wrong.
As you can see, it gave different answers to a straightforward and objective question.
This is a natural language processor and language should be its forte. How about a simple math problem described in words? Here too, GPT failed. While it got most right, it failed in few tests. It is either an issue in how it understands the problem or how it computes the solution. We don’t know, yet.
Next test was multiplication. ChatGPT did get simple multiplication of whole number right. But when mixed with decimals, it failed. The other area where it failed was when I mixed multiple operators. At first it got the answer wrong. But when I forced it to do something different, it got the answer right but also with PEMDAS explanation.
Wrong, silly and impressive all at the same time.
But as you can see, if you type the same thing in Python, you get the right answer. But when I forced ChatGPT to not use Python, it got the answer right. Has ChatGPT ‘learnt’ Python wrong or is it using a different method to compute? We wouldn’t know.
The other thing is that ChatGPT doesn’t know what it doesn’t know. I tried asking it to find the next number in sequence. It declined saying that it was a language model and wouldn’t be able to do sequence. That is fair enough.
But, when asked the same question with different sets of numbers, it jumped to give you the answer along with the logic. Impressive that it gave logic but also a failure that it didn’t know whether it could do sequencing or not.
Word Problems and ChatGPT:
Since it is a language model, I tried asking simple maths questions in words. For a large language model, there appears to be an issue of comprehension. In the first example, when I asked about ‘Apple Phone’ it thought that the statement was ambiguous as ‘apple’ could be phone or fruit. Had I not specified ‘phone,’ it would have been quite impressive.
This test was to trick the language model in mixing two completely different items and the result was mixed. It was able to identify that ‘apple’ could mean fruit or the company but it failed when Apple was suffixed by Phone. It should have narrowed it down to the company.
Then, I made it unambiguous. I wanted it to compare iPods and iPads. It did know a lot about the characteristics of iPad and general market but it failed to understand the context of the question.
Then, I was specific.
Then, I narrowed it down.
It should be clear that the model couldn’t compute the answer.
How about another example in words? Again, it failed in solving the problem but it did understand the problem correctly.
Round peg in a Square hole
Another interesting problem I gave involved a variation of round peg and square hole. Asked in different ways, it failed to answer correctly. But finally, I was able to understand what was going wrong. It was calculating absolute area if you gave circle and square and absolute volume if you gave cylinder and cube. So, it compared overall area to area or volume to volume and answered whether one could indeed be fit into the other.
Exploring advanced Maths:
Just because the model told me that it wouldn’t do advanced Maths didn’t stop it from attempting it and standing its ground on answers. Take a look at the quadratic equation that it got wrong.
The issue here is not just about getting something right or wrong but how emphatic it was when questioned about the solution again.
There also appeared to be a problem with basic multiplication/addition combination.
It got the logic to compute matrix multiplication right and the steps on how to do it right. That was impressive. But it got the actual calculation wrong.
Comprehension:
I asked ChatGPT a simple question about an insurance policy that would expire 5 months from now. It didn’t understand my statement correctly. It understood as if my policy had expired and not active.
Conclusion:
In commercial applications, a Chatbot is expected to answer most common questions and also act as if a human is interacting with the customer. For a language model, it is all the more important for us to know whether it can understand the customer correctly. It was precisely there that ChatGPT failed at random. Be it word problems or words with some logic, it failed in varying degree and frequency, reasons for which remains unknown.
In OpenAI’s webpage, they have clearly mentioned what this ChatGPT can do. In their own words, “The dialogue format makes it possible for ChatGPT to answer followup questions, admit its mistakes, challenge incorrect premises, and reject inappropriate requests.”
In some examples above, it failed to admit its mistake, challenged right premises and processed inappropriate requests that it couldn’t or not supposed to handle.
The basis of my tests were in maths for the reasons I stated earlier. With examples I tested ChatGPT with, it did fail in basic mathematics. But it is also important to appreciate that it has learned from examples to do certain arithmetic operations. But limitations are plenty. It starts with its lack of understanding on foundational number types.
To call OpenAI, just pattern matching is a disservice to the amount of work that has gone into building an impressive chatbot. But, at its core, the model learnt by examples it was fed and also with a human interference or reinforcement. Obviously, real life is more complex, much more than examples and one cannot just apply same logic from examples.
It is also a conundrum that Artificial Intelligence is bad at what modern computers are good at — mathematics. The difference lies in how they are built. Modern computers are instructed to do things. They are given rules, algorithms, steps and are tightly bound by what it can do. On the other hand, AI is supposed to ‘learn’ things.
Learning is an abstract concept. We still haven’t understood how we ‘learn’ things. We can feed more data and more examples and assume that the model, however sophisticated, synthesises logic from examples. Learning is also incremental. That is why we do not teach elementary kids partial differential equations first. We build towards it.
ChatGPT is an exponential change. It is also a non-starter in some of the foundational pillars of learning. It is this dichotomy that will drive research and also limit the commercial adoption.
We are a long way from making a machine, ‘intelligent’ in classic sense but we should be able to narrow what we want a specific machine to do and become best at it.