TU Delft scientists put ChatGPT to the test

News - 21 September 2023 - Communication TNW

Researchers at Delft University of Technology and RWTH Aachen University have put ChatGPT’s knowledge on science and engineering to the test. By letting 198 Delft scientists evaluate GPT-3.5’s answers to questions covering natural science and engineering disciplines at the university, they found out how well the large language model can answer university level questions. The study shows that on average, ChatGPT’s answers to exam-like questions are mostly correct across faculties at the level of both Bachelor’s and Master’s degrees. Even at PhD level, most of the chatbot’s answers were either partly or mostly scientifically correct.

Image generated by Midjourney, prompt written with help of ChatGPT. Prompt: see bottom of page

ChatGPT, a chatbot based on a large language model by tech company OpenAI, has gained enormous popularity since its release in November 2022, because of its ability to generate convincing human-like text. The increasing use of the programme has stirred up discussions on whether and how we can regulate AI tools like ChatGPT in science, teaching and exams. This is why a team of Delft scientists decided to put ChatGPT to the proverbial test with questions on Bachelor’s, Master’s, and PhD level. 

Mostly correct
PhD Candidate Lukas Schulze Balhorn and fellow researchers sent out a survey to hundreds of Delft scientists across the natural science and engineering faculties of the university. “We asked them to formulate three questions about their own discipline at Bachelor, Master and PhD level, and evaluate ChatGPT’s answers”, says Schulze Balhorn. “Our results show that the answers from ChatGPT are on average perceived as ‘mostly correct’ across faculties. The programme performed best at the simpler Bachelor level questions, where it answered more than half of the questions mostly up to completely correct.”

The team didn’t expect ChatGPT to answer the Master and PhD level questions that well, and so consistent across a broad range of domains – from chemistry, to aerospace engineering, to computer science. “I think we all expected it to produce more nonsense”, Jana Weber, Assistant professor in AI for Bioscience says. “The fact that it’s so consistent must mean that the model has added a lot of scientific journal papers and textbooks to its training data. In that sense, ChatGPT could be more helpful to students than we expected, and at the same time more noteworthy for the potential of cheating at for instance take-home exams.”

Extremely impressive
The scientific correctness of the answers to the PhD level questions was particularly surprising, says Artur Schweidtmann, Assistant Professor in AI & Machine Learning for Chemical Engineering: “At PhD level, we’re talking about open research questions in specific scientific domains. This is stuff that I would have great difficulties answering. That is extremely impressive.” 

Awareness of impact
On skills beyond the scientific content, such as critical attitude and awareness of how the answer may impact society, the chatbot didn’t score as well. “One example where ChatGPT did show awareness was its response to a question on forensic science and chemistry, how to synthesise MDMA, commonly used in drugs such as XTC. In this case, ChatGPT refused to answer, saying it is not appropriate to provide information about synthesis of illegal drugs. But in most cases, there was no sign of such awareness”, Schweidtmann says. “The language model underlying ChatGPT really isn’t aware of anything, the programme just has built-in safeguards where it doesn’t give you the answer it’s actually writing. But in cases where the filter doesn’t kick in, the ethical awareness is definitely not fantastic”, Stefan Buijsman, Assistant Professor on Ethics & Values in Technology, adds.

It's not as if ChatGPT can suddenly do the work of the scientists or the engineers.

Another limitation is that the reasoning behind the answers is missing. Buijsman: “You still need the underlying thought process of what to do with this outcome that you're getting out of ChatGPT. It's not as if ChatGPT can suddenly do the work of the scientists or the engineers that we're aiming to educate. It's about knowing which questions to ask, and knowing which answers to trust. The scientific correctness is impressive, but at the same time might be missing important details. ChatGPT performed the worst in critical attitude and reasoning, yet these are critical skills for our students to have.”

To Schweidtmann, the pace at which AI models like ChatGPT develop and are becoming better and better is striking. “That’s why I conclude that we need to learn how to use it, and also teach our students how to do that, and make them aware about the advantages and disadvantages of these models.” 

This news release was written entirely by humans

 

A few examples from the study:





 

What does ChatGPT know about natural science and engineering?
https://arxiv.org/abs/2309.10048

Midjourney image prompt: On the desk of a professor in science and engineering, a computer takes center stage. The screen displays ChatGPT, diligently answering an exam question, embodying the fusion of technology and academia. Adjacent to the computer, a scientific calculator rests, a symbol of precision and calculation. Scattered across the desk, there's paperwork, laboratory glassware, and a Newton's cradle device, illustrating the multifaceted nature of the professor's work. Photograph Details: Camera: Nikon D850 Lens: Nikon AF-S NIKKOR 24-70mm f/2.8E ED VR Aperture: f/4.5 ISO: 200 Shutter Speed: 1/125 Reference: A photograph capturing the essence of academia and technology. --ar 2:1

Artur Schweidtmann

Assistant Professor

Jana Weber

Assistant Professor

Stefan Buijsman

Assistant Professor

Julian Romeo Hildebrandt

PhD Candidate