- Volume 1, Issue 1 2023
By Touseef Mehmood
10.20547/aibd.231103
Keywords: Large language model (LLMs), Hallucination, Rouge Metrics, Blue Score
Large Language model (LLMs) has been used extensively in the past few years for natural language processing (N LP). But they have a major flaw: they often produce hallucinated results. Hallucination refers to a large language model creating coherent but unreal/ non-factual text. In other words, the model can generate consistent but wrong responses. They often lack a clear source for the information. This research examines the critical issue of hallucinations in LLMs. It focuses on ChatGPT and Gemini. To explore this, we created a dataset of Roman Urdu questions and their human-generated answers. We prompted ChatGPT and Gemini to answer these questions using APIs. Then, we compared their outputs to the reference answers using LAMA and MixTrail scores. Our evaluation found hallucinations in the models' responses. They had incorrect and misleading information. To improve hallucination detection, we created a custom metric. It assigned weights to ROUGE and BLEU scores using linear regression. The weights are learnt through machine learning applied on the dataset with ROUGE and BLEU has features. This created a weighted average score as a combination of individual metrics. Results are reported for the combined metrics along with the hallucination issues with ChatGPT and Gemini. This research also helps to understand the limits of LLMs on Urdu text. It also offers suggestions to reduce hallucinations in these models
