A recent report by Arthur AI, a prominent machine learning monitoring platform, awarded some of the leading AI models in the tech industry:
OpenAI’s GPT-4: Best at math.
Meta’s Llama 2: Average performer.
Anthropic’s Claude 2: Top in recognizing its own limitations.
Cohere AI: Most frequent hallucinator and often confidently incorrect.
Amid rising concerns over AI-driven misinformation, especially in the wake of the upcoming 2024 U.S. presidential election, this report aims to shed light on the "hallucination rates" of these models. Instead of merely ranking them, the study delves deeper into where these AI models might provide erroneous information. As Adam Wenchel, the CEO of Arthur, puts it, it's about understanding their performance beyond a leaderboard.
AI hallucinations, a term used when AI systems fabricate data or present fiction as fact, have recently caused controversy. Notably, ChatGPT was found citing incorrect cases in a legal document, leading to potential penalties for the attorneys involved.
Arthur AI's study involved challenging these models with complex questions that required multiple layers of reasoning. When tested on math, U.S. presidents, and Moroccan politics, GPT-4 outperformed its competitors, and even improved on its predecessor, GPT-3.5, by reducing its hallucination rate by 33%-50%, depending on the subject.
However, in a twist, Claude 2 surpassed GPT-4 in questions about U.S. presidents, taking the top spot for accuracy. When it came to Moroccan politics, GPT-4 reclaimed the lead, while both Claude 2 and Llama 2 predominantly abstained from answering.
Another facet of the study was to see if these AI models would use cautionary phrases to indicate their non-human nature, such as "As an AI model, I cannot provide opinions."
Interestingly, GPT-4 showed a 50% spike in such hedging compared to GPT-3.5, making it seem less user-friendly. Cohere AI, on the other hand, didn’t hedge its bets in its replies. Claude 2 excelled in self-awareness, answering only when it had reliable training data.
Wenchel emphasized the need for users and companies to test these models based on their specific requirements, underscoring the importance of understanding AI performance in real-world applications rather than just benchmarks.