A detailed scatter plot showcasing the relationship between performance (Arena Score) and cost-effectiveness (Cost per 1M Output Tokens) for a range of Large Language Models (LLMs). The y-axis represents the Arena Score, a measure of model performance, ranging from 1160 to 1380. The x-axis, in logarithmic scale, depicts the Cost ($/1M Output Tokens), spanning from 0.1 to 100. Each point on the plot corresponds to an LLM, color-coded by its developing organization. Prominent models like Gemini 2.0 Flash, DeepSeek-R1, GPT-4, and Claude 3 Opus are labeled. The plot reveals a general pattern: as the cost decreases, the Arena Score tends to decrease as well, indicating a trade-off between cost and performance. An annotation "Cheaper" with an arrow pointing left towards the lower cost region is present, along with a reference to Imarena.ai/price
A horizontal bar chart ranking the top 25 Large Language Models (LLMs) by their hallucination rates. The models are listed vertically on the left, in descending order of performance (lowest hallucination rate at the top). Each bar's length corresponds to the hallucination rate, represented as a percentage and displayed to the right of the bar. The chart includes specific models like Google Gemini-2.0-Flash-001 (0.7%), OpenAI-03-mini-high-reasoning (0.8%), and Snowflake-Arctic-Instruct (3.0%). The data is provided by Vectara and was last updated on February 5th, 2025. The chart highlights the variability in hallucination rates across different LLMs.
Gemini is really good and we keep making it better. In quality and performance but also in cost efficiency.
With outputs costing $0.40 per million tokens and a 0.7% hallucination rate (Vectara), Gemini 2.0 Flash is best-in-class and has been my go-to model for most applications.