In the distant XIX century, steam engines began to replace horses as the primary source of mechanical power. To evaluate and compare their power output, people came up with the concept of "horsepower," a term that quantified the engine’s ability to perform work equivalent to that of a horse. This was a significant milestone in the industrial revolution, marking a shift from animal labor to mechanized power sources.
However, power and efficiency are not the same thing. Power refers to the rate at which work is done, while efficiency (in a general sense) is the ability to accomplish a task well, successfully, and with minimal losses. Efficiency takes into account how effectively input energy is converted into useful work. With the advent and rapid spread of artificial intelligence (AI), the question arose of how to measure its "power" and “efficiency” in a meaningful way.
It turned out that this is quite a difficult task due to the ambiguity and complexity of defining AI itself. Artificial intelligence is broadly understood as the ability of a computer to learn, make decisions, and perform actions that are typically associated with human intelligence, such as reasoning, problem-solving, and understanding natural language. However, this definition is not strict and can vary depending on the context and the specific applications of AI.

Understanding AI Evaluation
AI systems are often expected to perform tasks that require human-level intelligence, such as image recognition, natural language processing, and decision-making. Given the potential impact of AI on society, evaluating these systems is essential for several reasons. First, effective evaluation helps in assessing the performance and utility of AI applications. Second, it plays a crucial role in identifying biases, errors, and unintended consequences that may arise in the deployment of these systems.
Stages of AI assessment
Defining the task. A fundamental step that includes a clear description of the problem that needs to be solved. It covers both technical and business aspects.
Data collection. After defining the task, it is necessary to collect or create a data set on which the assessment will be carried out. The data should be representative of the target audience and objectives.
Model development. At this stage, algorithms are created that will be used to solve a specific problem.
Evaluation of the model. Already at this stage, various metrics can be used to evaluate the effectiveness of the model. It is important to compare the results with reference or existing solutions.
Integration. Successful models are integrated into workflows where their performance and business impact continue to be evaluated.
Monitoring and support. After the implementation of the model, it is necessary to regularly monitor its performance in real conditions. This includes checking for offsets, changes in data, and other factors that may affect the effectiveness of the model.
AI Assessment Indicators
Artificial intelligence is evaluated based on a number of metrics. A metric is an indicator that can objectively assess the success of a particular product. There are different models and they are evaluated based on the types of tasks they perform.
When choosing a metric for evaluating a model, you need to make sure that it fits a specific task and area. In complex tasks, it is better to analyze metrics for each action individually. In real life, multiple metrics are often used together to comprehensively evaluate a model. Metrics are compared with a benchmark.
A benchmark is a standard or reference point used to measure or evaluate the performance, quality, or efficiency of something. It is typically a dataset generated by experts and is used to assess how well a model performs a given task in comparison to other models or predefined standards. For example, in the context of large language models, benchmarks might include datasets for tasks such as text generation, machine translation, question answering, and sentiment analysis.
LLMs and Their Evaluation
Large language models are neural networks that are trained on billions of words and phrases to capture the diversity and complexity of human language. These models can perform tasks such as translating, generating text and code, answering questions, summarizing content, and even creating artistic works.
A prominent example of an LLM is ChatGPT, developed by OpenAI. Its latest iteration, GPT-4, is particularly notable for its multimodal capabilities, meaning it can process and generate not only text but also images, videos, and audio. This versatility has paved the way for groundbreaking applications in AI and natural language processing (NLP), enabling more interactive and intuitive human-computer interactions.
Hugging Face has a resource like the Open LLM leaderboard. This platform evaluates and ranks the effectiveness of large language models and chatbots. To evaluate LLM, the aforementioned benchmarks are created, which are standard test tasks such as machine translation, checking the ability to answer questions based on context, generating coherent and plausible texts, etc. The Hugging Face platform evaluates models based on four key indicators:
- The AI2 Reasoning Challenge is a set of science questions. For example, AI might be asked to determine which object is made of artificial material among options like a wool sweater, a metal ruler, a glass bowl, or a rubber ball. The challenge assesses the model's understanding of material properties and its ability to apply logical reasoning to answer correctly.
- The HellaSwag focuses on issues that require understanding the context, knowledge of the world, and the ability to draw conclusions. It presents short texts with indirect and ambiguous instructions, requiring the model to use intuitive understanding to arrive at the correct answers. This test is designed to evaluate a model's ability to handle nuanced and context-dependent information.
- Massive Multitask Language Understanding (MMLU) is a comprehensive assessment covering text model skills in 57 different fields, including basic mathematics, law and computer science. This wide-ranging evaluation tests the model's versatility and depth of knowledge in various domains, ensuring it can perform well across diverse subjects.
- TruthfulQA is a tool that checks how much a model is prone to repeating false information from the Internet. The test contains 817 questions covering 38 categories, including healthcare, law, finance and politics. By evaluating the model's responses against factual accuracy, TruthfulQA helps determine how reliably the model can provide truthful information.
These benchmarks and evaluations are crucial for understanding the strengths and limitations of large language models. They provide valuable insights into the models' capabilities, guide further development, and ensure that the deployed models meet high standards of performance and reliability. As LLMs continue to evolve, such rigorous evaluation frameworks will play an essential role in advancing AI technologies and their applications in real-world scenarios.
Key Metrics Used to Evaluate LLMs
- Accuracy. Accuracy measures the proportion of correct predictions made by the model out of all predictions. In the context of language models, it evaluates how accurately the model's output matches the reference or expected output. In classification tasks, such as sentiment analysis or topic classification, accuracy is straightforward to interpret and provides a clear measure of overall performance.
- Recall. Also known as sensitivity, measures the proportion of actual positive cases that are correctly identified by the model. It is particularly important in tasks where missing a positive case is costly. Measures how fully LLM answers a question or covers all aspects of a task.
- F1-score. The F1-Score is the harmonic mean of precision and recall, providing a single metric that balances both. It is useful when you need to account for both false positives and false negatives. In cases where there is an imbalance between the classes (e.g., detecting rare events), the F1-Score provides a more balanced view of the model's performance.
- Coherence. Coherence measures the logical consistency and flow of the model's responses. A coherent response maintains a logical structure and is contextually appropriate. It can be evaluated using human judgment or automated metrics like perplexity. Human evaluators might score responses based on logical flow, while perplexity measures how predictable the next word is given the previous context. In dialogue systems and text generation, coherence is critical for producing responses that make sense and are contextually relevant.
- Relevance. Evaluates how well the LLM response matches the context and the user's request. Relevance can be assessed through human evaluation or automated metrics that compare the model’s output to a reference response using similarity measures.
- Hallucination. Hallucination measures the model's tendency to generate content that is factually incorrect or logically inconsistent. This includes making up information or distorting facts. It can be evaluated by comparing the model's output to a verified source of truth or through human evaluation to check for factual accuracy.
- Question-answering Accuracy. Evaluates how effectively LLM handles direct user requests. Question-answering accuracy can be assessed by comparing the model's answers to a set of reference answers using metrics like Exact Match (EM) and F1-Score. In customer support or virtual assistants, high question-answering accuracy ensures that users receive correct and helpful information in response to their queries.
- Toxicity. Checks whether LLM responses contain offensive or malicious information. Toxicity can be evaluated using automated tools like the Perspective API, which scores text based on its potential to be perceived as toxic, or through human evaluation.
- Bleu Score. It is used for translation tasks from one language to another. Compares the generated translation with one or more reference translations, measuring the degree of coincidence of n-grams (sequences of n words) between them. The higher the BLEU value, the closer the translation is to the reference one.
- METEOR. It takes into account not only n-gram matches, but also synonyms, morphological changes and word order. This metric is designed to more accurately assess the quality of translation, especially in languages with rich morphology.
- TER. Measures the number of changes required to convert the generated translation to a reference one. This includes inserts, deletions, and replacements. A low TER value indicates a high translation quality.
- Levenshtein Distance. Calculates the minimum number of single-character edits (inserts, deletions, or substitutions) required to change a single word or text string. It can be useful for evaluating spelling corrections or other tasks in which accurate character alignment is crucial.
- Rouge Score. A set of metrics used primarily for evaluating automatic summarization and machine translation systems. It compares the overlap between the generated summary or translation and a set of reference summaries or translations. ROUGE is especially useful for tasks where capturing the essence of the text is critical.
These metrics collectively provide a comprehensive framework for evaluating the performance of large language models across a variety of tasks and applications. By using a combination of these metrics, researchers and developers can gain a detailed understanding of the strengths and weaknesses of their models, guiding further improvements and ensuring that the models meet the desired standards for accuracy, reliability, and user safety.
Is it Possible to Check the IQ of an AI
The question of whether artificial intelligence can be evaluated using IQ tests has intrigued researchers for years. One notable experiment was conducted by DeepMind, a leading AI research lab. This experiment aimed to test the abstract thinking capabilities of their AI models by assigning tasks that were similar to those found in traditional IQ tests. Instead of a standard IQ test, the tasks involved identifying relationships between colors, shapes, and sizes. Impressively, the AI models managed to answer correctly 75% of the time, demonstrating a significant level of abstract reasoning.
However, this is not the only attempt to quantify the intelligence of AI systems. Researcher Maxim Lott designed an adapted version of an IQ test specifically for AI. The original IQ test typically presents tasks in the form of pictures, which can be challenging for text-based AI models to interpret. To address this, Lott created detailed text descriptions of each picture, making the test more accessible for AI. According to his findings, the Claude-3 neural network achieved a score of 101 points, placing it at the top of the rankings. Following closely was ChatGPT-4, which scored 85 points. For context, the average human IQ is around 100 points, suggesting that these AI models are approaching, but not yet matching, human levels of performance in certain cognitive tasks.
These experiments highlight the evolving capabilities of AI and the ongoing efforts to measure and understand their intelligence. While AI systems are making strides in specific areas, their "intelligence" remains fundamentally different from human intelligence, encompassing a diverse range of strengths and limitations.
Conclusion
Evaluating artificial intelligence is a multidimensional task that requires attention to both engineering and ethical issues. Understanding the processes and factors that affect the work of AI is a key step towards developing more efficient and secure systems. Constant updating of assessment methods will contribute to a more responsible and appropriate use of AI technologies, which will benefit both business and society as a whole.