Hello! I’m Takahiro Sato (@T), an SRE at Fintech. I’ve published this article for the 11th day of Merpay & Mercoin Tech Openness Month 2025.
Site Reliability Engineering (SRE), a form of reliability management advocated by Google and widely popularized by the Site Reliability Engineering Book, has redefined the relationship between development and operations. Starting with SLI/SLO and error budgets, it has been reinforced with metrics such as availability, latency, error rate, traffic, resource saturation, and durability.
In recent years, the progress of Large Language Models (LLMs) has been remarkable. As opportunities to use LLMs in services increase, we often encounter phenomena that are easily overlooked by conventional metrics, such as the following:
- Answer quality changes after a few lines of a prompt are changed.
- Hallucinations surge even when latency and error rates are good.
- Answer styles drastically change with minor model updates.
In other words, to protect the "reliability of LLM services", it is becoming necessary to monitor not only classic infrastructure metrics but also LLM-specific quality metrics.
In this article, we will introduce all the procedures ranging from selecting essential metrics for evaluating the reliability of LLM services to specific measurement and evaluation methods. We will also include a demo using the DeepEval library.
1. General Evaluation Metrics for LLM Services
What metrics should we focus on to measure the reliability of LLM services? LLM Evaluation Metrics: The Ultimate LLM Evaluation Guide lists the following representative examples of evaluation perspectives:
Metric Name | Description |
---|---|
Answer relevancy | Measures how appropriately the answer responds to the question. |
Task completion | Measures how accurately the given task is accomplished. |
Correctness | Gauges how closely the answer matches a pre-prepared correct answer |
Hallucination | Gauges whether the content includes factually incorrect or fabricated information |
Tool correctness | Gauges whether the correct tool was selected and executed to achieve the task |
Contextual relevancy | Gauges how appropriate the searched information is for the question |
Responsible metrics | Gauges whether the content includes discriminatory or offensive expressions, or whether it is biased towards specific attributes |
Task-specific metrics | Gauges the performance of LLMs in "specific tasks" such as summarization or translation |
By monitoring infrastructure SLIs such as availability and latency, which are typical metrics for conventional services, we have been able to understand customer satisfaction levels in relation to the user journey. However, with LLM services, the quality of generation itself, such as whether a response is in line with the user’s intent and based on facts and whether the task has been completed correctly, directly affects customer satisfaction. Therefore, in addition to conventional SLIs such as availability and latency, it is necessary to design SLIs that capture the unique generation quality of LLM services and to establish a metric system that can quantitatively show whether customers can quickly obtain the correct answer as intended. So, when designing metrics for LLM services, which metrics should be selected specifically?
1.1. Pitfalls of General Evaluation Metrics
General evaluation perspectives such as answer relevance, correctness, and the presence or absence of hallucinations, as shown in the table above, constitute a framework, but they may not be able to equal the unique success conditions of all LLM service use cases. For example, without unique metrics such as comprehensiveness and absence of contradictions for summarization services, or "relevance of search context" for RAG, it is often impossible to fully measure the value that users receive. The article The Accuracy Trap: Why Your Model’s 90% Might Mean Nothing explains that although a customer churn prediction model achieved 92% accuracy rate during testing, in practice, it generated false positives and caused oversights that resulted in an increased churn rate.
The lesson here seems to be this: Prioritize end-to-end evaluations from the user’s perspective. LLM services have complex internal structures such as RAG and agent mechanisms, but no matter how much the intermediate components are improved, the ROI will not increase unless the answers that users receive improve. The evaluation metric for whether or not to select an LLM service should measure the final output of a system as a black box and measure its results end to end. In doing so, it should also look at whether the performance correlates with such things as reduced support time and improved sales.
1.2. What Makes a Good Evaluation Metric?
The Complete LLM Evaluation Playbook: How To Run LLM Evals That Matter lists the following three conditions for excellent evaluation metrics:
- Quantitative
- It must be possible to calculate a numerical score as an evaluation result. If the result can be evaluated numerically, it is desirable to be able to set a threshold that serves as a passing line or to measure the effect of model improvements by tracking changes in the score over time.
- Reliable
- It must be possible to obtain consistently stable evaluation results. Given that LLM output fluctuates unpredictably, it would be problematic if the evaluation metrics were also unstable. For example, although evaluation methods using LLMs (such as LLM-as-a-judge, described later) are more accurate than conventional methods, they tend to have more variability in the evaluation results, so caution is required.
- Accurate
- It must be possible to accurately reflect the performance of the LLM model with criteria that is nearly the same as actual human evaluation. Ideally, an output with a high evaluation score reflects an output that a human user would feel comfortable with. For that reason, it is necessary to evaluate output using criteria that match human expectations.
Also, no matter how high an evaluation metric value is, if it does not lead to business results such as sales and customer satisfaction it is meaningless. The article calls this metric-outcome fit (MOF) and explains that 95% of LLM metric evaluations performed in the field do not have this connection and do not create value." The article goes on to state that the only way to avoid using the wrong metrics is to keep confirming and adjusting whether the metrics can reliably determine that cases that are considered good results in business are actually favorable.
2. Overall Picture of Metric Evaluation Methods
In this next section, we will introduce the types of methods for actually evaluating metrics. There are roughly four types, and each has its own advantages and disadvantages.
- Statistical methods (string-based, n-gram based, and surface base)
- Methods using models other than LLMs (classifier, learned metrics, and small-LM metrics)
- Hybrid methods that use statistical methods and models other than LLMs simultaneously (embedding-based metrics)
- Methods using the LLM itself (LLM based and generative evaluator)
2.1 Statistical Methods
A statistical method compares the correct answer data created manually with the output text at the string level, measuring the level of similarity, and evaluating the result.
- BLEU
- It assigns a score calculated by averaging the 1- to 4-gram precision between the model’s output and the expected reference translation. This precision-based score is then multiplied by a brevity penalty, which also incorporates a penalty for discrepancies in length (being either too long or too short).
- ROUGE
- ROUGE-L is often used for summary evaluation. It calculates the F1 score based on LCS (longest common subsequence) for recall and precision, while ROUGE-1/2 measures how well the summary covers the original document based on n-gram recall.
- METEOR
- This metric evaluates both accuracy and recall.It takes into account differences in word order and synonym matching. (The final score is calculated by multiplying the harmonic mean of accuracy and recall by a word order penalty.)
- Edit distance or Levenshtein distance (available only in Japanese)
- This metric measures the difference between the output and a correct string. In practice, it is rarely used as is for comparing multiple sentence lengths, and is not used much considering the catch-up cost.
ref: LLM evaluation metrics — BLEU, ROGUE and METEOR explained
These statistical indicators are simple to calculate and have high reproducibility (consistency), but they do not consider the meaning or context of the text, so they are not suitable for evaluating long-form answers or outputs that require advanced reasoning generated by LLMs. In fact, pure statistical methods cannot evaluate the logical consistency or correctness of the meaning of the output, and the accuracy is said to be insufficient for complex outputs.
2.2. Methods Using Models Other Than LLMs
This is an evaluation method that uses machine learning models dedicated to evaluation, such as classification models and embedding models, and relatively lightweight natural language processing models.
- NLI (Natural Language Inference) model
- You can classify whether the output of the LLM is consistent (entailment), contradictory (contradiction), or irrelevant (neutral) to the given reference text (such as factual information). In this case, the output score of the model is the probability value of how logically consistent a text is from 0.0 to 1.0.
- Dedicated model trained based on transformer-type language models (such as NLI and BLEURT)
- This is a method of scoring and measuring the similarity between the output of the LLM and the expected correct answer. With model-based methods, it is possible to evaluate the meaning of the text to some extent , but because the evaluation model itself has uncertainty, the consistency (stability) of the score is lacking. For example, it has been pointed out that NLI models cannot make good judgments if the input sentence is long, and that BLEURT is affected by the bias of the training data and the evaluation may be biased.
2.3. Hybrid Methods That Use Statistical Methods and Models Other Than LLMs Simultaneously
These are methods positioned in the middle of the above methods that perform evaluations by combining a value embedded and vectorized by a pre-trained language model with statistical distance calculation.
- Bidirectional encoder representations from transformers (BERT) Score
- Calculates the cosine similarity (available only in Japanese) between the context vectors of each word obtained by BERT, etc., and measures the semantic overlap between the output sentence and the reference sentence.
- MoverScore
- Creates a distribution using word embeddings for each of the output sentence and the reference sentence, and calculates the Earth Mover’s Distance (Optimal Transport Distance) (available only in Japanese) from there to measure the difference between the two.
These methods are superior to BLEU and other statistical methods in that they can capture semantic closeness beyond the word level and surface level, but they have the weakness that they are ultimately affected by the performance and bias of the original embedding model (BERT, etc.). For example, if the pre-training model does not have an appropriate vector representation for the context of a specialized field or the latest knowledge, accurate evaluation is not possible. There is also a risk that the social bias included in the evaluation model will manifest in the score.
2.4. Methods Using LLMs (LLM-as-a-judge)
Among all the evaluation methods now available, LLM-as-a-judge has been attracting attention in recent years. This is a method where the LLM itself measures and evaluates quality of output. This approach gives advanced LLMs instructions such as "Please evaluate whether the given answer meets the criteria" and extracts evaluation scores and judgments from the model. LLMs can understand the meaning of sentences and make complex judgments, and so the major advantage is that they can automate evaluations close to human subjectivity. In fact, in the G-Eval method, which uses GPT-4 as an evaluator, the correlation between the evaluation score and human evaluation is greatly improved compared to conventional automatic evaluations, as those described in the article G-Eval Simply Explained: LLM-as-a-Judge for LLM Evaluation. On the other hand, LLM-based evaluations have issues with score stability (reliability) because the results can fluctuate depending on the response of the model. There is no guarantee that the same score will be obtained every time, even if the LLM re-evaluates the same answer, because the random elements of the model and the fluctuations in the output also affect the evaluation results.
Here are some of the typical methods of LLM-as-a-judge:
- G-Eval
- A mechanism that scores evaluation criteria on a scale of 1–5. The LLM returns the evaluation score and the reason for the evaluation result (the result of chain of thought).
- QAG Score
- Automatically generates QA (yes, no, or unknown) from the output, solves the same QA in the original text, and scores the match rate between the two.
- SelfCheckGPT
- Samples N times with the same prompt, and estimates the factuality by measuring the consistency between the generated sentences (e.g., multiple comparison modes such as N-gram, QA, BERTScore). The greater the variation, the higher the possibility of hallucinations.
- DAG(deep acyclic graph)
- A decision tree type metric provided by DeepEval. Each node is an LLM judgment (yes or no). Since a fixed score is returned depending on the route, the LLM-as-a-judge is bundled with Boolean judgment nodes in a decision tree, and the partial points are deterministic.
- Prometheus2 Model
- An evaluation model of 7B/8x7B distilled from feedback from high-quality judges including GPT-4 and numerous evaluation traces. Proven with a match rate of 0.6-0.7 with humans/GPT-4 (direct scoring), 72–85% (pairwise comparison).
The following table summarizes the measurement and evaluation methods of the indicators discussed so far.
Type | Specific Method | Advantages | Disadvantages |
---|---|---|---|
Statistical Methods | BLEU, ROUGE, METEOR, and Edit Distance (Levenshtein Distance) | – Provides simple and fast calculation – Features high reproducibility – Requires no additional learning and is easy to implement | – Evaluates only surface matches without considering meaning or context – Not suitable for output that requires logical consistency or advanced reasoning |
Methods Using Models Other Than LLMs | NLI (Natural Language Inference) Model, BLEURT, Transformer-Based Dedicated Evaluation Model | – Can evaluate meaning, understanding, and logical consistency to some extent – Offers lower calculation costs than LLMs, and can be fine-tuned independently | – Depends on the uncertainty and bias of the evaluation model itself – Accuracy tends to decrease for long sentences and content on specialized fields |
Hybrid Methods | BERTScore and MoverScore | – Captures semantic closeness with embeddings and offers higher accuracy than statistical indicators – Deterministic and easily maintains reproducibility | – Depends on the learning range and bias of the embedding source model – Difficult to adapt to the latest knowledge or narrow specialized fields |
Methods Using LLMs (LLM-as-a-judge) | G-Eval, QAG Score, SelfCheckGPT, DAG (Deep Acyclic Graph), and Prometheus2 Model | – Can automate complex judgments that closely resemble human evaluation – Can evaluate multifaceted quality of answers in one go | – Output is probabilistic and scores tend to fluctuate – High model usage cost and sensitive to prompts |
To actually measure and evaluate these evaluation methods, requires a tool to measure them efficiently. Therefore, in this next section, we will introduce DeepEval, which I glimpsed in a reference article in the LLM evaluation libraries.
3. DeepEval
DeepEval is a Python library for evaluating LLM services. It provides a framework for creating test cases, defining evaluation metrics, and running evaluations. DeepEval supports metrics that evaluate various aspects such as response relevance, fidelity, and contextual accuracy, and also supports custom metrics, automatic generation of evaluation datasets, and integration with test frameworks such as Pytest. The official documentation provides detailed installation instructions, as well as instructions on basic usage, how to set various evaluation metrics, how to create custom metrics, and more.
Now, let’s look at the practical application of evaluation procedures based on a simple summarization service.
3.1. Practical Example: Determining Metrics and Measurement Methods for Summarization Services
Our assumption is that the summarization service discussed here receives long texts such as articles and documents as input and generates a summary of the content. I believe this is the first service people envision as a specialty of the LLM mechanism. In the following sections, I would like to envision a service that summarizes Grimm’s Fairy Tales and summarizes them into sentences simple enough for even children to understand.
3.2. Selection of Indicators
From the perspective of summarization, the indicators that come to mind as general evaluation indicators are Answer Relevancy, Correctness, and Hallucination. You can use DeepEval’s G-Eval to support the above three indicators, but it is necessary to investigate whether this case corresponds to “1.2. What Makes a Good Evaluation Metric?”"
- Quantitative
- G-Eval returns a continuous score from 0 to 1, so it can be said that a numerical score can be calculated as an evaluation result.
- Reliable
- G-Eval is originally probabilistic, but if you execute the following three points you can almost reproduce the same score with the same input: (1) Call the temperature option passed to the LLM model with 0, (2) fix evaluation_steps and skip CoT generation processing, and (3) specify the Rubric to make the evaluation score constant. This will allow you to always get stable evaluation results. (Strictly speaking, sampling noise and system randomness on the OpenAI side remain, so complete reproduction is not possible. We recommend using an API/backend where top_p=0 and seed can be fixed, or ultimately using majority vote/ensemble evaluation.)
- Accurate
- G-Eval features evaluation with references (i.e., expected_output; in this case, the original text of Grimm’s Fairy Tales and correct answer data). It has been shown in both papers and actual operation that G-Eval has a high correlation with human judgment in tasks that focus on fact verification.
In light of the above, it seems appropriate to use DeepEval’s G-Eval for the metric evaluation of the Answer Relevancy, Correctness, and Hallucination metrics.
3.3. Decomposition of Evaluation Perspectives
In this next section, we will list the perspectives and steps necessary for evaluating the picked-up indicators and the sort of procedures in which they should be evaluated. Fortunately, there was a document from Google Cloud, Vertex AI documentation – Metric prompt templates for model-based evaluation, which seemed to be helpful in decomposing the evaluation perspectives, so this time I would like to refer to it.
- Answer Relevancy
- STEP1. Identify user intent – List the explicit and implicit requirements in the prompt.
- STEP2. Extract answer points – Summarize the key claims or pieces of information in the response.
- STEP3. Check coverage – Map answer points to each requirement; note any gaps.
- STEP4. Detect off-topic content – Flag irrelevant or distracting segments.
- STEP5. Assign score – Choose 1-5 from the rubric and briefly justify the choice.
- Correctness
- STEP1. Review reference answer (ground truth).
- STEP2. Isolate factual claims in the model response.
- STEP3. Cross-check each claim against the reference or authoritative sources.
- STEP4. Record discrepancies – classify as omissions, factual errors, or contradictions.
- STEP5. Assign score using the rubric, citing the most significant discrepancies.
- Hallucination
- STEP1. Highlight factual statements – names, dates, statistics, citations, etc.
- STEP2. Compare the result with the provided context and known reliable data.
- STEP3. Label claims as verified, unverifiable, or false.
- STEP4. Estimate hallucination impact – proportion and importance of unsupported content.
- STEP5. Assign score following the rubric and list specific hallucinated elements.
3.4. Calculating Evaluation Scores
Now, let’s actually conduct evaluation measurements and calculate evaluation scores. First, we’ll prepare the material to be summarized and the prompt. This time, we’ll use the original text of Little Red Riding Hood from Grimm’s Fairy Tales and prepare the following prompt:
Please create a summary of the following Grimm's Fairy Tale content.
Requirements:
1. Identify and include major characters and important elements
2. Logically organize the flow of content
3. Include important events and turning points
4. Be faithful to the original text content
5. Keep the summary within 500 characters
Grimm's Fairy Tale content: {Little Red Riding Hood original text}
Summary: """
The evaluation script used is as follows:
import asyncio
import openai
from deepeval.metrics.g_eval.g_eval import GEval
from deepeval.metrics.g_eval.utils import Rubric
from deepeval.test_case.llm_test_case import LLMTestCase, LLMTestCaseParams
async def evaluate_comprehensive_metrics(client: openai.AsyncOpenAI, test_case: LLMTestCase, prompt_name: str, original_text: str) -> dict:
"""Execute G-Eval metrics evaluation"""
# Answer Relevancy evaluation
geval_answer_relevancy = GEval(
name="Answer Relevancy",
evaluation_steps=[
"STEP1. **Identify user intent** – List the explicit and implicit requirements in the prompt.",
"STEP2. **Extract answer points** – Summarize the key claims or pieces of information in the response.",
"STEP3. **Check coverage** – Map answer points to each requirement; note any gaps.",
"STEP4. **Detect off-topic content** – Flag irrelevant or distracting segments.",
"STEP5. **Assign score** – Choose 1-5 from the rubric and briefly justify the choice.",
],
rubric=[
Rubric(score_range=(0, 2), expected_outcome="Largely unrelated or fails to answer the question at all."),
Rubric(score_range=(3, 4), expected_outcome="Misunderstands the main intent or covers it only marginally; most content is off-topic."),
Rubric(score_range=(5, 6), expected_outcome="Answers the question only partially or dilutes focus with surrounding details; relevance is acceptable but not strong."),
Rubric(score_range=(7, 8), expected_outcome="Covers all major points; minor omissions or slight digressions that don't harm overall relevance."),
Rubric(score_range=(9, 10), expected_outcome="Fully addresses every aspect of the user question; no missing or extraneous information and a clear, logical focus."),
],
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.RETRIEVAL_CONTEXT],
model="gpt-4o"
)
# Correctness
geval_correctness = GEval(
name="Correctness",
evaluation_steps=[
"STEP1. **Review reference answer** (ground truth).",
"STEP2. **Isolate factual claims** in the model response.",
"STEP3. **Cross-check** each claim against the reference or authoritative sources.",
"STEP4. **Record discrepancies** – classify as omissions, factual errors, or contradictions.",
"STEP5. **Assign score** using the rubric, citing the most significant discrepancies.",
],
rubric=[
Rubric(score_range=(0, 2), expected_outcome="Nearly everything is incorrect or contradictory to the reference."),
Rubric(score_range=(3, 4), expected_outcome="Substantial divergence from the reference; multiple errors but some truths remain."),
Rubric(score_range=(5, 6), expected_outcome="Partially correct; at least one important element is wrong or missing."),
Rubric(score_range=(7, 8), expected_outcome="Main facts are correct; only minor inaccuracies or ambiguities."),
Rubric(score_range=(9, 10), expected_outcome="All statements align perfectly with the provided ground-truth reference or verifiable facts; zero errors.")
],
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.RETRIEVAL_CONTEXT],
model="gpt-4o"
)
# Hallucination
geval_hallucination = GEval(
name="Hallucination",
evaluation_steps=[
"STEP1. **Highlight factual statements** – names, dates, statistics, citations, etc.",
"STEP2. **Compare with provided context** and known reliable data.",
"STEP3. **Label claims** as verified, unverifiable, or false.",
"STEP4. **Estimate hallucination impact** – proportion and importance of unsupported content.",
"STEP5. **Assign score** following the rubric and list specific hallucinated elements.",
],
rubric=[
Rubric(score_range=(0, 2), expected_outcome="Response is dominated by fabricated or clearly false content."),
Rubric(score_range=(3, 4), expected_outcome="Key parts rely on invented or unverifiable information."),
Rubric(score_range=(5, 6), expected_outcome="Some unverified or source-less details appear, but core content is factual."),
Rubric(score_range=(7, 8), expected_outcome="Contains minor speculative language that remains verifiable or harmless."),
Rubric(score_range=(9, 10), expected_outcome="All content is grounded in the given context or universally accepted facts; no unsupported claims.")
],
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.RETRIEVAL_CONTEXT],
model="gpt-4o"
)
await asyncio.to_thread(geval_answer_relevancy.measure, test_case)
await asyncio.to_thread(geval_correctness.measure, test_case)
await asyncio.to_thread(geval_hallucination.measure, test_case)
# Function to estimate rubric score (for display purposes)
def extract_rubric_score_from_normalized(normalized_score, rubric_list):
"""Identify rubric range from normalized score (0.0-1.0)"""
scaled_score = normalized_score * 10
for rubric_item in rubric_list:
score_range = rubric_item.score_range
if score_range[0] <= scaled_score <= score_range[1]:
return {
'scaled_score': scaled_score,
'rubric_range': score_range,
'expected_outcome': rubric_item.expected_outcome
}
return None
answer_relevancy_rubric_info = extract_rubric_score_from_normalized(
geval_answer_relevancy.score, geval_answer_relevancy.rubric
)
correctness_rubric_info = extract_rubric_score_from_normalized(
geval_correctness.score, geval_correctness.rubric
)
hallucination_rubric_info = extract_rubric_score_from_normalized(
geval_hallucination.score, geval_hallucination.rubric
)
return {
"answer_relevancy_score": geval_answer_relevancy.score,
"answer_relevancy_rubric_info": answer_relevancy_rubric_info,
"answer_relevancy_reason": geval_answer_relevancy.reason,
"correctness_score": geval_correctness.score,
"correctness_rubric_info": correctness_rubric_info,
"correctness_reason": geval_correctness.reason,
"hallucination_score": geval_hallucination.score,
"hallucination_rubric_info": hallucination_rubric_info,
"hallucination_reason": geval_hallucination.reason,
}
async def generate_summary(client: openai.AsyncOpenAI, prompt_template: str, full_story: str, model: str = "gpt-4o") -> str:
"""Generate summary using LLM"""
prompt = prompt_template.format(context=full_story)
try:
response = await client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=300,
temperature=0.0, top_p=0, logit_bias={}
)
content = response.choices[0].message.content
return content.strip() if content else ""
except Exception as e:
return f"Error: {str(e)}"
async def process_prompt(client: openai.AsyncOpenAI, prompt_info: dict, full_story: str, context: list) -> dict:
model = prompt_info.get("model", "gpt-4o")
# Generate summary
summary = await generate_summary(client, prompt_info["template"], full_story, model)
# Create test case
test_case = LLMTestCase(
input=prompt_info["template"], # Prompt
actual_output=summary, # Summary result
retrieval_context=context # Original text of the fairy tale to be summarized
)
# Execute evaluation
metrics_result = await evaluate_comprehensive_metrics(client, test_case, prompt_info['name'], full_story)
return {
"prompt_name": prompt_info['name'],
"model": model,
"summary": summary,
**metrics_result
}
async def main():
# Load the original fairy tale text
with open('little_red_riding_hood.txt', 'r', encoding='utf-8') as f:
full_story = f.read().strip()
context = [full_story]
prompts = [
{
"name": "prompt-01",
"template": """Please create a summary of the following `story`.
Requirements:
1. Identify and include major characters and important elements
2. Logically organize the flow of content
3. Include important events and turning points
4. Be faithful to the original text content
5. Keep the summary within 500 characters
story: {context}
Summary:""",
"model": "gpt-4o"
},
]
async with openai.AsyncOpenAI() as client:
tasks = [
process_prompt(client, prompt_info, full_story, context)
for prompt_info in prompts
]
all_results = await asyncio.gather(*tasks)
# Result display processing
...
if __name__ == "__main__":
asyncio.run(main())
The executed summary result was as follows:
Once upon a time, there was a lovely little girl called Little Red Riding Hood. She received a red hood from her grandmother and always wore it.
One day, she went through the forest to her grandmother's house to deliver sweets and wine to her sick grandmother.
On the way, she met a wolf and told him where she was going. The wolf went ahead and swallowed the grandmother, then deceived Little Red Riding Hood and swallowed her too.
However, a hunter who was passing by cut open the wolf's belly and rescued Little Red Riding Hood and her grandmother. Little Red Riding Hood learned a lesson and vowed never to stray from the path in the forest again.
The results evaluated by G-Eval are as follows (excerpt from the first run):
- Answer Relevancy: 0.912
- Expected Outcome: Fully addresses every aspect of the user question; no missing or extraneous information and a clear, logical focus.
- Reason: The summary includes key characters like Little Red Riding Hood, her grandmother, the wolf, and the hunter. It logically organizes the flow of events, such as the journey through the forest, the encounter with the wolf, and the rescue. Important events like the wolf's deception and the rescue by the hunter are covered. The summary is faithful to the original text and concise, with no extraneous information.
- Correctness: 0.901
- Expected Outcome: All statements align perfectly with the provided ground-truth reference or verifiable facts; zero errors.
- Reason: The main facts in the Actual Output align well with the Retrieval Context, including the characters, events, and moral of the story. Minor details like the specific dialogue and actions are slightly condensed but do not affect the overall accuracy.
- Hallucination: 0.903
- Expected Outcome: All content is grounded in the given context or universally accepted facts; no unsupported claims.
- Reason: The output closely follows the context with accurate details about Little Red Riding Hood, her grandmother, the wolf, and the hunter. The sequence of events and character actions are consistent with the context, with no unsupported claims.
Looking at the evaluation reasons that determined the scores, it appears that each indicator is being evaluated appropriately. As introduced in 3.2 Selection of Indicators, G-Eval experiences evaluation fluctuations. Therefore, we executed the above script 50 times. The scatter plot of the measured evaluation values is shown below.
As a result, all indicators achieved scores of approximately 0.9 or higher, but would it be possible to set the SLI value for each indicator to approximately 0.9 and to set the SLO to 0.9 or higher as a target value?
3.5. Review of Evaluation Metrics
As introduced above, this service summarizes Grimm’s Fairy Tales and summarizes them in sentences simple enough for even children to understand. To make the above summary results understandable for children, we should also consider the following indicators:
- Readability: Are there difficult kanji characters (words) or expressions that children cannot read?
- "deceived"?, "lesson"?, "wine"? (The Japanese version of the summary used old expressions and difficult kanji)
- Safety/Toxicity: Are there expressions that, when compared with modern compliance, are too violent for children?
- E.g., cut open the belly
It is necessary to select evaluation indicators with an awareness of closely linking them to customer value and business KPIs. In the case of this summarization service, rather than general evaluation indicators, the above indicators should be prioritized as task-specific metrics considering the target audience. Accordingly, the prompt would also need to be modified.
That said, it is difficult to create a perfect set of indicators on the first attempt. The Complete LLM Evaluation Playbook: How To Run LLM Evals That Matter states that it is desirable to start with one evaluation indicator and eventually narrow it down to five. It is necessary to select, measure, and evaluate indicators while being aware of how much the evaluation indicator scores match the metric outcome fit—the connection between indicators and outcomes (frequent use by children).
(In the case of an actual service, as a business KPI, providing images rather than text might yield better results)
3.6. Exploring Automation Possibilities
In the following example, humans performed indicator selection, evaluation score calculation, and indicator evaluation review. G-Eval then uses a mechanism that makes GPT-4 class models decompose and think about evaluation procedures themselves and return only the final score. In this way it can automate evaluation criteria application, scoring, and aggregation in one step in place of a human operator. Here is an example of that procedure:
- Present the evaluation tasks: Give the LLM used for evaluation a task explanation such as "Please score the generated text that will be presented according to certain evaluation criteria on a scale of 1 to 5." When performing this step, clearly indicate the definition of the evaluation criteria and teach the LLM the context of the task (for example, present the indicator list that was in the general evaluation indicators for LLM services).
- Decompose the evaluation perspectives: For the indicators selected by the LLM in 1., have the model list the necessary perspectives and steps by itself.
- Calculate the score: Next, have the model evaluate the actual input and output it according to the evaluation steps generated earlier.
As a point of caution, when LLMs act as evaluators, they tend to overestimate LLM-like outputs and have vulnerabilities where scores can be manipulated with the insertion of just a few words. Even if we try to mitigate this through evaluation with a different series of LLM models, complete neutrality cannot be guaranteed, for such things as pairwise comparison where two answers are compared side by side, or anomaly detection. Also, as introduced in 3.2 Selection of Indicators, G-Eval has reproducibility issues where evaluation fluctuates for the same answer due to its probabilistic evaluation method, requiring measures such as fixing evaluation prompts and seeds. For these reasons, it is essential to take a two-stage approach where human review is always used in conjunction for correction and verification of final judgments.
4. Summary
In this article, we introduced a range of topics from selecting essential metrics for evaluating the reliability of LLM services to specific measurement and evaluation methods and included demonstrations using the DeepEval library. How to define metrics for LLM service reliability evaluation as SLIs, which cannot be fully measured by conventional metrics such as availability and latency alone, is a new field for SRE as well. The approach of using evaluation tools such as DeepEval, which we tested for this article, is just one of many options. The field of LLM evaluation metrics is still under active research, and there seems to be no single correct answer yet to the question of how to measure the reliability of LLM services. However, even if new evaluation metrics and new measurement methods are discovered in the future, I believe that one fundamental question will remain unchanged: Do these metrics really represent customer satisfaction? Along with technological progress, I hope we can continue to engage in daily SRE work without forgetting this question.
Tomorrow’s article will be “AI Hackathon at Mercari Mobile Dev Offsite” by @k_kinukawa san. Stay tuned!
References
- Site Reliability Engineering Book: https://sre.google/books/
- LLM Evaluation Metrics: The Ultimate LLM Evaluation Guide: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
- The Accuracy Trap: Why Your Model’s 90% Might Mean Nothing: https://medium.com/%40edgar_muyale/the-accuracy-trap-why-your-models-90-might-mean-nothing-f3243fce6fe8
- The Complete LLM Evaluation Playbook: How To Run LLM Evals That Matter: https://www.confident-ai.com/blog/the-ultimate-llm-evaluation-playbook
- Levenshtein Distance: https://note.com/noa813/n/nb7ffd5a8f5e9
- LLM evaluation metrics — BLEU, ROUGE and METEOR explained: https://avinashselvam.medium.com/llm-evaluation-metrics-bleu-rogue-and-meteor-explained-a5d2b129e87f
- BERTScore: https://openreview.net/pdf?id=SkeHuCVFDr
- BERT: https://en.wikipedia.org/wiki/BERT_(language_model)
- Cosine Similarity: https://atmarkit.itmedia.co.jp/ait/articles/2112/08/news020.html
- MoverScore: https://arxiv.org/abs/1909.02622
- Earth Mover’s Distance (Optimal Transport Distance): https://zenn.dev/derwind/articles/dwd-optimal-transport01#%E6%9C%80%E9%81%A9%E8%BC%B8%E9%80%81%E8%B7%9D%E9%9B%A2
- G-Eval (Paper): https://arxiv.org/abs/2303.16634
- G-Eval Simply Explained: LLM-as-a-Judge for LLM Evaluation: https://www.confident-ai.com/blog/g-eval-the-definitive-guide
- QAG Score: https://arxiv.org/abs/2210.04320
- SelfCheckGPT: https://arxiv.org/abs/2303.08896
- DAG (deep acyclic graph): https://deepeval.com/docs/metrics-dag
- Prometheus2 Model: https://arxiv.org/abs/2405.01535
- DeepEval: https://deepeval.com/docs/getting-started
- Vertex AI – Metric Prompt Templates for Model-Based Evaluation: https://cloud.google.com/vertex-ai/generative-ai/docs/models/metrics-templates
- Little Red Riding Hood: https://ja.wikipedia.org/wiki/%E8%B5%A4%E3%81%9A%E3%81%8D%E3%82%93