Using DeepEval for Large Language Model (LLM) Evaluation in Python
DeepEval is an open-source Python framework that lets you evaluate outputs from large language models (LLMs) and retrieval augmented generation (RAG) applications using 14+ built-in metrics like hallucination, toxicity, and bias detection. Here’s how to get started:
- Install DeepEval using
pip install deepeval - Configure API key using
export OPENAI_API_KEY="your-key" - Create test cases using the
LLMTestCase()function - Define metrics and run evaluations using the
assert_test()function
This guide covers DeepEval installation, setup, metrics implementation, and practical testing examples to help you build reliable LLM applications.
What is LLM evaluation?
LLM evaluation measures how well language models perform on specific tasks like text generation, question answering, and reasoning. This process ensures models produce accurate, safe, and reliable outputs.
We can evaluate LLMs using automated tests and benchmarks like GLUE, SuperGLUE, HellaSwag, TruthfulQA, etc. We can also use humans or LLMs like ChatGPT and Gemini as judges to evaluate LLMs for relevancy, faithfulness, hallucination, etc. Frameworks like DeepEval, Evidently AI, MLflow LLM Evaluate, and TruLens help us evaluate LLMs using different metrics and benchmarks.
Let’s discuss how to evaluate LLMs using DeepEval, starting with DeepEval basics.
What is DeepEval?
DeepEval is an open-source LLM evaluation framework for testing LLMs and RAG applications. It provides different metrics and tools to test LLMs for accuracy, relevance, coherence, and ethical alignment. DeepEval also allows us to define custom metrics to evaluate LLMs using ChatGPT and Gemini as LLM judges.
To evaluate LLMs using DeepEval, we can create unit tests for LLM outputs and execute the tests manually. We can also automate DeepEval tests by integrating them into CI/CD pipelines to evaluate model responses continuously. Let’s discuss how to install and set up DeepEval.
Install and set up DeepEval
You can install DeepEval on your machine using PIP by executing the following command in the command-line terminal:
pip install deepeval
DeepEval uses an LLM as a judge to evaluate the LLM responses. Hence, we will configure an LLM to execute DeepEval testcases. By default, DeepEval uses OpenAI GPT models to evaluate LLMs. Therefore, you need an OpenAI API key to evaluate LLMs using DeepEval. If you want to use Gemini, you can create a Gemini API key.
To use Gemini with DeepEval, you can set the model name and Gemini API key by executing the following command in the command line terminal:
deepeval set-gemini --model-name="gemini-model-name" --google-api-key="Your_Gemini_API_Key"
By default, DeepEval uses gpt-4.1 in OpenAI API calls, which can be expensive. You can select a smaller or cost-efficient model to reduce the per-token costs while evaluating LLMs using DeepEval. For example, you can set gpt-3.5-turbo as the default model to run tests in DeepEval using the following command:
deepeval set-openai --model="gpt-3.5-turbo"
Finally, you must configure the API keys in the environment variables by executing one of the following commands:
# For using Geminiexport GOOGLE_API_KEY='Your_Gemini_API_KEY'# For using ChatGPTexport OPENAI_API_KEY='Your_OPENAI_API_KEY'
After initializing the environment variables, we can start testing LLMs using DeepEval. However, DeepEval also has a web-based Confident AI interface that allows us to view and analyze past LLM evaluation tests. Let’s discuss enabling the Confident AI web interface before starting with the LLM evaluation.
Set up Confident web UI
To set up the Confident web UI, execute the following command in the command line terminal:
deepeval login
After executing this command, we get an input prompt asking for the Confident AI API key:
At the same time, we are also redirected to the Confident AI signup page, which looks as follows:
Once we fill in all the details and sign up, we get the Confident AI API key in a pop-up on the homepage. The Confident AI homepage looks as follows:
You must copy the Confident AI API key and save it somewhere for future use. You can also copy the API key from project settings on the sidebar menu in the UI.
Once we enter the API key in the input prompt in the command-line terminal, we get a message confirming successful login to Confident AI.
We can use the Confident web UI to analyze the LLM evaluation test results, which we will see in the following sections. Let’s first discuss the different steps to evaluate LLMs using DeepEval.
Step-by-step process to test LLMs in DeepEval
To evaluate an LLM using DeepEval, we can use the following six-step process:
- Step 1: Initialize environment variables
- Step 2: Define a function to get the LLM output
- Step 3: Define a test case
- Step 4: Define a metric
- Step 5: Execute the test case
- Step 6: Analyze test results
Let’s discuss each step individually.
Step 1: Initialize environment variables
We need an OpenAI or Gemini API key to execute DeepEval tests. If you want to use ChatGPT models, you can set the OPENAI_API_KEY environment variable as follows:
import osos.environ["OPENAI_API_KEY"]="Your_OpenAI_API_Key"
You can set the GOOGLE_API_KEY environment variable for using Gemini models as follows:
import osos.environ["GOOGLE_API_KEY"]="Your_Gemini_API_Key"
After setting the environment variables, we will define a function to get the outputs from the LLM we want to test.
Step 2: Define a function to get the LLM output
We must evaluate an LLM for a given query and the corresponding ground truth. Hence, we will define a function that takes a query and returns the LLM output we will use to execute DeepEval tests.
For this article, we will evaluate a codecademy-assistant LLM hosted in Ollama that we have created by fine-tuning a qwen3 model on Codecademy terms of service. We will define the get_model_output() function to get answers for a query using the fine-tuned model, as shown in the following code:
from ollama import Clientdef get_model_output(query):client = Client()# Get model responseoutput = client.chat(model="codecademy-assistant",messages=[{"role": "user", "content": query}],think=False)# Retrieve model responseresponse = output["message"]["content"]return response
The get_model_output() function returns the LLM output for any query. We will evaluate the LLM outputs against the actual terms and conditions in different test cases to see how good the model is as a virtual assistant. Let’s see how to define a DeepEval test case.
Step 3: Define a test case
We use the LLMTestCase class to define an LLM evaluation test case. The LLMTestCase() constructor takes the query and the LLM output as input to its input and actual_output parameters. It also takes the ground truth statement as an input to its retrieval_context or context parameter. You can define a DeepEval test case as follows:
from deepeval.test_case import LLMTestCasetest_case = LLMTestCase(input="user query",actual_output="LLM output",context="Ground truth statement")
After defining the test case, we will define the metric for evaluating the LLM output.
Step 4: Define a metric
DeepEval provides the deepeval.metrics module containing different functions to define LLM evaluation metrics. For example, we can define the HallucinationMetric to test the LLM output for hallucination as follows:
from deepeval.metrics import HallucinationMetrichallucination_metric = HallucinationMetric(threshold=0.7)
The threshold parameter decides whether to pass or fail a test case based on the score given by ChatGPT or Gemini for a given ground truth and LLM output pair.
Step 5: Execute test case
We use the assert_test() function to execute DeepEval test cases. The assert_test() function takes a test case as its first input argument and a list of metrics as its second input argument. You can execute the hallucination test using the assert_test() statement as follows:
from deepeval import assert_testassert_test(test_case, [hallucination_metric])
When executed, the assert_test() function runs the LLM evaluation tests and shows the results on the standard output. To run the test, we must save the code in a Python file with the filename starting with test_. After that, we can execute the file using the deepeval test run command as follows:
deepeval test run test_filename.py
If you don’t name the Python file with the test_ prefix, DeepEval throws a ValueError exception with the message “ValueError: Test will not run. Please ensure the file starts with test_ prefix.”, as shown in the following image:
After running the DeepEval test, we can analyze the results using the Confident web UI.
Step 6: Analyze test results
To analyze a DeepEval test’s results, we can execute the deepeval view command in the command-line terminal and enter the Confident AI API key. The deepeval view command redirects us to the Confident web UI, where we can analyze the test results.
Now that we know all the steps to execute a DeepEval test case for evaluating LLMs, let’s discuss some metrics we can use to evaluate LLMs using DeepEval.
Metrics to evaluate LLMs using DeepEval
DeepEval provides different built-in metrics, such as answer relevancy, bias, toxicity, and PII leakage, to evaluate LLMs. It also allows us to create custom metrics for LLM evaluation. Let’s discuss the different LLM evaluation metrics in DeepEval and how to define the metrics in Python code.
Answer relevancy
The answer relevancy metric uses LLM-as-a-judge to measure the relevance of an LLM’s output for a given query. We can define the answer relevancy metric using the AnswerRelevancyMetric() function as follows:
from deepeval.metrics import AnswerRelevancyMetricmetric = AnswerRelevancyMetric(threshold=0.7)
When testing LLMs for the answer relevancy metric, the LLMTestCase object must have the query and LLM output assigned to the input and actual_output parameters.
Bias
The bias metric uses LLM-as-a-judge to check if the LLM output contains gender, racial, or political bias. We can define the bias metric using the BiasMetric() function as follows:
from deepeval.metrics import BiasMetricmetric = BiasMetric(threshold=0.5)
Like answer relevancy, the LLMTestCase object must have the query and LLM output assigned to the input and actual_output parameters when testing LLMs for bias.
Toxicity
The toxicity metric measures toxicness in LLM outputs. We can define the toxicity metric using the ToxicityMetric() as follows:
from deepeval.metrics import ToxicityMetricmetric = ToxicityMetric(threshold=0.5)
PII leakage
The PII leakage metric determines whether the LLM output contains personally identifiable information (PII) or sensitive data that should be protected. We can define the PII leakage metric using the PIILeakageMetric() function.
from deepeval.metrics import PIILeakageMetricmetric = PIILeakageMetric(threshold=0.5)
Role violation metric
The role violation metric checks if the LLM output violates the assigned role or character. We can define the role violation metric using the RoleViolationMetric() function as follows:
from deepeval.metrics import RoleViolationMetricmetric = RoleViolationMetric(role= "Codecademy assistant", threshold= 0.5)
Hallucination
The hallucination metric checks if the LLM output is factually correct compared to a ground truth statement. We can define the hallucination metric using the HallucinationMetric() function as follows:
from deepeval.metrics import HallucinationMetricmetric = HallucinationMetric(threshold=0.5)
While testing an LLM for hallucination, the ground truth should be passed to the LLMTestCase object using the context parameter.
Custom metrics in DeepEval
DeepEval allows us to create custom metrics using the GEval() function. To do this, we can define the evaluation criteria, metric name, evaluation parameters, and threshold using the name, criteria, evaluation_params, and threshold parameters as follows:
from deepeval.metrics import GEval# Define custom metric using GEvalcorrectness_metric = GEval(name="Correctness",criteria="Determine whether the actual output is factually correct with respect to the expected output.",evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],threshold=0.5)
For every evaluation parameter passed to the GEval() function, we must pass the corresponding input to the LLMTestCase object. For example, we have defined the ACTUAL_OUTPUT and EXPECTED_OUTPUT parameters in the Correctness metric. Hence, the LLMTestCase object must have actual_output and expected_output parameters for the correctness_metric to execute successfully. Like the correctness_metric, you can create any custom metric by defining the evaluation criteria and parameters.
Now that we know the stepwise process to test LLMs and the different LLM evaluation metrics, let’s test the codecademy-assistant LLM using DeepEval and the different metrics.
Evaluating LLMs using DeepEval
We will use DeepEval to run the following tests on the codecademy-assistant LLM.
- A unit test with a single predefined metric.
- A unit test with a custom metric.
- A unit test with multiple metrics.
- Multiple unit tests in a single Python file.
Let’s discuss how to run each type of LLM evaluation test individually.
Run single unit tests
To run a DeepEval unit test, we must define a function containing a test case and a metric. Then, we use the assert_test() function to execute the test case. For example, we can write a DeepEval test to test an LLM output for hallucination as follows:
from ollama import Clientfrom deepeval import assert_testfrom deepeval.test_case import LLMTestCasefrom deepeval.metrics import HallucinationMetricimport osos.environ["OPENAI_API_KEY"] = "Your_OPENAI_API_KEY"# Define a function to get the model output for a given querydef get_model_output(query):client=Client()# Get model responseoutput=client.chat(model= "codecademy-assistant",messages=[{"role":"user", "content":query}],think=False)# Retrieve model responseresponse=output["message"]["content"]return responsedef test_hallucination():# Define query, ground truth, and model outputquery = "What is the minimum age required to be able to use Codecademy?"ground_truth = "You must be 16 years or older to use the Services. If you are less than 18 years of age and would like to register to use any part of the Services, please ask your parent or legal guardian to review and agree to these terms before you use any part of the Services, or ask them to complete the purchase or registration on your behalf."model_output = get_model_output(query)print("The model output is:\n",model_output)# Define deepeval testcasetest_case=LLMTestCase(input = query,actual_output = model_output,context = [ground_truth])# Define metricmetric= HallucinationMetric(threshold=0.7)# Execute testtry:assert_test(test_case, [metric])except AssertionError:print("Test failed.")if __name__== "__main__":test_hallucination()
In this code, we implemented the test_hallucination() function to implement the test case for hallucination. When we execute this Python using the deepeval test run command, the unit test is also named as the function name, i.e., test_hallucination, and the test results are shown in the standard output:
To implement DeepEval tests, we can also define custom metrics instead of built-in metrics. Let’s discuss how to do so.
Run unit tests with custom metrics
We can use the GEval() function to implement custom metrics and run DeepEval tests as shown in the following code:
from ollama import Clientfrom deepeval import assert_testfrom deepeval.test_case import LLMTestCase, LLMTestCaseParamsfrom deepeval.metrics import GEvalimport osos.environ["OPENAI_API_KEY"] = "Your_OPENAI_API_KEY"# Define a function to get the model output for a given querydef get_model_output(query):client=Client()# Get model responseoutput=client.chat(model= "codecademy-assistant",messages=[{"role":"user", "content":query}],think=False)# Retrieve model responseresponse=output["message"]["content"]return responsedef test_correctness():# Define query, ground truth, and model outputquery = "Can we learn AI from Codecademy?"ground_truth = "You can learn about AI (Artificial Intelligence) through Codecademy."model_output = get_model_output(query)print("The model output is:\n",model_output)# Define deepeval testcasetest_case = LLMTestCase(input=query,actual_output=model_output,expected_output= ground_truth)# Define custom metric using GEvalcorrectness_metric = GEval(name="Correctness",criteria="Determine whether the actual output is factually correct with respect to the expected output.",evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],threshold=0.5)# Execute testtry:assert_test(test_case, [correctness_metric])except AssertionError:print("Test failed.")if __name__== "__main__":test_correctness()
We implemented the test_correctness() function in this code to test the LLM output for correctness using the GEval() function. The output of the test_correctness DeepEval test is as follows:
We can also implement multiple unit tests in a single Python file to evaluate LLMs instead of a single DeepEval test.
Run multiple unit tests
To run multiple unit tests in DeepEval, we need to create a separate function for each test case. For example, we can test the LLM output for correctness and hallucination by implementing two test cases using the test_correctness() and test_hallucination() functions as shown in the following code:
from ollama import Clientfrom deepeval import assert_testfrom deepeval.test_case import LLMTestCase, LLMTestCaseParamsfrom deepeval.metrics import GEval, HallucinationMetricimport osos.environ["OPENAI_API_KEY"] = "Your_OPENAI_API_KEY"# Define a function to get the model output for a given querydef get_model_output(query):client=Client()# Get model responseoutput=client.chat(model= "codecademy-assistant",messages=[{"role":"user", "content":query}],think=False)# Retrieve model responseresponse=output["message"]["content"]return responsedef test_correctness():# Define query, ground truth, and model outputquery = "Can we learn AI from Codecademy?"ground_truth = "You can learn about AI (Artificial Intelligence) through Codecademy."model_output = get_model_output(query)# Define deepeval testcasecorrectness_test_case = LLMTestCase(input=query,actual_output=model_output,expected_output= ground_truth)# Define correctness metric using GEvalcorrectness_metric = GEval(name="Correctness",criteria="Determine whether the actual output is factually correct with respect to the expected output.",evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],threshold=0.5)# Execute testtry:assert_test(correctness_test_case, [correctness_metric])except AssertionError:print("Correctness test failed.")def test_hallucination():# Define query, ground truth, and model outputquery = "What is the minimum age required to be able to use Codecademy?"ground_truth = "You must be 16 years or older to use the Services. If you are less than 18 years of age and would like to register to use any part of the Services, please ask your parent or legal guardian to review and agree to these terms before you use any part of the Services, or ask them to complete the purchase or registration on your behalf."model_output = get_model_output(query)# Define hallucination testcasehallucination_test_case = LLMTestCase(input = query,actual_output = model_output,context = [ground_truth])# Define metrichallucination_metric= HallucinationMetric(threshold=0.6)# Execute testtry:assert_test(hallucination_test_case, [hallucination_metric])except AssertionError:print("Hallucination test failed.")if __name__== "__main__":test_correctness()test_hallucination()
After executing this code, we get an output containing each test case with its metric and evaluation status as shown in the following image:
Multiple test cases are good for evaluating different outputs. However, you can also evaluate a single output against different metrics using a single test case.
Run a single unit test with multiple metrics
To run a single unit test with multiple metrics, you can define all the metrics and pass them to the assert_test() function in a list. After this, the assert_test() function evaluates the test case for each metric.
from ollama import Clientfrom deepeval import assert_testfrom deepeval.test_case import LLMTestCase, LLMTestCaseParamsfrom deepeval.metrics import HallucinationMetric, FaithfulnessMetric, GEvalimport osos.environ["OPENAI_API_KEY"] = "Your_OPENAI_API_KEY"def get_model_output(query):client = Client()# Get model responseoutput = client.chat(model="codecademy-assistant",messages=[{"role": "user", "content": query}],think=False)# Retrieve model responseresponse = output["message"]["content"]return responsedef test_two_metrics():# Define query, ground truth, and model outputquery = "What is the minimum age required to be able to use Codecademy?"ground_truth = "You must be 16 years or older to use the Services. If you are less than 18 years of age and would like to register to use any part of the Services, please ask your parent or legal guardian to review and agree to these terms before you use any part of the Services, or ask them to complete the purchase or registration on your behalf."model_output = get_model_output(query)# Define deepeval testcasetest_case = LLMTestCase(input=query,actual_output=model_output,expected_output= ground_truth, # for correctnesscontext=[ground_truth] # for hallucination metric)# Define custom metric using GEvalcorrectness_metric = GEval(name="Correctness",criteria="Determine whether the actual output is factually correct with respect to the expected output.",evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],threshold=0.5)# Define metricshallucination_metric = HallucinationMetric(threshold=0.7)# Execute testtry:assert_test(test_case, [hallucination_metric, correctness_metric])except AssertionError:print("Test failed.")if __name__ == "__main__":test_two_metrics()
In this code, we defined two metrics to test a single LLM output for hallucination and correctness. DeepEval shows each metric, score, and evaluation status in the output.
Now that we know how to evaluate LLMs using different approaches, let’s discuss how to analyze a test case in the Confident web UI after executing it on our machine.
Using Confident web UI to analyze LLM evaluation tests
To analyze a test case in the web UI, we first need to execute the deepeval view command after getting the test case results in the command-line terminal. The deepeval view command gives us an input prompt for entering the Confident AI API key.
Once we enter the Confident API key, we are redirected to the Confident UI containing the test results.
You can click on each test case and metric to see the input, context, LLM output, score, evaluation status, etc, for each test case and metric included in the DeepEval test, as shown in the following image:
Conclusion
As LLMs become core components of modern AI systems, the importance of rigorous and systematic evaluation cannot be ignored. DeepEval helps us move beyond manual testing and implement structured, reproducible, and meaningful tests for LLM evaluation. Whether we’re fine-tuning an LLM for a niche domain or deploying it at scale, DeepEval provides the flexibility and tools to ensure the models are accurate, safe, fair, and aligned with our expectations. In this article, we discussed DeepEval basics, installation, and setup. We also discussed implementing in-built and custom metrics to test a fine-tuned LLM using different test cases.
To learn more about security and testing in generative AI, you can take this navigating AI ethical challenges and risks course that discusses challenges, risks, data privacy, algorithmic bias, and decision-making frameworks for responsible use of LLMs. You might also like the IT automation with generative AI course that discusses AI fundamentals, SRE practices, ethical considerations, server monitoring, and automation system integration.
Frequently asked questions
1. What is LLM as a judge?
LLM as a judge is an evaluation methodology in which we use a large language model like Gemini or ChatGPT to evaluate other LLMs for faithfulness, bias, hallucination, toxicity, etc.
2. What is the difference between DeepEval and GEval?
DeepEval is a framework to evaluate LLMs, whereas GEval is a framework inside DeepEval that uses LLM-as-a-judge to evaluate LLMs by helping us create custom metrics.
3. What is model bias in LLM?
Model bias in LLMs is the tendency of AI models to produce outputs that reflect prejudices, stereotypes, or unfair generalizations in their training data.
4. What is UniEval?
UniEval is an LLM evaluation framework for evaluating unified multimodal models without extra models, images, or annotations.
5. What is contextual recall in Deepeval?
Contextual recall in DeepEval is an LLM evaluation metric to evaluate the quality of a RAG pipeline’s retriever by measuring how well the retrieved document aligns with the expected output for any given query.
6. Is DeepEval free or paid?
DeepEval is completely free and open-source. You only pay for LLM API costs (like OpenAI or Gemini) used as judges for evaluation.
'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'
Meet the full teamRelated articles
- Article
LLM Evaluation: Metrics, Benchmarks & Best Practices
Complete guide to LLM evaluation metrics, benchmarks, and best practices. Learn about BLEU, ROUGE, GLUE, SuperGLUE, and other evaluation frameworks. - Article
Build an LLM Evaluation Framework: Metrics, Methods & Tools
Build an LLM evaluation framework with essential metrics, methods like LLM-as-a-judge, and top evaluation tools. - Article
How to Fine Tune Large Language Models (LLMs)
Learn how to fine tune large language models (LLMs) in Python with step-by-step examples, techniques, and best practices.
Learn more on Codecademy
- Evaluate LLM skill through metrics like BLEU, ROUGE, F1, HELM, navigating accuracy, latency, cost, scalability trade-offs, addressing bias and ethical concerns.
- Includes 27 Courses
- With Certificate
- Intermediate.10 hours
- Learn the basics of large language models (LLMs) and text-based Generative Artificial Intelligence (AI). We’ll show you how LLMs work and how they’re used.
- Beginner Friendly.1 hour
- Use generative AI to help create a unit test for Python code. Generative AI can generate unit tests for all of the functions of a Python class.
- Beginner Friendly.< 1 hour