Using DeepEval for Large Language Model (LLM) Evaluation in Python

DeepEval is an open-source Python framework that lets you evaluate outputs from large language models (LLMs) and retrieval augmented generation (RAG) applications using 14+ built-in metrics like hallucination, toxicity, and bias detection. Here’s how to get started:

Install DeepEval using pip install deepeval
Configure API key using export OPENAI_API_KEY="your-key"
Create test cases using the LLMTestCase() function
Define metrics and run evaluations using the assert_test() function

This guide covers DeepEval installation, setup, metrics implementation, and practical testing examples to help you build reliable LLM applications.

Skill path
LLM Metrics and Trade-Offs
Evaluate LLM skill through metrics like BLEU, ROUGE, F1, HELM, navigating accuracy, latency, cost, scalability trade-offs, addressing bias and ethical concerns.
Includes 27 Courses
With Certificate
Intermediate.
10 hours
Free course
Intro to Large Language Models (LLMs)
Learn the basics of large language models (LLMs) and text-based Generative Artificial Intelligence (AI). We’ll show you how LLMs work and how they’re used.
Beginner Friendly.
1 hour

What is LLM evaluation?

LLM evaluation measures how well language models perform on specific tasks like text generation, question answering, and reasoning. This process ensures models produce accurate, safe, and reliable outputs.

We can evaluate LLMs using automated tests and benchmarks like GLUE, SuperGLUE, HellaSwag, TruthfulQA, etc. We can also use humans or LLMs like ChatGPT and Gemini as judges to evaluate LLMs for relevancy, faithfulness, hallucination, etc. Frameworks like DeepEval, Evidently AI, MLflow LLM Evaluate, and TruLens help us evaluate LLMs using different metrics and benchmarks.

Let’s discuss how to evaluate LLMs using DeepEval, starting with DeepEval basics.

What is DeepEval?

DeepEval is an open-source LLM evaluation framework for testing LLMs and RAG applications. It provides different metrics and tools to test LLMs for accuracy, relevance, coherence, and ethical alignment. DeepEval also allows us to define custom metrics to evaluate LLMs using ChatGPT and Gemini as LLM judges.

To evaluate LLMs using DeepEval, we can create unit tests for LLM outputs and execute the tests manually. We can also automate DeepEval tests by integrating them into CI/CD pipelines to evaluate model responses continuously. Let’s discuss how to install and set up DeepEval.

Install and set up DeepEval

You can install DeepEval on your machine using PIP by executing the following command in the command-line terminal:

pip install deepeval 

Copy to clipboardCopy to clipboard

DeepEval uses an LLM as a judge to evaluate the LLM responses. Hence, we will configure an LLM to execute DeepEval testcases. By default, DeepEval uses OpenAI GPT models to evaluate LLMs. Therefore, you need an OpenAI API key to evaluate LLMs using DeepEval. If you want to use Gemini, you can create a Gemini API key.

To use Gemini with DeepEval, you can set the model name and Gemini API key by executing the following command in the command line terminal:

deepeval set-gemini --model-name="gemini-model-name" --google-api-key="Your_Gemini_API_Key" 

Copy to clipboardCopy to clipboard

By default, DeepEval uses gpt-4.1 in OpenAI API calls, which can be expensive. You can select a smaller or cost-efficient model to reduce the per-token costs while evaluating LLMs using DeepEval. For example, you can set gpt-3.5-turbo as the default model to run tests in DeepEval using the following command:

deepeval set-openai --model="gpt-3.5-turbo" 

Copy to clipboardCopy to clipboard

Finally, you must configure the API keys in the environment variables by executing one of the following commands:

# For using Gemini 
export GOOGLE_API_KEY='Your_Gemini_API_KEY' 
# For using ChatGPT 
export OPENAI_API_KEY='Your_OPENAI_API_KEY' 

Copy to clipboardCopy to clipboard

After initializing the environment variables, we can start testing LLMs using DeepEval. However, DeepEval also has a web-based Confident AI interface that allows us to view and analyze past LLM evaluation tests. Let’s discuss enabling the Confident AI web interface before starting with the LLM evaluation.

Set up Confident web UI

To set up the Confident web UI, execute the following command in the command line terminal:

deepeval login 

Copy to clipboardCopy to clipboard

After executing this command, we get an input prompt asking for the Confident AI API key:

DeepEval LLM evaluation framework login command setup

At the same time, we are also redirected to the Confident AI signup page, which looks as follows:

DeepEval framework Confident AI signup page

Once we fill in all the details and sign up, we get the Confident AI API key in a pop-up on the homepage. The Confident AI homepage looks as follows:

DeepEval LLM evaluation framework homepage dashboard

You must copy the Confident AI API key and save it somewhere for future use. You can also copy the API key from project settings on the sidebar menu in the UI.

Once we enter the API key in the input prompt in the command-line terminal, we get a message confirming successful login to Confident AI.

DeepEval command-line output after login

We can use the Confident web UI to analyze the LLM evaluation test results, which we will see in the following sections. Let’s first discuss the different steps to evaluate LLMs using DeepEval.

Step-by-step process to test LLMs in DeepEval

To evaluate an LLM using DeepEval, we can use the following six-step process:

Step 1: Initialize environment variables
Step 2: Define a function to get the LLM output
Step 3: Define a test case
Step 4: Define a metric
Step 5: Execute the test case
Step 6: Analyze test results

Let’s discuss each step individually.

Step 1: Initialize environment variables

We need an OpenAI or Gemini API key to execute DeepEval tests. If you want to use ChatGPT models, you can set the OPENAI_API_KEY environment variable as follows:

import os 
os.environ["OPENAI_API_KEY"]="Your_OpenAI_API_Key" 

Copy to clipboardCopy to clipboard

You can set the GOOGLE_API_KEY environment variable for using Gemini models as follows:

import os 
os.environ["GOOGLE_API_KEY"]="Your_Gemini_API_Key" 

Copy to clipboardCopy to clipboard

After setting the environment variables, we will define a function to get the outputs from the LLM we want to test.

Step 2: Define a function to get the LLM output

We must evaluate an LLM for a given query and the corresponding ground truth. Hence, we will define a function that takes a query and returns the LLM output we will use to execute DeepEval tests.

For this article, we will evaluate a codecademy-assistant LLM hosted in Ollama that we have created by fine-tuning a qwen3 model on Codecademy terms of service. We will define the get_model_output() function to get answers for a query using the fine-tuned model, as shown in the following code:

from ollama import Client 
def get_model_output(query): 
    client = Client() 
    # Get model response 
    output = client.chat( 
        model="codecademy-assistant", 
        messages=[{"role": "user", "content": query}], 
        think=False 
    ) 
    # Retrieve model response 
    response = output["message"]["content"] 
    return response 

Copy to clipboardCopy to clipboard

The get_model_output() function returns the LLM output for any query. We will evaluate the LLM outputs against the actual terms and conditions in different test cases to see how good the model is as a virtual assistant. Let’s see how to define a DeepEval test case.

Step 3: Define a test case

We use the LLMTestCase class to define an LLM evaluation test case. The LLMTestCase() constructor takes the query and the LLM output as input to its input and actual_output parameters. It also takes the ground truth statement as an input to its retrieval_context or context parameter. You can define a DeepEval test case as follows:

from deepeval.test_case import LLMTestCase 
test_case = LLMTestCase( 
        input="user query", 
        actual_output="LLM output",  
        context="Ground truth statement" 
    ) 
 

Copy to clipboardCopy to clipboard

After defining the test case, we will define the metric for evaluating the LLM output.

Step 4: Define a metric

DeepEval provides the deepeval.metrics module containing different functions to define LLM evaluation metrics. For example, we can define the HallucinationMetric to test the LLM output for hallucination as follows:

from deepeval.metrics import HallucinationMetric 
hallucination_metric = HallucinationMetric(threshold=0.7) 

Copy to clipboardCopy to clipboard

The threshold parameter decides whether to pass or fail a test case based on the score given by ChatGPT or Gemini for a given ground truth and LLM output pair.

Step 5: Execute test case

We use the assert_test() function to execute DeepEval test cases. The assert_test() function takes a test case as its first input argument and a list of metrics as its second input argument. You can execute the hallucination test using the assert_test() statement as follows:

from deepeval import assert_test 
assert_test(test_case, [hallucination_metric]) 

Copy to clipboardCopy to clipboard

When executed, the assert_test() function runs the LLM evaluation tests and shows the results on the standard output. To run the test, we must save the code in a Python file with the filename starting with test_. After that, we can execute the file using the deepeval test run command as follows:

deepeval test run test_filename.py 

Copy to clipboardCopy to clipboard

If you don’t name the Python file with the test_ prefix, DeepEval throws a ValueError exception with the message “ValueError: Test will not run. Please ensure the file starts with test_ prefix.”, as shown in the following image:

Image showing DeepEval test execution error

After running the DeepEval test, we can analyze the results using the Confident web UI.

Step 6: Analyze test results

To analyze a DeepEval test’s results, we can execute the deepeval view command in the command-line terminal and enter the Confident AI API key. The deepeval view command redirects us to the Confident web UI, where we can analyze the test results.

Now that we know all the steps to execute a DeepEval test case for evaluating LLMs, let’s discuss some metrics we can use to evaluate LLMs using DeepEval.

Metrics to evaluate LLMs using DeepEval

DeepEval provides different built-in metrics, such as answer relevancy, bias, toxicity, and PII leakage, to evaluate LLMs. It also allows us to create custom metrics for LLM evaluation. Let’s discuss the different LLM evaluation metrics in DeepEval and how to define the metrics in Python code.

Answer relevancy

The answer relevancy metric uses LLM-as-a-judge to measure the relevance of an LLM’s output for a given query. We can define the answer relevancy metric using the AnswerRelevancyMetric() function as follows:

from deepeval.metrics import AnswerRelevancyMetric 
metric = AnswerRelevancyMetric(threshold=0.7) 

Copy to clipboardCopy to clipboard

When testing LLMs for the answer relevancy metric, the LLMTestCase object must have the query and LLM output assigned to the input and actual_output parameters.

Bias

The bias metric uses LLM-as-a-judge to check if the LLM output contains gender, racial, or political bias. We can define the bias metric using the BiasMetric() function as follows:

from deepeval.metrics import BiasMetric 
metric = BiasMetric(threshold=0.5) 

Copy to clipboardCopy to clipboard

Like answer relevancy, the LLMTestCase object must have the query and LLM output assigned to the input and actual_output parameters when testing LLMs for bias.

Toxicity

The toxicity metric measures toxicness in LLM outputs. We can define the toxicity metric using the ToxicityMetric() as follows:

from deepeval.metrics import ToxicityMetric 
metric = ToxicityMetric(threshold=0.5) 

Copy to clipboardCopy to clipboard

PII leakage

The PII leakage metric determines whether the LLM output contains personally identifiable information (PII) or sensitive data that should be protected. We can define the PII leakage metric using the PIILeakageMetric() function.

from deepeval.metrics import PIILeakageMetric 
metric = PIILeakageMetric(threshold=0.5) 

Copy to clipboardCopy to clipboard

Role violation metric

The role violation metric checks if the LLM output violates the assigned role or character. We can define the role violation metric using the RoleViolationMetric() function as follows:

from deepeval.metrics import RoleViolationMetric 
metric = RoleViolationMetric(role= "Codecademy assistant", threshold= 0.5) 

Copy to clipboardCopy to clipboard

Hallucination

The hallucination metric checks if the LLM output is factually correct compared to a ground truth statement. We can define the hallucination metric using the HallucinationMetric() function as follows:

from deepeval.metrics import HallucinationMetric 
metric = HallucinationMetric(threshold=0.5) 

Copy to clipboardCopy to clipboard

While testing an LLM for hallucination, the ground truth should be passed to the LLMTestCase object using the context parameter.

Custom metrics in DeepEval

DeepEval allows us to create custom metrics using the GEval() function. To do this, we can define the evaluation criteria, metric name, evaluation parameters, and threshold using the name, criteria, evaluation_params, and threshold parameters as follows:

from deepeval.metrics import GEval 
 
# Define custom metric using GEval 
correctness_metric = GEval( 
    name="Correctness", 
    criteria="Determine whether the actual output is factually correct with respect to the expected output.", 
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT], 
    threshold=0.5 
) 

Copy to clipboardCopy to clipboard

For every evaluation parameter passed to the GEval() function, we must pass the corresponding input to the LLMTestCase object. For example, we have defined the ACTUAL_OUTPUT and EXPECTED_OUTPUT parameters in the Correctness metric. Hence, the LLMTestCase object must have actual_output and expected_output parameters for the correctness_metric to execute successfully. Like the correctness_metric, you can create any custom metric by defining the evaluation criteria and parameters.

Now that we know the stepwise process to test LLMs and the different LLM evaluation metrics, let’s test the codecademy-assistant LLM using DeepEval and the different metrics.

Evaluating LLMs using DeepEval

We will use DeepEval to run the following tests on the codecademy-assistant LLM.

A unit test with a single predefined metric.
A unit test with a custom metric.
A unit test with multiple metrics.
Multiple unit tests in a single Python file.

Let’s discuss how to run each type of LLM evaluation test individually.

Run single unit tests

To run a DeepEval unit test, we must define a function containing a test case and a metric. Then, we use the assert_test() function to execute the test case. For example, we can write a DeepEval test to test an LLM output for hallucination as follows:

from ollama import Client  
from deepeval import assert_test  
from deepeval.test_case import LLMTestCase 
from deepeval.metrics import HallucinationMetric 
 
import os 
 
os.environ["OPENAI_API_KEY"] = "Your_OPENAI_API_KEY" 
 
# Define a function to get the model output for a given query 
def get_model_output(query): 
    client=Client()  
    # Get model response  
    output=client.chat(  
        model= "codecademy-assistant",  
        messages=[{"role":"user", "content":query}],  
        think=False  
    )  
    # Retrieve model response  
    response=output["message"]["content"]  
    return response 
 
def test_hallucination(): 
 
    # Define query, ground truth, and model output 
    query = "What is the minimum age required to be able to use Codecademy?" 
    ground_truth = "You must be 16 years or older to use the Services. If you are less than 18 years of age and would like to register to use any part of the Services, please ask your parent or legal guardian to review and agree to these terms before you use any part of the Services, or ask them to complete the purchase or registration on your behalf." 
    model_output = get_model_output(query) 
    print("The model output is:\n",model_output) 
     
    # Define deepeval testcase 
    test_case=LLMTestCase( 
        input = query, 
        actual_output = model_output, 
        context = [ground_truth] 
    ) 
 
    # Define metric 
    metric= HallucinationMetric(threshold=0.7) 
     
    # Execute test 
    try: 
        assert_test(test_case, [metric]) 
    except AssertionError: 
        print("Test failed.") 
   
if __name__== "__main__": 
    test_hallucination() 

Copy to clipboardCopy to clipboard

In this code, we implemented the test_hallucination() function to implement the test case for hallucination. When we execute this Python using the deepeval test run command, the unit test is also named as the function name, i.e., test_hallucination, and the test results are shown in the standard output:

Image showing output for hallucination test in DeepEval

To implement DeepEval tests, we can also define custom metrics instead of built-in metrics. Let’s discuss how to do so.

Run unit tests with custom metrics

We can use the GEval() function to implement custom metrics and run DeepEval tests as shown in the following code:

from ollama import Client  
from deepeval import assert_test  
from deepeval.test_case import LLMTestCase, LLMTestCaseParams 
from deepeval.metrics import GEval 
 
import os 
 
os.environ["OPENAI_API_KEY"] = "Your_OPENAI_API_KEY" 
 
# Define a function to get the model output for a given query 
def get_model_output(query): 
    client=Client()  
    # Get model response  
    output=client.chat(  
        model= "codecademy-assistant",  
        messages=[{"role":"user", "content":query}],  
        think=False  
    )  
    # Retrieve model response  
    response=output["message"]["content"]  
    return response 
 
def test_correctness(): 
 
    # Define query, ground truth, and model output 
    query = "Can we learn AI from Codecademy?" 
    ground_truth = "You can learn about AI (Artificial Intelligence) through Codecademy." 
    model_output = get_model_output(query) 
    print("The model output is:\n",model_output) 
 
    # Define deepeval testcase 
    test_case = LLMTestCase( 
        input=query, 
        actual_output=model_output, 
        expected_output= ground_truth 
    ) 
     
    # Define custom metric using GEval 
    correctness_metric = GEval( 
        name="Correctness", 
        criteria="Determine whether the actual output is factually correct with respect to the expected output.", 
        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT], 
        threshold=0.5 
    ) 
     
    # Execute test 
    try: 
        assert_test(test_case, [correctness_metric]) 
    except AssertionError: 
        print("Test failed.") 
   
if __name__== "__main__": 
    test_correctness() 

Copy to clipboardCopy to clipboard

We implemented the test_correctness() function in this code to test the LLM output for correctness using the GEval() function. The output of the test_correctness DeepEval test is as follows:

Image showing output for custom test using GEval in DeepEval

We can also implement multiple unit tests in a single Python file to evaluate LLMs instead of a single DeepEval test.

Run multiple unit tests

To run multiple unit tests in DeepEval, we need to create a separate function for each test case. For example, we can test the LLM output for correctness and hallucination by implementing two test cases using the test_correctness() and test_hallucination() functions as shown in the following code:

from ollama import Client  
from deepeval import assert_test  
from deepeval.test_case import LLMTestCase, LLMTestCaseParams 
from deepeval.metrics import GEval, HallucinationMetric 
 
import os 
 
os.environ["OPENAI_API_KEY"] = "Your_OPENAI_API_KEY" 
 
# Define a function to get the model output for a given query 
 
def get_model_output(query): 
    client=Client()  
    # Get model response  
    output=client.chat(  
        model= "codecademy-assistant",  
        messages=[{"role":"user", "content":query}],  
        think=False  
    )  
    # Retrieve model response  
    response=output["message"]["content"]  
    return response 
 
def test_correctness(): 
    # Define query, ground truth, and model output 
    query = "Can we learn AI from Codecademy?" 
    ground_truth = "You can learn about AI (Artificial Intelligence) through Codecademy." 
    model_output = get_model_output(query) 
 
    # Define deepeval testcase 
    correctness_test_case = LLMTestCase( 
        input=query, 
        actual_output=model_output, 
        expected_output= ground_truth 
    ) 
     
    # Define correctness metric using GEval 
    correctness_metric = GEval( 
        name="Correctness", 
        criteria="Determine whether the actual output is factually correct with respect to the expected output.", 
        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT], 
        threshold=0.5 
    ) 
     
    # Execute test 
    try: 
        assert_test(correctness_test_case, [correctness_metric]) 
    except AssertionError: 
        print("Correctness test failed.") 
     
 
def test_hallucination(): 
    # Define query, ground truth, and model output 
    query = "What is the minimum age required to be able to use Codecademy?" 
    ground_truth = "You must be 16 years or older to use the Services. If you are less than 18 years of age and would like to register to use any part of the Services, please ask your parent or legal guardian to review and agree to these terms before you use any part of the Services, or ask them to complete the purchase or registration on your behalf." 
    model_output = get_model_output(query) 
     
    # Define hallucination testcase 
    hallucination_test_case = LLMTestCase( 
        input = query, 
        actual_output = model_output, 
        context = [ground_truth] 
    ) 
 
    # Define metric 
    hallucination_metric= HallucinationMetric(threshold=0.6) 
     
    # Execute test 
    try: 
        assert_test(hallucination_test_case, [hallucination_metric]) 
    except AssertionError: 
        print("Hallucination test failed.") 
 
 
if __name__== "__main__": 
    test_correctness() 
    test_hallucination() 

Copy to clipboardCopy to clipboard

After executing this code, we get an output containing each test case with its metric and evaluation status as shown in the following image:

Image showing output for running multiple unit tests in DeepEval

Multiple test cases are good for evaluating different outputs. However, you can also evaluate a single output against different metrics using a single test case.

Run a single unit test with multiple metrics

To run a single unit test with multiple metrics, you can define all the metrics and pass them to the assert_test() function in a list. After this, the assert_test() function evaluates the test case for each metric.

from ollama import Client 
from deepeval import assert_test 
from deepeval.test_case import LLMTestCase, LLMTestCaseParams 
from deepeval.metrics import HallucinationMetric, FaithfulnessMetric, GEval 
 
import os 
os.environ["OPENAI_API_KEY"] = "Your_OPENAI_API_KEY" 
 
def get_model_output(query): 
    client = Client() 
    # Get model response 
    output = client.chat( 
        model="codecademy-assistant", 
        messages=[{"role": "user", "content": query}], 
        think=False 
    ) 
    # Retrieve model response 
    response = output["message"]["content"] 
    return response 
 
def test_two_metrics(): 
    # Define query, ground truth, and model output 
    query = "What is the minimum age required to be able to use Codecademy?" 
    ground_truth = "You must be 16 years or older to use the Services. If you are less than 18 years of age and would like to register to use any part of the Services, please ask your parent or legal guardian to review and agree to these terms before you use any part of the Services, or ask them to complete the purchase or registration on your behalf." 
     
    model_output = get_model_output(query) 
 
    # Define deepeval testcase 
    test_case = LLMTestCase( 
        input=query, 
        actual_output=model_output, 
        expected_output= ground_truth, # for correctness 
        context=[ground_truth] # for hallucination metric 
    ) 
     
    # Define custom metric using GEval 
    correctness_metric = GEval( 
        name="Correctness", 
        criteria="Determine whether the actual output is factually correct with respect to the expected output.", 
        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT], 
        threshold=0.5 
    ) 
 
    # Define metrics 
    hallucination_metric = HallucinationMetric(threshold=0.7) 
 
    # Execute test 
    try: 
       assert_test(test_case, [hallucination_metric, correctness_metric]) 
    except AssertionError: 
       print("Test failed.") 
 
 
if __name__ == "__main__": 
    test_two_metrics() 

Copy to clipboardCopy to clipboard

In this code, we defined two metrics to test a single LLM output for hallucination and correctness. DeepEval shows each metric, score, and evaluation status in the output.

Image showing output for running unit tests with multiple metrics in DeepEval

Now that we know how to evaluate LLMs using different approaches, let’s discuss how to analyze a test case in the Confident web UI after executing it on our machine.

Using Confident web UI to analyze LLM evaluation tests

To analyze a test case in the web UI, we first need to execute the deepeval view command after getting the test case results in the command-line terminal. The deepeval view command gives us an input prompt for entering the Confident AI API key.

Image showing DeepEval view command output

Once we enter the Confident API key, we are redirected to the Confident UI containing the test results.

Image showing DeepEval test screen

You can click on each test case and metric to see the input, context, LLM output, score, evaluation status, etc, for each test case and metric included in the DeepEval test, as shown in the following image:

Image showing DeepEval test case details

Conclusion

As LLMs become core components of modern AI systems, the importance of rigorous and systematic evaluation cannot be ignored. DeepEval helps us move beyond manual testing and implement structured, reproducible, and meaningful tests for LLM evaluation. Whether we’re fine-tuning an LLM for a niche domain or deploying it at scale, DeepEval provides the flexibility and tools to ensure the models are accurate, safe, fair, and aligned with our expectations. In this article, we discussed DeepEval basics, installation, and setup. We also discussed implementing in-built and custom metrics to test a fine-tuned LLM using different test cases.

To learn more about security and testing in generative AI, you can take this navigating AI ethical challenges and risks course that discusses challenges, risks, data privacy, algorithmic bias, and decision-making frameworks for responsible use of LLMs. You might also like the IT automation with generative AI course that discusses AI fundamentals, SRE practices, ethical considerations, server monitoring, and automation system integration.

Frequently asked questions

1. What is LLM as a judge?

LLM as a judge is an evaluation methodology in which we use a large language model like Gemini or ChatGPT to evaluate other LLMs for faithfulness, bias, hallucination, toxicity, etc.

2. What is the difference between DeepEval and GEval?

DeepEval is a framework to evaluate LLMs, whereas GEval is a framework inside DeepEval that uses LLM-as-a-judge to evaluate LLMs by helping us create custom metrics.

3. What is model bias in LLM?

Model bias in LLMs is the tendency of AI models to produce outputs that reflect prejudices, stereotypes, or unfair generalizations in their training data.

4. What is UniEval?

UniEval is an LLM evaluation framework for evaluating unified multimodal models without extra models, images, or annotations.

5. What is contextual recall in Deepeval?

Contextual recall in DeepEval is an LLM evaluation metric to evaluate the quality of a RAG pipeline’s retriever by measuring how well the retrieved document aligns with the expected output for any given query.

6. Is DeepEval free or paid?

DeepEval is completely free and open-source. You only pay for LLM API costs (like OpenAI or Gemini) used as judges for evaluation.

Codecademy Team

'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'

Meet the full team

Learn more on Codecademy

Skill path
LLM Metrics and Trade-Offs
Evaluate LLM skill through metrics like BLEU, ROUGE, F1, HELM, navigating accuracy, latency, cost, scalability trade-offs, addressing bias and ethical concerns.
Includes 27 Courses
With Certificate
Intermediate.
10 hours
Free course
Intro to Large Language Models (LLMs)
Learn the basics of large language models (LLMs) and text-based Generative Artificial Intelligence (AI). We’ll show you how LLMs work and how they’re used.
Beginner Friendly.
1 hour
Free course
Unit Testing with Generative AI Case Study
Use generative AI to help create a unit test for Python code. Generative AI can generate unit tests for all of the functions of a Python class.
Beginner Friendly.
< 1 hour

Contents