自 2025 年 4 月 29 日起，Gemini 1.5 Pro 和 Gemini 1.5 Flash 模型將無法用於先前未使用這些模型的專案，包括新專案。詳情請參閱「模型版��和生命週期」。

本頁面由 Cloud Translation API 翻譯而成。

評估生成式 AI 虛擬服務專員
透過集合功能整理內容你可以依據偏好儲存及分類內容。

建構及評估生成式 AI 模型後，您可以使用該模型建構聊天機器人等代理程式。您可以使用 Gen AI 評估服務，評估代理完成用途相關任務和目標的能力。

總覽

您可以透過下列選項評估服務機器人：

最終回應評估：評估代理程式的最終輸出內容 (代理程式是否達成目標)。
軌跡評估：評估服務專員取得最終回應所採取的路徑 (工具呼叫序列)。

您可以使用 Gen AI 評估服務，在單一 Vertex AI SDK 查詢中觸發代理執行作業，並取得軌跡評估和最終回覆評估的指標。

支援的服務專員

Gen AI Evaluation Service 支援下列類型的代理程式：

支援的服務專員	說明
使用 Agent Engine 範本建構的服務專員	Agent Engine (LangChain on Vertex AI) 是 Google Cloud 可用來部署及管理代理的平台。
使用 Agent Engine 可自訂範本建立的 LangChain 代理程式	LangChain 是開放原始碼平台。
自訂代理函式	自訂代理程式函式是一項彈性函式，可接收代理程式的提示，並在字典中傳回回應和軌跡。

定義評估服務專員的指標

定義最終回應或軌跡評估的指標：

最終回覆評估

最終回應評估程序與模型回應評估程序相同。詳情請參閱「定義評估指標」。

軌跡評估

您可以利用下列指標評估模型是否能追隨預期軌跡：

完全比對

如果預測軌跡與參考軌跡相同，且工具呼叫的順序完全相同，trajectory_exact_match 指標會傳回 1 分，否則傳回 0 分。

指標輸入參數

輸入參數	說明
`predicted_trajectory`	代理程式用來取得最終回應的工具呼叫清單。
`reference_trajectory`	代理程式用來滿足查詢的預期工具。

輸出分數

值	說明
0	預測軌跡與參考值不符。
1	預測軌跡與參考軌跡相符。

排序比對

如果預測軌跡包含參考軌跡中的所有工具呼叫，且可能還有額外的工具呼叫，trajectory_in_order_match 指標會傳回 1 分，否則為 0。

指標輸入參數

輸入參數	說明
`predicted_trajectory`	代理程式用來取得最終回應的預測軌跡。
`reference_trajectory`	服務機器人滿足查詢時，預期的預測軌跡。

輸出分數

值	說明
0	預測軌跡中的工具呼叫與參考軌跡中的順序不符。
1	預測軌跡與參考軌跡相符。

任意順序比對

如果預測軌跡包含參考軌跡中的所有工具呼叫，但順序不重要，且可能包含額外的工具呼叫，則 trajectory_any_order_match 指標會傳回 1 分，否則為 0。

指標輸入參數

輸入參數	說明
`predicted_trajectory`	代理程式用來取得最終回應的工具呼叫清單。
`reference_trajectory`	代理程式用來滿足查詢的預期工具。

輸出分數

值	說明
0	預測軌跡不包含參考軌跡中的所有工具呼叫。
1	預測軌跡與參考軌跡相符。

精確度

trajectory_precision 指標會根據參考軌跡，評估預測軌跡中實際上相關或正確的工具呼叫數量。

精確度計算方式如下：計算預測軌跡中與參考軌跡中相同的動作數量。將該次數除以預測軌跡中的動作總數。

指標輸入參數

輸入參數	說明
`predicted_trajectory`	代理程式用來取得最終回應的工具呼叫清單。
`reference_trajectory`	代理程式用來滿足查詢的預期工具。

輸出分數

值	說明
介於 [0,1] 範圍內的浮點值	分數越高，預測軌跡就越精確。

喚回

trajectory_recall 指標會評估預測軌跡中實際擷取的參考軌跡必要工具呼叫數量。

喚回率的計算方式如下：計算參考軌跡中同時出現在預測軌跡中的動作數量。將該計數除以參考軌跡中的動作總數。

指標輸入參數

輸入參數	說明
`predicted_trajectory`	代理程式用來取得最終回應的工具呼叫清單。
`reference_trajectory`	代理程式用來滿足查詢的預期工具。

輸出分數

值	說明
介於 [0,1] 範圍內的浮點值	分數越高，預測軌跡的喚回率就越高。

單一工具使用

trajectory_single_tool_use 指標會檢查預測軌跡中是否使用指標規格中指定的特定工具。這項工具不會檢查工具呼叫的順序或使用次數，只會檢查工具是否存在。

指標輸入參數

輸入參數	說明
`predicted_trajectory`	代理程式用來取得最終回應的工具呼叫清單。

輸出分數

值	說明
0	工具不存在
1	工具已出現。

此外，系統會根據預設將下列兩個代理商成效指標加入評估結果。您不需要在 EvalTask 中指定這些值。

`latency`

服務專員傳回回應所需的時間。

值	說明
浮點數	以秒為單位。

`failure`

布林值，用於說明喚醒服務是否導致錯誤或成功。

輸出分數

值	說明
1	錯誤
0	傳回有效回應

準備用於評估服務專員的資料集

準備資料集，以便進行最終回應或軌跡評估。

最終回應評估的資料架構與模型回應評估的資料架構類��。

如要進行以運算為基礎的軌跡評估，資料集必須提供下列資訊：

輸入類型	輸入欄位內容
`predicted_trajectory`	代理程式用來取得最終回應的工具呼叫清單。
`reference_trajectory` (`trajectory_single_tool_use metric` 不必使用)	代理程式用來滿足查詢的預期工具。

評估資料集範例

以下範例顯示了評估軌跡的資料集。請注意，除了 trajectory_single_tool_use 以外，所有指標都需要 reference_trajectory。

reference_trajectory = [
# example 1
[
  {
    "tool_name": "set_device_info",
    "tool_input": {
        "device_id": "device_2",
        "updates": {
            "status": "OFF"
        }
    }
  }
],
# example 2
[
    {
      "tool_name": "get_user_preferences",
      "tool_input": {
          "user_id": "user_y"
      }
  },
  {
      "tool_name": "set_temperature",
      "tool_input": {
          "location": "Living Room",
          "temperature": 23
      }
    },
  ]
]

predicted_trajectory = [
# example 1
[
  {
    "tool_name": "set_device_info",
    "tool_input": {
        "device_id": "device_3",
        "updates": {
            "status": "OFF"
        }
    }
  }
],
# example 2
[
    {
      "tool_name": "get_user_preferences",
      "tool_input": {
          "user_id": "user_z"
      }
    },
    {
      "tool_name": "set_temperature",
      "tool_input": {
          "location": "Living Room",
          "temperature": 23
      }
    },
  ]
]

eval_dataset = pd.DataFrame({
    "predicted_trajectory": predicted_trajectory,
    "reference_trajectory": reference_trajectory,
})

匯入評估用資料集

您可以使用下列格式匯入資料集：

儲存在 Cloud Storage 中的 JSONL 或 CSV 檔案
BigQuery 資料表
Pandas DataFrame

Gen AI 評估服務提供範例公開資料集，說明如何評估您的服務機器人。以下程式碼說明如何從 Cloud Storage 值區匯入公開資料集：

# dataset name to be imported
dataset = "on-device" # Alternatives: "customer-support", "content-creation"

# copy the tools and dataset file
!gcloud storage cp gs://cloud-ai-demo-datasets/agent-eval-datasets/{dataset}/tools.py .
!gcloud storage cp gs://cloud-ai-demo-datasets/agent-eval-datasets/{dataset}/eval_dataset.json .

# load the dataset examples
import json

eval_dataset = json.loads(open('eval_dataset.json').read())

# run the tools file
%run -i tools.py

其中 dataset 是下列任一公開資料集：

"on-device"：用於控制家用裝置的裝置端 Google 助理。你可以透過這項功能，執行「設定臥室的空調，讓空調在晚上 11 點到隔天上午 8 點之間開啟，其餘時間關閉」等查詢。
"customer-support" 代表客戶服務專員。客服專員會協助處理查詢，例如「你能否取消任何待處理的訂單，並提報任何未解決的支援單？」
"content-creation" 是行銷內容創作代理程式。服務專員可協助您處理查詢，例如「將 X 廣告活動重新排定為在社群媒體網站 Y 上進行的一次性廣告活動，且預算減少 50%，僅限 2024 年 12 月 25 日」。

執行代理程式評估

執行軌跡評估或最終回應評估：

針對代理程式評估，您可以混合使用回應評估指標和軌跡評估指標，如下列程式碼所示：

single_tool_use_metric = TrajectorySingleToolUse(tool_name='tool_name')

eval_task = EvalTask(
    dataset=EVAL_DATASET,
    metrics=[
        "rouge_l_sum",
        "bleu",
        custom_trajectory_eval_metric, # custom computation-based metric
        "trajectory_exact_match",
        "trajectory_precision",
        single_tool_use_metric,
        response_follows_trajectory_metric # llm-based metric
    ],
)
eval_result = eval_task.evaluate(
    runnable=RUNNABLE,
)

自訂指標

��可以使用範本介面或從頭開始，自訂大型語言模型指標，用於評估軌跡。詳情請參閱「模型指標」一節。以下是範本示例：

response_follows_trajectory_prompt_template = PointwiseMetricPromptTemplate(
    criteria={
        "Follows trajectory": (
            "Evaluate whether the agent's response logically follows from the "
            "sequence of actions it took. Consider these sub-points:\n"
            "  - Does the response reflect the information gathered during the trajectory?\n"
            "  - Is the response consistent with the goals and constraints of the task?\n"
            "  - Are there any unexpected or illogical jumps in reasoning?\n"
            "Provide specific examples from the trajectory and response to support your evaluation."
        )
    },
    rating_rubric={
        "1": "Follows trajectory",
        "0": "Does not follow trajectory",
    },
    input_variables=["prompt", "predicted_trajectory"],
)

response_follows_trajectory_metric = PointwiseMetric(
    metric="response_follows_trajectory",
    metric_prompt_template=response_follows_trajectory_prompt_template,
)

您也可以定義自訂的計算指標，用於評估軌跡或回應，如下所示：

def essential_tools_present(instance, required_tools = ["tool1", "tool2"]):
    trajectory = instance["predicted_trajectory"]
    tools_present = [tool_used['tool_name'] for tool_used in trajectory]
    if len(required_tools) == 0:
      return {"essential_tools_present": 1}
    score = 0
    for tool in required_tools:
      if tool in tools_present:
        score += 1
    return {
        "essential_tools_present": score/len(required_tools),
    }

custom_trajectory_eval_metric = CustomMetric(name="essential_tools_present", metric_function=essential_tools_present)

查看及解讀結果

針對軌跡評估或最終回應評估，評估結果會顯示如下：

代理人評估指標的資料表

評估結果包含下列資訊：

最終回覆指標

執行個體層級結果

欄	說明
回應	代理產生的最終回覆。
latency_in_seconds	產生回應所需的時間。
失敗	指出是否已產生有效的回應。
分數	為指標規格中指定的回應計算的分數。
說明	指標規格中指定的分數說明。

匯總結果

欄	說明
平均值	所有執行個體的平均分數。
標準差	所有分數的標準差。

軌跡指標

執行個體層級結果

欄	說明
predicted_trajectory	工具呼叫的順序，以及服務專員的後續動作，以便取得最終回應。
reference_trajectory	預期的工具呼叫序列。
分數	根據指標規格所指定的預測軌跡和參考軌跡計算出的分數。
latency_in_seconds	產生回應所需的時間。
失敗	指出是否已產生有效的回應。

匯總結果

欄	說明
平均值	所有執行個體的平均分數。
標準差	所有分數的標準差。

後續步驟

請試試下列代理程式評估筆記本：

評估生成式 AI 虛擬服務專員 透過集合功能整理內容 你可以依據偏好儲存及分類內容。

總覽

支援的服務專員

定義評估服務專員的指標

最終回覆評估

軌跡評估

完全比對

指標輸入參數

輸出分數

排序比對

指標輸入參數

輸出分數

任意順序比對

指標輸入參數

輸出分數

精確度

指標輸入參數

輸出分數

喚回

指標輸入參數

輸出分數

單一工具使用

指標輸入參數

輸出分數

latency

failure

輸出分數

準備用於評估服務專員的資料集

評估資料集範例

匯入評估用資料集

執行代理程式評估

自訂指標

查看及解讀結果

最終回覆指標

執行個體層級結果

匯總結果

軌跡指標

執行個體層級結果

匯總結果

後續步驟

評估生成式 AI 虛擬服務專員
透過集合功能整理內容你可以依據偏好儲存及分類內容。

`latency`

`failure`