Configure Output Evaluation

Output Evaluation scores each specialized agent run against quality criteria, examples, and a baseline score you define. Use Output Evaluation to measure agent quality consistently as you iterate, without manually reviewing every run.

Configure Output Evaluation per agent on the Output Evaluation sub-tab of the Quality tab.

Output Evaluation results are independent of Execution Guardrails outcomes. Output Evaluation sometimes marks a run as Passed when Execution Guardrails flags it, and vice versa. See Configure Execution Guardrails for details on guardrail-based scoring.

Prerequisites

A specialized agent exists in your Opal instance. See Create a specialized agent.
You have permission to edit the agent.

Access the Output Evaluation sub-tab

Open the Output Evaluation sub-tab to configure scoring or view results for a specialized agent.

Go to home.optimizely.com.
Select your organization.
Click Opal.
Click Agents.
Click the Your Agents tab.

Select a specialized agent.
Click Quality.
Click Output Evaluation.

Configure evaluation criteria

Evaluation criteria are statements that an LLM-as-Judge model applies when scoring each run. An LLM-as-Judge model is a Large Language Model (LLM) trained to evaluate other model outputs against quality standards. Add up to 10 criteria for each agent. Criteria apply to both Output Evaluation Examples and Conversation Evaluation Samples.

Click Add Criterion in the Evaluation Criteria section.
Enter the criterion as a clear, testable statement. For example: Content must be at least 250 words long. or Content must not include em dashes.
(Optional) Repeat steps 1–2 for each additional criterion, up to 10.
Click Save to add your criterion.

Add output examples or conversation samples

Output Evaluation displays two sections for every specialized agent: Output Evaluation Examples (Single-Turn) and Conversation Evaluation Samples (Multi-Turn). Populate the section that matches your agent's interaction mode (single-shot or multi-turn). Opal ignores entries in the section that does not match the agent's mode.

Screenshot of the Output Evaluation Examples and Conversation Evaluation Samples sections highlighted

Add an output example for a single-turn agent

Output examples are preferred outputs that show what a successful single-turn run looks like. Add up to five.

Click Add Example in the Output Evaluation Examples section.
Enter a Name.
Provide the input variables for the example.
Enter the sample output.
Click Add to save your changes.
(Optional) Repeat steps 1–5 for each additional example, up to five total.

Add a conversation sample for a multi-turn agent

Conversation samples are reference conversations that show ideal multi-turn behavior. Add up to five.

Click Add Sample in the Conversation Evaluation Samples section.
Enter a Name.
(Optional) Enter a Description.
Enter the Turn 1 information, including the User message and optionally the Expected assistant response, Tool calls, and Notes.
(Optional) Click Add Turn and repeat the previous step with the turn's information.
Click Add to save your changes.
Repeat steps 1–6 for each additional sample, up to five total.

Set the baseline evaluation score

The baseline evaluation score is the lowest score that counts as Passed.

Select 75%, 80%, 85%, 90%, or 95% from the baseline evaluation score drop-down list.
Click Update to save your changes.

A lower baseline marks more runs as Passed but accepts more variation in quality. A higher baseline enforces stricter quality but marks more runs as Failed. The default value is 85%.

How Output Evaluation scores runs

Output Evaluation uses an LLM-as-Judge model to score each agent execution. The model evaluates the run against:

The evaluation criteria you defined.
The output examples or conversation samples you provided.
A default rubric.

Each run produces an Evaluation Score between 0% and 100%. Opal compares the score to the baseline evaluation score:

Passed – The score met or exceeded the baseline.
Failed – The score fell below the baseline.
Not Evaluated – Opal has not yet scored the run.

Default rubric

Opal scores every agent against a default rubric, even without custom criteria. The default rubric covers:

Accuracy – Whether the output is factually correct based on the agent's inputs and context.
Completeness – Whether the output addresses every part of the task.
Format consistency – Whether the output follows the expected structure or format.
Usefulness – Whether the output is practical and actionable for the task.

When your custom criteria conflict with the default rubric, the custom criteria take priority.

Multi-turn evaluation

For multi-turn agents, Opal scores the conversation at three distinct levels so you can identify where quality breaks down.

Per criterion – Identifies which quality dimensions are weakest.
Per turn – Evaluates how each individual response performed.
Across the conversation – Measures coherence, goal completion, and efficiency.

View Output Evaluation results

The Summary card on the Output Evaluation sub-tab displays the following:

Total Executions for this version – The total number of runs for the current agent version.
Avg. Output Evaluation Score – The average score across evaluated runs.
Passed – The count of runs that met or exceeded the baseline.
Failed – The count of runs that fell below the baseline.
Not Evaluated – The count of runs Opal has not yet scored.

Screenshot of the Output Evaluation results summary.

To see individual run scores, go to the Logs tab. Each run row in the Logs table shows its Evaluation Score and Status. Click a run row to open the execution details panel. The panel displays the Evaluation Score alongside the Guardrail Status.

If you use Opti ID, administrators can turn off generative AI in the Opti ID Admin Center. See Turn generative AI off across Optimizely applications.

Configure Output Evaluation

Prerequisites

Access the Output Evaluation sub-tab

Configure evaluation criteria

Add output examples or conversation samples

Add an output example for a single-turn agent

Add a conversation sample for a multi-turn agent

Set the baseline evaluation score

How Output Evaluation scores runs

Default rubric

Multi-turn evaluation

View Output Evaluation results

Related articles

<%= previousTitle %>

<%= nextTitle %>

In this article

<%= heading %>

<% if (!block.description) { %> <%= block.name %> <% } else { %> <%= block.name %> <% } %>

<%= heading %>

<% if (!block.description) { %> <%= parsed.title %> <% } else { %> <%= parsed.title %> <% } %>

User Research

Security Announcements

Still have questions?

Categories

Toggle navigation menu

<%= category.name %>