Output Evaluation scores each specialized agent run against quality criteria, examples, and a baseline score you define. Use Output Evaluation to measure agent quality consistently as you iterate, without manually reviewing every run.
Configure Output Evaluation per agent on the Output Evaluation sub-tab of the Quality tab.
Output Evaluation results are independent of Execution Guardrails outcomes. Output Evaluation sometimes marks a run as Passed when Execution Guardrails flags it, and vice versa. See Configure Execution Guardrails for details on guardrail-based scoring.
Prerequisites
- A specialized agent exists in your Opal instance. See Create a specialized agent.
- You have permission to edit the agent.
Access the Output Evaluation sub-tab
Open the Output Evaluation sub-tab to configure scoring or view results for a specialized agent.
- Go to home.optimizely.com.
- Select your organization.
-
Click Opal.
-
Click Agents.
-
Click the Your Agents tab.
- Select a specialized agent.
- Click Quality.
-
Click Output Evaluation.
Configure evaluation criteria
Evaluation criteria are statements that an LLM-as-Judge model applies when scoring each run. An LLM-as-Judge model is a Large Language Model (LLM) trained to evaluate other model outputs against quality standards. Add up to 10 criteria for each agent. Criteria apply to both Output Evaluation Examples and Conversation Evaluation Samples.
- Click Add Criterion in the Evaluation Criteria section.
- Enter the criterion as a clear, testable statement. For example: Content must be at least 250 words long. or Content must not include em dashes.
- (Optional) Click Repeat for each additional criterion, up to 10.
-
Click Save to save your changes.
Add output examples or conversation samples
The Output Evaluation sub-tab displays two sections for every specialized agent: Output Evaluation Examples (Single-Turn) and Conversation Evaluation Samples (Multi-Turn). Populate the section that matches your agent's interaction mode. Opal ignores entries in the section that does not match the agent's mode.
Add an output example for a single-turn agent
Output examples are preferred outputs that show what a successful single-turn run looks like. Add up to five.
- Click Add Example in the Output Evaluation Examples section.
- Enter a Name.
- Provide the input variables for the example.
- Enter the sample output.
-
Click Add to save your changes.
- (Optional) Repeat steps 1–5 for each additional example, up to five total.
Add a conversation sample for a multi-turn agent
Conversation samples are reference conversations that show ideal multi-turn behavior. Add up to five.
- Click Add Sample in the Conversation Evaluation Samples section.
- Enter a Name.
- (Optional) Enter a Description.
- Enter the Turn 1 information, including the User message and optionally the Expected assistant response, Tool calls, and Notes.
- (Optional) Click Add Turn and repeat the previous step with the turn's information.
-
Click Add to save your changes.
- Repeat steps 1–6 for each additional sample, up to five total.
Set the baseline evaluation score
The baseline evaluation score is the lowest score that counts as Passed.
-
Select 75%, 80%, 85%, 90%, or 95% from the baseline evaluation score drop-down list.
- Click Update to save your changes.
A lower baseline marks more runs as Passed but accepts more variation in quality. A higher baseline enforces stricter quality but marks more runs as Failed. The default value is 85%.
How Output Evaluation scores runs
Output Evaluation uses an LLM-as-Judge model to score each agent execution. The model evaluates the run against:
- The evaluation criteria you defined.
- The output examples or conversation samples you provided.
- A default rubric.
Each run produces an Evaluation Score between 0% and 100%. Opal compares the score to the baseline evaluation score:
- Passed – The score met or exceeded the baseline.
- Failed – The score fell below the baseline.
- Not Evaluated – Opal has not yet scored the run.
Default rubric
Opal scores every agent against a default rubric, even without custom criteria. The default rubric covers:
- Accuracy – Whether the output is factually correct based on the agent's inputs and context.
- Completeness – Whether the output addresses every part of the task.
- Format consistency – Whether the output follows the expected structure or format.
- Usefulness – Whether the output is practical and actionable for the task.
When your custom criteria conflict with the default rubric, the custom criteria take priority.
Multi-turn evaluation
For multi-turn agents, Opal scores the conversation at three distinct levels so you can identify where quality breaks down.
- Per criterion – Identifies which quality dimensions are weakest.
- Per turn – Evaluates how each individual response performed.
- Across the conversation – Measures coherence, goal completion, and efficiency.
View Output Evaluation results
The Summary card on the Output Evaluation sub-tab displays the following:
- Total Executions for this version – The total number of runs for the current agent version.
- Avg. Output Evaluation Score – The average score across evaluated runs.
- Passed – The count of runs that met or exceeded the baseline.
- Failed – The count of runs that fell below the baseline.
- Not Evaluated – The count of runs Opal has not yet scored.
To see individual run scores, go to the Logs tab. Each run row in the Logs table shows its Evaluation Score and Status. Click a run row to open the execution details panel. The panel displays the Evaluation Score alongside the Guardrail Status.
Related articles
If you use Opti ID, administrators can turn off generative AI in the Opti ID Admin Center. See Turn generative AI off across Optimizely applications.
Article is closed for comments.