Configure Output Evaluation

  • Updated

Output Evaluation scores each specialized agent run against quality criteria, examples, and a baseline score you define. Use Output Evaluation to measure agent quality consistently as you iterate, without manually reviewing every run.

Configure Output Evaluation per agent on the Output Evaluation sub-tab of the Quality tab.

Output Evaluation results are independent of Execution Guardrails outcomes. Output Evaluation sometimes marks a run as Passed when Execution Guardrails flags it, and vice versa. See Configure Execution Guardrails for details on guardrail-based scoring.

Prerequisites

Access the Output Evaluation sub-tab

Open the Output Evaluation sub-tab to configure scoring or view results for a specialized agent.

  1. Go to home.optimizely.com.
  2. Select your organization.
  3. Click Opal.

    Screenshot of the Optimizely home page showing the Opal option in the navigation
  4. Click Agents.

    Screenshot of the Agents page in Opal where the list of agents is displayed
  5. Click the Your Agents tab.

    Screenshot of the Agents page in Opal where the Your Agents tab is selected
  1. Select a specialized agent.
  2. Click Quality.
  3. Click Output Evaluation.

Configure evaluation criteria

Evaluation criteria are statements that an LLM-as-Judge model applies when scoring each run. An LLM-as-Judge model is a Large Language Model (LLM) trained to evaluate other model outputs against quality standards. Add up to 10 criteria for each agent. Criteria apply to both Output Evaluation Examples and Conversation Evaluation Samples.

  1. Click Add Criterion in the Evaluation Criteria section.
  2. Enter the criterion as a clear, testable statement. For example: Content must be at least 250 words long. or Content must not include em dashes.
  3. (Optional) Click Repeat for each additional criterion, up to 10.
  4. Click Save to save your changes.

    Screenshot of a specialized agent on the Quality tab with the Evaluation Criteria highlighted.

Add output examples or conversation samples

The Output Evaluation sub-tab displays two sections for every specialized agent: Output Evaluation Examples (Single-Turn) and Conversation Evaluation Samples (Multi-Turn). Populate the section that matches your agent's interaction mode. Opal ignores entries in the section that does not match the agent's mode.

Screenshot of a specialized agent with the Output Evaluation Examples section highlighted

Add an output example for a single-turn agent

Output examples are preferred outputs that show what a successful single-turn run looks like. Add up to five.

  1. Click Add Example in the Output Evaluation Examples section.
  2. Enter a Name.
  3. Provide the input variables for the example.
  4. Enter the sample output.
  5. Click Add to save your changes.

    Screenshot of the Output Evaluation Examples showing how to add an example.
  6. (Optional) Repeat steps 1–5 for each additional example, up to five total.

Add a conversation sample for a multi-turn agent

Conversation samples are reference conversations that show ideal multi-turn behavior. Add up to five.

  1. Click Add Sample in the Conversation Evaluation Samples section.
  2. Enter a Name.
  3. (Optional) Enter a Description.
  4. Enter the Turn 1 information, including the User message and optionally the Expected assistant response, Tool calls, and Notes.
  5. (Optional) Click Add Turn and repeat the previous step with the turn's information.
  6. Click Add to save your changes.

    Screenshot of Conversation Evaluation Samples with one example filled out and the add button highlighted
  7. Repeat steps 1–6 for each additional sample, up to five total.

Set the baseline evaluation score

The baseline evaluation score is the lowest score that counts as Passed.

  1. Select 75%, 80%, 85%, 90%, or 95% from the baseline evaluation score drop-down list.

    Screenshot of the baseline evaluation score drop-down list.
  2. Click Update to save your changes.

A lower baseline marks more runs as Passed but accepts more variation in quality. A higher baseline enforces stricter quality but marks more runs as Failed. The default value is 85%.

How Output Evaluation scores runs

Output Evaluation uses an LLM-as-Judge model to score each agent execution. The model evaluates the run against:

  • The evaluation criteria you defined.
  • The output examples or conversation samples you provided.
  • A default rubric.

Each run produces an Evaluation Score between 0% and 100%. Opal compares the score to the baseline evaluation score:

  • Passed – The score met or exceeded the baseline.
  • Failed – The score fell below the baseline.
  • Not Evaluated – Opal has not yet scored the run.

Default rubric

Opal scores every agent against a default rubric, even without custom criteria. The default rubric covers:

  • Accuracy – Whether the output is factually correct based on the agent's inputs and context.
  • Completeness – Whether the output addresses every part of the task.
  • Format consistency – Whether the output follows the expected structure or format.
  • Usefulness – Whether the output is practical and actionable for the task.

When your custom criteria conflict with the default rubric, the custom criteria take priority.

Multi-turn evaluation

For multi-turn agents, Opal scores the conversation at three distinct levels so you can identify where quality breaks down.

  • Per criterion – Identifies which quality dimensions are weakest.
  • Per turn – Evaluates how each individual response performed.
  • Across the conversation – Measures coherence, goal completion, and efficiency.

View Output Evaluation results

The Summary card on the Output Evaluation sub-tab displays the following:

  • Total Executions for this version – The total number of runs for the current agent version.
  • Avg. Output Evaluation Score – The average score across evaluated runs.
  • Passed – The count of runs that met or exceeded the baseline.
  • Failed – The count of runs that fell below the baseline.
  • Not Evaluated – The count of runs Opal has not yet scored.
Screenshot of the Output Evaluation results summary.

To see individual run scores, go to the Logs tab. Each run row in the Logs table shows its Evaluation Score and Status. Click a run row to open the execution details panel. The panel displays the Evaluation Score alongside the Guardrail Status.

Related articles

If you use Opti ID, administrators can turn off generative AI in the Opti ID Admin Center. See Turn generative AI off across Optimizely applications.