Preferred output examples

  • Updated

Preferred output examples are a core component of AI Evaluations (evals) in Optimizely Opal, letting you establish a benchmark for the quality of content generated by specialized agents. By designating preferred output examples, you provide Opal with a clear standard against which subsequent agent executions can be measured, leading to an automated quality score.

What is a preferred output example?

A preferred output example is a specific, high-quality output from a specialized agent execution that you select as a "good" output. This chosen output serves as the gold standard or reference point for a particular agent.  It embodies the desired characteristics, tone, accuracy, and overall quality you expect from your agent. The preferred output example should also reflect the desired tool call pattern.

Think of it as providing Opal with an exemplary answer to a question or a perfect example of a generated asset. Opal then uses this example to understand what "good" looks like for that specific agent and task.

Purpose and benefits

The primary purpose of preferred output examples is to

  • Establish a quality benchmark – Define the desired quality level for agent outputs.
  • Automate quality assessment – Lets Opal automatically score future outputs against this benchmark.
  • Monitor performance over time – Track how well an agent consistently meets quality expectations.
  • Reduce manual review – Streamline the content creation process by reducing the need for extensive manual checks.

How to create a preferred output example

Creating a preferred output example involves a simple process within the specialized agent workflow.

  1. Execute a specialized agent – Run your specialized agent with a specific prompt or task.
  2. Review the output – Examine the generated output to ensure it meets your quality standards.
  3. Select as preferred output example – Add the output as a preferred output if it is satisfactory. Once designated, this preferred output becomes the reference point for all future evaluations of that agent's performance for similar tasks. To do so, complete the steps in the following sections:

To provide valuable output, you should

  • Add at least three varied preferred output examples.
  • Use real use cases as examples.

Add preferred output examples manually

After running your specialized agents a few times, you can copy and paste "good" output as preferred outputs to help Opal evaluate its responses. You can add up to five examples.

  1. Go to Opal > Agents.
  2. Click on your specialized agent, or click More (...) > Edit Agent.
  3. Click Add example.
  4. Enter a Name for the example.
  5. Enter example Input Variables that would correspond with this example output. Variables marked as Required? in step five in the Define input variables (optional) section are required here as well.
  6. Paste the output from the agent run in the Output Example field. 
  7. Click Add.
  8. (Optional) Repeat steps one through five to add up to five examples.

Set agent execution as preferred output

  1. Go to Opal > Agents.
  2. Click on your specialized agent, or click More (...) > Edit Agent.
  3. Select the Logs tab. 
  4. Click on More (...) for a particular execution and select Link as Output Eval.

The quality score

After you set the preferred output examples, every subsequent execution of that specialized agent is automatically compared against them. Opal then generates a quality score for each new output.

  • Comparison mechanism – Opal's internal evaluation mechanisms analyze various aspects of the new output in relation to the preferred output.
  • Score generation – Opal produces a numerical score between 0 and 100, indicating the degree of alignment with the preferred output.
  • Pass or fail indication – Alongside the numerical score, a visual indicator quickly communicates whether the output meets a predefined threshold of quality.

Quality score criteria

Opal compares each agent's output to your preferred output examples and then rates how well it did by assessing the following factors:

  • Completeness and scope
    • What it means – The agent's response addresses all necessary components of your request and provides the required depth of information.
    • Why it matters – It guarantees that your queries are fully answered without missing critical details.
  • Structural and format consistency
    • What it means – The output strictly adheres to expected formatting, markdown, punctuation, and structural conventions (for example, proper use of headings, lists, and code blocks).
    • Why it matters – It makes the information easy to read, understand, and integrate into your workflow.
  • Overall usefulness and adherence
    • What it means – The response effectively achieves the implied objective of your task and provides practical utility.
    • Why it matters – It ensures the output is actionable and helpful in achieving your goals.

Each agent output is assigned a percentage score based on how well it meets the previous criteria, ranging from 0 to 100. This rubric helps categorize performance and identify areas for improvement.

  • 90-100 (Excellent or Exceptional) – The output is an excellent match, meeting or exceeding all explicit and implied expectations.
  • 80-89 (Good or Strong) – The output is a strong match with only minor, non-critical deviations (for instance, slight format inconsistencies or negligible omissions).
  • 70-79 (Adequate or Acceptable) – The output is generally acceptable but has some notable issues (for example, a few incorrect data points, noticeable format drift, or minor missing sections).
  • 60-69 (Needs Improvement or Below Standard) – The output is below expectations, containing significant structural or factual differences that make it difficult to use or inconsistent with the intended purpose.
  • 0-59 (Poor or Unacceptable) – The output is a poor match, exhibiting major errors, failing to follow fundamental content requirements, or completely missing the intended objective.

Baseline quality

The baseline quality score serves as the lowest acceptable quality percentage you expect the specialized agent output to achieve. You define this threshold.

When Opal evaluates an agent's output, it compares the quality score against this baseline quality score. If the quality score meets or exceeds the baseline quality score, Opal marks the output as Passed. If it falls below, Opal marks it as Failed. The agent's execution logs display the pass or fail status, quality score, and evaluation feedback, providing immediate feedback on performance.

Leverage output evals 

The output evals give you critical insights to optimize your specialized agents.

  • Identify underperforming outputs – Quickly spot outputs that receive low quality scores or fail the evaluation.
  • Analyze discrepancies – Investigate why a particular output did not meet the preferred output's standard. This could reveal issues with the prompt, the agent's configuration, or the underlying AI model.
  • Refine prompts – Adjust the agent's input prompt to be more specific, provide better context, or include clearer instructions to guide Opal towards the desired output.
  • Update preferred outputs – Update the preferred output to set a new, higher benchmark as your needs evolve or if you discover an even better example.
  • Iterative improvement – Use the quality scores and baseline quality to track the effectiveness of your prompt engineering and agent refinement efforts, creating a continuous loop of improvement.

Best practices

  • Choose wisely – Select preferred outputs that truly represents the ideal outcome. They should be comprehensive, accurate, and align with all relevant guidelines.
  • Set a realistic baseline –  Define a baseline quality score that is achievable but also ensures a meaningful level of quality. Adjust it as your agent improves.
  • Be specific in prompts – The more precise your initial prompt to the agent, the better its chances of generating an output that can serve as a strong preferred output.
  • Regular review – Periodically review your preferred outputs to ensure they remain relevant and reflect current quality expectations.
  • Combine with human review – Use human review for nuanced judgment and strategic adjustments.

If you use Opti ID, administrators can turn off generative AI in the Opti ID Admin Center. See Turn generative AI off across Optimizely applications.