> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cloosphere.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Auto-Evaluations

> A judge LLM asynchronously scores agent responses — evaluating retrieval quality, faithfulness, and response quality, plus enabling, results, statistics, and export

<Info>Admin › Evaluation › Auto-Evaluations</Info>

When you enable auto-evaluation on an agent, after each response a **judge LLM** asynchronously evaluates quality and records the result.

<Frame caption="Auto-evaluation results screen — Score Trend chart, filters, results table">
  <img src="https://mintcdn.com/cloocus/Nim6rqpdwJuim_F0/images/monitoring/evaluations-auto-top.png?fit=max&auto=format&n=Nim6rqpdwJuim_F0&q=85&s=52c5442435b28eb0eff17d98a8da7282" alt="Auto-evaluation results screen — Score Trend chart, filters, results table" width="2880" height="1800" data-path="images/monitoring/evaluations-auto-top.png" />
</Frame>

<Note>
  Auto-evaluation is a licensed feature. Requires a license with the `evaluation` feature enabled.
</Note>

***

## Evaluation Types

| Type                  | Description                                                                                 |
| --------------------- | ------------------------------------------------------------------------------------------- |
| **Retrieval Quality** | Evaluates whether retrieved documents are relevant to the question                          |
| **Faithfulness**      | Evaluates whether the answer is grounded in the retrieved content (hallucination detection) |
| **Response Quality**  | Evaluates overall usefulness and accuracy                                                   |

***

## Evaluation Process

```mermaid theme={null}
flowchart LR
    A[Agent response] --> B{Auto-evaluation enabled?}
    B -->|Yes| C[Sampling]
    B -->|No| D[End]
    C --> E[Pass to judge LLM]
    E --> F[Generate score + reasoning]
    F --> G[Save result]
    G --> H[Reflect to dashboard]
```

***

## Enabling Auto-Evaluation (Activate on the Agent)

Auto-evaluation is enabled per agent. Results only start accumulating once it's turned on.

<Frame caption="Workspace > Agents > Auto-evaluation — judge model, sampling rate, evaluation type settings">
  <img src="https://mintcdn.com/cloocus/z12HbjPvLk3VcOGS/images/monitoring/evaluations-auto-enable.png?fit=max&auto=format&n=z12HbjPvLk3VcOGS&q=85&s=1b3d1d61df577300e71737b4e210eb69" alt="Agent auto-evaluation activation settings screen" width="902" height="738" data-path="images/monitoring/evaluations-auto-enable.png" />
</Frame>

<Steps>
  <Step title="Edit the agent">
    Open the target agent's edit screen in **Workspace > Agents**.
  </Step>

  <Step title="Activate auto-evaluation">
    Activate it in the **Auto-evaluation** section of the agent settings.

    | Setting              | Description                                            |
    | -------------------- | ------------------------------------------------------ |
    | **Enabled**          | Whether auto-evaluation is used                        |
    | **Sampling rate**    | Share of responses to evaluate (1%\~100%, default 10%) |
    | **Judge model**      | LLM model used for evaluation                          |
    | **Evaluation types** | Select which evaluation types to enable                |

    **Sampling rate guidance:**

    | Situation           | Recommended | Reason                         |
    | ------------------- | :---------: | ------------------------------ |
    | New agent           |   50–100%   | Quickly assess initial quality |
    | After stabilization |    5–10%    | Cost saving + monitoring       |
    | Critical business   |    20–30%   | Quality assurance              |
  </Step>

  <Step title="Save">
    Once you save the agent, auto-evaluation runs on that agent's subsequent responses.
  </Step>
</Steps>

<Tip>
  Use a judge model that is equal to or higher in caliber than the model being evaluated. For example, evaluating GPT-4o responses with GPT-4o-mini may reduce accuracy.
</Tip>

***

## Evaluation Result Fields

<Accordion title="Evaluation result fields (8)" icon="list">
  | Field               | Description                         |
  | ------------------- | ----------------------------------- |
  | **Chat/Message ID** | Evaluated message                   |
  | **Model ID**        | Model that generated the response   |
  | **Judge Model ID**  | LLM used for evaluation             |
  | **Evaluation type** | retrieval, faithfulness, quality    |
  | **Score**           | 0.0 \~ 1.0 (1.0 is best)            |
  | **Reasoning**       | LLM's explanation of the score      |
  | **Status**          | pending, completed, failed          |
  | **Error message**   | Error content on evaluation failure |
</Accordion>

***

## Score Trend Chart

Shows the trend of average scores by date as a line chart. Use the toggle at the top right of the chart to change the aggregation unit.

| Aggregation Unit              | Description                                                               |
| ----------------------------- | ------------------------------------------------------------------------- |
| **Hour / Day / Week / Month** | Aggregate average scores by hour, day, week, or month                     |
| **Auto**                      | Automatically determine the aggregation unit based on the selected period |

To narrow down by model or evaluation type, use the top filters (model / type).

***

## Filter Options

Filter using the dropdowns at the top of the results.

| Filter              | Description                                       |
| ------------------- | ------------------------------------------------- |
| **Date range**      | Select the evaluation period (e.g., last 7 days)  |
| **Model**           | Filter by a specific model                        |
| **Evaluation type** | Retrieval Quality, Faithfulness, Response Quality |
| **Status**          | All / Completed / Processing / Failed             |

***

## Auto-Evaluation Statistics

Summary cards appear at the top of the results screen.

| Card              | Description                                  |
| ----------------- | -------------------------------------------- |
| **Total**         | Total number of auto-evaluations             |
| **Completed**     | Number of successfully completed evaluations |
| **Processing**    | Number of evaluations still processing       |
| **Average score** | Overall average score (%)                    |

***

## Export

Use the **CSV** button at the top right of the results screen to export the evaluation results matching the current filter as CSV (id, chat\_id, message\_id, model\_id, score, reasoning, etc.).

***

## Use Cases

<Accordion title="Monitoring Response Quality" icon="chart-line">
  1. Check daily/weekly score trends in the Score Trend chart
  2. When a specific model's score drops, check that period's traces
  3. Click a low-scoring individual evaluation to review its reasoning
  4. Adjust prompt, Knowledge Base, and tool settings
</Accordion>

***

## Troubleshooting

<Accordion title="What if auto-evaluation fails (failed)?" icon="triangle-exclamation">
  When auto-evaluation is in the failed state:

  * **Check the error message**: Review the error content for that item in the results table
  * **Common causes**: Judge model API errors, timeouts, token limit exceeded
  * **Re-run**: Automatic re-run is not currently supported. Re-enabling auto-evaluation in the agent settings resumes evaluation from subsequent responses.
</Accordion>

***

## Related Pages

<Columns cols={3}>
  <Card title="Evaluation" icon="star" href="/en/monitoring/evaluations">
    Full overview of evaluation — manual feedback, Arena, Leaderboard, and more
  </Card>

  <Card title="Tracing" icon="route" href="/en/monitoring/tracing">
    Trace the cause of low evaluation scores
  </Card>

  <Card title="Agent Settings" icon="robot" href="/en/workspace/agents">
    Configure auto-evaluation on an agent
  </Card>
</Columns>
