Admin › Evaluation › Auto-Evaluations

Auto-evaluation is a licensed feature. Requires a license with the
evaluation feature enabled.Evaluation Types
| Type | Description |
|---|---|
| Retrieval Quality | Evaluates whether retrieved documents are relevant to the question |
| Faithfulness | Evaluates whether the answer is grounded in the retrieved content (hallucination detection) |
| Response Quality | Evaluates overall usefulness and accuracy |
Evaluation Process
Enabling Auto-Evaluation (Activate on the Agent)
Auto-evaluation is enabled per agent. Results only start accumulating once it’s turned on.
Activate auto-evaluation
Activate it in the Auto-evaluation section of the agent settings.
Sampling rate guidance:
| Setting | Description |
|---|---|
| Enabled | Whether auto-evaluation is used |
| Sampling rate | Share of responses to evaluate (1%~100%, default 10%) |
| Judge model | LLM model used for evaluation |
| Evaluation types | Select which evaluation types to enable |
| Situation | Recommended | Reason |
|---|---|---|
| New agent | 50–100% | Quickly assess initial quality |
| After stabilization | 5–10% | Cost saving + monitoring |
| Critical business | 20–30% | Quality assurance |
Evaluation Result Fields
Evaluation result fields (8)
Evaluation result fields (8)
| Field | Description |
|---|---|
| Chat/Message ID | Evaluated message |
| Model ID | Model that generated the response |
| Judge Model ID | LLM used for evaluation |
| Evaluation type | retrieval, faithfulness, quality |
| Score | 0.0 ~ 1.0 (1.0 is best) |
| Reasoning | LLM’s explanation of the score |
| Status | pending, completed, failed |
| Error message | Error content on evaluation failure |
Score Trend Chart
Shows the trend of average scores by date as a line chart. Use the toggle at the top right of the chart to change the aggregation unit.| Aggregation Unit | Description |
|---|---|
| Hour / Day / Week / Month | Aggregate average scores by hour, day, week, or month |
| Auto | Automatically determine the aggregation unit based on the selected period |
Filter Options
Filter using the dropdowns at the top of the results.| Filter | Description |
|---|---|
| Date range | Select the evaluation period (e.g., last 7 days) |
| Model | Filter by a specific model |
| Evaluation type | Retrieval Quality, Faithfulness, Response Quality |
| Status | All / Completed / Processing / Failed |
Auto-Evaluation Statistics
Summary cards appear at the top of the results screen.| Card | Description |
|---|---|
| Total | Total number of auto-evaluations |
| Completed | Number of successfully completed evaluations |
| Processing | Number of evaluations still processing |
| Average score | Overall average score (%) |
Export
Use the CSV button at the top right of the results screen to export the evaluation results matching the current filter as CSV (id, chat_id, message_id, model_id, score, reasoning, etc.).Use Cases
Monitoring Response Quality
Monitoring Response Quality
- Check daily/weekly score trends in the Score Trend chart
- When a specific model’s score drops, check that period’s traces
- Click a low-scoring individual evaluation to review its reasoning
- Adjust prompt, Knowledge Base, and tool settings
Troubleshooting
What if auto-evaluation fails (failed)?
What if auto-evaluation fails (failed)?
When auto-evaluation is in the failed state:
- Check the error message: Review the error content for that item in the results table
- Common causes: Judge model API errors, timeouts, token limit exceeded
- Re-run: Automatic re-run is not currently supported. Re-enabling auto-evaluation in the agent settings resumes evaluation from subsequent responses.
Related Pages
Evaluation
Full overview of evaluation — manual feedback, Arena, Leaderboard, and more
Tracing
Trace the cause of low evaluation scores
Agent Settings
Configure auto-evaluation on an agent
