Skip to main content
Admin › Evaluation › Auto-Evaluations
When you enable auto-evaluation on an agent, after each response a judge LLM asynchronously evaluates quality and records the result.
Auto-evaluation results screen — Score Trend chart, filters, results table
Auto-evaluation is a licensed feature. Requires a license with the evaluation feature enabled.

Evaluation Types

TypeDescription
Retrieval QualityEvaluates whether retrieved documents are relevant to the question
FaithfulnessEvaluates whether the answer is grounded in the retrieved content (hallucination detection)
Response QualityEvaluates overall usefulness and accuracy

Evaluation Process


Enabling Auto-Evaluation (Activate on the Agent)

Auto-evaluation is enabled per agent. Results only start accumulating once it’s turned on.
Agent auto-evaluation activation settings screen
1

Edit the agent

Open the target agent’s edit screen in Workspace > Agents.
2

Activate auto-evaluation

Activate it in the Auto-evaluation section of the agent settings.
SettingDescription
EnabledWhether auto-evaluation is used
Sampling rateShare of responses to evaluate (1%~100%, default 10%)
Judge modelLLM model used for evaluation
Evaluation typesSelect which evaluation types to enable
Sampling rate guidance:
SituationRecommendedReason
New agent50–100%Quickly assess initial quality
After stabilization5–10%Cost saving + monitoring
Critical business20–30%Quality assurance
3

Save

Once you save the agent, auto-evaluation runs on that agent’s subsequent responses.
Use a judge model that is equal to or higher in caliber than the model being evaluated. For example, evaluating GPT-4o responses with GPT-4o-mini may reduce accuracy.

Evaluation Result Fields

FieldDescription
Chat/Message IDEvaluated message
Model IDModel that generated the response
Judge Model IDLLM used for evaluation
Evaluation typeretrieval, faithfulness, quality
Score0.0 ~ 1.0 (1.0 is best)
ReasoningLLM’s explanation of the score
Statuspending, completed, failed
Error messageError content on evaluation failure

Score Trend Chart

Shows the trend of average scores by date as a line chart. Use the toggle at the top right of the chart to change the aggregation unit.
Aggregation UnitDescription
Hour / Day / Week / MonthAggregate average scores by hour, day, week, or month
AutoAutomatically determine the aggregation unit based on the selected period
To narrow down by model or evaluation type, use the top filters (model / type).

Filter Options

Filter using the dropdowns at the top of the results.
FilterDescription
Date rangeSelect the evaluation period (e.g., last 7 days)
ModelFilter by a specific model
Evaluation typeRetrieval Quality, Faithfulness, Response Quality
StatusAll / Completed / Processing / Failed

Auto-Evaluation Statistics

Summary cards appear at the top of the results screen.
CardDescription
TotalTotal number of auto-evaluations
CompletedNumber of successfully completed evaluations
ProcessingNumber of evaluations still processing
Average scoreOverall average score (%)

Export

Use the CSV button at the top right of the results screen to export the evaluation results matching the current filter as CSV (id, chat_id, message_id, model_id, score, reasoning, etc.).

Use Cases

  1. Check daily/weekly score trends in the Score Trend chart
  2. When a specific model’s score drops, check that period’s traces
  3. Click a low-scoring individual evaluation to review its reasoning
  4. Adjust prompt, Knowledge Base, and tool settings

Troubleshooting

When auto-evaluation is in the failed state:
  • Check the error message: Review the error content for that item in the results table
  • Common causes: Judge model API errors, timeouts, token limit exceeded
  • Re-run: Automatic re-run is not currently supported. Re-enabling auto-evaluation in the agent settings resumes evaluation from subsequent responses.

Evaluation

Full overview of evaluation — manual feedback, Arena, Leaderboard, and more

Tracing

Trace the cause of low evaluation scores

Agent Settings

Configure auto-evaluation on an agent