Skip to main content
The evaluation feature measures AI response quality via two methods: manual feedback and auto-evaluation. Combine direct user assessment with LLM-based auto-quality measurement to build a systematic quality management framework.

Manual Evaluation (Feedbacks tab)

A feature where users directly evaluate AI responses. View all feedback in Admin > Evaluations > Feedbacks tab.
Evaluations tab main

How Feedback is Collected

Collected via feedback buttons below AI responses on the chat screen.
Feedback TypeDescription
LikeWhen the response is useful and accurate
DislikeWhen the response is inaccurate or not helpful
CommentAdditional text feedback

Feedback Data

FieldDescription
UserUser who left feedback
TypeLike/Dislike
Model IDModel that generated the response
ReasonFeedback reason
CommentDetailed opinion
CreatedFeedback creation time

Feedback Management

FeatureDescriptionPermission
View allView feedback from all usersAdmin
ExportExport all feedback as JSONAdmin
Delete allBulk delete all feedbackAdmin
Per-item deleteDelete own feedbackRegular user
Feedback detail

Arena Evaluation

A feature for blind-comparing two model responses side by side.

Setup

Configure in Admin > Settings (gear) > Evaluations.
SettingDescription
Enable ArenaToggle Arena mode
Arena modelsCompose model pairs to compare
When Arena is enabled, two models’ responses appear anonymously side by side during chat, and users select the better one.

Leaderboard (Leaderboard tab)

In Admin > Evaluations > Leaderboard, see model rankings. Based on Arena blind comparison results, calculates Elo rating-based model rankings. Each time a user picks the better response in Arena, that model’s Elo score updates — providing objective real-usage-based model quality rankings.
Leaderboard tab
FieldDescription
ModelEvaluated model
Elo RatingScore derived from Arena comparison results
MatchesNumber of Arena comparisons
Win rateSelected ratio

Auto-evaluation (Auto Evaluations tab)

When auto-evaluation is enabled on an agent, after each response, a judge LLM asynchronously evaluates quality and records results. View results in Admin > Evaluations > Auto Evaluations tab.
Auto-evaluation results screen
Auto-evaluation is a licensed feature. Requires a license with evaluation feature enabled.

Evaluation Types

TypeDescription
Retrieval QualityEvaluate whether retrieved documents are relevant to the question
FaithfulnessEvaluate whether the answer is grounded on retrieved content (hallucination detection)
Response QualityEvaluate overall usefulness and accuracy

Evaluation Process

Evaluation Result Fields

FieldDescription
Chat/Message IDEvaluated message
Model IDModel that generated the response
Judge Model IDLLM used for evaluation
Evaluation typeretrieval, faithfulness, quality
Score0.0 ~ 1.0 (1.0 is best)
ReasoningLLM’s explanation of the score
Statuspending, completed, failed
Error messageError content on evaluation failure

Score Trend Chart

Visualizes daily average score trends.
ModeDescription
All typesAverage score line per model
Specific typeDetailed lines per model + type

Filter Options

FilterDescription
Date rangePick evaluation period
ModelFilter by specific model
Evaluation typeRetrieval Quality, Faithfulness, Response Quality
Statuspending, completed, failed
Score rangeMin/max score (0.0 ~ 1.0)

Auto-evaluation Statistics

Provides summary statistics for all auto-evaluations.
MetricDescription
TotalTotal auto-evaluation count
CompletedSuccessfully completed evaluation count
PendingEvaluations still being processed
FailedEvaluations failed with errors
Average scoreOverall average score
Per-model statisticsCount and average score per model
Per-type statisticsCount and average score per evaluation type

Export

Export auto-evaluation data.
FormatDescription
CSVFor spreadsheet analysis (id, chat_id, message_id, model_id, score, reasoning, etc.)
JSONFull data for programmatic integration

Configuring Auto-evaluation on an Agent

Auto-evaluation is enabled per agent.
1

Edit agent

Open the target agent’s edit screen in Workspace > Agents.
2

Activate auto-evaluation

Activate in the Auto-evaluation section of agent settings.
SettingDescription
EnabledWhether auto-evaluation is used
Sampling rateShare of responses to evaluate (1%~100%, default 10%)
Judge modelLLM model used for evaluation
Evaluation typesPick which types to enable
Sampling rate guidance:
SituationRecommendedReason
New agent50–100%Quickly assess initial quality
Stabilized5–10%Cost saving + monitoring
Critical business20–30%Quality assurance
3

Save

After saving the agent, auto-evaluation runs on subsequent responses from this agent.
Use a judge model that’s equal or higher caliber than the model being evaluated. For example, evaluating GPT-4o responses with GPT-4o-mini may reduce accuracy.

Use Cases

  1. Check daily/weekly score trends in the Score Trend chart
  2. When a model’s score drops, check that period’s traces
  3. Click low-score individual evaluations to review reasoning
  4. Adjust prompts, Knowledge Bases, tool settings
  1. Enable Arena evaluation to collect blind comparison data
  2. Compare per-model average scores in auto-evaluation statistics
  3. Set the model with best cost-quality efficiency as default
  1. Identify models/agents with high “Dislike” rates from manual feedback
  2. Analyze comments to understand common dissatisfaction patterns
  3. Improve the agent’s system prompt or Knowledge Base
  4. Track auto-evaluation score changes after improvement
When auto-evaluation is in failed state:
  • Check error message: Review the error details for that item in the result table
  • Common causes: Judge model API errors, timeouts, token limit exceeded
  • Re-run: Auto re-run isn’t currently supported. Re-enable auto-evaluation in agent settings to evaluate subsequent responses.

Next Steps

Tracing

Trace causes of low evaluation scores

Agent Settings

Configure auto-evaluation on agents

Usage

Check token usage including evaluation cost