Evaluation

The evaluation feature measures AI response quality via two methods: manual feedback and auto-evaluation. Combine direct user assessment with LLM-based auto-quality measurement to build a systematic quality management framework.

Manual Evaluation (Feedbacks tab)

A feature where users directly evaluate AI responses. View all feedback in Admin > Evaluations > Feedbacks tab.

How Feedback is Collected

Collected via feedback buttons below AI responses on the chat screen.

Feedback Type	Description
Like	When the response is useful and accurate
Dislike	When the response is inaccurate or not helpful
Comment	Additional text feedback

Feedback Data

Field	Description
User	User who left feedback
Type	Like/Dislike
Model ID	Model that generated the response
Reason	Feedback reason
Comment	Detailed opinion
Created	Feedback creation time

Feedback Management

Feature	Description	Permission
View all	View feedback from all users	Admin
Export	Export all feedback as JSON	Admin
Delete all	Bulk delete all feedback	Admin
Per-item delete	Delete own feedback	Regular user

Arena Evaluation

A feature for blind-comparing two model responses side by side.

Setup

Configure in Admin > Settings (gear) > Evaluations.

Setting	Description
Enable Arena	Toggle Arena mode
Arena models	Compose model pairs to compare

When Arena is enabled, two models’ responses appear anonymously side by side during chat, and users select the better one.

Leaderboard (Leaderboard tab)

In Admin > Evaluations > Leaderboard, see model rankings. Based on Arena blind comparison results, calculates Elo rating-based model rankings. Each time a user picks the better response in Arena, that model’s Elo score updates — providing objective real-usage-based model quality rankings.

Field	Description
Model	Evaluated model
Elo Rating	Score derived from Arena comparison results
Matches	Number of Arena comparisons
Win rate	Selected ratio

Auto-evaluation (Auto Evaluations tab)

When auto-evaluation is enabled on an agent, after each response, a judge LLM asynchronously evaluates quality and records results. View results in Admin > Evaluations > Auto Evaluations tab.

Auto-evaluation is a licensed feature. Requires a license with evaluation feature enabled.

Evaluation Types

Type	Description
Retrieval Quality	Evaluate whether retrieved documents are relevant to the question
Faithfulness	Evaluate whether the answer is grounded on retrieved content (hallucination detection)
Response Quality	Evaluate overall usefulness and accuracy

Evaluation Process

Evaluation Result Fields

Field	Description
Chat/Message ID	Evaluated message
Model ID	Model that generated the response
Judge Model ID	LLM used for evaluation
Evaluation type	retrieval, faithfulness, quality
Score	0.0 ~ 1.0 (1.0 is best)
Reasoning	LLM’s explanation of the score
Status	pending, completed, failed
Error message	Error content on evaluation failure

Score Trend Chart

Visualizes daily average score trends.

Mode	Description
All types	Average score line per model
Specific type	Detailed lines per model + type

Filter	Description
Date range	Pick evaluation period
Model	Filter by specific model
Evaluation type	Retrieval Quality, Faithfulness, Response Quality
Status	pending, completed, failed
Score range	Min/max score (0.0 ~ 1.0)

Auto-evaluation Statistics

Provides summary statistics for all auto-evaluations.

Metric	Description
Total	Total auto-evaluation count
Completed	Successfully completed evaluation count
Pending	Evaluations still being processed
Failed	Evaluations failed with errors
Average score	Overall average score
Per-model statistics	Count and average score per model
Per-type statistics	Count and average score per evaluation type

Export

Export auto-evaluation data.

Format	Description
CSV	For spreadsheet analysis (id, chat_id, message_id, model_id, score, reasoning, etc.)
JSON	Full data for programmatic integration

Configuring Auto-evaluation on an Agent

Auto-evaluation is enabled per agent.

Edit agent

Open the target agent’s edit screen in Workspace > Agents.

Activate auto-evaluation

Activate in the Auto-evaluation section of agent settings.

Setting	Description
Enabled	Whether auto-evaluation is used
Sampling rate	Share of responses to evaluate (1%~100%, default 10%)
Judge model	LLM model used for evaluation
Evaluation types	Pick which types to enable

Sampling rate guidance:

Situation	Recommended	Reason
New agent	50–100%	Quickly assess initial quality
Stabilized	5–10%	Cost saving + monitoring
Critical business	20–30%	Quality assurance

Save

After saving the agent, auto-evaluation runs on subsequent responses from this agent.

Use a judge model that’s equal or higher caliber than the model being evaluated. For example, evaluating GPT-4o responses with GPT-4o-mini may reduce accuracy.

Use Cases

Response Quality Monitoring

Check daily/weekly score trends in the Score Trend chart
When a model’s score drops, check that period’s traces
Click low-score individual evaluations to review reasoning
Adjust prompts, Knowledge Bases, tool settings

Cross-model Quality Comparison

Enable Arena evaluation to collect blind comparison data
Compare per-model average scores in auto-evaluation statistics
Set the model with best cost-quality efficiency as default

Feedback-based Improvement

Identify models/agents with high “Dislike” rates from manual feedback
Analyze comments to understand common dissatisfaction patterns
Improve the agent’s system prompt or Knowledge Base
Track auto-evaluation score changes after improvement

What if auto-evaluation fails?

When auto-evaluation is in failed state:

Check error message: Review the error details for that item in the result table
Common causes: Judge model API errors, timeouts, token limit exceeded
Re-run: Auto re-run isn’t currently supported. Re-enable auto-evaluation in agent settings to evaluate subsequent responses.

Next Steps

Tracing

Trace causes of low evaluation scores

Agent Settings

Configure auto-evaluation on agents

Usage

Check token usage including evaluation cost

Monitoring

Dashboards

Logs

Analytics

Manual Evaluation (Feedbacks tab)

How Feedback is Collected

Feedback Data

Feedback Management

Arena Evaluation

Setup

Leaderboard (Leaderboard tab)

Auto-evaluation (Auto Evaluations tab)

Evaluation Types

Evaluation Process

Evaluation Result Fields

Score Trend Chart

Auto-evaluation Statistics

Export

Configuring Auto-evaluation on an Agent

Use Cases

Next Steps

Tracing

Agent Settings

Usage

Monitoring

Dashboards

Logs

Analytics

​Manual Evaluation (Feedbacks tab)

​How Feedback is Collected

​Feedback Data

​Feedback Management

​Arena Evaluation

​Setup

​Leaderboard (Leaderboard tab)

​Auto-evaluation (Auto Evaluations tab)

​Evaluation Types

​Evaluation Process

​Evaluation Result Fields

​Score Trend Chart

​Filter Options

​Auto-evaluation Statistics

​Export

​Configuring Auto-evaluation on an Agent

​Use Cases

​Next Steps

Tracing

Agent Settings

Usage

Manual Evaluation (Feedbacks tab)

How Feedback is Collected

Feedback Data

Feedback Management

Arena Evaluation

Setup

Leaderboard (Leaderboard tab)

Auto-evaluation (Auto Evaluations tab)

Evaluation Types

Evaluation Process

Evaluation Result Fields

Score Trend Chart

Filter Options

Auto-evaluation Statistics

Export

Configuring Auto-evaluation on an Agent

Use Cases

Next Steps