Manual Evaluation (Feedbacks tab)
A feature where users directly evaluate AI responses. View all feedback in Admin > Evaluations > Feedbacks tab.
How Feedback is Collected
Collected via feedback buttons below AI responses on the chat screen.| Feedback Type | Description |
|---|---|
| Like | When the response is useful and accurate |
| Dislike | When the response is inaccurate or not helpful |
| Comment | Additional text feedback |
Feedback Data
| Field | Description |
|---|---|
| User | User who left feedback |
| Type | Like/Dislike |
| Model ID | Model that generated the response |
| Reason | Feedback reason |
| Comment | Detailed opinion |
| Created | Feedback creation time |
Feedback Management
| Feature | Description | Permission |
|---|---|---|
| View all | View feedback from all users | Admin |
| Export | Export all feedback as JSON | Admin |
| Delete all | Bulk delete all feedback | Admin |
| Per-item delete | Delete own feedback | Regular user |

Arena Evaluation
A feature for blind-comparing two model responses side by side.Setup
Configure in Admin > Settings (gear) > Evaluations.| Setting | Description |
|---|---|
| Enable Arena | Toggle Arena mode |
| Arena models | Compose model pairs to compare |
Leaderboard (Leaderboard tab)
In Admin > Evaluations > Leaderboard, see model rankings. Based on Arena blind comparison results, calculates Elo rating-based model rankings. Each time a user picks the better response in Arena, that model’s Elo score updates — providing objective real-usage-based model quality rankings.
| Field | Description |
|---|---|
| Model | Evaluated model |
| Elo Rating | Score derived from Arena comparison results |
| Matches | Number of Arena comparisons |
| Win rate | Selected ratio |
Auto-evaluation (Auto Evaluations tab)
When auto-evaluation is enabled on an agent, after each response, a judge LLM asynchronously evaluates quality and records results. View results in Admin > Evaluations > Auto Evaluations tab.
Auto-evaluation is a licensed feature. Requires a license with
evaluation feature enabled.Evaluation Types
| Type | Description |
|---|---|
| Retrieval Quality | Evaluate whether retrieved documents are relevant to the question |
| Faithfulness | Evaluate whether the answer is grounded on retrieved content (hallucination detection) |
| Response Quality | Evaluate overall usefulness and accuracy |
Evaluation Process
Evaluation Result Fields
| Field | Description |
|---|---|
| Chat/Message ID | Evaluated message |
| Model ID | Model that generated the response |
| Judge Model ID | LLM used for evaluation |
| Evaluation type | retrieval, faithfulness, quality |
| Score | 0.0 ~ 1.0 (1.0 is best) |
| Reasoning | LLM’s explanation of the score |
| Status | pending, completed, failed |
| Error message | Error content on evaluation failure |
Score Trend Chart
Visualizes daily average score trends.| Mode | Description |
|---|---|
| All types | Average score line per model |
| Specific type | Detailed lines per model + type |
Filter Options
| Filter | Description |
|---|---|
| Date range | Pick evaluation period |
| Model | Filter by specific model |
| Evaluation type | Retrieval Quality, Faithfulness, Response Quality |
| Status | pending, completed, failed |
| Score range | Min/max score (0.0 ~ 1.0) |
Auto-evaluation Statistics
Provides summary statistics for all auto-evaluations.| Metric | Description |
|---|---|
| Total | Total auto-evaluation count |
| Completed | Successfully completed evaluation count |
| Pending | Evaluations still being processed |
| Failed | Evaluations failed with errors |
| Average score | Overall average score |
| Per-model statistics | Count and average score per model |
| Per-type statistics | Count and average score per evaluation type |
Export
Export auto-evaluation data.| Format | Description |
|---|---|
| CSV | For spreadsheet analysis (id, chat_id, message_id, model_id, score, reasoning, etc.) |
| JSON | Full data for programmatic integration |
Configuring Auto-evaluation on an Agent
Auto-evaluation is enabled per agent.Activate auto-evaluation
Activate in the Auto-evaluation section of agent settings.
Sampling rate guidance:
| Setting | Description |
|---|---|
| Enabled | Whether auto-evaluation is used |
| Sampling rate | Share of responses to evaluate (1%~100%, default 10%) |
| Judge model | LLM model used for evaluation |
| Evaluation types | Pick which types to enable |
| Situation | Recommended | Reason |
|---|---|---|
| New agent | 50–100% | Quickly assess initial quality |
| Stabilized | 5–10% | Cost saving + monitoring |
| Critical business | 20–30% | Quality assurance |
Use Cases
Response Quality Monitoring
Response Quality Monitoring
- Check daily/weekly score trends in the Score Trend chart
- When a model’s score drops, check that period’s traces
- Click low-score individual evaluations to review reasoning
- Adjust prompts, Knowledge Bases, tool settings
Cross-model Quality Comparison
Cross-model Quality Comparison
- Enable Arena evaluation to collect blind comparison data
- Compare per-model average scores in auto-evaluation statistics
- Set the model with best cost-quality efficiency as default
Feedback-based Improvement
Feedback-based Improvement
- Identify models/agents with high “Dislike” rates from manual feedback
- Analyze comments to understand common dissatisfaction patterns
- Improve the agent’s system prompt or Knowledge Base
- Track auto-evaluation score changes after improvement
What if auto-evaluation fails?
What if auto-evaluation fails?
When auto-evaluation is in
failed state:- Check error message: Review the error details for that item in the result table
- Common causes: Judge model API errors, timeouts, token limit exceeded
- Re-run: Auto re-run isn’t currently supported. Re-enable auto-evaluation in agent settings to evaluate subsequent responses.
Next Steps
Tracing
Trace causes of low evaluation scores
Agent Settings
Configure auto-evaluation on agents
Usage
Check token usage including evaluation cost
