Admin › Evaluation › Arena · Leaderboard
Arena
A feature for blind-evaluating two models by comparing their responses side by side.Setup
Admin › Evaluation › Arena

| Setting | Description |
|---|---|
| Arena models | Toggle whether Arena mode is used |
| Manage | Configure the models to compare (use default Arena models or add custom ones) |

Leaderboard
Admin › Evaluation › Leaderboard

| Column | Description |
|---|---|
| RK | Rank (descending by evaluation score) |
| Model | Evaluated model |
| Evaluation | Score derived from Arena comparison results (Elo rating) |
| Wins | Number of wins in Arena comparisons |
| Losses | Number of losses in Arena comparisons |
The leaderboard is in beta, and evaluation criteria may change as the algorithm is revised. It updates in real time based on the Elo evaluation system.
Use Cases
Comparing Quality Across Models
Comparing Quality Across Models
- Enable Arena evaluation to collect blind comparison data
- Compare average scores in the per-model statistics of auto-evaluation
- Set the model with the best cost-to-quality efficiency as the default model
Related Pages
Evaluation
Full overview and guide to evaluation methods
Auto-Evaluations
Automatic quality scoring by a judge LLM
Usage
Check token usage per model
