Skip to main content
Admin › Evaluation › Arena · Leaderboard
Arena · Leaderboard pits two models against each other blindly (Arena) and ranks them by accumulating the results as an Elo rating (Leaderboard). It provides an objective comparison of model quality based on real users’ choices.

Arena

A feature for blind-evaluating two models by comparing their responses side by side.

Setup

Admin › Evaluation › Arena
Arena setup — Arena model toggle, model management
SettingDescription
Arena modelsToggle whether Arena mode is used
ManageConfigure the models to compare (use default Arena models or add custom ones)
Use + in the Manage item to add comparison models directly. Name and ID are required, and you specify access permissions and the models to include. Leaving the models empty includes all models.
Add Arena model modal — name, ID, description, permissions, model selection
When Arena is enabled, two models’ responses appear anonymously side by side while a user chats, and the user selects the better response.

Leaderboard

Admin › Evaluation › Leaderboard
Calculates Elo rating-based model rankings from Arena blind comparison results. Each time a user picks the better response in Arena, that model’s Elo score updates, letting you objectively gauge real-usage-based model quality rankings.
Leaderboard — Elo rating-based model ranking table
You can search rankings by model name in the search box at the top.
ColumnDescription
RKRank (descending by evaluation score)
ModelEvaluated model
EvaluationScore derived from Arena comparison results (Elo rating)
WinsNumber of wins in Arena comparisons
LossesNumber of losses in Arena comparisons
Example: RK 1 · Cloocus general model - GPT-oss-120B · Evaluation 1061 · Wins 4 · Losses 0
The leaderboard is in beta, and evaluation criteria may change as the algorithm is revised. It updates in real time based on the Elo evaluation system.

Use Cases

  1. Enable Arena evaluation to collect blind comparison data
  2. Compare average scores in the per-model statistics of auto-evaluation
  3. Set the model with the best cost-to-quality efficiency as the default model

Evaluation

Full overview and guide to evaluation methods

Auto-Evaluations

Automatic quality scoring by a judge LLM

Usage

Check token usage per model