Model Arena

D 54 completed
Web App
containerized / python · tiny
16
Files
1,468
LOC
1
Frameworks
7
Languages

Pipeline State

completed
Run ID
#369762
Phase
done
Progress
1%
Started
Finished
2026-04-13 01:31:02
LLM tokens
0

Pipeline Metadata

Stage
Skipped
Decision
skip_scaffold_dup
Novelty
35.23
Framework unique
Isolation
Last stage change
2026-04-16 18:15:42
Deduplication group #52266
Member of a group with 1 similar repo(s) — canonical #89897 view group →
Top concepts (2)
Project DescriptionWeb Backend
Repobility · code-quality intelligence platform · https://repobility.com

AI Prompt

Create a self-hosted web tool for blind AI model comparison and ranking, similar to Chatbot Arena. The tool should allow users to enter a prompt and select a category, then have two different models respond simultaneously via real-time streaming. After viewing the side-by-side responses, users should be able to vote ("A Wins", "Tie", or "B Wins"). Key features to include are an ELO leaderboard that tracks rankings, support for multiple OpenAI-compatible APIs (like OpenAI, Anthropic, or Ollama), and the ability to configure models via a YAML file without changing code. The frontend should be simple, using vanilla JavaScript, and the deployment should be easy with Docker Compose.
python fastapi web-tool ai-comparison elo-ranking docker streaming javascript yaml
Generated by gemma4:latest

Catalog Information

A self-hosted web tool that lets teams compare AI models blind and rank them using ELO.

Description

Model Arena is a lightweight, self-hosted web application that enables teams to compare two AI models side‑by‑side on the same prompt without revealing their identities. It streams responses in real time, allowing users to vote on which model performs better. The platform tracks performance with an ELO leaderboard, supports multiple OpenAI‑compatible providers, and estimates cost per response. Users can configure models via a simple YAML file and run the service with a single Docker command. Model Arena is ideal for internal model evaluation, budgeting, and unbiased benchmarking.

الوصف

تُعد أداة Model Arena منصة ويب خفيفة الوزن يمكن نشرها محلياً، وتتيح للفرق مقارنة نماذج الذكاء الاصطناعي بشكل خفي على نفس السؤال. تُظهر الأداة ردّ كل نموذج جنباً إلى جنب مع تدفق النتائج في الوقت الحقيقي، ما يتيح للمستخدمين التصويت على الأفضل دون معرفة هوية النموذج. تُحسب تصنيفات الأداء باستخدام نظام ELO، مع إمكانية تصفية النتائج حسب الفئة (عام، برمجة، استدلال، إبداعي). يدعم التطبيق مزودات متعددة متوافقة مع واجهة OpenAI، ويُقدّر تكلفة كل رد بناءً على تكاليف المزود. يُمكن تكوين النماذج عبر ملف YAML بسيط، ويُشغَّل التطبيق بأمر Docker واحد فقط. يضمن النظام حفظ سرية المطالبات، ويُسجِّل كل تصويت مع سجل كامل لتغييرات ELO. تُعد هذه الأداة حلاً مثالياً لتقييم النماذج داخلياً، وتخطيط الميزانية، وتحليل الأداء بدون تحيز.

Novelty

7/10

Tags

ai-model-comparison blind-evaluation elo-ranking real-time-streaming cost-estimation multi-provider-support prompt-privacy leaderboard-analytics

Technologies

fastapi openai uvicorn

Claude Models

claude-opus-4.6

Quality Score

D
53.5/100
Structure
44
Code Quality
75
Documentation
39
Testing
0
Practices
72
Security
92
Dependencies
60

Strengths

  • Consistent naming conventions (snake_case)
  • Good security practices \u2014 no major issues detected
  • Containerized deployment (Docker)

Weaknesses

  • No LICENSE file \u2014 legal ambiguity for contributors
  • No tests found \u2014 high risk of regressions
  • No CI/CD configuration \u2014 manual testing and deployment

Recommendations

  • Add a test suite \u2014 start with critical path integration tests
  • Set up CI/CD (GitHub Actions recommended) to automate testing and deployment
  • Add a linter configuration to enforce code style consistency
  • Add a LICENSE file (MIT recommended for open source)

Security & Health

4.1h
Tech Debt (D)
A
OWASP (100%)
PASS
Quality Gate
A
Risk (7)
Methodology: Repobility · https://repobility.com/research/state-of-ai-code-2026/
Unknown
License
0.7%
Duplication
Full Security Report AI Fix Prompts SARIF SBOM

Languages

python
36.8%
css
27.8%
javascript
19.5%
html
8.8%
markdown
5.8%
yaml
0.9%
text
0.4%

Frameworks

FastAPI

Concepts (2)

Source: Repobility analyzer (https://repobility.com)
CategoryNameDescriptionConfidence
Repobility analyzer · published findings · https://repobility.com
auto_descriptionProject DescriptionA self-hosted blind AI model comparison tool with ELO rankings. Inspired by Chatbot Arena (LMSYS) — a lightweight, self-hosted alternative for internal/private model evaluation.80%
auto_categoryWeb Backendweb-backend70%

Quality Timeline

1 quality score recorded.

View File Metrics

Embed Badge

Add to your README:

![Quality](https://repos.aljefra.com/badge/93966.svg)
Quality BadgeSecurity Badge
Export Quality CSVDownload SBOMExport Findings CSV