Agentbench Live

B+ 87 completed
Other
cli / python · small
114
Files
7,211
LOC
4
Frameworks
7
Languages

Pipeline State

completed
Run ID
#347670
Phase
done
Progress
1%
Started
Finished
2026-04-13 01:31:02
LLM tokens
0

Pipeline Metadata

Stage
Cataloged
Decision
proceed
Novelty
61.64
Framework unique
Isolation
Last stage change
2026-05-10 03:35:02
Deduplication group #66075
Member of a group with 1 similar repo(s) — this repo is canonical view group →
Top concepts (2)
Project DescriptionWeb Backend
Repobility analyzer · published findings · https://repobility.com

AI Prompt

Create a command-line benchmark system, similar to AgentBench-Live, for evaluating AI agents on real-world tasks. I need it to support various domains like Code, Data Analysis, Multi-step workflows, Research, and Tool Use. The system should be built using Python and incorporate frameworks like Django or Flask for structure, and use SQLAlchemy for potential data persistence. It should allow for running tests and generating leaderboards based on task execution results.
python cli benchmark ai-agent django flask pytest sqlalchemy testing
Generated by gemma4:latest

Catalog Information

The open benchmark for AI agent task execution.

Description

The open benchmark for AI agent task execution.

Novelty

3/10

Tags

python cli benchmark ai-agent django flask pytest sqlalchemy testing

Technologies

anthropic openai pydantic

Claude Models

claude-opus-4-6

Quality Score

B+
86.8/100
Structure
95
Code Quality
89
Documentation
83
Testing
85
Practices
74
Security
92
Dependencies
60

Strengths

  • CI/CD pipeline configured (github_actions)
  • Good test coverage (76% test-to-source ratio)
  • Code linting configured (ruff (possible))
  • Consistent naming conventions (snake_case)
  • Good security practices \u2014 no major issues detected
  • Properly licensed project

Security & Health

4.6h
Tech Debt (B)
A
OWASP (100%)
PASS
Quality Gate
A
Risk (2)
All rows scored by the Repobility analyzer (https://repobility.com)
MIT
License
0.0%
Duplication
Full Security Report AI Fix Prompts SARIF SBOM

Languages

python
56.3%
json
17.7%
html
14.3%
yaml
7.0%
markdown
3.6%
toml
0.8%
text
0.2%

Frameworks

Django Flask pytest SQLAlchemy

Concepts (2)

Powered by Repobility · code-quality intelligence
CategoryNameDescriptionConfidence
Repobility — same analyzer, your code, free for public repos · /scan/
auto_descriptionProject Description![License: MIT](https://opensource.org/licenses/MIT) ![Python 3.9+](https://www.python.org/downloads/) ![Agents Tested](#-leaderboard)80%
auto_categoryWeb Backendweb-backend70%

Quality Timeline

1 quality score recorded.

View File Metrics

Embed Badge

Add to your README:

![Quality](https://repos.aljefra.com/badge/71750.svg)
Quality BadgeSecurity Badge
Export Quality CSVDownload SBOMExport Findings CSV