Agentbench Live

B+ 87 completed

Other

cli / python · small

114

Files

7,211

LOC

Frameworks

Languages

Overview Files & Metrics Git Activity Call Graph Security Reports

Pipeline State

completed

Run ID

#347670

Phase

done

Progress

Started

Finished

2026-04-13 01:31:02

LLM tokens

Pipeline Metadata

Stage

Cataloged

Decision

proceed

Novelty

61.64

Framework unique

—

Isolation

—

Last stage change

2026-05-10 03:35:02

Deduplication group #66075

Member of a group with 1 similar repo(s) — this repo is canonical view group →

Top concepts (2)

Project DescriptionWeb Backend

Repobility analyzer · published findings · https://repobility.com

AI Prompt

Create a command-line benchmark system, similar to AgentBench-Live, for evaluating AI agents on real-world tasks. I need it to support various domains like Code, Data Analysis, Multi-step workflows, Research, and Tool Use. The system should be built using Python and incorporate frameworks like Django or Flask for structure, and use SQLAlchemy for potential data persistence. It should allow for running tests and generating leaderboards based on task execution results.

python cli benchmark ai-agent django flask pytest sqlalchemy testing

Generated by gemma4:latest

Catalog Information

The open benchmark for AI agent task execution.

Description

The open benchmark for AI agent task execution.

Novelty

3/10

Technologies

anthropic openai pydantic

Claude Models

claude-opus-4-6

Quality Score

B+

86.8/100

Structure

Code Quality

Documentation

Testing

Practices

Security

Dependencies

Strengths

CI/CD pipeline configured (github_actions)
Good test coverage (76% test-to-source ratio)
Code linting configured (ruff (possible))
Consistent naming conventions (snake_case)
Good security practices \u2014 no major issues detected
Properly licensed project

Security & Health

4.6h

Tech Debt (B)

OWASP (100%)

PASS

Quality Gate

Risk (2)

All rows scored by the Repobility analyzer (https://repobility.com)

MIT

License

0.0%

Duplication

Full Security Report AI Fix Prompts SARIF SBOM

Languages

python

56.3%

json

17.7%

html

14.3%

yaml

7.0%

markdown

3.6%

toml

0.8%

text

0.2%

Frameworks

Django Flask pytest SQLAlchemy

Concepts (2)

Powered by Repobility · code-quality intelligence
Category	Name	Description	Confidence
Repobility — same analyzer, your code, free for public repos · /scan/
auto_description	Project Description	![License: MIT](https://opensource.org/licenses/MIT) ![Python 3.9+](https://www.python.org/downloads/) ![Agents Tested](#-leaderboard)	80%
auto_category	Web Backend	web-backend	70%

Quality Timeline

1 quality score recorded.

View File Metrics

Embed Badge

Add to your README:

![Quality](https://repos.aljefra.com/badge/71750.svg)

Export Quality CSV Download SBOM Export Findings CSV

Agentbench Live

Pipeline State

Pipeline Metadata

AI Prompt

Catalog Information

Description

Novelty

Tags

Technologies

Claude Models

Quality Score

Strengths

Security & Health

Languages

Frameworks

Concepts (2)

Quality Timeline

Embed Badge