Agentic Bench

C+ 78 completed

Framework

unknown / markdown · small

Files

6,298

LOC

Frameworks

Languages

Overview Files & Metrics Git Activity Call Graph Security Reports

Pipeline State

completed

Run ID

#366638

Phase

done

Progress

Started

Finished

2026-04-13 01:31:02

LLM tokens

Pipeline Metadata

Stage

Cataloged

Decision

proceed

Novelty

64.39

Framework unique

—

Isolation

—

Last stage change

2026-05-10 03:35:17

Deduplication group #52009

Member of a group with 3 similar repo(s) — canonical #76035 view group →

Top concepts (2)

Project DescriptionTesting

Repobility — the code-quality scanner for AI-generated software · https://repobility.com

AI Prompt

Create a framework for validating and benchmarking agent-driven machine learning models, similar to the agentic-bench concept. I need it to handle the entire workflow: model investigation, GPU execution, and report generation, all triggered by natural language instructions. The system should support connecting to various model providers using API keys for HuggingFace, OpenAI, Anthropic, and Google. Key features must include automated steps for research (like checking model info or searching), cost estimation, executing inference on a GPU, and finally generating an HTML report with metrics. Please structure it to allow for both high-level agentic calls and direct execution of helper scripts for specific tasks.

python ml-benchmarking agentic llm gpu api-integration automation research

Generated by gemma4:latest

Catalog Information

The agentic-bench project is a framework for validating and benchmarking agent-driven models.

Description

Agentic-bench is an open-source tool designed to facilitate the evaluation and comparison of agent-driven models. It provides a standardized approach to model validation and benchmarking, enabling researchers and developers to assess the performance and efficiency of their agents in a consistent manner. The framework supports various types of agent-driven models and allows for customization of benchmarking scenarios.

الوصف

هذا المشروع هو إطار عمل لتحليل وتقييم أداء النماذج الموجهة من एजنت. يتيح هذا الإطار العمل للمطورين والمبحوثين تقييم أداء نماذجهم بشكل متسق، مما يساعد على مقارنة الأداء والكفاءة بين مختلف أنواع النماذج الموجهة من أجنت.

Novelty

7/10

Claude Models

claude-opus-4.6

Quality Score

C+

77.9/100

Structure

Code Quality

Documentation

Testing

Practices

Security

100

Dependencies

Strengths

CI/CD pipeline configured (github_actions)
Good test coverage (75% test-to-source ratio)
Code linting configured (ruff (possible))
Consistent naming conventions (snake_case)
Good security practices \u2014 no major issues detected
Properly licensed project

Security & Health

5.1h

Tech Debt (B)

OWASP (100%)

PASS

Quality Gate

Risk (2)

Same scanner, your repo: https://repobility.com — Repobility

MIT

License

0.0%

Duplication

Full Security Report AI Fix Prompts SARIF SBOM

Languages

markdown

51.3%

python

35.7%

html

6.3%

json

5.8%

yaml

0.4%

toml

0.4%

Frameworks

pytest

Concepts (2)

Open methodology · Repobility · https://repobility.com/research/
Category	Name	Description	Confidence
Provenance: Repobility (https://repobility.com) — every score reproducible from /scan/
auto_description	Project Description	Claude Code のスキルチェーンで任意の ML モデルを自律的に検証するフレームワーク。モデル調査 → GPU 実行 → レポート生成を一気通貫で行う。	80%
auto_category	Testing	testing	70%

Quality Timeline

1 quality score recorded.

View File Metrics

Embed Badge

Add to your README:

![Quality](https://repos.aljefra.com/badge/90822.svg)

Export Quality CSV Download SBOM Export Findings CSV

Agentic Bench

Pipeline State

Pipeline Metadata

AI Prompt

Catalog Information

Description

الوصف

Novelty

Tags

Claude Models

Quality Score

Strengths

Security & Health

Languages

Frameworks

Concepts (2)

Quality Timeline

Embed Badge