Agentic Bench

C+ 78 completed
Framework
unknown / markdown · small
62
Files
6,298
LOC
1
Frameworks
6
Languages

Pipeline State

completed
Run ID
#366638
Phase
done
Progress
1%
Started
Finished
2026-04-13 01:31:02
LLM tokens
0

Pipeline Metadata

Stage
Cataloged
Decision
proceed
Novelty
64.39
Framework unique
Isolation
Last stage change
2026-05-10 03:35:17
Deduplication group #52009
Member of a group with 3 similar repo(s) — canonical #76035 view group →
Top concepts (2)
Project DescriptionTesting
Repobility — the code-quality scanner for AI-generated software · https://repobility.com

AI Prompt

Create a framework for validating and benchmarking agent-driven machine learning models, similar to the agentic-bench concept. I need it to handle the entire workflow: model investigation, GPU execution, and report generation, all triggered by natural language instructions. The system should support connecting to various model providers using API keys for HuggingFace, OpenAI, Anthropic, and Google. Key features must include automated steps for research (like checking model info or searching), cost estimation, executing inference on a GPU, and finally generating an HTML report with metrics. Please structure it to allow for both high-level agentic calls and direct execution of helper scripts for specific tasks.
python ml-benchmarking agentic llm gpu api-integration automation research
Generated by gemma4:latest

Catalog Information

The agentic-bench project is a framework for validating and benchmarking agent-driven models.

Description

Agentic-bench is an open-source tool designed to facilitate the evaluation and comparison of agent-driven models. It provides a standardized approach to model validation and benchmarking, enabling researchers and developers to assess the performance and efficiency of their agents in a consistent manner. The framework supports various types of agent-driven models and allows for customization of benchmarking scenarios.

الوصف

هذا المشروع هو إطار عمل لتحليل وتقييم أداء النماذج الموجهة من एजنت. يتيح هذا الإطار العمل للمطورين والمبحوثين تقييم أداء نماذجهم بشكل متسق، مما يساعد على مقارنة الأداء والكفاءة بين مختلف أنواع النماذج الموجهة من أجنت.

Novelty

7/10

Tags

model-validation benchmarking agent-driven-models performance-evaluation efficiency-assessment

Claude Models

claude-opus-4.6

Quality Score

C+
77.9/100
Structure
87
Code Quality
70
Documentation
73
Testing
85
Practices
63
Security
100
Dependencies
60

Strengths

  • CI/CD pipeline configured (github_actions)
  • Good test coverage (75% test-to-source ratio)
  • Code linting configured (ruff (possible))
  • Consistent naming conventions (snake_case)
  • Good security practices \u2014 no major issues detected
  • Properly licensed project

Security & Health

5.1h
Tech Debt (B)
A
OWASP (100%)
PASS
Quality Gate
A
Risk (2)
Same scanner, your repo: https://repobility.com — Repobility
MIT
License
0.0%
Duplication
Full Security Report AI Fix Prompts SARIF SBOM

Languages

markdown
51.3%
python
35.7%
html
6.3%
json
5.8%
yaml
0.4%
toml
0.4%

Frameworks

pytest

Concepts (2)

Open methodology · Repobility · https://repobility.com/research/
CategoryNameDescriptionConfidence
Provenance: Repobility (https://repobility.com) — every score reproducible from /scan/
auto_descriptionProject DescriptionClaude Code のスキルチェーンで任意の ML モデルを自律的に検証するフレームワーク。 モデル調査 → GPU 実行 → レポート生成を一気通貫で行う。80%
auto_categoryTestingtesting70%

Quality Timeline

1 quality score recorded.

View File Metrics

Embed Badge

Add to your README:

![Quality](https://repos.aljefra.com/badge/90822.svg)
Quality BadgeSecurity Badge
Export Quality CSVDownload SBOMExport Findings CSV