Agentic Bench
C+ 78 completed
Framework
unknown / markdown · small
62
Files
6,298
LOC
1
Frameworks
6
Languages
Pipeline State
completedRun ID
#366638Phase
doneProgress
1%Started
Finished
2026-04-13 01:31:02LLM tokens
0Pipeline Metadata
Stage
CatalogedDecision
proceedNovelty
64.39Framework unique
—Isolation
—Last stage change
2026-05-10 03:35:17Deduplication group #52009
Member of a group with 3 similar repo(s) — canonical #76035 view group →
Top concepts (2)
Project DescriptionTesting
Repobility — the code-quality scanner for AI-generated software · https://repobility.com
AI Prompt
Create a framework for validating and benchmarking agent-driven machine learning models, similar to the agentic-bench concept. I need it to handle the entire workflow: model investigation, GPU execution, and report generation, all triggered by natural language instructions. The system should support connecting to various model providers using API keys for HuggingFace, OpenAI, Anthropic, and Google. Key features must include automated steps for research (like checking model info or searching), cost estimation, executing inference on a GPU, and finally generating an HTML report with metrics. Please structure it to allow for both high-level agentic calls and direct execution of helper scripts for specific tasks.
python ml-benchmarking agentic llm gpu api-integration automation research
Generated by gemma4:latest
Catalog Information
The agentic-bench project is a framework for validating and benchmarking agent-driven models.
Description
Agentic-bench is an open-source tool designed to facilitate the evaluation and comparison of agent-driven models. It provides a standardized approach to model validation and benchmarking, enabling researchers and developers to assess the performance and efficiency of their agents in a consistent manner. The framework supports various types of agent-driven models and allows for customization of benchmarking scenarios.
الوصف
هذا المشروع هو إطار عمل لتحليل وتقييم أداء النماذج الموجهة من एजنت. يتيح هذا الإطار العمل للمطورين والمبحوثين تقييم أداء نماذجهم بشكل متسق، مما يساعد على مقارنة الأداء والكفاءة بين مختلف أنواع النماذج الموجهة من أجنت.
Novelty
7/10Tags
model-validation benchmarking agent-driven-models performance-evaluation efficiency-assessment
Claude Models
claude-opus-4.6
Quality Score
C+
77.9/100
Structure
87
Code Quality
70
Documentation
73
Testing
85
Practices
63
Security
100
Dependencies
60
Strengths
- CI/CD pipeline configured (github_actions)
- Good test coverage (75% test-to-source ratio)
- Code linting configured (ruff (possible))
- Consistent naming conventions (snake_case)
- Good security practices \u2014 no major issues detected
- Properly licensed project
Security & Health
5.1h
Tech Debt (B)
A
OWASP (100%)
PASS
Quality Gate
A
Risk (2)
Same scanner, your repo: https://repobility.com — Repobility
MIT
License
0.0%
Duplication
Languages
Frameworks
pytest
Concepts (2)
| Category | Name | Description | Confidence | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Provenance: Repobility (https://repobility.com) — every score reproducible from /scan/ | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| auto_description | Project Description | Claude Code のスキルチェーンで任意の ML モデルを自律的に検証するフレームワーク。 モデル調査 → GPU 実行 → レポート生成を一気通貫で行う。 | 80% | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| auto_category | Testing | testing | 70% | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Embed Badge
Add to your README:
