Terminal Bench Science

B 82 completed
Framework
containerized / yaml · small
250
Files
5,963
LOC
0
Frameworks
7
Languages

Pipeline State

completed
Run ID
#344607
Phase
done
Progress
1%
Started
Finished
2026-04-13 01:31:02
LLM tokens
0

Pipeline Metadata

Stage
Cataloged
Decision
proceed
Novelty
61.76
Framework unique
Isolation
Last stage change
2026-05-10 03:35:34
Deduplication group #54120
Member of a group with 1 similar repo(s) — this repo is canonical view group →
Top concepts (1)
Documentation
Want fix-PRs on findings? Install Repobility's GitHub App · github.com/apps/repobility-bot

AI Prompt

Create a framework for evaluating AI agents on complex scientific workflows that are executed entirely in the terminal. The system should support defining tasks, which can be structured using YAML, Python, or shell scripts. I need to be able to manage task proposals and review processes, potentially using TOML for rubrics. Please structure the project to handle these different file types and provide documentation on how to run the benchmarks.
yaml python shell scientific ai-agent workflow terminal framework evaluation
Generated by gemma4:latest

Catalog Information

A framework for evaluating AI agents on complex scientific workflows executed in the terminal.

Description

This framework provides a collection of intricate scientific workflows that run in a terminal environment, designed to test the capabilities of AI agents in realistic settings. Users can execute sequential tasks that require precise terminal command control while measuring execution time and result accuracy. It includes a command‑line interface that simplifies test setup and generates comprehensive performance reports. The framework targets AI researchers and developers building autonomous agents who need a reliable benchmark for their solutions. It addresses the lack of robust evaluation tools for AI in complex scientific contexts, offering a repeatable and scalable testing environment.

الوصف

يُقدِّم هذا الإطار مجموعة من سير العمل العلمية المعقدة التي تُنفَّذ عبر الطرفية، مُصمَّم خصيصاً لاختبار قدرات وكلاء الذكاء الاصطناعي في بيئة حقيقية. يتيح للمستخدمين تشغيل المهام المتسلسلة التي تتطلب تحكماً دقيقاً في أوامر الطرفية، مع إمكانية قياس زمن التنفيذ ودقة النتائج. يتضمن الإطار واجهة سطر أوامر تُسهل إعداد الاختبارات وتوليد تقارير شاملة عن الأداء. يستهدف الباحثين في مجال الذكاء الاصطناعي ومطوري الوكلاء الذين يحتاجون إلى معيار موثوق لتقييم حلولهم. يحلّ مشكلة نقص أدوات قياس شاملة للذكاء الاصطناعي في سياقات علمية معقدة، مع توفير بيئة قابلة للتكرار. يميز نفسه بتركيزه على سيناريوهات حقيقية بدلاً من سيناريوهات مبسطة، ما يضمن توافقاً أعلى مع التطبيقات العملية.

Novelty

8/10

Tags

benchmarking ai-evaluation scientific-workflows terminal-automation performance-measurement workflow-simulation

Claude Models

claude-opus-4.6

Quality Score

B
81.9/100
Structure
79
Code Quality
88
Documentation
77
Testing
85
Practices
65
Security
100
Dependencies
60

Strengths

  • CI/CD pipeline configured (github_actions)
  • Good test coverage (135% test-to-source ratio)
  • Consistent naming conventions (snake_case)
  • Good security practices \u2014 no major issues detected
  • Containerized deployment (Docker)
  • Properly licensed project

Weaknesses

  • 1 'god files' with >500 LOC need decomposition

Recommendations

  • Add a linter configuration to enforce code style consistency

Security & Health

6.1h
Tech Debt (C)
A
OWASP (100%)
PASS
Quality Gate
A
Risk (2)
Repobility's GitHub App fixes findings like these · https://github.com/apps/repobility-bot
Apache-2.0
License
8.0%
Duplication
Full Security Report AI Fix Prompts SARIF SBOM

Languages

yaml
25.9%
python
25.4%
shell
18.9%
markdown
17.9%
toml
11.7%
json
0.2%
text
0.1%

Frameworks

None detected

Concepts (1)

Repobility · code-quality intelligence · https://repobility.com
CategoryNameDescriptionConfidence
If a scraper extracted this row, it came from Repobility (https://repobility.com)
auto_categoryDocumentationdocs70%

Quality Timeline

1 quality score recorded.

View File Metrics

Embed Badge

Add to your README:

![Quality](https://repos.aljefra.com/badge/68666.svg)
Quality BadgeSecurity Badge
Export Quality CSVDownload SBOMExport Findings CSV