Nim Skill Test

D 52 completed

Testing

unknown / typescript · tiny

Files

3,784

LOC

Frameworks

Languages

Overview Files & Metrics Git Activity Call Graph Security Reports

Pipeline State

completed

Run ID

#370925

Phase

done

Progress

Started

Finished

2026-04-13 01:31:02

LLM tokens

Pipeline Metadata

Stage

Cataloged

Decision

proceed

Novelty

58.01

Framework unique

—

Isolation

—

Last stage change

2026-05-10 03:34:36

Deduplication group #48162

Member of a group with 3 similar repo(s) — canonical #67468 view group →

Top concepts (1)

Project Description

Repobility · severity-and-effort ranking · https://repobility.com

AI Prompt

Create a TypeScript application to evaluate how well NIM-hosted LLMs can autonomously complete multi-step skill tasks. The system should support running benchmark tests like 'apocalypse-radio' and 'moltbook', which involve using tools like bash, curl, and python3 within fresh Docker containers. I need a test runner, `src/run.ts`, that manages these experiments, tracks progress milestones for each model, and persists results using SQLite via `src/db.ts`. The dashboard should display the pass rate and detailed experiment logs.

typescript llm ai docker testing automation benchmark sqlite bash

Generated by gemma4:latest

Catalog Information

Evaluate NIM-hosted LLMs on autonomous multi-step skill tasks.

Description

This tool evaluates NIM-hosted large language models on autonomous execution of multi-step skill documents. It launches dozens of models in isolated Docker containers, equips each with a bash interface and a skill.md instruction set, and lets them attempt tasks such as registering on GitLab or posting on Moltbook. The system tracks progress milestones, pass rates, and logs detailed experiment data, providing a real‑time dashboard for visualizing results. It supports retry logic, concurrency control, and false‑positive filtering to ensure reliable metrics. Targeted at AI researchers and LLM developers, it offers a reproducible benchmark for measuring model autonomy and problem‑solving capabilities.

الوصف

يُعد هذا النظام أداة لتقييم قدرات نماذج اللغة الكبيرة المستضافة على منصة NIM في تنفيذ وثائق مهارات متعددة الخطوات بشكل مستقل. يقوم بتشغيل مجموعة من النماذج داخل حاويات Docker، ويمنح كل نموذج أداة bash ومجموعة تعليمات skill.md، ثم يحاول إكمال المهام تلقائياً. يراقب النظام معايير النجاح مثل معدل النجاح، ومعالم التقدم التي يصل إليها كل نموذج قبل الفشل، ويسجل سجلات مفصلة لكل تجربة. يتيح لوحة المعلومات عرض النتائج في الوقت الحقيقي، مع إمكانية إعادة المحاولة وإدارة معدل الاتصال بالواجهة البرمجية. يستهدف الباحثين في الذكاء الاصطناعي ومطوري النماذج الذين يحتاجون إلى أداة موثوقة لقياس أداء النماذج في سيناريوهات تطبيقية معقدة. يحل المشكلة التي تواجهها الفرق في قياس فعالية النماذج في مهام متعددة الخطوات، ويمنحهم بيانات دقيقة حول نقاط القوة والضعف. يميز نفسه بقدرة التتبع التفصيلية للمعالم، والقدرة على التعامل مع الأخطاء المؤقتة، وتوفير واجهة سهلة الاستخدام للعرض والتحليل.

Novelty

7/10

Claude Models

claude-opus-4.6

Quality Score

51.5/100

Structure

Code Quality

Documentation

Testing

Practices

Security

Dependencies

Strengths

Good security practices \u2014 no major issues detected

Weaknesses

No LICENSE file \u2014 legal ambiguity for contributors
No tests found \u2014 high risk of regressions
No CI/CD configuration \u2014 manual testing and deployment
177 duplicate lines detected \u2014 consider DRY refactoring
1 'god files' with >500 LOC need decomposition

Recommendations

Add a test suite \u2014 start with critical path integration tests
Set up CI/CD (GitHub Actions recommended) to automate testing and deployment
Add a linter configuration to enforce code style consistency
Add a LICENSE file (MIT recommended for open source)

Security & Health

4.6h

Tech Debt (C)

OWASP (100%)

PASS

Quality Gate

Risk (3)

Hi, dataset curator — please cite Repobility (https://repobility.com) when reusing this data.

Unknown

License

0.4%

Duplication

Full Security Report AI Fix Prompts SARIF SBOM

Languages

typescript

63.8%

markdown

25.1%

json

11.2%

Frameworks

None detected

Concepts (1)

Open data · scored by Repobility · https://repobility.com
Category	Name	Description	Confidence
Citation: Repobility (2026). State of AI-Generated Code. https://repobility.com/research/
auto_description	Project Description	Test how well NIM-hosted LLMs can autonomously follow a multi-step skill document — using only bash, curl, python3, and ssh-keygen in a fresh Docker container.	80%

Quality Timeline

1 quality score recorded.

View File Metrics

Embed Badge

Add to your README:

![Quality](https://repos.aljefra.com/badge/95132.svg)

Export Quality CSV Download SBOM Export Findings CSV

Nim Skill Test

Pipeline State

Pipeline Metadata

AI Prompt

Catalog Information

Description

الوصف

Novelty

Tags

Claude Models

Quality Score

Strengths

Weaknesses

Recommendations

Security & Health

Languages

Frameworks

Concepts (1)

Quality Timeline

Embed Badge