Merit
C 64 completed
Other
library / python · small
52
Files
6,300
LOC
0
Frameworks
3
Languages
Pipeline State
completedRun ID
#387302Phase
doneProgress
1%Started
Finished
2026-04-13 01:31:02LLM tokens
0Pipeline Metadata
Stage
SkippedDecision
skip_scaffold_dupNovelty
37.53Framework unique
—Isolation
—Last stage change
2026-04-16 18:15:42Deduplication group #47572
Member of a group with 1 similar repo(s) — canonical #27562 view group →
Top concepts (2)
Project DescriptionLibrary
Repobility · MCP-ready · https://repobility.com
AI Prompt
Create a Python framework, similar to MERIT, for multi-dimensional evaluation of language models, suitable for NeurIPS-style research. I need it to go beyond simple accuracy by measuring logical consistency, factual accuracy, reasoning quality, and alignment. The tool should support both heuristic metrics and LLM-as-judge evaluation. Please include functionality to run evaluations against datasets like ARC, generate paper-ready LaTeX tables from results, and allow comparing multiple experiment outputs.
python library llm-evaluation nlp reasoning metrics neurips framework
Generated by gemma4:latest
Catalog Information
A NeurIPS-oriented framework for multi-dimensional evaluation of reasoning in language models. MERIT goes beyond accuracy to measure logical consistency, factual accuracy, reasoning quality, and alignment using both heuristic metrics and LLM-as-judge evaluation.
Description
A NeurIPS-oriented framework for multi-dimensional evaluation of reasoning in language models. MERIT goes beyond accuracy to measure logical consistency, factual accuracy, reasoning quality, and alignment using both heuristic metrics and LLM-as-judge evaluation.
Novelty
3/10Tags
python library llm-evaluation nlp reasoning metrics neurips framework
Technologies
anthropic
Claude Models
claude-opus-4-6
Quality Score
C
64.3/100
Structure
67
Code Quality
64
Documentation
62
Testing
60
Practices
50
Security
92
Dependencies
60
Strengths
- Good test coverage (39% test-to-source ratio)
- Consistent naming conventions (snake_case)
- Good security practices \u2014 no major issues detected
- Properly licensed project
Weaknesses
- No CI/CD configuration \u2014 manual testing and deployment
- 5 bare except/catch blocks swallowing errors
- 478 duplicate lines detected \u2014 consider DRY refactoring
Recommendations
- Set up CI/CD (GitHub Actions recommended) to automate testing and deployment
- Add a linter configuration to enforce code style consistency
- Replace bare except/catch blocks with specific exception types
Security & Health
5.1h
Tech Debt (B)
A
OWASP (100%)
PASS
Quality Gate
A
Risk (2)
Repobility · open methodology · https://repobility.com/research/
MIT
License
5.2%
Duplication
Languages
Frameworks
None detected
Concepts (2)
| Category | Name | Description | Confidence | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Repobility · MCP-ready · https://repobility.com | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| auto_description | Project Description |   ![Version]() | 80% | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| auto_category | Library | library | 70% | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Embed Badge
Add to your README:
