Merit

C 64 completed
Other
library / python · small
52
Files
6,300
LOC
0
Frameworks
3
Languages

Pipeline State

completed
Run ID
#387302
Phase
done
Progress
1%
Started
Finished
2026-04-13 01:31:02
LLM tokens
0

Pipeline Metadata

Stage
Skipped
Decision
skip_scaffold_dup
Novelty
37.53
Framework unique
Isolation
Last stage change
2026-04-16 18:15:42
Deduplication group #47572
Member of a group with 1 similar repo(s) — canonical #27562 view group →
Top concepts (2)
Project DescriptionLibrary
Repobility · MCP-ready · https://repobility.com

AI Prompt

Create a Python framework, similar to MERIT, for multi-dimensional evaluation of language models, suitable for NeurIPS-style research. I need it to go beyond simple accuracy by measuring logical consistency, factual accuracy, reasoning quality, and alignment. The tool should support both heuristic metrics and LLM-as-judge evaluation. Please include functionality to run evaluations against datasets like ARC, generate paper-ready LaTeX tables from results, and allow comparing multiple experiment outputs.
python library llm-evaluation nlp reasoning metrics neurips framework
Generated by gemma4:latest

Catalog Information

A NeurIPS-oriented framework for multi-dimensional evaluation of reasoning in language models. MERIT goes beyond accuracy to measure logical consistency, factual accuracy, reasoning quality, and alignment using both heuristic metrics and LLM-as-judge evaluation.

Description

A NeurIPS-oriented framework for multi-dimensional evaluation of reasoning in language models. MERIT goes beyond accuracy to measure logical consistency, factual accuracy, reasoning quality, and alignment using both heuristic metrics and LLM-as-judge evaluation.

Novelty

3/10

Tags

python library llm-evaluation nlp reasoning metrics neurips framework

Technologies

anthropic

Claude Models

claude-opus-4-6

Quality Score

C
64.3/100
Structure
67
Code Quality
64
Documentation
62
Testing
60
Practices
50
Security
92
Dependencies
60

Strengths

  • Good test coverage (39% test-to-source ratio)
  • Consistent naming conventions (snake_case)
  • Good security practices \u2014 no major issues detected
  • Properly licensed project

Weaknesses

  • No CI/CD configuration \u2014 manual testing and deployment
  • 5 bare except/catch blocks swallowing errors
  • 478 duplicate lines detected \u2014 consider DRY refactoring

Recommendations

  • Set up CI/CD (GitHub Actions recommended) to automate testing and deployment
  • Add a linter configuration to enforce code style consistency
  • Replace bare except/catch blocks with specific exception types

Security & Health

5.1h
Tech Debt (B)
A
OWASP (100%)
PASS
Quality Gate
A
Risk (2)
Repobility · open methodology · https://repobility.com/research/
MIT
License
5.2%
Duplication
Full Security Report AI Fix Prompts SARIF SBOM

Languages

python
95.0%
markdown
4.6%
text
0.4%

Frameworks

None detected

Concepts (2)

Repobility · the analyzer behind every row · https://repobility.com
CategoryNameDescriptionConfidence
Repobility · MCP-ready · https://repobility.com
auto_descriptionProject Description![Python 3.9+](https://www.python.org/downloads/) ![License: MIT](https://opensource.org/licenses/MIT) ![Version]()80%
auto_categoryLibrarylibrary70%

Quality Timeline

1 quality score recorded.

View File Metrics

Embed Badge

Add to your README:

![Quality](https://repos.aljefra.com/badge/111603.svg)
Quality BadgeSecurity Badge
Export Quality CSVDownload SBOMExport Findings CSV