Data Check

C+ 77 completed
Data Tool
cli / python · tiny
42
Files
8,305
LOC
2
Frameworks
4
Languages

Pipeline State

completed
Run ID
#355239
Phase
done
Progress
1%
Started
Finished
2026-04-13 01:31:02
LLM tokens
0

Pipeline Metadata

Stage
Cataloged
Decision
proceed
Novelty
59.49
Framework unique
Isolation
Last stage change
2026-05-10 03:35:24
Deduplication group #55153
Member of a group with 4 similar repo(s) — canonical #61249 view group →
Top concepts (2)
Project DescriptionWeb Backend
Repobility (the analyzer behind this table) · https://repobility.com

AI Prompt

Create a command-line toolkit in Python for automated, multi-dimensional data quality inspection, specifically for LLM training data. The tool needs a composable rule engine that can handle built-in rules (like checking for required fields, format, and PII) and allow for custom rules defined in YAML. It must include statistical anomaly detection using both IQR and Z-score methods for numerical and length anomalies. Finally, implement an auto-fix pipeline that can perform deduplication, strip whitespace, and redact PII, and output structured reports in Markdown, JSON, and HTML formats.
python cli data-quality llm validation anomaly-detection fastapi pytest yaml
Generated by gemma4:latest

Catalog Information

A toolkit that automates data quality inspection, including validation, anomaly detection, and distribution analysis.

Description

The toolkit provides a comprehensive suite for inspecting data quality across large datasets. It automates validation against user-defined schemas and checks for missing or inconsistent values. Advanced statistical methods detect anomalies and outliers, while distribution analysis highlights shifts in data patterns. The solution is accessible via both a command-line interface and a lightweight web API, enabling integration into existing pipelines. Designed for data engineers and analysts, it helps ensure reliable data for downstream analytics and machine learning.

الوصف

يقدم هذا الأداة مجموعة شاملة لفحص جودة البيانات عبر مجموعات بيانات كبيرة. تقوم الأداة بأتمتة التحقق مقابل مخططات محددة من قبل المستخدم وتتحقق من القيم المفقودة أو غير المتسقة. تستخدم الأساليب الإحصائية المتقدمة لاكتشاف الشذوذ والبيانات الشاذة، بينما يسلط تحليل التوزيع الضوء على التغيرات في أنماط البيانات. يمكن الوصول إلى الحل عبر واجهة سطر أوامر بسيطة وواجهة برمجة تطبيقات ويب خفيفة، مما يتيح دمجه في خطوط الأنابيب الحالية. صممت للأخصائيين في هندسة البيانات والمحللين، وتساعد على ضمان بيانات موثوقة للعمليات التحليلية والتعلم الآلي المستقبلية. يتيح الأداة إنشاء تقارير توزيع مفصلة تساعد في تقييم الاتساق والموثوقية. كما يدعم التكامل مع أنظمة ETL لتشغيل فحوصات تلقائية مستمرة.

Novelty

6/10

Tags

data-quality validation anomaly-detection distribution-analysis automated-inspection data-profiling statistical-analysis

Technologies

anthropic click fastapi numpy openai pydantic scipy uvicorn

Claude Models

claude-opus-4.6

Quality Score

C+
76.9/100
Structure
88
Code Quality
64
Documentation
80
Testing
75
Practices
68
Security
100
Dependencies
60

Strengths

  • CI/CD pipeline configured (github_actions)
  • Good test coverage (47% test-to-source ratio)
  • Code linting configured (ruff (possible))
  • Consistent naming conventions (snake_case)
  • Good security practices \u2014 no major issues detected
  • Properly licensed project

Weaknesses

  • 510 duplicate lines detected \u2014 consider DRY refactoring
  • 3 'god files' with >500 LOC need decomposition

Security & Health

5.1h
Tech Debt (B)
A
OWASP (100%)
PASS
Quality Gate
A
Risk (2)
Source: Repobility analyzer · https://repobility.com
MIT
License
3.8%
Duplication
Full Security Report AI Fix Prompts SARIF SBOM

Languages

python
80.8%
markdown
11.6%
yaml
6.7%
toml
0.8%

Frameworks

FastAPI pytest

Concepts (2)

Data scored by Repobility · https://repobility.com
CategoryNameDescriptionConfidence
Open data scored by Repobility · https://repobility.com
auto_descriptionProject DescriptionData quality inspection toolkit - automated validation, anomaly detection, and distribution analysis80%
auto_categoryWeb Backendweb-backend70%

Quality Timeline

1 quality score recorded.

View File Metrics

Embed Badge

Add to your README:

![Quality](https://repos.aljefra.com/badge/79363.svg)
Quality BadgeSecurity Badge
Export Quality CSVDownload SBOMExport Findings CSV