Winnow

B 85 completed
Other
cli / python · small
59
Files
3,763
LOC
2
Frameworks
5
Languages

Pipeline State

completed
Run ID
#392853
Phase
done
Progress
1%
Started
Finished
2026-04-13 01:31:02
LLM tokens
0

Pipeline Metadata

Stage
Skipped
Decision
skip_scaffold_dup
Novelty
34.37
Framework unique
Isolation
Last stage change
2026-04-16 18:15:42
Deduplication group #47678
Member of a group with 1 similar repo(s) — canonical #118510 view group →
Top concepts (2)
Project DescriptionWeb Backend
Want this analysis on your repo? https://repobility.com/scan/

AI Prompt

Create a command-line tool using Python that functions as a workbench for curating ML training data. The tool should allow users to filter junk data, deduplicate records, compute embeddings, find outliers, and search a dataset. It needs a CLI interface and a Python SDK. Please structure the project to include features like length filtering, whitespace filtering, and exact deduplication, and ensure it's built with FastAPI for any potential API components, and use pytest for testing.
python cli fastapi pytest ml data-curation toolkit sdk
Generated by gemma4:latest

Catalog Information

Winnow is a Python toolkit for curating ML training data. Filter junk, deduplicate, compute embeddings, find outliers, and search your dataset — from one pip install, with a CLI and a Python SDK.

Description

Winnow is a Python toolkit for curating ML training data. Filter junk, deduplicate, compute embeddings, find outliers, and search your dataset — from one pip install, with a CLI and a Python SDK.

Novelty

3/10

Tags

python cli fastapi pytest ml data-curation toolkit sdk

Technologies

fastapi pydantic

Claude Models

claude-opus-4-6

Quality Score

B
84.9/100
Structure
96
Code Quality
90
Documentation
90
Testing
75
Practices
62
Security
92
Dependencies
60

Strengths

  • CI/CD pipeline configured (github_actions)
  • Good test coverage (38% test-to-source ratio)
  • Code linting configured (ruff (possible))
  • Consistent naming conventions (snake_case)
  • Good security practices \u2014 no major issues detected
  • Properly licensed project

Weaknesses

  • 110 duplicate lines detected \u2014 consider DRY refactoring

Security & Health

4.1h
Tech Debt (C)
A
OWASP (100%)
PASS
Quality Gate
A
Risk (3)
Provenance: Repobility (https://repobility.com) — every score reproducible from /scan/
Apache-2.0
License
0.6%
Duplication
Full Security Report AI Fix Prompts SARIF SBOM

Languages

python
57.8%
markdown
17.6%
html
11.8%
yaml
8.7%
toml
4.2%

Frameworks

FastAPI pytest

Concepts (2)

Findings produced by Repobility · scan your repo at https://repobility.com/scan/
CategoryNameDescriptionConfidence
Repobility · MCP-ready · https://repobility.com
auto_descriptionProject Description<a href="#quick-start">Quick Start</a> &bull; <a href="#installation">Install</a> &bull; <a href="#features">Features</a> &bull;80%
auto_categoryWeb Backendweb-backend70%

Quality Timeline

1 quality score recorded.

View File Metrics

Embed Badge

Add to your README:

![Quality](https://repos.aljefra.com/badge/117191.svg)
Quality BadgeSecurity Badge
Export Quality CSVDownload SBOMExport Findings CSV