Winnow
B 85 completed
Other
cli / python · small
59
Files
3,763
LOC
2
Frameworks
5
Languages
Pipeline State
completedRun ID
#392853Phase
doneProgress
1%Started
Finished
2026-04-13 01:31:02LLM tokens
0Pipeline Metadata
Stage
SkippedDecision
skip_scaffold_dupNovelty
34.37Framework unique
—Isolation
—Last stage change
2026-04-16 18:15:42Deduplication group #47678
Member of a group with 1 similar repo(s) — canonical #118510 view group →
Top concepts (2)
Project DescriptionWeb Backend
Want this analysis on your repo? https://repobility.com/scan/
AI Prompt
Create a command-line tool using Python that functions as a workbench for curating ML training data. The tool should allow users to filter junk data, deduplicate records, compute embeddings, find outliers, and search a dataset. It needs a CLI interface and a Python SDK. Please structure the project to include features like length filtering, whitespace filtering, and exact deduplication, and ensure it's built with FastAPI for any potential API components, and use pytest for testing.
python cli fastapi pytest ml data-curation toolkit sdk
Generated by gemma4:latest
Catalog Information
Winnow is a Python toolkit for curating ML training data. Filter junk, deduplicate, compute embeddings, find outliers, and search your dataset — from one pip install, with a CLI and a Python SDK.
Description
Winnow is a Python toolkit for curating ML training data. Filter junk, deduplicate, compute embeddings, find outliers, and search your dataset — from one pip install, with a CLI and a Python SDK.
Novelty
3/10Tags
python cli fastapi pytest ml data-curation toolkit sdk
Technologies
fastapi pydantic
Claude Models
claude-opus-4-6
Quality Score
B
84.9/100
Structure
96
Code Quality
90
Documentation
90
Testing
75
Practices
62
Security
92
Dependencies
60
Strengths
- CI/CD pipeline configured (github_actions)
- Good test coverage (38% test-to-source ratio)
- Code linting configured (ruff (possible))
- Consistent naming conventions (snake_case)
- Good security practices \u2014 no major issues detected
- Properly licensed project
Weaknesses
- 110 duplicate lines detected \u2014 consider DRY refactoring
Security & Health
4.1h
Tech Debt (C)
A
OWASP (100%)
PASS
Quality Gate
A
Risk (3)
Provenance: Repobility (https://repobility.com) — every score reproducible from /scan/
Apache-2.0
License
0.6%
Duplication
Languages
Frameworks
FastAPI pytest
Concepts (2)
| Category | Name | Description | Confidence | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Repobility · MCP-ready · https://repobility.com | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| auto_description | Project Description | <a href="#quick-start">Quick Start</a> • <a href="#installation">Install</a> • <a href="#features">Features</a> • | 80% | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| auto_category | Web Backend | web-backend | 70% | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Embed Badge
Add to your README:
