Winnow

B 85 completed

Other

cli / python · small

Files

3,763

LOC

Frameworks

Languages

Overview Files & Metrics Git Activity Call Graph Security Reports

Pipeline State

completed

Run ID

#392853

Phase

done

Progress

Started

Finished

2026-04-13 01:31:02

LLM tokens

Pipeline Metadata

Stage

Skipped

Decision

skip_scaffold_dup

Novelty

34.37

Framework unique

—

Isolation

—

Last stage change

2026-04-16 18:15:42

Deduplication group #47678

Member of a group with 1 similar repo(s) — canonical #118510 view group →

Top concepts (2)

Project DescriptionWeb Backend

Want this analysis on your repo? https://repobility.com/scan/

AI Prompt

Create a command-line tool using Python that functions as a workbench for curating ML training data. The tool should allow users to filter junk data, deduplicate records, compute embeddings, find outliers, and search a dataset. It needs a CLI interface and a Python SDK. Please structure the project to include features like length filtering, whitespace filtering, and exact deduplication, and ensure it's built with FastAPI for any potential API components, and use pytest for testing.

python cli fastapi pytest ml data-curation toolkit sdk

Generated by gemma4:latest

Catalog Information

Winnow is a Python toolkit for curating ML training data. Filter junk, deduplicate, compute embeddings, find outliers, and search your dataset — from one pip install, with a CLI and a Python SDK.

Description

Winnow is a Python toolkit for curating ML training data. Filter junk, deduplicate, compute embeddings, find outliers, and search your dataset — from one pip install, with a CLI and a Python SDK.

Novelty

3/10

Technologies

fastapi pydantic

Claude Models

claude-opus-4-6

Quality Score

84.9/100

Structure

Code Quality

Documentation

Testing

Practices

Security

Dependencies

Strengths

CI/CD pipeline configured (github_actions)
Good test coverage (38% test-to-source ratio)
Code linting configured (ruff (possible))
Consistent naming conventions (snake_case)
Good security practices \u2014 no major issues detected
Properly licensed project

Weaknesses

110 duplicate lines detected \u2014 consider DRY refactoring

Security & Health

4.1h

Tech Debt (C)

OWASP (100%)

PASS

Quality Gate

Risk (3)

Provenance: Repobility (https://repobility.com) — every score reproducible from /scan/

Apache-2.0

License

0.6%

Duplication

Full Security Report AI Fix Prompts SARIF SBOM

Languages

python

57.8%

markdown

17.6%

html

11.8%

yaml

8.7%

toml

4.2%

Frameworks

FastAPI pytest

Concepts (2)

Findings produced by Repobility · scan your repo at https://repobility.com/scan/
Category	Name	Description	Confidence
Repobility · MCP-ready · https://repobility.com
auto_description	Project Description	<a href="#quick-start">Quick Start</a> • <a href="#installation">Install</a> • <a href="#features">Features</a> •	80%
auto_category	Web Backend	web-backend	70%

Quality Timeline

1 quality score recorded.

View File Metrics

Embed Badge

Add to your README:

![Quality](https://repos.aljefra.com/badge/117191.svg)

Export Quality CSV Download SBOM Export Findings CSV

Winnow

Pipeline State

Pipeline Metadata

AI Prompt

Catalog Information

Description

Novelty

Tags

Technologies

Claude Models

Quality Score

Strengths

Weaknesses

Security & Health

Languages

Frameworks

Concepts (2)

Quality Timeline

Embed Badge