Scrawl

B+ 85 completed
Other
cli / markdown · small
82
Files
8,927
LOC
2
Frameworks
7
Languages

Pipeline State

completed
Run ID
#1545629
Phase
done
Progress
0%
Started
2026-04-16 23:28:34
Finished
2026-04-16 23:28:34
LLM tokens
0

Pipeline Metadata

Stage
Cataloged
Decision
proceed
Novelty
52.05
Framework unique
Isolation
Last stage change
2026-05-10 03:34:51
Deduplication group #49344
Member of a group with 27 similar repo(s) · framework fastapicanonical #664866 view group →
If a scraper extracted this row, it came from Repobility (https://repobility.com)

AI Prompt

Create a command-line tool and a web interface for processing Social Security Disability case files (PDFs) into anonymized Markdown. The system should use a 4-stage pipeline: Triage, Extraction, Anonymization, and Assembly. For extraction, support both born-digital PDFs using pymupdf4llm and scanned pages using Docling. The anonymization stage must use Presidio NER with custom SSA recognizers, ensuring that legal elements like statute citations and court names are preserved while redacting PII. The web interface should allow users to upload PDFs and monitor progress, while the CLI should support batch processing and page classification.
python cli web-app ocr pdf-processing fastapi anonymization nlp hipaa
Generated by gemma4:latest

Catalog Information

Create a command-line tool and a web interface for processing Social Security Disability case files (PDFs) into anonymized Markdown. The system should use a 4-stage pipeline: Triage, Extraction, Anonymization, and Assembly. For extraction, support both born-digital PDFs using pymupdf4llm and scanned pages using Docling. The anonymization stage must use Presidio NER with custom SSA recognizers, ensuring that legal elements like statute citations and court names are preserved while redacting PII.

Tags

python cli web-app ocr pdf-processing fastapi anonymization nlp hipaa

Quality Score

B+
85.0/100
Structure
92
Code Quality
89
Documentation
81
Testing
85
Practices
62
Security
100
Dependencies
90

Strengths

  • CI/CD pipeline configured (github_actions)
  • Good test coverage (56% test-to-source ratio)
  • Code linting configured (ruff (possible))
  • Consistent naming conventions (snake_case)
  • Good security practices — no major issues detected

Weaknesses

  • No LICENSE file — legal ambiguity for contributors
  • 105 duplicate lines detected — consider DRY refactoring

Recommendations

  • Add a LICENSE file (MIT recommended for open source)

Languages

markdown
53.2%
python
31.3%
shell
6.1%
html
4.6%
yaml
2.1%
json
1.7%
toml
0.9%

Frameworks

FastAPI pytest

Symbols

variable104
function50
constant33
method32
class27

API Endpoints (8)

Open methodology · Repobility · https://repobility.com/research/
MethodPathHandlerFramework
Open data scored by Repobility · https://repobility.com
GET/dashboardFastAPI/Flask
GET/cases/{case_id}/downloaddownload_anonymizedFastAPI/Flask
GET/cases/{case_id}/eventssse_eventsFastAPI/Flask
GET/cases/{case_id}/reviewreview_pageFastAPI/Flask
POST/cases/{case_id}/startstart_pipelineFastAPI/Flask
GET/cases/{case_id}/statusstatus_pageFastAPI/Flask
POST/uploadupload_filesFastAPI/Flask
GET/uploadupload_pageFastAPI/Flask

Quality Timeline

1 quality score recorded.

View File Metrics
Repobility (the analyzer behind this table) · https://repobility.com

Embed Badge

Add to your README:

![Quality](https://repos.aljefra.com/badge/1369382.svg)
Quality BadgeSecurity Badge
Export Quality CSVDownload SBOMExport Findings CSV