Generic Extractor
D 51 completedPipeline State
completedPipeline Metadata
AI Prompt
Catalog Information
The Generic Extractor is a server that extracts navigable document trees from PDFs, using OCR and LLMs to provide summaries, cross-references, and lazy-loadable content.
Description
This project provides a server for extracting hierarchical document structures from PDFs. It uses Docling for OCR and an LLM (Gemini 3 Flash via OpenRouter) to extract a navigable document tree with summaries, cross-references, and lazy-loadable content. The server is configurable through domain-specific extraction configs, which define the structure of the extracted documents.
الوصف
هذا المشروع يقدم خادمًا لاستخراج بنية المستندات الهيكلية من ملفات PDF. يستخدم دكلنج (Docling) للOCR و LLG (Gemini 3 Flash عبر OpenRouter) لاستخراج شجرة مستند قابلة للتوجيه مع تلخيص، مراجع متقاطعة، ومحتوى قابل للتحميل بشكل مشروط. يمكن تكوين الخادم من خلال تكوينات استخراج خاصة بالمنطقة، التي تحدد بنية المستندات المستخرجة.
Novelty
7/10Tags
Technologies
Claude Models
Quality Score
Strengths
- Code linting configured (ruff (possible))
Weaknesses
- No LICENSE file \u2014 legal ambiguity for contributors
- No tests found \u2014 high risk of regressions
- No CI/CD configuration \u2014 manual testing and deployment
- 1 files with critical complexity need refactoring
- Potential hardcoded secrets in 4 files
- 1701 duplicate lines detected \u2014 consider DRY refactoring
- 4 'god files' with >500 LOC need decomposition
Recommendations
- Add a test suite \u2014 start with critical path integration tests
- Set up CI/CD (GitHub Actions recommended) to automate testing and deployment
- Add a LICENSE file (MIT recommended for open source)
- Move hardcoded secrets to environment variables or a secrets manager
- Address 24 TODO/FIXME items \u2014 consider tracking them as issues
Security & Health
Languages
Frameworks
Concepts (2)
| Category | Name | Description | Confidence | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Repobility's GitHub App fixes findings like these · https://github.com/apps/repobility-bot | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| auto_description | Project Description | Config-driven hierarchical document extraction server. Uploads a PDF, runs OCR via Docling, then uses an LLM (Gemini 3 Flash via OpenRouter) to extract a navigable document tree with summaries, cross-references, and lazy-loadable content. | 80% | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| auto_category | Web Backend | web-backend | 70% | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Embed Badge
Add to your README:
