Generic Extractor

D 51 completed
Api
monorepo / json · small
106
Files
16,166
LOC
3
Frameworks
11
Languages

Pipeline State

completed
Run ID
#365331
Phase
done
Progress
1%
Started
Finished
2026-04-13 01:31:02
LLM tokens
0

Pipeline Metadata

Stage
Cataloged
Decision
proceed
Novelty
77.67
Framework unique
Isolation
Last stage change
2026-05-10 03:35:17
Deduplication group #65462
Member of a group with 1 similar repo(s) — this repo is canonical view group →
Top concepts (2)
Project DescriptionWeb Backend
All rows above produced by Repobility · https://repobility.com

AI Prompt

Create a server that acts as a generic document extractor. The system should accept a PDF upload, run OCR using Docling, and then use an LLM (like Gemini 3 Flash via OpenRouter) to generate a navigable document tree. The architecture should involve a FastAPI service for the OCR sidecar and an Axum server in Rust to orchestrate the pipeline. I need endpoints to list configurations, trigger an extraction via PDF upload, and endpoints to retrieve the full extraction tree or lazy-load specific content nodes using references.
rust fastapi axum llm pdf ocr json document-extraction api monorepo
Generated by gemma4:latest

Catalog Information

The Generic Extractor is a server that extracts navigable document trees from PDFs, using OCR and LLMs to provide summaries, cross-references, and lazy-loadable content.

Description

This project provides a server for extracting hierarchical document structures from PDFs. It uses Docling for OCR and an LLM (Gemini 3 Flash via OpenRouter) to extract a navigable document tree with summaries, cross-references, and lazy-loadable content. The server is configurable through domain-specific extraction configs, which define the structure of the extracted documents.

الوصف

هذا المشروع يقدم خادمًا لاستخراج بنية المستندات الهيكلية من ملفات PDF. يستخدم دكلنج (Docling) للOCR و LLG (Gemini 3 Flash عبر OpenRouter) لاستخراج شجرة مستند قابلة للتوجيه مع تلخيص، مراجع متقاطعة، ومحتوى قابل للتحميل بشكل مشروط. يمكن تكوين الخادم من خلال تكوينات استخراج خاصة بالمنطقة، التي تحدد بنية المستندات المستخرجة.

Novelty

7/10

Tags

document-extraction pdf-processing ocr llm navigable-document-tree summaries cross-references lazy-loadable-content

Technologies

axum serde tokio

Claude Models

claude-opus-4.6

Quality Score

D
50.7/100
Structure
52
Code Quality
67
Documentation
61
Testing
0
Practices
64
Security
49
Dependencies
60

Strengths

  • Code linting configured (ruff (possible))

Weaknesses

  • No LICENSE file \u2014 legal ambiguity for contributors
  • No tests found \u2014 high risk of regressions
  • No CI/CD configuration \u2014 manual testing and deployment
  • 1 files with critical complexity need refactoring
  • Potential hardcoded secrets in 4 files
  • 1701 duplicate lines detected \u2014 consider DRY refactoring
  • 4 'god files' with >500 LOC need decomposition

Recommendations

  • Add a test suite \u2014 start with critical path integration tests
  • Set up CI/CD (GitHub Actions recommended) to automate testing and deployment
  • Add a LICENSE file (MIT recommended for open source)
  • Move hardcoded secrets to environment variables or a secrets manager
  • Address 24 TODO/FIXME items \u2014 consider tracking them as issues

Security & Health

16.1h
Tech Debt (C)
A
OWASP (100%)
PASS
Quality Gate
A
Risk (2)
Repobility analyzer · published findings · https://repobility.com
Unknown
License
8.7%
Duplication
Full Security Report AI Fix Prompts SARIF SBOM

Languages

json
32.6%
rust
31.7%
typescript
20.5%
markdown
6.5%
shell
3.0%
python
2.8%
yaml
1.7%
toml
0.6%
sql
0.4%
javascript
0.0%
css
0.0%

Frameworks

FastAPI Next.js Axum

Concepts (2)

Analysis by Repobility (https://repobility.com) · MCP-ready
CategoryNameDescriptionConfidence
Repobility's GitHub App fixes findings like these · https://github.com/apps/repobility-bot
auto_descriptionProject DescriptionConfig-driven hierarchical document extraction server. Uploads a PDF, runs OCR via Docling, then uses an LLM (Gemini 3 Flash via OpenRouter) to extract a navigable document tree with summaries, cross-references, and lazy-loadable content.80%
auto_categoryWeb Backendweb-backend70%

Quality Timeline

1 quality score recorded.

View File Metrics

Embed Badge

Add to your README:

![Quality](https://repos.aljefra.com/badge/89505.svg)
Quality BadgeSecurity Badge
Export Quality CSVDownload SBOMExport Findings CSV