Generic Extractor

D 51 completed

Api

monorepo / json · small

106

Files

16,166

LOC

Frameworks

Languages

Overview Files & Metrics Git Activity Call Graph Security Reports

Pipeline State

completed

Run ID

#365331

Phase

done

Progress

Started

Finished

2026-04-13 01:31:02

LLM tokens

Pipeline Metadata

Stage

Cataloged

Decision

proceed

Novelty

77.67

Framework unique

—

Isolation

—

Last stage change

2026-05-10 03:35:17

Deduplication group #65462

Member of a group with 1 similar repo(s) — this repo is canonical view group →

Top concepts (2)

Project DescriptionWeb Backend

All rows above produced by Repobility · https://repobility.com

AI Prompt

Create a server that acts as a generic document extractor. The system should accept a PDF upload, run OCR using Docling, and then use an LLM (like Gemini 3 Flash via OpenRouter) to generate a navigable document tree. The architecture should involve a FastAPI service for the OCR sidecar and an Axum server in Rust to orchestrate the pipeline. I need endpoints to list configurations, trigger an extraction via PDF upload, and endpoints to retrieve the full extraction tree or lazy-load specific content nodes using references.

rust fastapi axum llm pdf ocr json document-extraction api monorepo

Generated by gemma4:latest

Catalog Information

The Generic Extractor is a server that extracts navigable document trees from PDFs, using OCR and LLMs to provide summaries, cross-references, and lazy-loadable content.

Description

This project provides a server for extracting hierarchical document structures from PDFs. It uses Docling for OCR and an LLM (Gemini 3 Flash via OpenRouter) to extract a navigable document tree with summaries, cross-references, and lazy-loadable content. The server is configurable through domain-specific extraction configs, which define the structure of the extracted documents.

الوصف

هذا المشروع يقدم خادمًا لاستخراج بنية المستندات الهيكلية من ملفات PDF. يستخدم دكلنج (Docling) للOCR و LLG (Gemini 3 Flash عبر OpenRouter) لاستخراج شجرة مستند قابلة للتوجيه مع تلخيص، مراجع متقاطعة، ومحتوى قابل للتحميل بشكل مشروط. يمكن تكوين الخادم من خلال تكوينات استخراج خاصة بالمنطقة، التي تحدد بنية المستندات المستخرجة.

Novelty

7/10

Technologies

axum serde tokio

Claude Models

claude-opus-4.6

Quality Score

50.7/100

Structure

Code Quality

Documentation

Testing

Practices

Security

Dependencies

Strengths

Code linting configured (ruff (possible))

Weaknesses

No LICENSE file \u2014 legal ambiguity for contributors
No tests found \u2014 high risk of regressions
No CI/CD configuration \u2014 manual testing and deployment
1 files with critical complexity need refactoring
Potential hardcoded secrets in 4 files
1701 duplicate lines detected \u2014 consider DRY refactoring
4 'god files' with >500 LOC need decomposition

Recommendations

Add a test suite \u2014 start with critical path integration tests
Set up CI/CD (GitHub Actions recommended) to automate testing and deployment
Add a LICENSE file (MIT recommended for open source)
Move hardcoded secrets to environment variables or a secrets manager
Address 24 TODO/FIXME items \u2014 consider tracking them as issues

Security & Health

16.1h

Tech Debt (C)

OWASP (100%)

PASS

Quality Gate

Risk (2)

Repobility analyzer · published findings · https://repobility.com

Unknown

License

8.7%

Duplication

Full Security Report AI Fix Prompts SARIF SBOM

Languages

json

32.6%

rust

31.7%

typescript

20.5%

markdown

6.5%

shell

3.0%

python

2.8%

yaml

1.7%

toml

0.6%

sql

0.4%

javascript

0.0%

css

0.0%

Frameworks

FastAPI Next.js Axum

Concepts (2)

Analysis by Repobility (https://repobility.com) · MCP-ready
Category	Name	Description	Confidence
Repobility's GitHub App fixes findings like these · https://github.com/apps/repobility-bot
auto_description	Project Description	Config-driven hierarchical document extraction server. Uploads a PDF, runs OCR via Docling, then uses an LLM (Gemini 3 Flash via OpenRouter) to extract a navigable document tree with summaries, cross-references, and lazy-loadable content.	80%
auto_category	Web Backend	web-backend	70%

Quality Timeline

1 quality score recorded.

View File Metrics

Embed Badge

Add to your README:

![Quality](https://repos.aljefra.com/badge/89505.svg)

Export Quality CSV Download SBOM Export Findings CSV

Generic Extractor

Pipeline State

Pipeline Metadata

AI Prompt

Catalog Information

Description

الوصف

Novelty

Tags

Technologies

Claude Models

Quality Score

Strengths

Weaknesses

Recommendations

Security & Health

Languages

Frameworks

Concepts (2)

Quality Timeline

Embed Badge