Omicidx Etl

C+ 74 completed
Cli Tool
cli / python · small
57
Files
8,872
LOC
0
Frameworks
6
Languages

Pipeline State

completed
Run ID
#367466
Phase
done
Progress
1%
Started
Finished
2026-04-13 01:31:02
LLM tokens
0

Pipeline Metadata

Stage
Skipped
Decision
skip_scaffold_dup
Novelty
39.55
Framework unique
Isolation
Last stage change
2026-04-16 18:15:42
Deduplication group #47804
Member of a group with 2 similar repo(s) — canonical #73474 view group →
Top concepts (1)
Project Description
Want fix-PRs on findings? Install Repobility's GitHub App · github.com/apps/repobility-bot

AI Prompt

Create a command-line tool in Python for performing ETL pipelines on NCBI metadata resources. I need it to handle extraction and transformation for multiple sources, specifically SRA, GEO, Biosample, and PubMed. The tool should support running these extractions via CLI commands, accepting destination paths like S3 buckets, and outputting data into specified formats such as Parquet, JSONL. Please structure the project to manage these different workflows cleanly.
python cli etl ncbi bioinformatics command-line s3 data-pipeline
Generated by gemma4:latest

Catalog Information

Extracts and transforms metadata from NCBI resources for a genomic metadata platform, storing the results in cloud storage.

Description

This command‑line tool provides ETL pipelines for several NCBI datasets, including Biosample, Bioproject, SRA, GEO, and PubMed. It validates incoming data with a schema library and writes the output as compressed JSONL or Parquet files to a cloud bucket. The tool is designed for bioinformatics teams that need up‑to‑date, standardized metadata for downstream analysis or integration into larger data pipelines. It simplifies the extraction process by offering a single command for each dataset and handles incremental updates automatically. The result is a ready‑to‑use, query‑friendly dataset that can be consumed by analytics or machine‑learning workflows.

الوصف

يُقدّم هذا البرنامج واجهة سطر أوامر لتشغيل خطوط استخراج وتحويل بيانات التعريف من عدة قواعد بيانات NCBI، مثل Biosample و Bioproject و SRA و GEO و PubMed. يتم التحقق من صحة البيانات وفقاً لهيكل مُعرّف مسبقاً، ثم تُحفظ النتائج في ملفات مضغوطة بصيغة JSONL أو Parquet داخل حاوية تخزين سحابي. يهدف البرنامج إلى تلبية احتياجات فرق علم المعلومات الحيوية التي تتطلب بيانات تعريفية موحدة ومحدثة لاستخدامها في التحليلات أو دمجها مع خطوط بيانات أكبر. يسهّل البرنامج عملية الاستخراج عبر توفير أمر واحد لكل قاعدة بيانات، مع دعم للتحديثات التدريجية تلقائياً. النتيجة هي مجموعة بيانات جاهزة للقراءة، قابلة للاستعلام، يمكن استهلاكها في تطبيقات التحليل أو التعلم الآلي.

Novelty

6/10

Tags

metadata-extraction etl-pipelines genomic-data ncbi-integration cloud-storage data-transformation bioinformatics command-line-interface

Technologies

click pydantic

Claude Models

claude-opus-4.6 claude-sonnet-4.6

Quality Score

C+
74.2/100
Structure
70
Code Quality
90
Documentation
61
Testing
55
Practices
79
Security
84
Dependencies
60

Strengths

  • CI/CD pipeline configured (github_actions)
  • Code linting configured (ruff (possible))
  • Consistent naming conventions (snake_case)
  • Good security practices \u2014 no major issues detected
  • Containerized deployment (Docker)

Weaknesses

  • No LICENSE file \u2014 legal ambiguity for contributors
  • 243 duplicate lines detected \u2014 consider DRY refactoring

Recommendations

  • Add a LICENSE file (MIT recommended for open source)

Security & Health

4.1h
Tech Debt (B)
A
OWASP (100%)
PASS
Quality Gate
A
Risk (1)
Repobility (the analyzer behind this table) · https://repobility.com
MIT
License
4.0%
Duplication
Full Security Report AI Fix Prompts SARIF SBOM

Languages

python
74.1%
markdown
14.8%
yaml
7.2%
sql
1.9%
toml
1.2%
shell
0.8%

Frameworks

None detected

Concepts (1)

Repobility · the analyzer behind every row · https://repobility.com
CategoryNameDescriptionConfidence
Citation: Repobility (2026). State of AI-Generated Code. https://repobility.com/research/
auto_descriptionProject DescriptionETL pipelines for OmicIDX metadata resources.80%

Quality Timeline

1 quality score recorded.

View File Metrics

Embed Badge

Add to your README:

![Quality](https://repos.aljefra.com/badge/91657.svg)
Quality BadgeSecurity Badge
Export Quality CSVDownload SBOMExport Findings CSV