Welcome to Crump¶
Examines and syncs CSV, Parquet, and CDF files into PostgreSQL or SQLite databases in batched files using easy to edit configuration files.
Overview¶
crump is a command-line tool and Python library for easy syncing CSV, Parquet, and CDF files to PostgreSQL or SQLite databases, and extracxting data from CDF files. It provides a declarative, configuration-based approach to data synchronization with automatic schema management..
Key Features¶
Data File Support¶
- CSV Support: Read and sync standard CSV files
- Native CDF Processing: Built-in support for Common Data Format (CDF) science files
- Automatic Extraction: Extracts CDF variables to CSV, Parquet, or directly to database
- Array Variable Handling: Automatically expands multi-dimensional array variables
- Apache Parquet Support: Built-in support for Apache Parquet files and sync Parquet files directly to database
- Extract to Parquet: Convert CDF files to Parquet format with
--parquetflag
Data Synchronization¶
- Configuration-Based: Examines your CSV files with the prepare command, and defines sync jobs in YAML with sensible column mappings
- Column Mapping: Sync all columns, rename them, or only sync a subset
- Automatic Table Creation: Creates target tables if they don't exist
- Schema Evolution: Automatically adds new columns as needed, never deletes existing columns. Optionally keeps a history of data changes in a history table.
- Index Management: Suggests and creates database indexes based on column types
- Dual Interface: Use as a CLI tool or import as a Python library
- Filename-Based Extraction: Extract values from filenames (dates, versions, etc.) and store in database columns
- Automatic Cleanup: Delete stale records based on extracted filename values
- Compound Primary Keys: Support for multi-column primary keys
- Dry-Run Mode: Preview all changes without modifying the database
- Idempotent Operations: Safe to run multiple times, uses upsert
- Rich Output: Beautiful terminal output with Rich library
Quick Example¶
# Create a configuration file
crump prepare users.csv --config crump_config.yml --job users_sync
# Look at the mapping it generated for you in crump_config.yml and edit as needed.
# Crump has mapped your columns and suggested keys and indexes
# get ready to sync - you db must be available
export DATABASE_URL="sqlite:///test.db"
# Or for Postgres
# export DATABASE_URL="postgresql://user:pass@localhost:5432/mydb"
# preview changes first (requires --db-url or DATABASE_URL)
crump sync users.csv crump_config.yml --job users_sync --dry-run
# Sync the file to database
crump sync users.csv crump_config.yml --job users_sync
# Later that day the v2 of the file arrives
# Sync the new file, old records from v1 are removed automatically, updates are applied to rows that match based on primary key
crump sync users_v2.csv crump_config.yml --job users_sync
Use Cases¶
- Rapid data ingestion: Quickly get lots of data files dumpoed into a database with minimal setup and no code.
- Daily Data Updates: Sync daily CSV exports with automatic date extraction and cleanup
- Science Data Processing: Process CDF science files with metadata extraction
- Data Warehousing: Load CSV data into PostgreSQL with column transformations
- Incremental Updates: Replace partitioned data (by date, version, etc.) while preserving other partitions
- Configuration-Driven ETL: Define data pipelines in YAML without writing code
Next Steps¶
- Installation Guide - Install crump
- Quick Start - Get started in 5 minutes
- Configuration - Learn about YAML configuration
- CLI Reference - Command-line interface documentation
- Features - Detailed feature documentation
- API Reference - Use crump as a Python library
Support¶
If you have any questions or run into issues, please open an issue on GitHub.