PromptFork

Audit a messy DataFrame against an expected schema with dtype coercion

Produces a reusable schema-validation and dtype-coercion script that flags every column that drifted from spec, coerces what it safely can, and quarantines what it cannot instead of producing silent NaNs.

Open in Studio
Prompt
You are a senior data engineer who treats a DataFrame as untrusted until it passes a schema.

I have a messy file or extract: [DESCRIBE SOURCE — e.g. 'a monthly CSV export from the finance tool, hand-edited by analysts'].
Expected schema (what a clean row should look like):
[COLUMN: EXPECTED DTYPE — e.g. 'order_id: int, amount: float, currency: 3-letter str, placed_at: datetime, status: one of {paid,refunded,pending}']
Python 3, pandas [2.x].

Build a validation and coercion module with:
1. A single declarative schema: per column, the expected dtype, nullable yes/no, allowed values or regex, and a date format hint if relevant. Define it once and reuse it.
2. A coercer that converts safely where it can (numeric strings to numbers, ISO or loose dates to datetime with a stated format, strip whitespace, normalize case for enums). Every coercion must be explicit and traceable.
3. A validator that reports, per column: how many values matched, were coerced, failed coercion, were null-but-required, or violated allowed values. Show counts and a few example offending rows.
4. A split on failure: clean rows proceed; rows that fail a hard rule go to a quarantined DataFrame with a reason column. Never silently drop or NaN a row without recording why.
5. A strict-mode flag: strict=True raises if any required column fails; strict=False quarantines and continues.
6. A __main__ that reads the file, runs validate, prints the per-column report, and writes clean plus quarantine to two CSVs.

Rules:
- Never infer the schema from the file alone. Trust the declared schema; the file must conform.
- Coercion must be lossless or explicitly flagged. A '$1,200.00' amount that becomes 1200.0 is fine; a 'N/A' that becomes NaN must be logged as a coercion, not hidden.
- Do not mutate the input in place. Return new frames.

Output the module in one fenced block, then a short runbook: how to edit the schema, how to read the report, and what strict vs quarantine mode does.

Success signal: the output is good only if every coercion is explicit and logged, every dropped or quarantined row has a reason, and a required-column violation never produces a silent NaN.

Use case

Use when a CSV or DB extract arrives with inconsistent types (numbers as strings, mixed dates) and you need a typed, validated DataFrame you can trust downstream.

When to use this

As the first step of any pipeline that consumes a hand-edited or third-party file. Not a substitute for enforced schemas at the source.

Follow-up prompts

  • Turn the coercion rules into a pandera or Great Expectations schema for CI enforcement.
  • Add a per-column null-and-cardinality report to catch silent empties.
  • Wire the quarantine output into an alert so bad rows surface before a dashboard breaks.
#pandas#data-cleaning#data-quality#validation#python
Source
promptfork seed
License
CC-BY-4.0
Published
6/22/2026

More prompts you might like

Pandas data-cleaning pipeline for a messy CSV

Produce a reproducible Pandas cleaning pipeline: types, missing values, dedupe, outliers.

New

Reshape a messy human-made spreadsheet into tidy long-form data

Takes a report-style spreadsheet (merged headers, multi-row titles, wide months-as-columns) and reshapes it into a tidy long DataFrame with a documented step-by-step transform you can re-run on the next export.

#pandas#data-cleaning
New

RAG system prompt that refuses to hallucinate and cites sources

A retrieval-augmented system prompt that answers only from context and returns inline citations or 'I don't know'.

New

Scaffold a clean PyTorch training loop with eval and early stopping

Gives you a reproducible, well-structured PyTorch training script — config, model, dataloaders, train/eval loop, metrics, checkpointing, and early stopping — tuned to your task.

#pytorch#machine-learning
New

Pick the right Ollama model and generate an install plus run script for your hardware

Produces a hardware-aware Ollama model recommendation for your task plus a ready-to-run install and run script with VRAM checks, instead of guessing a model name and hoping it fits.

#ollama#local-llm
New

Wire a local RAG pipeline to Ollama with a doc loader and vector store

Produces a complete, local-first RAG pipeline with document loading, chunking, Ollama embeddings, a vector store, retrieval, and a grounded answer step with citations, requiring no cloud APIs.

#ollama#rag
New