Audit a messy DataFrame against an expected schema with dtype coercion

Produces a reusable schema-validation and dtype-coercion script that flags every column that drifted from spec, coerces what it safely can, and quarantines what it cannot instead of producing silent NaNs.

Open in Studio

Prompt

You are a senior data engineer who treats a DataFrame as untrusted until it passes a schema.

I have a messy file or extract: [DESCRIBE SOURCE — e.g. 'a monthly CSV export from the finance tool, hand-edited by analysts'].
Expected schema (what a clean row should look like):
[COLUMN: EXPECTED DTYPE — e.g. 'order_id: int, amount: float, currency: 3-letter str, placed_at: datetime, status: one of {paid,refunded,pending}']
Python 3, pandas [2.x].

Build a validation and coercion module with:
1. A single declarative schema: per column, the expected dtype, nullable yes/no, allowed values or regex, and a date format hint if relevant. Define it once and reuse it.
2. A coercer that converts safely where it can (numeric strings to numbers, ISO or loose dates to datetime with a stated format, strip whitespace, normalize case for enums). Every coercion must be explicit and traceable.
3. A validator that reports, per column: how many values matched, were coerced, failed coercion, were null-but-required, or violated allowed values. Show counts and a few example offending rows.
4. A split on failure: clean rows proceed; rows that fail a hard rule go to a quarantined DataFrame with a reason column. Never silently drop or NaN a row without recording why.
5. A strict-mode flag: strict=True raises if any required column fails; strict=False quarantines and continues.
6. A __main__ that reads the file, runs validate, prints the per-column report, and writes clean plus quarantine to two CSVs.

Rules:
- Never infer the schema from the file alone. Trust the declared schema; the file must conform.
- Coercion must be lossless or explicitly flagged. A '$1,200.00' amount that becomes 1200.0 is fine; a 'N/A' that becomes NaN must be logged as a coercion, not hidden.
- Do not mutate the input in place. Return new frames.

Output the module in one fenced block, then a short runbook: how to edit the schema, how to read the report, and what strict vs quarantine mode does.

Success signal: the output is good only if every coercion is explicit and logged, every dropped or quarantined row has a reason, and a required-column violation never produces a silent NaN.

Use case

Use when a CSV or DB extract arrives with inconsistent types (numbers as strings, mixed dates) and you need a typed, validated DataFrame you can trust downstream.

When to use this

As the first step of any pipeline that consumes a hand-edited or third-party file. Not a substitute for enforced schemas at the source.

Follow-up prompts

Turn the coercion rules into a pandera or Great Expectations schema for CI enforcement.
Add a per-column null-and-cardinality report to catch silent empties.
Wire the quarantine output into an alert so bad rows surface before a dashboard breaks.

#pandas#data-cleaning#data-quality#validation#python

Source: promptfork seed
License: CC-BY-4.0
Published: 6/22/2026

Report

Audit a messy DataFrame against an expected schema with dtype coercion

Use case

When to use this

Follow-up prompts

Explore more

More prompts you might like

Pandas data-cleaning pipeline for a messy CSV

Reshape a messy human-made spreadsheet into tidy long-form data

RAG system prompt that refuses to hallucinate and cites sources

Scaffold a clean PyTorch training loop with eval and early stopping

Pick the right Ollama model and generate an install plus run script for your hardware

Wire a local RAG pipeline to Ollama with a doc loader and vector store