Audit a messy DataFrame against an expected schema with dtype coercion
Produces a reusable schema-validation and dtype-coercion script that flags every column that drifted from spec, coerces what it safely can, and quarantines what it cannot instead of producing silent NaNs.
You are a senior data engineer who treats a DataFrame as untrusted until it passes a schema.
I have a messy file or extract: [DESCRIBE SOURCE — e.g. 'a monthly CSV export from the finance tool, hand-edited by analysts'].
Expected schema (what a clean row should look like):
[COLUMN: EXPECTED DTYPE — e.g. 'order_id: int, amount: float, currency: 3-letter str, placed_at: datetime, status: one of {paid,refunded,pending}']
Python 3, pandas [2.x].
Build a validation and coercion module with:
1. A single declarative schema: per column, the expected dtype, nullable yes/no, allowed values or regex, and a date format hint if relevant. Define it once and reuse it.
2. A coercer that converts safely where it can (numeric strings to numbers, ISO or loose dates to datetime with a stated format, strip whitespace, normalize case for enums). Every coercion must be explicit and traceable.
3. A validator that reports, per column: how many values matched, were coerced, failed coercion, were null-but-required, or violated allowed values. Show counts and a few example offending rows.
4. A split on failure: clean rows proceed; rows that fail a hard rule go to a quarantined DataFrame with a reason column. Never silently drop or NaN a row without recording why.
5. A strict-mode flag: strict=True raises if any required column fails; strict=False quarantines and continues.
6. A __main__ that reads the file, runs validate, prints the per-column report, and writes clean plus quarantine to two CSVs.
Rules:
- Never infer the schema from the file alone. Trust the declared schema; the file must conform.
- Coercion must be lossless or explicitly flagged. A '$1,200.00' amount that becomes 1200.0 is fine; a 'N/A' that becomes NaN must be logged as a coercion, not hidden.
- Do not mutate the input in place. Return new frames.
Output the module in one fenced block, then a short runbook: how to edit the schema, how to read the report, and what strict vs quarantine mode does.
Success signal: the output is good only if every coercion is explicit and logged, every dropped or quarantined row has a reason, and a required-column violation never produces a silent NaN.Use case
Use when a CSV or DB extract arrives with inconsistent types (numbers as strings, mixed dates) and you need a typed, validated DataFrame you can trust downstream.
When to use this
As the first step of any pipeline that consumes a hand-edited or third-party file. Not a substitute for enforced schemas at the source.
Follow-up prompts
- Turn the coercion rules into a pandera or Great Expectations schema for CI enforcement.
- Add a per-column null-and-cardinality report to catch silent empties.
- Wire the quarantine output into an alert so bad rows surface before a dashboard breaks.
- Source
- promptfork seed
- License
- CC-BY-4.0
- Published
- 6/22/2026