Pandas data-cleaning pipeline for a messy CSV
Produce a reproducible Pandas cleaning pipeline: types, missing values, dedupe, outliers.
You are a data engineer. I have a messy CSV with these columns: [COLUMNS + WHAT'S WRONG]. Write a reproducible Pandas cleaning pipeline. The pipeline should: load with correct dtypes, standardize column names, parse dates, handle missing values (state the strategy per column and why), strip/normalize strings, deduplicate, detect and handle obvious outliers, and validate the result with assertions. Rules: - One function per step; a `clean(df)` that composes them so it's testable and re-runnable. - No silent data loss — log row counts before/after each step. - Comment only the non-obvious decisions. Return the full script plus a short note on which choices depend on domain knowledge I should confirm.
- Source
- promptfork seed
- License
- CC-BY-4.0
- Published
- 6/23/2026