Pandas data-cleaning pipeline for a messy CSV

Produce a reproducible Pandas cleaning pipeline: types, missing values, dedupe, outliers.

Prompt

You are a data engineer. I have a messy CSV with these columns: [COLUMNS + WHAT'S WRONG]. Write a reproducible Pandas cleaning pipeline.

The pipeline should: load with correct dtypes, standardize column names, parse dates, handle missing values (state the strategy per column and why), strip/normalize strings, deduplicate, detect and handle obvious outliers, and validate the result with assertions.

Rules:
- One function per step; a `clean(df)` that composes them so it's testable and re-runnable.
- No silent data loss — log row counts before/after each step.
- Comment only the non-obvious decisions.

Return the full script plus a short note on which choices depend on domain knowledge I should confirm.

Source: promptfork seed
License: CC-BY-4.0
Published: 6/23/2026

Report

Explore more

System Prompts for RAG PyTorch Neural Networks Local LLM Ollama Setup

More prompts you might like

Data Science & AI ModelsSeed

Audit a messy DataFrame against an expected schema with dtype coercion

Produces a reusable schema-validation and dtype-coercion script that flags every column that drifted from spec, coerces what it safely can, and quarantines what it cannot instead of producing silent NaNs.

#pandas#data-cleaning

New

Data Science & AI ModelsSeed

Reshape a messy human-made spreadsheet into tidy long-form data

Takes a report-style spreadsheet (merged headers, multi-row titles, wide months-as-columns) and reshapes it into a tidy long DataFrame with a documented step-by-step transform you can re-run on the next export.

#pandas#data-cleaning

New

Data Science & AI ModelsSeed

RAG system prompt that refuses to hallucinate and cites sources

A retrieval-augmented system prompt that answers only from context and returns inline citations or 'I don't know'.

New

Data Science & AI ModelsSeed

Scaffold a clean PyTorch training loop with eval and early stopping

Gives you a reproducible, well-structured PyTorch training script — config, model, dataloaders, train/eval loop, metrics, checkpointing, and early stopping — tuned to your task.

#pytorch#machine-learning

New

Data Science & AI ModelsSeed

Pick the right Ollama model and generate an install plus run script for your hardware

Produces a hardware-aware Ollama model recommendation for your task plus a ready-to-run install and run script with VRAM checks, instead of guessing a model name and hoping it fits.

#ollama#local-llm

New

Data Science & AI ModelsSeed

Wire a local RAG pipeline to Ollama with a doc loader and vector store

Produces a complete, local-first RAG pipeline with document loading, chunking, Ollama embeddings, a vector store, retrieval, and a grounded answer step with citations, requiring no cloud APIs.

#ollama#rag

New