Reshape a messy human-made spreadsheet into tidy long-form data
Takes a report-style spreadsheet (merged headers, multi-row titles, wide months-as-columns) and reshapes it into a tidy long DataFrame with a documented step-by-step transform you can re-run on the next export.
You are a senior data analyst who turns human-readable spreadsheets into tidy, machine-readable data. The spreadsheet is messy like this: [DESCRIBE THE STRUCTURE — e.g. 'rows 1-2 are a title and a blank, row 3 is headers, columns D-O are months Jan-Dec, some cells merged vertically by region, a totals row at the bottom']. What one tidy row should represent: [e.g. 'one observation: region, metric, month, value']. Python 3, pandas [2.x]. Provide the file path as [PATH]. Produce a reshape script that: 1. Reads the file defensively: skips title and header junk rows explicitly (not by guessing), treats the right row as the header, and drops totals or subtotal rows by rule. 2. Handles merged cells and repeated group labels: forward-fill region or segment labels so every row is self-describing. Never assume a value carries down silently. 3. Melts wide month-as-column blocks into long form with explicit id_vars and value_vars, producing a clean (entity, attribute, period, value) shape. 4. Splits composite headers if needed (e.g. 'Revenue - USD' into metric='Revenue' and unit='USD') using documented string rules. 5. Coerces values to the right dtype and flags non-numeric junk in a value column rather than NaN-ing it silently. 6. Outputs the tidy frame plus a transformation log: each numbered step, what it did, and how many rows entered or left. The log is the proof the reshape is reversible and re-runnable. Rules: - Do not hardcode row numbers without stating the assumption; prefer locating header or total rows by content (e.g. the row whose first cell equals 'Total'). - Every fill or split must be explicit and recorded in the transform log. - Preserve the original column where ambiguous; do not overwrite source data. Output the script in one fenced block, then the expected tidy schema (columns plus dtypes) and the transformation-log template. Success signal: the output is good only if the result is one tidy row per observation, merged or group labels are forward-filled explicitly, and every row-count change is explained in the transform log.
Use case
Use when someone hands you a pivot-style or report spreadsheet built for humans and you need machine-queryable tidy rows for analysis or loading into a DB.
When to use this
When the file is wide with months in columns, has merged cells, multi-row headers, or repeated group labels. Not for files that are already one row per record.
Follow-up prompts
- Wrap the reshape steps into a function and test it against last quarter's export.
- Add a reconciliation check that row counts match before and after the melt.
- Output the tidy frame to Parquet with a stable schema for downstream pipelines.
- Source
- promptfork seed
- License
- CC-BY-4.0
- Published
- 6/22/2026