Build a robust PyTorch Dataset and DataLoader with an augmentation pipeline

Produces a custom PyTorch Dataset with correct transforms, a tuned DataLoader, and a debuggable augmentation pipeline that handles edge cases instead of throwing on the first weird sample.

Open in Studio

Prompt

You are a senior ML engineer who writes correct, debuggable PyTorch data pipelines.

I need a custom Dataset and a tuned DataLoader for: [DESCRIBE THE DATA — e.g. 'image classification on a folder of JPGs of varying sizes, plus a CSV of labels', 'multimodal: short text plus a thumbnail per sample']. PyTorch [2.x].

Modality and shape:
- Source(s): [LOCAL PATHS / S3 / HTTP / MIXED]
- Sample shape: [e.g. 'RGB image plus integer label', 'variable-length token sequence plus label']
- Approximate dataset size and whether it fits in RAM: [N ROWS / FITS / DOES NOT FIT]
- Class balance: [BALANCED / IMBALANCED — give rough ratios]

Build:
1. A Dataset subclass (map-style) with __init__, __len__, __getitem__. Lazy file I/O only — do not load the whole set into memory unless it fits. Cache nothing that would break across epochs by accident.
2. A transform pipeline that differs for train vs eval (train augments, eval only normalizes). Make every transform deterministic for a given seed so results are reproducible.
3. Robust handling of the obvious failure modes: corrupt or missing files, wrong dtype, a sample that fails to decode — catch, log, and return a safe replacement or skip, never crash the whole run.
4. A collate_fn if samples are variable-length (e.g. padding for text, or stacking ragged boxes).
5. A DataLoader configured with the right batch size, num_workers, pin_memory, and drop_last for [TRAIN / EVAL]. Justify each setting in a comment.
6. A __main__ smoke test that pulls one batch, prints shapes and dtypes, and confirms no NaNs.

Rules:
- Do not hardcode paths, classes, or normalization stats. Take them as args or compute stats from a sample.
- Train and eval transforms must diverge (eval never applies random crop or flip).
- Say plainly where I must supply real labels and files.

Output the full module in one fenced block, then a 4-item runbook: how to point it at my data, how to seed it, how to verify one batch, and how to switch num_workers safely.

Success signal: the output is good only if train and eval transforms differ, a corrupt sample cannot crash the run, and the smoke test prints concrete shapes and dtypes for one batch.

Use case

Use when your data is not a tidy ImageFolder and you need a Dataset class that loads, augments, and batches correctly the first time.

When to use this

Before training begins; specify data location, modality, and whether augmentation must differ between train and eval.

Follow-up prompts

Add a weighted sampler to correct class imbalance at the batch level.
Visualize a full augmented batch as a grid to sanity-check transforms before training.
Convert the Dataset to a streaming webdataset variant for data too large to fit on disk.

#pytorch#machine-learning#data-loading#augmentation#python

Source: promptfork seed
License: CC-BY-4.0
Published: 6/22/2026

Report

Build a robust PyTorch Dataset and DataLoader with an augmentation pipeline

Use case

When to use this

Follow-up prompts

Explore more

More prompts you might like

Scaffold a clean PyTorch training loop with eval and early stopping

Fine-tune a pretrained model in PyTorch with a deliberate layer-freezing strategy

RAG system prompt that refuses to hallucinate and cites sources

Pandas data-cleaning pipeline for a messy CSV

Pick the right Ollama model and generate an install plus run script for your hardware

Wire a local RAG pipeline to Ollama with a doc loader and vector store