Build a robust PyTorch Dataset and DataLoader with an augmentation pipeline
Produces a custom PyTorch Dataset with correct transforms, a tuned DataLoader, and a debuggable augmentation pipeline that handles edge cases instead of throwing on the first weird sample.
You are a senior ML engineer who writes correct, debuggable PyTorch data pipelines. I need a custom Dataset and a tuned DataLoader for: [DESCRIBE THE DATA — e.g. 'image classification on a folder of JPGs of varying sizes, plus a CSV of labels', 'multimodal: short text plus a thumbnail per sample']. PyTorch [2.x]. Modality and shape: - Source(s): [LOCAL PATHS / S3 / HTTP / MIXED] - Sample shape: [e.g. 'RGB image plus integer label', 'variable-length token sequence plus label'] - Approximate dataset size and whether it fits in RAM: [N ROWS / FITS / DOES NOT FIT] - Class balance: [BALANCED / IMBALANCED — give rough ratios] Build: 1. A Dataset subclass (map-style) with __init__, __len__, __getitem__. Lazy file I/O only — do not load the whole set into memory unless it fits. Cache nothing that would break across epochs by accident. 2. A transform pipeline that differs for train vs eval (train augments, eval only normalizes). Make every transform deterministic for a given seed so results are reproducible. 3. Robust handling of the obvious failure modes: corrupt or missing files, wrong dtype, a sample that fails to decode — catch, log, and return a safe replacement or skip, never crash the whole run. 4. A collate_fn if samples are variable-length (e.g. padding for text, or stacking ragged boxes). 5. A DataLoader configured with the right batch size, num_workers, pin_memory, and drop_last for [TRAIN / EVAL]. Justify each setting in a comment. 6. A __main__ smoke test that pulls one batch, prints shapes and dtypes, and confirms no NaNs. Rules: - Do not hardcode paths, classes, or normalization stats. Take them as args or compute stats from a sample. - Train and eval transforms must diverge (eval never applies random crop or flip). - Say plainly where I must supply real labels and files. Output the full module in one fenced block, then a 4-item runbook: how to point it at my data, how to seed it, how to verify one batch, and how to switch num_workers safely. Success signal: the output is good only if train and eval transforms differ, a corrupt sample cannot crash the run, and the smoke test prints concrete shapes and dtypes for one batch.
Use case
Use when your data is not a tidy ImageFolder and you need a Dataset class that loads, augments, and batches correctly the first time.
When to use this
Before training begins; specify data location, modality, and whether augmentation must differ between train and eval.
Follow-up prompts
- Add a weighted sampler to correct class imbalance at the batch level.
- Visualize a full augmented batch as a grid to sanity-check transforms before training.
- Convert the Dataset to a streaming webdataset variant for data too large to fit on disk.
- Source
- promptfork seed
- License
- CC-BY-4.0
- Published
- 6/22/2026