PromptFork

Build a robust PyTorch Dataset and DataLoader with an augmentation pipeline

Produces a custom PyTorch Dataset with correct transforms, a tuned DataLoader, and a debuggable augmentation pipeline that handles edge cases instead of throwing on the first weird sample.

Open in Studio
Prompt
You are a senior ML engineer who writes correct, debuggable PyTorch data pipelines.

I need a custom Dataset and a tuned DataLoader for: [DESCRIBE THE DATA — e.g. 'image classification on a folder of JPGs of varying sizes, plus a CSV of labels', 'multimodal: short text plus a thumbnail per sample']. PyTorch [2.x].

Modality and shape:
- Source(s): [LOCAL PATHS / S3 / HTTP / MIXED]
- Sample shape: [e.g. 'RGB image plus integer label', 'variable-length token sequence plus label']
- Approximate dataset size and whether it fits in RAM: [N ROWS / FITS / DOES NOT FIT]
- Class balance: [BALANCED / IMBALANCED — give rough ratios]

Build:
1. A Dataset subclass (map-style) with __init__, __len__, __getitem__. Lazy file I/O only — do not load the whole set into memory unless it fits. Cache nothing that would break across epochs by accident.
2. A transform pipeline that differs for train vs eval (train augments, eval only normalizes). Make every transform deterministic for a given seed so results are reproducible.
3. Robust handling of the obvious failure modes: corrupt or missing files, wrong dtype, a sample that fails to decode — catch, log, and return a safe replacement or skip, never crash the whole run.
4. A collate_fn if samples are variable-length (e.g. padding for text, or stacking ragged boxes).
5. A DataLoader configured with the right batch size, num_workers, pin_memory, and drop_last for [TRAIN / EVAL]. Justify each setting in a comment.
6. A __main__ smoke test that pulls one batch, prints shapes and dtypes, and confirms no NaNs.

Rules:
- Do not hardcode paths, classes, or normalization stats. Take them as args or compute stats from a sample.
- Train and eval transforms must diverge (eval never applies random crop or flip).
- Say plainly where I must supply real labels and files.

Output the full module in one fenced block, then a 4-item runbook: how to point it at my data, how to seed it, how to verify one batch, and how to switch num_workers safely.

Success signal: the output is good only if train and eval transforms differ, a corrupt sample cannot crash the run, and the smoke test prints concrete shapes and dtypes for one batch.

Use case

Use when your data is not a tidy ImageFolder and you need a Dataset class that loads, augments, and batches correctly the first time.

When to use this

Before training begins; specify data location, modality, and whether augmentation must differ between train and eval.

Follow-up prompts

  • Add a weighted sampler to correct class imbalance at the batch level.
  • Visualize a full augmented batch as a grid to sanity-check transforms before training.
  • Convert the Dataset to a streaming webdataset variant for data too large to fit on disk.
#pytorch#machine-learning#data-loading#augmentation#python
Source
promptfork seed
License
CC-BY-4.0
Published
6/22/2026

More prompts you might like

Scaffold a clean PyTorch training loop with eval and early stopping

Gives you a reproducible, well-structured PyTorch training script — config, model, dataloaders, train/eval loop, metrics, checkpointing, and early stopping — tuned to your task.

#pytorch#machine-learning
New

Fine-tune a pretrained model in PyTorch with a deliberate layer-freezing strategy

Produces a transfer-learning script that swaps the right head, freezes the right layers, and uses distinct learning rates so you adapt a backbone instead of nuking its pretrained weights.

#pytorch#transfer-learning
New

RAG system prompt that refuses to hallucinate and cites sources

A retrieval-augmented system prompt that answers only from context and returns inline citations or 'I don't know'.

New

Pandas data-cleaning pipeline for a messy CSV

Produce a reproducible Pandas cleaning pipeline: types, missing values, dedupe, outliers.

New

Pick the right Ollama model and generate an install plus run script for your hardware

Produces a hardware-aware Ollama model recommendation for your task plus a ready-to-run install and run script with VRAM checks, instead of guessing a model name and hoping it fits.

#ollama#local-llm
New

Wire a local RAG pipeline to Ollama with a doc loader and vector store

Produces a complete, local-first RAG pipeline with document loading, chunking, Ollama embeddings, a vector store, retrieval, and a grounded answer step with citations, requiring no cloud APIs.

#ollama#rag
New