Wire a local RAG pipeline to Ollama with a doc loader and vector store

Produces a complete, local-first RAG pipeline with document loading, chunking, Ollama embeddings, a vector store, retrieval, and a grounded answer step with citations, requiring no cloud APIs.

Open in Studio

Prompt

You are a senior engineer who builds local-first RAG systems that stay grounded.

Build a complete local RAG pipeline wired to Ollama. Context:
- Documents: [FILE TYPES — e.g. 'PDFs and Markdown in a ./docs folder', and approx count/size]
- Language: [Python / TypeScript]
- Vector store: [Chroma / Qdrant local / LanceDB / FAISS in-memory]
- Embedding model (Ollama): [nomic-embed-text / mxbai-embed-large / suggest one]
- Generation model (Ollama): [llama3.1 / qwen2.5 / suggest one for my hardware]
- Hardware: [GPU and VRAM / CPU only / Apple Silicon]

Build a pipeline with these stages, each its own function:
1. Load — ingest the document types from the path, extract text, and track source plus page or section for citation.
2. Chunk — split with a sensible strategy (recursive or semantic) and chunk size plus overlap chosen for the doc type; explain the choice.
3. Embed — call the Ollama embedding model locally; batch to stay efficient; store vectors with metadata (source, chunk index).
4. Store — persist to the chosen vector store so re-embedding is not needed on every run.
5. Retrieve — take a query, embed it, return the top-k chunks with a similarity score; expose k and the score threshold as knobs.
6. Answer — build a prompt that uses ONLY the retrieved chunks, instruct the model to answer from them and to say when the context does not contain the answer, and require per-claim citations to source and chunk.
7. Guardrail — if retrieval returns nothing above threshold, the pipeline returns 'no relevant context found' instead of hallucinating.

Requirements:
- Everything runs locally — no OpenAI or Anthropic API calls.
- Show the exact Ollama model pulls needed and approximate disk/RAM cost.
- No silent errors; each stage logs what it did.

Output, in this exact order:
1. A design overview (stages, store, models, why).
2. The full runnable pipeline as one script with clear function boundaries.
3. A usage example: index a folder, then ask a question and print the grounded answer with citations.
4. A tuning checklist (chunk size, top-k, threshold, model choice) and how to tell retrieval quality is good.

Success signal: the output is good only if the pipeline runs fully local, answers are grounded in retrieved chunks with citations, and a no-match query returns an explicit 'no relevant context found' instead of a guess.

Use case

Use when you want to ask questions of your own documents privately with a local model, using retrieval and citations rather than stuffing everything into the prompt.

When to use this

For private document Q&A where data must not leave the machine. Not for very large multi-million-doc corpora or when you need frontier-model reasoning.

Wire a local RAG pipeline to Ollama with a doc loader and vector store

Use case

When to use this

Follow-up prompts

Explore more

More prompts you might like

Pick the right Ollama model and generate an install plus run script for your hardware

Design a privacy-first local chat setup with quantization guidance

RAG system prompt that refuses to hallucinate and cites sources

Pandas data-cleaning pipeline for a messy CSV

Scaffold a clean PyTorch training loop with eval and early stopping

Build a robust PyTorch Dataset and DataLoader with an augmentation pipeline