Vector Schema & RAG Patterns

Status: Elective / Extended Module — complete the elective modules first.

Overview

Vector-schema RAG replaces Scout's prompt-stuffing approach with a genuine retrieval pipeline for scenarios where many datasets with unknown schemas must be supported. Schema documentation — table names, column descriptions, spatial types, category vocabularies — is embedded into a vector database and retrieved by semantic similarity at query time. This prevents LLMs from hallucinating column names and table relationships, and reduces prompt token count by only injecting the schemas relevant to the user's question.

Key Concepts

1. Schema-as-Document Embedding

Each dataset's schema is split into fine-grained text chunks — title/description, individual field definitions, category vocabulary — and embedded using a sentence-transformer model. STAC catalog items are the natural input: they already contain structured metadata written in human-readable prose. No manual description generation is required; you embed what the catalog already says.

2. Retrieval to Prevent Hallucination

When a user asks a question, the query is embedded and compared against all schema chunks in the vector store. The top-k most semantically similar chunks are retrieved and injected into the system prompt. The LLM receives only the schemas relevant to the question, which eliminates both the context-window overflow problem (60 schemas × full schema text) and the retrieval noise problem (irrelevant column names confusing SQL generation).

3. Grounded SQL Generation

With retrieved schema chunks injected, the LLM generates SQL against column names and vocabulary values that actually exist in the database. Synonym routing is a key benefit: querying "show me drugstores" retrieves the same category_vocabulary chunk as "show me pharmacies" because their embeddings are semantically close — no hardcoded synonym list required.

1. The Scale Problem

Imagine Scout deployed at a city's Open Data Portal. Instead of one GeoParquet, there are 60 datasets: parcels, zoning, tree inventory, traffic sensors, bike lanes, inspection records, permit history, flood zones... Each has its own schema, its own column names, its own vocabulary.

The prompt stuffing approach fails here for two reasons:

Context window overflow: Even if all 60 schemas fit in a 200K-token window today, you're paying for all 60 schemas on every single query. Most queries are about one dataset.
Retrieval noise: Injecting irrelevant schema context actively degrades SQL generation quality — the model gets confused by columns that don't apply to the user's intent.

The right approach is to retrieve before you inject: find the 2-3 schemas that are most relevant to the user's question, then build a minimal, targeted prompt.

2. How Vector Search Solves It

Vector embeddings turn text into numerical arrays ("vectors") that encode semantic meaning. Sentences with similar meaning get vectors that are close in space, regardless of exact word choice.

The workflow: offline, embed each STAC catalog item's schema chunks into a vector database. At query time, embed the user's question, perform semantic similarity search to retrieve the top-k most relevant schema chunks, inject those chunks into the system prompt, and call the LLM. The LLM generates SQL with accurate column names because it only receives schemas that match the question.

This is the same pattern used by Vanna AI (open source Text-to-SQL with RAG), Databricks Genie (natural language over data lakehouses), Snowflake Copilot (NL-to-SQL over Snowflake catalogs), and Esri's ArcGIS Assistant (NL over GIS layers).

The geospatial-specific insight: STAC catalogs are already structured metadata — they're the natural embedding input. You don't need to generate descriptions from scratch; you embed the STAC item summaries, keywords, and column descriptions you already wrote in Phase 1.

3. Why This Matters for Product Patterns

This module represents a major product expansion decision:

From a tool to a platform: Prompt stuffing makes Scout a single-dataset tool. RAG makes it a platform that can onboard new datasets without touching the application code. New dataset → run ETL → add to vector index → done.
The "data catalog" as a product: The vector database becomes a living, searchable index of everything the system knows. This is the same concept behind Alation, Atlan, and DataHub — data cataloging platforms that GIS organizations pay serious money for.
Cost optimization: Retrieving 3 relevant schemas instead of injecting 60 reduces prompt token count by 95%+. At scale, this is the difference between a viable and an unviable cost structure.
The cold start problem: When users onboard a new dataset, how does the system know what questions it can answer? The vector index enables "similarity by analogy" — a new "tree_inventory" dataset is automatically queryable for tree-related questions because its embeddings are close to similar datasets.

5. Exercises

The exercise directory at src/exercises/vector-rag/ contains three scripts: 01_embed_catalog.py builds a ChromaDB index from the Phase 1 STAC catalog, 02_retrieval_compare.py runs side-by-side token count and SQL quality comparisons between prompt-stuffing and RAG, and 03_onboard_buildings.py adds a second dataset (Overture buildings) to the index and exports a retrieval_index.json snapshot for the Next.js API route.

Exercise 1: Embed the STAC Catalog

Reads the STAC catalog from Phase 1, splits each dataset into fine-grained chunks (title/description, schema fields, category vocabulary), and embeds them using sentence-transformers/all-MiniLM-L6-v2 — fully local, no API key. Stores vectors in a persistent ChromaDB instance at output/chroma/.

Exercise 2: Retrieval vs. Prompt Stuffing

Requires LLM_PROVIDER and an API key set in your environment. Runs 6 test queries through both pipelines and prints a comparison table showing token count reduction (prompt-stuffing vs. RAG) and the synonym test — querying "pharmacy" and "drugstore" should retrieve the same schema chunk.

Exercise 3: Onboard a Second Dataset

Adds an Overture Maps buildings dataset to the index without running any ETL. Runs routing disambiguation tests to verify that "coffee shop near a park" routes to the places dataset and "tall buildings in SoMa" routes to the buildings dataset. Exports output/retrieval_index.json — a flat JSON snapshot that the Next.js RAG API route reads at query time.

What to Observe

Semantic routing: Query "show me pharmacies" then "show me drugstores". Both retrieve the same sf-places-2026:category_vocabulary chunk — no synonyms list needed.
Token delta: Log prompt token counts shown in the comparison table. The reduction is your cost-per-query improvement at scale.
Retrieval failures: Ask about something not in the index. The LLM should use LIKE '%keyword%' on the name column (graceful fallback).

Scout UI: RAG Toggle

After running Exercise 3, toggle to RAG mode on the /scout page. The bottom panel gains a green-themed mode indicator and a "Retrieved schema chunks" panel showing exactly which chunks were selected for your query (with similarity scores). Compare the same query in both modes to see the token count difference live.

6. Tooling Landscape

Tool	Type	Best for	Self-hosted?
ChromaDB	Vector DB	Local dev, notebooks, quick prototypes	Yes
Qdrant	Vector DB	Production, spatial payload filtering	Yes or Cloud
pgvector	Postgres ext	Existing Postgres infra, hybrid with PostGIS	Yes
Weaviate	Vector DB	Hybrid BM25+vector, rich metadata	Yes or Cloud
Pinecone	Managed	Zero ops, large scale	Cloud only
OpenAI embeddings	Embedding model	Highest quality, requires API	No
sentence-transformers	Embedding model	Local, no cost, good quality	Yes

For learning: start with ChromaDB + sentence-transformers. Zero servers, zero cost, runs in the same uv environment.

Techniques Learned

Tools Introduced