[!TIP] Try it out! You can explore the comparisons discussed in this module on the Spatial Eval Dashboard.
Complete first — the "Richmond bbox" problem encountered there is the direct motivation for this investigation.
Overview
Evaluating LLM geographic knowledge requires designing prompts, ground-truth datasets, and scoring rubrics that measure how well models reason about places, distances, and spatial relationships. The Scout evaluation framework applies this methodology in practice: a structured experiment comparing multiple LLMs on a corpus of San Francisco neighborhoods, scored against authoritative boundary sources including DataSF, OpenStreetMap, and Wikidata. The framework produces spatial metrics (IoU, centroid error) that quantify model accuracy in a way that reveals systematic biases invisible from qualitative inspection.
Key Concepts
1. Evaluation Design
The core question is what to test: topology (does the model's bounding box overlap the official polygon?), distance estimation (how far is the model's centroid from the reference?), and place famousness effects (do models know iconic places better than vernacular ones?). Designing the place corpus across four tiers — from Golden Gate Park to hyper-local vernacular regions like "Mid-Market" — exposes where LLM geographic knowledge degrades and where models agree confidently on a wrong answer.
2. Ground-Truth Construction
A spatial QA dataset requires authoritative reference geometries for each place. The framework draws from DataSF (SF Planning Department's official neighborhood boundaries), OpenStreetMap/Nominatim, Wikipedia infobox coordinates, and Wikidata structured geometries. DataSF is the primary source because it provides official polygons — enabling IoU scoring that quantifies accuracy against a legal boundary, not just visual inspection.
3. Scoring and Model Comparison
Three primary metrics capture different failure modes: centroid error in meters (is the model's center point correct?), bbox area error as a percentage (does the model's polygon match the reference size?), and Intersection over Union (IoU, where 1.0 is perfect and 0 is no spatial overlap). Aggregating these across models and place tiers produces a model × tier heatmap that reveals which providers know SF best, where knowledge falls off, and whether bigger models are actually better geographers.
When Claude generated lat BETWEEN 37.773 AND 37.788 for "in the Richmond," it
revealed something interesting: LLMs carry implicit geographic knowledge from
their training on text data, and that knowledge is unaudited, uneven, and
model-dependent. This is why (in part) many organizations are working on
building geo foundational models and world models so that we can explicitly
train on geospatial data to capture geospatial semantic relations, geometry,
topology, etc. not text representations of geographies. This module frames a
structured experiment to measure it.

1. The Research Question
Do different LLMs agree on where a place is and how does their answer compare to authoritative sources?
More specifically:
- If you ask Claude, GPT-4o, Gemini, Mistral, and Llama the same question ("What are the approximate lat/lon bounding box coordinates of the Richmond district in San Francisco?"), do they produce the same bbox?
- How does each model's answer compare to the OpenStreetMap polygon for the Richmond, or the Wikipedia coordinate, or a crowd-sourced boundary?
- Does model agreement correlate with how famous the place is? (The Richmond vs. the Dogpatch vs. the Outer Sunset vs. Visitacion Valley)
- Do models disagree more on vague places, informal, socially-constructed regions with no official boundary, than on formal administrative boundaries like census tracts or ZIP codes?
3. Experimental Design
3.1 Place corpus
Choose ~modules places at multiple specificity levels and famousness levels:
Tier 1 (Famous, well-defined — easy): Golden Gate Park SF, Chinatown SF, The Castro SF.
Tier 2 (Common, moderately vague — medium): The Richmond SF (the motivating example), The Mission SF, The Tenderloin SF, SoMa (South of Market) SF.
Tier 3 (Local knowledge required — hard): The Dogpatch SF, The Outer Sunset SF, Visitacion Valley SF, Noe Valley SF, Excelsior SF.
Tier 4 (Hyper-local vernacular — very hard, likely high disagreement): "Upper Haight" vs "Lower Haight", "North Beach" (overlaps with Fisherman's Wharf depending on who you ask), "Mid-Market" (an informal planning term, not a neighborhood), "The Panhandle" (a park? A neighborhood around it?).
3.2 Models to test
Current default set is defined in data/models.json. This central registry
allows models to be toggled on or off without changing code. The registry includes Anthropic (Sonnet and Haiku), OpenAI (GPT-4o and GPT-4o-mini), and Google (Gemini 1.5 Pro and Flash).
Testing a large and small model from the same provider (Sonnet vs. Haiku, GPT-4o
vs. GPT-4o-mini) measures whether geographic knowledge scales with model size.
To add or remove models, edit the default boolean in data/models.json, or
explicitly pass --models at the command line.
3.3 The prompt
The same prompt is sent to all models with no system message: ask for bounding box coordinates of a named place in San Francisco, structured as a JSON object with centroid_lat, centroid_lon, bbox_north, bbox_south, bbox_east, bbox_west, confidence, and notes fields. Force structured output. Parse with json.loads. Flag responses that don't parse or produce obviously impossible coordinates.
Run each model × place pair 3 times and average the coordinates. LLMs are stochastic, so a single run may not be representative.
3.4 Reference sources
For each place, pull ground-truth data from:
| Source | What it provides | How to access |
|---|---|---|
| OpenStreetMap / Nominatim | Polygon boundary for named places that OSM has mapped | geopy.geocoders.Nominatim or requests to nominatim.openstreetmap.org |
| Wikipedia | Lat/lon coordinate from the article's infobox (if it exists) | wikipedia Python package or Wikipedia API |
| Wikidata | Structured coordinates + bounding box for Wikidata entities | SPARQL query to query.wikidata.org |
| DataSF | San Francisco's official neighborhood boundaries (SF Planning Dept) | data.sfgov.org — GeoJSON download, free |
DataSF is particularly valuable: SF has official neighborhood boundaries published by the Planning Department. The Richmond has an official polygon. Comparing LLM bboxes to the official polygon gives you a quantified accuracy metric, not just visual inspection.
4. Analysis
4.1 Metrics per place × model
- Centroid error (meters): distance from LLM centroid to reference centroid
- Bbox area error (%):
|LLM_area - reference_area| / reference_area - IoU (Intersection over Union): overlap between LLM bbox polygon and reference polygon. IoU = 1 is perfect; 0 is no overlap at all.
- Parse failure rate: % of runs where the model returned non-JSON or impossible coordinates
4.2 Aggregate metrics
- Model × tier heatmap: average IoU across models and place tiers, which models know more, and where does knowledge fall off?
- Model agreement surface: for each place, what is the spread of LLM centroids? A tight cluster = high model consensus. A scattered cloud = the models disagree. Overlay the reference polygon to see who's right.
- Size calibration: LLMs tend to over-estimate how large neighborhoods are (they may conflate "the Mission" with "the Mission District" broadly construed). Measure bbox area relative to the official polygon area.
4.3 The map
The output everyone will understand: for a given place (e.g., "the Richmond"), draw:
- The official DataSF polygon (ground truth)
- The Wikipedia coordinate (a point)
- The Nominatim result (a polygon, if OSM has it)
- One marker per model, colored by provider
- Ellipses showing the 1-sigma spread of each model's 3 runs
This immediately reveals: which model is most accurate, which is most confident (tight ellipse), and whether the models cluster together or scatter randomly.
5. Implementation Sketch
The evaluation directory (src/exercises/spatial-eval/) contains five scripts: 01_collect_llm_responses.py (query all models × all places, save raw JSON), 02_collect_reference_data.py (pull Nominatim, Wikipedia, Wikidata, DataSF), 03_compute_metrics.py (IoU, centroid error, parse failure rate), 04_visualize.ipynb (map, heatmap, scatter plots), plus a data/ directory for the place corpus, raw LLM outputs, reference geometries, and computed metrics.
Key Python packages: anthropic and openai for LLM APIs; geopy.geocoders.Nominatim, wikipedia, and requests for reference data; geopandas, shapely.geometry.box, and shapely.ops.unary_union for spatial analysis; folium, pydeck, matplotlib, and seaborn for visualization.
6. Running the Eval Framework
The src/exercises/spatial-eval/ directory is a complete, runnable
implementation. A top-level dispatcher script (run.py) manages the evaluation
commands.
Setup (one-time)
Set API keys for the providers you are using (ANTHROPIC_API_KEY, OPENAI_API_KEY, GEMINI_API_KEY) in your environment before running any commands.
First run (full pipeline)
Run these commands in order. Each is idempotent, so re-running is safe and won't duplicate data.
Populate GT-vs-GT agreement data with uv run python run.py gt-compare (run once; re-run only if you add GT sources), then run all models × all 25 places × 3 runs with uv run python run.py batch-eval, then launch the dashboard with uv run streamlit run app.py.
That's it for a full baseline run. No flags needed, the defaults cover all
models, all places, and the primary GT source (sf_planning).
Test a single place first (Smoke Test)
Before running the full batch (which makes many API calls), smoke-test one
place with uv run python run.py smoke --place "The Mission" --model "claude-3-5-haiku-latest".
Returning: add new data without re-running everything
All scripts use INSERT OR REPLACE, so they skip already-cached LLM responses
and only compute what's missing. You can add a newly-supported model (--models "gemini/gemini-2.0-flash"), re-run one place with higher repetition (--places "Visitacion Valley" --runs 5), add OSM + Wikidata to the GT agreement matrix (run.py gt-compare --gt-sources osm wikidata), or run LLM eval against an additional GT source (run.py batch-eval --gt-sources osm).
View results from the CLI
- Leaderboard ranked by composite score:
uv run python run.py leaderboard --gt-source sf_planning - Metric breakdown:
uv run python run.py analysis --metric iou --gt-source sf_planning - GT-vs-GT agreement matrix:
uv run python run.py analysis --source-type gt
Updating the deployed Streamlit app
results.db is committed to git so Streamlit Community Cloud can read
pre-computed results (its filesystem is ephemeral, so scripts cannot run there).
After any local eval run, commit and push the updated DB.
TODO (if results grow past ~50 MB): Migrate away from SQLite-in-git. Options: export to Parquet committed to git, or move to Supabase. See
db.pymodule docstring for the migration path.
Ground Truth Sources
| Source | Key | Coverage | Notes |
|---|---|---|---|
| SF Analysis Neighborhoods | sf_planning | 41 neighborhoods | Primary — DataSF official boundaries |
| Click That Hood | click_that_hood | 37 neighborhoods | Community-sourced; codeforamerica/click_that_hood |
| Zillow (2017) | zillow | 92 polygons | 2017 Zillow neighborhood boundaries |
| OpenStreetMap | osm | varies | Live Nominatim query, disk-cached |
| Wikidata | wikidata | varies | P3896 geoshape, disk-cached |
Query strings for OSM and Wikidata are managed in data/source_aliases.json.
Use the ⚙️ Source Config tab in the Streamlit app to review and edit them.
Experiments to Try
-
Compare GT sources: Run
run_gt_comparison.pyand look at the GT Agreement tab. Places wheresf_planningandclick_that_hooddisagree (low IoU) represent genuine semantic ambiguity, so don't penalize LLMs too harshly there. -
Which model knows SF best?: Check
leaderboard.py. Look at both composite score and DE-9IM distribution. An "overlaps" relation is much better than "disjoint". -
Tier analysis: The corpus has 4 tiers (1 = iconic, 4 = obscure). Do models degrade on tier 3–4 places? Run
run_analysis.py --metric compositeand look for the pattern. -
GT sensitivity: Run the same model against multiple GT sources. If a model scores 0.7 vs
sf_planningbut 0.2 vsclick_that_hood, the GT sources disagree more than the model is wrong. -
Add a new GT source: Create
sources/my_source.py, decorate with@register_gt("my_source"), addfrom . import my_sourcetosources/__init__.py. It automatically appears in all CLIs and the Streamlit app.
7. What You Might Find (Hypotheses)
Based on what's known about LLM geographic knowledge:
-
Famousness dominates accuracy. Golden Gate Park (centroid error < 100m, IoU > 0.8 across all models). Visitacion Valley (IoU < 0.3, high variance).
-
Models agree more than they're right. For a moderately-famous neighborhood, most models will produce similar bboxes. But, all of them may be systematically offset or oversized relative to the official polygon. This is the most dangerous failure mode: confident consensus on a wrong answer.
-
Bigger models are not always better geographers. Geographic knowledge depends on training data distribution, not just model capacity. A model trained heavily on web text may know more SF neighborhood trivia than a more capable model trained on more filtered data.
-
Models inflate bbox size. Human descriptions of neighborhoods describe the "core" area (the blocks everyone agrees on). LLMs trained on those descriptions may overestimate size to be safe. The official polygon may be significantly smaller than what any model returns.
-
High-confidence responses are not more accurate. The
confidencefield in the prompt may not correlate with centroid error. This is testable and would be a publishable finding if true.
8. Connecting This Back to Scout
The direct product implication: if you were to deploy Scout commercially, the "Richmond problem" is not hypothetical. You could:
- Run this experiment with your target city
- Use the IoU scores to build a geographic knowledge reliability map: for which neighborhoods can you trust LLM training data? For which do you need to supply a polygon?
- Encode that as a product spec: "For neighborhoods with avg model IoU < 0.5, we require explicit polygon injection before generating SQL."
This turns a research finding into a concrete engineering requirement.
9. If You Want to Publish This
The right venue is ACM SIGSPATIAL (November, annual). The GeoAI workshop co-located with SIGSPATIAL is a natural fit for this kind of empirical LLM evaluation. A 4-page short paper with the map outputs and IoU heatmap would be competitive.
Prior work to cite and differentiate from:
- Purves et al. on vernacular geography (your comparison source, not prior LLM work)
- GeoEval benchmark (you're extending from formal places to vernacular regions)
- Any SIGSPATIAL 2024 papers on LLM spatial reasoning (your experimental setup is similar, your corpus of vague/vernacular places is the novelty)