Skip to content

0004 — Itinerary Finder (location → web-scraped itineraries)

ID: 0004
Status: In progress (v0.1–v0.7 shipped; v1 polish remaining)
Owner: @satya
Created: 2026-05-11
Updated: 2026-05-13
Related ADRs: -

Slice tracker — see progress.md. v0.1 through v0.7 are live end-to-end (workers + backend + mobile). What’s left for v1: source allowlist toggle, auto-refresh policy, quality scoring, url_launcher on source pills.

Planners start a trip from a blank page. They paste a city into Google, read 8 blogs, and stitch a rough itinerary by hand. Treeper already ingests their fragments (spec 0003). This feature makes the public web a first-class input: any location search returns ready itineraries scraped from blogs, Wikivoyage, and travel publishers — with the source URLs visible so the user can dig deeper.

From PRODUCT.md §2:

  • Solo planner — wants a starting itinerary instead of a blank page.
  • Inspiration hoarder — already collects links; now sees curated ones for the destination they searched.
  • Curated planner — sees attribution and can pick the source they trust before importing into their trip.
  • F4.1 POST /itineraries/search on the NestJS API — returns cached itineraries for a location instantly and enqueues a background scrape.
  • F4.2 Python pipeline in apps/workers that searches the web, fetches pages, extracts itineraries via the existing OpenRouter LLMClient, and persists them.
  • F4.3 Global cache — itineraries are shared across all users, keyed by normalized location_id. No per-user scoping.
  • F4.4 Permanent storage — no TTL on cached rows. Refresh is explicit (refresh=true flag).
  • F4.5 Every itinerary stores its itinerary_sources (≥1 row) with URL, domain, title, fetched_at, and content hash.
  • F4.6 Source allowlist is off by default in v0 (any domain) but the config knob exists so v1 can flip it on per-environment.
  • Quality scoring + ranking (v1).
  • Activity geocoding beyond best-effort from the source text.
  • Per-user “save itinerary to my trip” — that’s a thin extension to spec 0001 (trip-planner) and is tracked separately.
  • Source allowlist UI / admin curation (v1).
  • Refresh / staleness policy (v1; user-visible only via refresh=true).
  • As a solo planner, when I search “Kyoto 5 days”, I see existing itineraries immediately and new ones stream in as the worker finds them.
  • As an inspiration hoarder, I can open any itinerary and see exactly which URLs it was built from, with one tap to the source.
  • As any user searching the same location later, I get the cached results another user’s search produced — no duplicate scraping cost.
flowchart LR
Flutter[["Flutter<br/>Discover surface"]]
Nest[["NestJS<br/>/itineraries/*"]]
Worker[["Python worker<br/>treeper_workers.itinerary"]]
Search["Search provider<br/>(Brave / SerpAPI)"]
Fetch["httpx + trafilatura"]
LLM["OpenRouter LLM<br/>(structured extract)"]
DB[("Postgres<br/>itineraries + sources<br/>global cache")]
RT{{"Supabase Realtime<br/>itinerary_scrape_jobs"}}
Flutter -- "POST /itineraries/search" --> Nest
Nest -- "forward" --> Worker
Worker --> Search --> Fetch --> LLM --> DB
Worker -- "update job row" --> RT
RT -- "row update" --> Flutter
Flutter -- "GET /itineraries/:id" --> Nest
Nest -- "RLS public read" --> DB
  • Search box on Discover tab → results screen with two sections: “Already on Treeper” (cache) and “Finding more…” (SSE stream).
  • Itinerary detail screen shows a Sources strip at the bottom with favicon + domain + external-link icon.
AC-1 F4.1 POST /itineraries/search returns 200 with `cached: []`
and a `job_id` in <300ms when no rows exist.
AC-2 F4.1 Same call returns existing rows when cache is non-empty,
without enqueuing a duplicate job for the same
(location_id, duration_days) within 60s.
AC-3 F4.2 Worker completes a scrape job for a single location in
p95 ≤ 60s with ≥3 sources fetched.
AC-4 F4.3 Two different users searching the same location see the
same itinerary rows (global cache, no user_id column).
AC-5 F4.4 Cached rows persist indefinitely; no TTL job clears them.
AC-6 F4.5 Every `itineraries` row has ≥1 `itinerary_sources` row;
insert is rejected (DB check) if the LLM produced no
traceable source.
AC-7 F4.6 With `ITINERARY_SOURCE_ALLOWLIST` unset, any non-blocked
domain is scraped. With it set, only matching domains are
fetched (manual toggle, no UI in v0).
AC-8 F4.2 Re-running the worker on the same input does not insert
duplicate `itinerary_sources` (uniqueness on content_hash).

New tables under public.*. Existing trip tables are untouched.

locations id, name, country, lat, lng, slug,
normalized_key UNIQUE, created_at
itineraries id, location_id → locations,
title, duration_days, summary, tags text[],
hero_image_url, source_type, lang,
created_at, updated_at
itinerary_days id, itinerary_id → itineraries, day_number,
title, narrative
itinerary_activities id, day_id → itinerary_days, ord, name, kind,
description, lat, lng,
est_cost_cents, est_duration_min
itinerary_sources id, itinerary_id → itineraries, url, domain,
title, published_at, fetched_at,
content_hash UNIQUE
itinerary_scrape_jobs id, location_id, params jsonb, status,
attempts, last_error, started_at, finished_at

Indexes: locations(normalized_key), itineraries(location_id, duration_days), GIN on itineraries.tags, itinerary_sources(content_hash).

RLS:

  • All itinerary_* and locations tables: public read for any authenticated user (global cache); writes restricted to service role (worker). No owner_id.

NestJS (apps/backend):

  • POST /itineraries/search { location, duration_days?, refresh? }{ cached: Itinerary[], job_id: string }
  • GET /itineraries/:id → itinerary + nested days[], sources[]
  • GET /locations/:slug/itineraries?days=5
  • GET /jobs/:id/stream (SSE) → event: itinerary.created

Python workers (apps/workers):

  • POST /ai/itineraries/start — invoked by NestJS to enqueue a job (mirrors the existing /ai/imports/start pattern from spec 0003).
  • Realtime listener on itinerary_scrape_jobs where status='queued'.
  • First response < 300ms when cache has rows.
  • Worker p95 ≤ 60s for 5 sources / 1 location.
  • Idempotent: same input twice never duplicates sources.
  • Per-domain concurrency = 2; honor robots.txt; UA = TreeperBot/0.1 (+https://treeper.app/bot).
  • All secrets (OpenRouter, Supabase service role, search API key) via env. Service-role key never crosses the public HTTP boundary.
  • R1 Scraping any-source w/o allowlist → legal & quality variance. Mitigation: attribution always stored; allowlist toggle ready for v1.
  • R2 LLM hallucinates activities not in source. Mitigation: prompt forces source_snippet per activity; reject if missing.
  • R3 Permanent cache → rows go stale. Mitigation: updated_at + manual refresh=true. Auto-refresh in v1.
  • Q1 Search provider for v0 — Brave (cheap) vs SerpAPI (richer)? Decision pending; pipeline abstracts the provider behind a Protocol.
SliceSurfaceStatus
v0.1Migrations 0011 + 0012; worker pipeline scaffold;Shipped
mobile Discover search page + itinerary deep-link
v0.2NestJS /itineraries/*; mobile results list +Shipped
ItineraryCard navigation
v0.3Job-status polling for streamed-in itinerariesShipped
v0.4Supabase Realtime row updates replacing pollingShipped
v0.5SearXNG provider — self-hosted free meta-search,Shipped
default in place of paid Brave API
v0.6GET /itineraries/recent endpoint + mobileShipped
integration via ItinerariesRepository.recent()
v0.7Hero + activity image extraction + rehostShipped
(image_extract.py, images.py, migration 0013,
richer ScrapedItinerary schema with themes / pace /
budget / season / cost)
v1Source allowlist toggle, quality scoring, auto-refresh,Planned
url_launcher on source pills
  • Spec 0003 — trip-imports (same Python workers app, same LLM client).
  • ADR (future) — itinerary-source allowlist policy.