0004 — Itinerary Finder (location → web-scraped itineraries)

ID:           0004
Status:       In progress (v0.1–v0.7 shipped; v1 polish remaining)
Owner:        @satya
Created:      2026-05-11
Updated:      2026-05-13
Related ADRs: -

Slice tracker — see progress.md. v0.1 through v0.7 are live end-to-end (workers + backend + mobile). What’s left for v1: source allowlist toggle, auto-refresh policy, quality scoring, url_launcher on source pills.

1. Why

Planners start a trip from a blank page. They paste a city into Google, read 8 blogs, and stitch a rough itinerary by hand. Treeper already ingests their fragments (spec 0003). This feature makes the public web a first-class input: any location search returns ready itineraries scraped from blogs, Wikivoyage, and travel publishers — with the source URLs visible so the user can dig deeper.

2. Who it is for

From PRODUCT.md §2:

Solo planner — wants a starting itinerary instead of a blank page.
Inspiration hoarder — already collects links; now sees curated ones for the destination they searched.
Curated planner — sees attribution and can pick the source they trust before importing into their trip.

3. Scope

In scope (v0)

F4.1 POST /itineraries/search on the NestJS API — returns cached itineraries for a location instantly and enqueues a background scrape.
F4.2 Python pipeline in apps/workers that searches the web, fetches pages, extracts itineraries via the existing OpenRouter LLMClient, and persists them.
F4.3 Global cache — itineraries are shared across all users, keyed by normalized location_id. No per-user scoping.
F4.4 Permanent storage — no TTL on cached rows. Refresh is explicit (refresh=true flag).
F4.5 Every itinerary stores its itinerary_sources (≥1 row) with URL, domain, title, fetched_at, and content hash.
F4.6 Source allowlist is off by default in v0 (any domain) but the config knob exists so v1 can flip it on per-environment.

Out of scope

Quality scoring + ranking (v1).
Activity geocoding beyond best-effort from the source text.
Per-user “save itinerary to my trip” — that’s a thin extension to spec 0001 (trip-planner) and is tracked separately.
Source allowlist UI / admin curation (v1).
Refresh / staleness policy (v1; user-visible only via refresh=true).

4. User stories

As a solo planner, when I search “Kyoto 5 days”, I see existing itineraries immediately and new ones stream in as the worker finds them.
As an inspiration hoarder, I can open any itinerary and see exactly which URLs it was built from, with one tap to the source.
As any user searching the same location later, I get the cached results another user’s search produced — no duplicate scraping cost.

4a. Architecture

flowchart LR
  Flutter[["Flutter<br/>Discover surface"]]
  Nest[["NestJS<br/>/itineraries/*"]]
  Worker[["Python worker<br/>treeper_workers.itinerary"]]
  Search["Search provider<br/>(Brave / SerpAPI)"]
  Fetch["httpx + trafilatura"]
  LLM["OpenRouter LLM<br/>(structured extract)"]
  DB[("Postgres<br/>itineraries + sources<br/>global cache")]
  RT{{"Supabase Realtime<br/>itinerary_scrape_jobs"}}

  Flutter -- "POST /itineraries/search" --> Nest
  Nest -- "forward" --> Worker
  Worker --> Search --> Fetch --> LLM --> DB
  Worker -- "update job row" --> RT
  RT -- "row update" --> Flutter
  Flutter -- "GET /itineraries/:id" --> Nest
  Nest -- "RLS public read" --> DB

5. UX notes

Search box on Discover tab → results screen with two sections: “Already on Treeper” (cache) and “Finding more…” (SSE stream).
Itinerary detail screen shows a Sources strip at the bottom with favicon + domain + external-link icon.

6. Acceptance criteria

AC-1  F4.1   POST /itineraries/search returns 200 with `cached: []`
             and a `job_id` in <300ms when no rows exist.
AC-2  F4.1   Same call returns existing rows when cache is non-empty,
             without enqueuing a duplicate job for the same
             (location_id, duration_days) within 60s.
AC-3  F4.2   Worker completes a scrape job for a single location in
             p95 ≤ 60s with ≥3 sources fetched.
AC-4  F4.3   Two different users searching the same location see the
             same itinerary rows (global cache, no user_id column).
AC-5  F4.4   Cached rows persist indefinitely; no TTL job clears them.
AC-6  F4.5   Every `itineraries` row has ≥1 `itinerary_sources` row;
             insert is rejected (DB check) if the LLM produced no
             traceable source.
AC-7  F4.6   With `ITINERARY_SOURCE_ALLOWLIST` unset, any non-blocked
             domain is scraped. With it set, only matching domains are
             fetched (manual toggle, no UI in v0).
AC-8  F4.2   Re-running the worker on the same input does not insert
             duplicate `itinerary_sources` (uniqueness on content_hash).

7. Data model

New tables under public.*. Existing trip tables are untouched.

locations            id, name, country, lat, lng, slug,
                     normalized_key UNIQUE, created_at
itineraries          id, location_id → locations,
                     title, duration_days, summary, tags text[],
                     hero_image_url, source_type, lang,
                     created_at, updated_at
itinerary_days       id, itinerary_id → itineraries, day_number,
                     title, narrative
itinerary_activities id, day_id → itinerary_days, ord, name, kind,
                     description, lat, lng,
                     est_cost_cents, est_duration_min
itinerary_sources    id, itinerary_id → itineraries, url, domain,
                     title, published_at, fetched_at,
                     content_hash UNIQUE
itinerary_scrape_jobs id, location_id, params jsonb, status,
                     attempts, last_error, started_at, finished_at

Indexes: locations(normalized_key), itineraries(location_id, duration_days), GIN on itineraries.tags, itinerary_sources(content_hash).

RLS:

All itinerary_* and locations tables: public read for any authenticated user (global cache); writes restricted to service role (worker). No owner_id.

8. APIs / contracts

NestJS (apps/backend):

POST /itineraries/search { location, duration_days?, refresh? } → { cached: Itinerary[], job_id: string }
GET /itineraries/:id → itinerary + nested days[], sources[]
GET /locations/:slug/itineraries?days=5
GET /jobs/:id/stream (SSE) → event: itinerary.created

Python workers (apps/workers):

POST /ai/itineraries/start — invoked by NestJS to enqueue a job (mirrors the existing /ai/imports/start pattern from spec 0003).
Realtime listener on itinerary_scrape_jobs where status='queued'.

9. Non-functional requirements

First response < 300ms when cache has rows.
Worker p95 ≤ 60s for 5 sources / 1 location.
Idempotent: same input twice never duplicates sources.
Per-domain concurrency = 2; honor robots.txt; UA = TreeperBot/0.1 (+https://treeper.app/bot).
All secrets (OpenRouter, Supabase service role, search API key) via env. Service-role key never crosses the public HTTP boundary.

10. Risks & open questions

R1 Scraping any-source w/o allowlist → legal & quality variance. Mitigation: attribution always stored; allowlist toggle ready for v1.
R2 LLM hallucinates activities not in source. Mitigation: prompt forces source_snippet per activity; reject if missing.
R3 Permanent cache → rows go stale. Mitigation: updated_at + manual refresh=true. Auto-refresh in v1.
Q1 Search provider for v0 — Brave (cheap) vs SerpAPI (richer)? Decision pending; pipeline abstracts the provider behind a Protocol.

11. Rollout plan

Slice	Surface	Status
v0.1	Migrations 0011 + 0012; worker pipeline scaffold;	Shipped
	mobile Discover search page + itinerary deep-link
v0.2	NestJS `/itineraries/*`; mobile results list +	Shipped
	ItineraryCard navigation
v0.3	Job-status polling for streamed-in itineraries	Shipped
v0.4	Supabase Realtime row updates replacing polling	Shipped
v0.5	SearXNG provider — self-hosted free meta-search,	Shipped
	default in place of paid Brave API
v0.6	`GET /itineraries/recent` endpoint + mobile	Shipped
	integration via `ItinerariesRepository.recent()`
v0.7	Hero + activity image extraction + rehost	Shipped
	(`image_extract.py`, `images.py`, migration 0013,
	richer `ScrapedItinerary` schema with themes / pace /
	budget / season / cost)
v1	Source allowlist toggle, quality scoring, auto-refresh,	Planned
	`url_launcher` on source pills

12. References

Spec 0003 — trip-imports (same Python workers app, same LLM client).
ADR (future) — itinerary-source allowlist policy.