0004 — Itinerary Finder (location → web-scraped itineraries)
ID: 0004Status: In progress (v0.1–v0.7 shipped; v1 polish remaining)Owner: @satyaCreated: 2026-05-11Updated: 2026-05-13Related ADRs: -Slice tracker — see progress.md. v0.1 through v0.7 are live end-to-end (workers + backend + mobile). What’s left for v1: source allowlist toggle, auto-refresh policy, quality scoring,
url_launcheron source pills.
1. Why
Section titled “1. Why”Planners start a trip from a blank page. They paste a city into Google, read 8 blogs, and stitch a rough itinerary by hand. Treeper already ingests their fragments (spec 0003). This feature makes the public web a first-class input: any location search returns ready itineraries scraped from blogs, Wikivoyage, and travel publishers — with the source URLs visible so the user can dig deeper.
2. Who it is for
Section titled “2. Who it is for”From PRODUCT.md §2:
- Solo planner — wants a starting itinerary instead of a blank page.
- Inspiration hoarder — already collects links; now sees curated ones for the destination they searched.
- Curated planner — sees attribution and can pick the source they trust before importing into their trip.
3. Scope
Section titled “3. Scope”In scope (v0)
Section titled “In scope (v0)”F4.1POST /itineraries/searchon the NestJS API — returns cached itineraries for a location instantly and enqueues a background scrape.F4.2Python pipeline inapps/workersthat searches the web, fetches pages, extracts itineraries via the existing OpenRouterLLMClient, and persists them.F4.3Global cache — itineraries are shared across all users, keyed by normalizedlocation_id. No per-user scoping.F4.4Permanent storage — no TTL on cached rows. Refresh is explicit (refresh=trueflag).F4.5Every itinerary stores itsitinerary_sources(≥1 row) with URL, domain, title, fetched_at, and content hash.F4.6Source allowlist is off by default in v0 (any domain) but the config knob exists so v1 can flip it on per-environment.
Out of scope
Section titled “Out of scope”- Quality scoring + ranking (v1).
- Activity geocoding beyond best-effort from the source text.
- Per-user “save itinerary to my trip” — that’s a thin extension to spec 0001 (trip-planner) and is tracked separately.
- Source allowlist UI / admin curation (v1).
- Refresh / staleness policy (v1; user-visible only via
refresh=true).
4. User stories
Section titled “4. User stories”- As a solo planner, when I search “Kyoto 5 days”, I see existing itineraries immediately and new ones stream in as the worker finds them.
- As an inspiration hoarder, I can open any itinerary and see exactly which URLs it was built from, with one tap to the source.
- As any user searching the same location later, I get the cached results another user’s search produced — no duplicate scraping cost.
4a. Architecture
Section titled “4a. Architecture”flowchart LR Flutter[["Flutter<br/>Discover surface"]] Nest[["NestJS<br/>/itineraries/*"]] Worker[["Python worker<br/>treeper_workers.itinerary"]] Search["Search provider<br/>(Brave / SerpAPI)"] Fetch["httpx + trafilatura"] LLM["OpenRouter LLM<br/>(structured extract)"] DB[("Postgres<br/>itineraries + sources<br/>global cache")] RT{{"Supabase Realtime<br/>itinerary_scrape_jobs"}}
Flutter -- "POST /itineraries/search" --> Nest Nest -- "forward" --> Worker Worker --> Search --> Fetch --> LLM --> DB Worker -- "update job row" --> RT RT -- "row update" --> Flutter Flutter -- "GET /itineraries/:id" --> Nest Nest -- "RLS public read" --> DB5. UX notes
Section titled “5. UX notes”- Search box on Discover tab → results screen with two sections: “Already on Treeper” (cache) and “Finding more…” (SSE stream).
- Itinerary detail screen shows a
Sourcesstrip at the bottom with favicon + domain + external-link icon.
6. Acceptance criteria
Section titled “6. Acceptance criteria”AC-1 F4.1 POST /itineraries/search returns 200 with `cached: []` and a `job_id` in <300ms when no rows exist.AC-2 F4.1 Same call returns existing rows when cache is non-empty, without enqueuing a duplicate job for the same (location_id, duration_days) within 60s.AC-3 F4.2 Worker completes a scrape job for a single location in p95 ≤ 60s with ≥3 sources fetched.AC-4 F4.3 Two different users searching the same location see the same itinerary rows (global cache, no user_id column).AC-5 F4.4 Cached rows persist indefinitely; no TTL job clears them.AC-6 F4.5 Every `itineraries` row has ≥1 `itinerary_sources` row; insert is rejected (DB check) if the LLM produced no traceable source.AC-7 F4.6 With `ITINERARY_SOURCE_ALLOWLIST` unset, any non-blocked domain is scraped. With it set, only matching domains are fetched (manual toggle, no UI in v0).AC-8 F4.2 Re-running the worker on the same input does not insert duplicate `itinerary_sources` (uniqueness on content_hash).7. Data model
Section titled “7. Data model”New tables under public.*. Existing trip tables are untouched.
locations id, name, country, lat, lng, slug, normalized_key UNIQUE, created_atitineraries id, location_id → locations, title, duration_days, summary, tags text[], hero_image_url, source_type, lang, created_at, updated_atitinerary_days id, itinerary_id → itineraries, day_number, title, narrativeitinerary_activities id, day_id → itinerary_days, ord, name, kind, description, lat, lng, est_cost_cents, est_duration_minitinerary_sources id, itinerary_id → itineraries, url, domain, title, published_at, fetched_at, content_hash UNIQUEitinerary_scrape_jobs id, location_id, params jsonb, status, attempts, last_error, started_at, finished_atIndexes: locations(normalized_key),
itineraries(location_id, duration_days),
GIN on itineraries.tags,
itinerary_sources(content_hash).
RLS:
- All
itinerary_*andlocationstables: public read for any authenticated user (global cache); writes restricted to service role (worker). Noowner_id.
8. APIs / contracts
Section titled “8. APIs / contracts”NestJS (apps/backend):
POST /itineraries/search{ location, duration_days?, refresh? }→{ cached: Itinerary[], job_id: string }GET /itineraries/:id→ itinerary + nesteddays[],sources[]GET /locations/:slug/itineraries?days=5GET /jobs/:id/stream(SSE) →event: itinerary.created
Python workers (apps/workers):
POST /ai/itineraries/start— invoked by NestJS to enqueue a job (mirrors the existing/ai/imports/startpattern from spec 0003).- Realtime listener on
itinerary_scrape_jobswherestatus='queued'.
9. Non-functional requirements
Section titled “9. Non-functional requirements”- First response < 300ms when cache has rows.
- Worker p95 ≤ 60s for 5 sources / 1 location.
- Idempotent: same input twice never duplicates sources.
- Per-domain concurrency = 2; honor
robots.txt; UA =TreeperBot/0.1 (+https://treeper.app/bot). - All secrets (OpenRouter, Supabase service role, search API key) via env. Service-role key never crosses the public HTTP boundary.
10. Risks & open questions
Section titled “10. Risks & open questions”- R1 Scraping any-source w/o allowlist → legal & quality variance. Mitigation: attribution always stored; allowlist toggle ready for v1.
- R2 LLM hallucinates activities not in source.
Mitigation: prompt forces
source_snippetper activity; reject if missing. - R3 Permanent cache → rows go stale.
Mitigation:
updated_at+ manualrefresh=true. Auto-refresh in v1. - Q1 Search provider for v0 — Brave (cheap) vs SerpAPI (richer)? Decision pending; pipeline abstracts the provider behind a Protocol.
11. Rollout plan
Section titled “11. Rollout plan”| Slice | Surface | Status |
|---|---|---|
| v0.1 | Migrations 0011 + 0012; worker pipeline scaffold; | Shipped |
| mobile Discover search page + itinerary deep-link | ||
| v0.2 | NestJS /itineraries/*; mobile results list + | Shipped |
| ItineraryCard navigation | ||
| v0.3 | Job-status polling for streamed-in itineraries | Shipped |
| v0.4 | Supabase Realtime row updates replacing polling | Shipped |
| v0.5 | SearXNG provider — self-hosted free meta-search, | Shipped |
| default in place of paid Brave API | ||
| v0.6 | GET /itineraries/recent endpoint + mobile | Shipped |
integration via ItinerariesRepository.recent() | ||
| v0.7 | Hero + activity image extraction + rehost | Shipped |
(image_extract.py, images.py, migration 0013, | ||
richer ScrapedItinerary schema with themes / pace / | ||
| budget / season / cost) | ||
| v1 | Source allowlist toggle, quality scoring, auto-refresh, | Planned |
url_launcher on source pills |
12. References
Section titled “12. References”- Spec 0003 — trip-imports (same Python workers app, same LLM client).
- ADR (future) — itinerary-source allowlist policy.