Commit graph

3 commits

Author SHA1 Message Date
0aa4c6c2bb enrichment: drop LLM for structured info, dedup images by sha + phash
Per user request, the LLM is no longer asked to extract rooms/size/rent/WBS —
those come from the inberlinwohnen.de scraper which is reliable. Haiku is now
used for one narrow job: pick which <img> URLs from the listing page are
actual flat photos (vs. logos, badges, ads, employee portraits). On any LLM
failure the unfiltered candidate list passes through.

Image dedup runs in two tiers:
1. SHA256 of bytes — drops different URLs that point to byte-identical files
2. Perceptual hash (Pillow + imagehash, Hamming distance ≤ 5) — drops the
   "same image at a different resolution" duplicates from srcset / CDN
   variants that were filling galleries with 2–4× copies

UI:
- Wohnungsliste falls back to scraper-only display (rooms/size/rent/wbs)
- Detail panel only shows images + "Zur Original-Anzeige →"; description /
  features / pros & cons / kv table are gone
- Per-row "erneut versuchen" link + the "analysiert…/?" status chips were
  tied to LLM extraction and are removed; the header "Bilder nachladen (N)"
  button still surfaces pending/failed batches for admins

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 15:29:55 +02:00
a8f698bf5e enrichment: capture failure cause + admin retry button
Each enrichment failure now records {"_error": "...", "_step": "..."} into
enrichment_json, mirrors the message into the errors log (visible in
/logs/protokoll), and the list shows the cause as a tooltip on the
"Fehler beim Abrufen der Infos" text. Admins also get a "erneut versuchen"
link per failed row that re-queues just that flat (POST /actions/enrich-flat).

The pipeline raises a typed EnrichmentError per step (fetch / llm / crash)
so future failure modes don't get swallowed as a silent "failed".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 15:05:39 +02:00
eb66284172 enrichment: Haiku flat details + image gallery on expand
apply service
- POST /internal/fetch-listing: headless Playwright fetch of a listing URL,
  returns {html, image_urls[], final_url}. Uses the same browser
  fingerprint/profile as the apply run so bot guards don't kick in

web service
- New enrichment pipeline (web/enrichment.py):
  /internal/flats → upsert → kick() enrichment in a background thread
    1. POST /internal/fetch-listing on apply
    2. llm.extract_flat_details(html, url) — Haiku tool-use call returns
       structured JSON (address, rooms, rent, description, pros/cons, etc.)
    3. Download each image directly to /data/flats/<slug>/NN.<ext>
    4. Persist enrichment_json + image_count + enrichment_status on the flat
- llm.py: minimal Anthropic /v1/messages wrapper, no SDK
- DB migration v5 adds enrichment_json/_status/_updated_at + image_count
- Admin "Altbestand anreichern" button (POST /actions/enrich-all) queues
  backfill for all pending/failed rows; runs in a detached task
- GET /partials/wohnung/<id> renders _wohnung_detail.html
- GET /flat-images/<slug>/<n> serves the downloaded image

UI
- Chevron on each list row toggles an inline detail pane (HTMX fetch on
  first open, hx-preserve keeps it open across the 3–30 s polls)
- CSS .flat-gallery normalises image tiles to a 4/3 aspect with object-fit:
  cover so different source sizes align cleanly
- "analysiert…" / "?" chips on the list reflect enrichment_status

Config
- ANTHROPIC_API_KEY + ANTHROPIC_MODEL wired into docker-compose's web
  service (default model: claude-haiku-4-5-20251001)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 14:46:12 +02:00