healthcare: bound NPPES-stale window [3,10]yr + restore verify_ok gate

- Add NPPES_STALE_MAX_YEARS (default 10): a record untouched for many years is
  a stronger signal the practice closed/moved, and a bounce burns the warming
  IP. Observed institutional distribution clusters 3-7yrs with ~0 beyond 8, so
  10 is a safe ceiling that mails the whole real pool while excluding any
  outlier ancient record. MIN stays 3 (keeps the 'out of date' claim credible).
- Restore the SMTP-verification gate (verify_ok) that the shared
  institutional_verified selector had -- the swap to nppes_stale dropped it; we
  only mail inboxes we already proved live.
- enrich: process the re-fetch queue STALEST-FIRST so a bounded (--limit) or
  --max-age refresh spends its budget on the most-overdue cache entries (and new
  NPIs) first, never starving them behind merely-aging ones.
- Selector unit-tested (10 cases incl. window edges, verify gate, deactivated).
This commit is contained in:
justin 2026-06-20 15:28:12 -05:00
parent 9e155d214c
commit 744f0a89cf
2 changed files with 28 additions and 11 deletions

View file

@ -175,11 +175,17 @@ def main() -> int:
cache = load_cache(args.cache)
log(f"cache={args.cache} entries={len(cache):,}")
# Determine which NPIs need a (re)fetch.
# Determine which NPIs need a (re)fetch, STALEST FIRST so a bounded run
# (--limit) always spends its budget on the most-overdue cache entries.
# Never-fetched entries have an empty fetched_at, which sorts first, so new
# NPIs are prioritized over merely-aging ones.
todo = [n for n in npis if not is_fresh(cache.get(n, {}), today, args.max_age)]
todo.sort(key=lambda n: cache.get(n, {}).get("fetched_at", "") or "")
n_due = len(todo)
if args.limit:
todo = todo[:args.limit]
log(f"to_fetch={len(todo):,} (of {len(npis):,} unique NPIs; limit={args.limit or 'all'})")
log(f"to_fetch={len(todo):,} (of {n_due:,} due / {len(npis):,} unique NPIs; "
f"limit={args.limit or 'all'})")
fetched = 0
t0 = time.time()