Commit graph

2 commits

Author SHA1 Message Date
justin
4d3af2aeae otc: domain->email scraper + filing-agent domain filtering
scrape_otc_emails.py: fetch each issuer domain's IR/contact pages (gzip,
HTML-only, early-abort, prefer ir@/investor@/info@), extract a contact email.
Skip filing-agent domains (DFN/Donnelley/Broadridge/etc.) that leak into the
extracted domain -- those are not the issuer's site. Same filter added to the
harvester's DOMAIN_NOISE for future runs. Phone (100%) is the fallback channel
for email misses.
2026-06-14 06:56:45 -05:00
justin
fdea97e57e otc: EDGAR harvester for US-domestic OTC issuers + domain-from-filings
Pilot -> production: harvest_otc_issuers.py pulls the OTC/None universe (2,771),
keeps US-domestic (requires BOTH a US state-of-incorporation AND a US-state
business address -- disambiguates the 'DE'=Delaware-vs-Germany trap that leaked
Infineon etc.), and extracts each issuer's website DOMAIN directly from its
latest 10-K/8-K/DEF-14A filing (free, no scrape; ~58-60% find rate in testing).
Outputs cik,name,ticker,state_inc,phone,city,state,zip,domain -- ready for the
domain->email scrape + verify step. Phone is 100% (clean fallback call channel).
Reincorporation-to-TX / RA / foreign-qual / franchise-tax / annual-report fit.
2026-06-14 01:24:56 -05:00