new-site/scripts/ops/carbonio/README.md
justin 2d220a273d ops(carbonio): add noreply@ mailbox auto-purge + daily cron
Server-side classifier for the noreply@performancewest.net Carbonio mailbox
(35,337 msgs, ~98.6% machine noise). Deletes bounces/auto-replies/ticket
auto-acks, keeps genuine human Re: replies + unsubscribes (move to Trash,
reversible).

Classifier precedence: unsubscribe guard > RFC3834 Auto-Submitted header >
machine From-address (localpart/strong-token/display-bot) > STRONG auto
subjects (deletes deceptive Re: auto-acks) > human Re: keep > broad auto-ack
subjects > default keep. Subjects RFC2047 MIME-decoded first.

Three-phase execution: Phase1 fast MAILER-DAEMON search-delete, Phase1.5 fast
search-delete of common auto classes (guarded against Re:/unsub), Phase2
header-classify the small remainder with KEEP-caching.

Validated 23/23 against hand-labelled live sample. Initial backfill reduced
35,337 -> 68 (67 human replies + 1 unsubscribe). Daily cron installed in root
crontab: 17 4 * * * --apply --days 3.
2026-06-21 04:55:50 -05:00

5.4 KiB

Carbonio noreply@ mailbox auto-purge

Server-side maintenance for the noreply@performancewest.net mailbox on the Carbonio (Zextras) mail host co.carrierone.com.

Problem

The noreply@ mailbox accumulated 35,337 messages (~488 MB). A sampled audit showed ~98.6% were machine noise: bounce DSNs (this box's own Postfix backscatter), out-of-office / auto-reply messages, and helpdesk/ticket auto-acknowledgements. Buried in the rest were a small number of genuine human replies to the trucking (DOT#/MCS-150) and telecom/FCC campaigns -- these land here because of the historical Reply-To behaviour -- plus the occasional unsubscribe request.

Policy (explicit)

  • DELETE: bounces, ticket/case auto-acknowledgements, out-of-office and auto-reply messages, delivery-status notifications, authentication reports.
  • KEEP: genuine human replies (Re:/Fwd:) and unsubscribe/opt-out requests.
  • Fail-safe: when a message is not clearly machine-generated, KEEP it.
  • Deletions move to /Trash (reversible), never hard-delete.

Why a header/sender classifier, not subject matching

Subject text alone is unreliable: auto-responders frequently reply with a deceptive Re: prefix (e.g. an auto-responder answering our campaign with Re: <our subject>). The classifier therefore uses, in precedence order:

  1. Unsubscribe guard (compliance) -- always KEEP, overrides everything.
  2. RFC 3834 Auto-Submitted: header -- if present and != no, the sending system has declared the message automatic (bounces = auto-generated, vacation/auto-replies = auto-replied). This is the single most reliable signal and it catches the deceptive Re: auto-responders.
  3. Machine From-address -- exact bot localparts (mailer-daemon, postmaster, no-reply, ...), strong tokens anywhere in the localpart (...-bounces@, expense-noreply-...@, auth-results@), and display-name bots (Mail Delivery System, System Administrator, ...).
  4. STRONG auto subjects -- unambiguous machine markers no human types (New Ticket Created, (autoresponse), Auto Re:, your request with id ##...##, we're on it, Undeliverable, Authentication Report, ...). Checked before the human Re: guard so ticket auto-acks dressed as Re: are still removed.
  5. Human Re:/Fwd: -- KEEP.
  6. Ticket tag [##...##] / broad auto-ack subjects -- DELETE.
  7. Default -> KEEP (human-safe).

Subjects are RFC 2047 MIME-decoded first (campaign subjects contain an em-dash, so they arrive =?utf-8?Q?...?= encoded and would otherwise evade matching).

The ruleset was validated against a hand-labelled set drawn from the live mailbox: 23/23 cases correct, including keeping the real Re: replies from the same campaigns whose auto-responder twins were deleted.

Execution model

nr_purge.sh runs in three stages so the expensive part stays small:

  • Phase 1 -- fast server-side search-delete of from:MAILER-DAEMON bounces (the ~97% bulk), guarded against unsubscribe. No per-message fetch.
  • Phase 1.5 -- fast search-delete of the common non-MAILER machine classes (from:postmaster, Undeliverable, automatic reply, out of office, delivery status notification), each hard-guarded with AND NOT (subject:Re OR subject:Fwd OR subject:unsubscribe ...) so anything ambiguous falls through to the accurate classifier.
  • Phase 2 -- header-classify the small remainder one message at a time using the full decide() ruleset; KEEP decisions are cached so survivors are not re-fetched on subsequent pages.

On the initial backfill this reduced 35,337 -> 68 messages (67 genuine human replies + 1 unsubscribe), moving ~35,269 machine items to Trash.

Usage

# read-only preview of the N most-recent messages (prints survivors + sample deletes)
bash nr_purge.sh --preview 150

# full purge (move matches to /Trash)
bash nr_purge.sh --apply

# date-bounded purge (only inspect last N days) -- used by the daily cron
bash nr_purge.sh --apply --days 3

# Phase-1-only fast bounce sweep
bash nr_purge.sh --apply --quick

Deployment

The script lives on the Carbonio host at /opt/zextras/nr_purge.sh (and a copy in ~zextras/). It must run as the zextras user (owns zmmailbox).

A daily cron is installed in root's crontab (not the zextras crontab, which Carbonio/zmcontrol regenerates and would wipe):

17 4 * * * su - zextras -c 'bash /opt/zextras/nr_purge.sh --apply --days 3' >> /var/log/nr_purge_cron.log 2>&1

--days 3 keeps the daily run cheap: it only header-inspects mail from the last three days (a few dozen messages), which is more than enough overlap to catch anything that arrived since the previous run.

To (re)deploy after editing this file:

scp -P 22022 nr_purge.sh justin@co.carrierone.com:/tmp/nr_purge.sh
ssh -p 22022 justin@co.carrierone.com \
  'sudo cp /tmp/nr_purge.sh /opt/zextras/nr_purge.sh && sudo chown zextras: /opt/zextras/nr_purge.sh && sudo chmod +x /opt/zextras/nr_purge.sh'

Notes / gotchas

  • zmmailbox search -l works up to 1000 results/page; offset paging (-o) does not work reliably and large limits (2000+) silently return empty. The script loops on "delete the top page, re-search" instead of offset paging.
  • Trash still counts against mailbox size until emptied. The initial backfill left Trash populated (reversible); emptying it is an optional, irreversible follow-up.