new-site/scripts/ops/carbonio/README.md
justin 2d220a273d ops(carbonio): add noreply@ mailbox auto-purge + daily cron
Server-side classifier for the noreply@performancewest.net Carbonio mailbox
(35,337 msgs, ~98.6% machine noise). Deletes bounces/auto-replies/ticket
auto-acks, keeps genuine human Re: replies + unsubscribes (move to Trash,
reversible).

Classifier precedence: unsubscribe guard > RFC3834 Auto-Submitted header >
machine From-address (localpart/strong-token/display-bot) > STRONG auto
subjects (deletes deceptive Re: auto-acks) > human Re: keep > broad auto-ack
subjects > default keep. Subjects RFC2047 MIME-decoded first.

Three-phase execution: Phase1 fast MAILER-DAEMON search-delete, Phase1.5 fast
search-delete of common auto classes (guarded against Re:/unsub), Phase2
header-classify the small remainder with KEEP-caching.

Validated 23/23 against hand-labelled live sample. Initial backfill reduced
35,337 -> 68 (67 human replies + 1 unsubscribe). Daily cron installed in root
crontab: 17 4 * * * --apply --days 3.
2026-06-21 04:55:50 -05:00

121 lines
5.4 KiB
Markdown

# Carbonio `noreply@` mailbox auto-purge
Server-side maintenance for the `noreply@performancewest.net` mailbox on the
Carbonio (Zextras) mail host `co.carrierone.com`.
## Problem
The `noreply@` mailbox accumulated **35,337 messages (~488 MB)**. A sampled
audit showed **~98.6% were machine noise**: bounce DSNs (this box's own Postfix
backscatter), out-of-office / auto-reply messages, and helpdesk/ticket
auto-acknowledgements. Buried in the rest were a small number of **genuine human
replies** to the trucking (DOT#/MCS-150) and telecom/FCC campaigns -- these land
here because of the historical Reply-To behaviour -- plus the occasional
**unsubscribe** request.
## Policy (explicit)
- **DELETE**: bounces, ticket/case auto-acknowledgements, out-of-office and
auto-reply messages, delivery-status notifications, authentication reports.
- **KEEP**: genuine human replies (`Re:`/`Fwd:`) and unsubscribe/opt-out
requests.
- **Fail-safe**: when a message is not clearly machine-generated, KEEP it.
- Deletions **move to `/Trash`** (reversible), never hard-delete.
## Why a header/sender classifier, not subject matching
Subject text alone is unreliable: auto-responders frequently reply with a
deceptive `Re:` prefix (e.g. an auto-responder answering our campaign with
`Re: <our subject>`). The classifier therefore uses, in precedence order:
1. **Unsubscribe guard** (compliance) -- always KEEP, overrides everything.
2. **RFC 3834 `Auto-Submitted:` header** -- if present and != `no`, the sending
system has declared the message automatic (bounces = `auto-generated`,
vacation/auto-replies = `auto-replied`). This is the single most reliable
signal and it catches the deceptive `Re:` auto-responders.
3. **Machine From-address** -- exact bot localparts (`mailer-daemon`,
`postmaster`, `no-reply`, ...), strong tokens anywhere in the localpart
(`...-bounces@`, `expense-noreply-...@`, `auth-results@`), and display-name
bots (`Mail Delivery System`, `System Administrator`, ...).
4. **STRONG auto subjects** -- unambiguous machine markers no human types
(`New Ticket Created`, `(autoresponse)`, `Auto Re:`, `your request with id
##...##`, `we're on it`, `Undeliverable`, `Authentication Report`, ...).
Checked **before** the human `Re:` guard so ticket auto-acks dressed as `Re:`
are still removed.
5. **Human `Re:`/`Fwd:`** -- KEEP.
6. **Ticket tag `[##...##]` / broad auto-ack subjects** -- DELETE.
7. **Default -> KEEP** (human-safe).
Subjects are RFC 2047 MIME-decoded first (campaign subjects contain an em-dash,
so they arrive `=?utf-8?Q?...?=` encoded and would otherwise evade matching).
The ruleset was validated against a hand-labelled set drawn from the live
mailbox: **23/23 cases correct**, including keeping the real `Re:` replies from
the same campaigns whose auto-responder twins were deleted.
## Execution model
`nr_purge.sh` runs in three stages so the expensive part stays small:
- **Phase 1** -- fast server-side search-delete of `from:MAILER-DAEMON` bounces
(the ~97% bulk), guarded against unsubscribe. No per-message fetch.
- **Phase 1.5** -- fast search-delete of the common non-MAILER machine classes
(`from:postmaster`, `Undeliverable`, `automatic reply`, `out of office`,
`delivery status notification`), each hard-guarded with
`AND NOT (subject:Re OR subject:Fwd OR subject:unsubscribe ...)` so anything
ambiguous falls through to the accurate classifier.
- **Phase 2** -- header-classify the small remainder one message at a time using
the full `decide()` ruleset; KEEP decisions are cached so survivors are not
re-fetched on subsequent pages.
On the initial backfill this reduced **35,337 -> 68** messages (67 genuine human
replies + 1 unsubscribe), moving ~35,269 machine items to Trash.
## Usage
```sh
# read-only preview of the N most-recent messages (prints survivors + sample deletes)
bash nr_purge.sh --preview 150
# full purge (move matches to /Trash)
bash nr_purge.sh --apply
# date-bounded purge (only inspect last N days) -- used by the daily cron
bash nr_purge.sh --apply --days 3
# Phase-1-only fast bounce sweep
bash nr_purge.sh --apply --quick
```
## Deployment
The script lives on the Carbonio host at `/opt/zextras/nr_purge.sh` (and a copy
in `~zextras/`). It must run as the `zextras` user (owns `zmmailbox`).
A daily cron is installed in **root's** crontab (not the zextras crontab, which
Carbonio/`zmcontrol` regenerates and would wipe):
```cron
17 4 * * * su - zextras -c 'bash /opt/zextras/nr_purge.sh --apply --days 3' >> /var/log/nr_purge_cron.log 2>&1
```
`--days 3` keeps the daily run cheap: it only header-inspects mail from the last
three days (a few dozen messages), which is more than enough overlap to catch
anything that arrived since the previous run.
To (re)deploy after editing this file:
```sh
scp -P 22022 nr_purge.sh justin@co.carrierone.com:/tmp/nr_purge.sh
ssh -p 22022 justin@co.carrierone.com \
'sudo cp /tmp/nr_purge.sh /opt/zextras/nr_purge.sh && sudo chown zextras: /opt/zextras/nr_purge.sh && sudo chmod +x /opt/zextras/nr_purge.sh'
```
## Notes / gotchas
- `zmmailbox search -l` works up to 1000 results/page; offset paging (`-o`) does
not work reliably and large limits (2000+) silently return empty. The script
loops on "delete the top page, re-search" instead of offset paging.
- Trash still counts against mailbox size until emptied. The initial backfill
left Trash populated (reversible); emptying it is an optional, irreversible
follow-up.