Server-side classifier for the noreply@performancewest.net Carbonio mailbox (35,337 msgs, ~98.6% machine noise). Deletes bounces/auto-replies/ticket auto-acks, keeps genuine human Re: replies + unsubscribes (move to Trash, reversible). Classifier precedence: unsubscribe guard > RFC3834 Auto-Submitted header > machine From-address (localpart/strong-token/display-bot) > STRONG auto subjects (deletes deceptive Re: auto-acks) > human Re: keep > broad auto-ack subjects > default keep. Subjects RFC2047 MIME-decoded first. Three-phase execution: Phase1 fast MAILER-DAEMON search-delete, Phase1.5 fast search-delete of common auto classes (guarded against Re:/unsub), Phase2 header-classify the small remainder with KEEP-caching. Validated 23/23 against hand-labelled live sample. Initial backfill reduced 35,337 -> 68 (67 human replies + 1 unsubscribe). Daily cron installed in root crontab: 17 4 * * * --apply --days 3.
121 lines
5.4 KiB
Markdown
121 lines
5.4 KiB
Markdown
# Carbonio `noreply@` mailbox auto-purge
|
|
|
|
Server-side maintenance for the `noreply@performancewest.net` mailbox on the
|
|
Carbonio (Zextras) mail host `co.carrierone.com`.
|
|
|
|
## Problem
|
|
|
|
The `noreply@` mailbox accumulated **35,337 messages (~488 MB)**. A sampled
|
|
audit showed **~98.6% were machine noise**: bounce DSNs (this box's own Postfix
|
|
backscatter), out-of-office / auto-reply messages, and helpdesk/ticket
|
|
auto-acknowledgements. Buried in the rest were a small number of **genuine human
|
|
replies** to the trucking (DOT#/MCS-150) and telecom/FCC campaigns -- these land
|
|
here because of the historical Reply-To behaviour -- plus the occasional
|
|
**unsubscribe** request.
|
|
|
|
## Policy (explicit)
|
|
|
|
- **DELETE**: bounces, ticket/case auto-acknowledgements, out-of-office and
|
|
auto-reply messages, delivery-status notifications, authentication reports.
|
|
- **KEEP**: genuine human replies (`Re:`/`Fwd:`) and unsubscribe/opt-out
|
|
requests.
|
|
- **Fail-safe**: when a message is not clearly machine-generated, KEEP it.
|
|
- Deletions **move to `/Trash`** (reversible), never hard-delete.
|
|
|
|
## Why a header/sender classifier, not subject matching
|
|
|
|
Subject text alone is unreliable: auto-responders frequently reply with a
|
|
deceptive `Re:` prefix (e.g. an auto-responder answering our campaign with
|
|
`Re: <our subject>`). The classifier therefore uses, in precedence order:
|
|
|
|
1. **Unsubscribe guard** (compliance) -- always KEEP, overrides everything.
|
|
2. **RFC 3834 `Auto-Submitted:` header** -- if present and != `no`, the sending
|
|
system has declared the message automatic (bounces = `auto-generated`,
|
|
vacation/auto-replies = `auto-replied`). This is the single most reliable
|
|
signal and it catches the deceptive `Re:` auto-responders.
|
|
3. **Machine From-address** -- exact bot localparts (`mailer-daemon`,
|
|
`postmaster`, `no-reply`, ...), strong tokens anywhere in the localpart
|
|
(`...-bounces@`, `expense-noreply-...@`, `auth-results@`), and display-name
|
|
bots (`Mail Delivery System`, `System Administrator`, ...).
|
|
4. **STRONG auto subjects** -- unambiguous machine markers no human types
|
|
(`New Ticket Created`, `(autoresponse)`, `Auto Re:`, `your request with id
|
|
##...##`, `we're on it`, `Undeliverable`, `Authentication Report`, ...).
|
|
Checked **before** the human `Re:` guard so ticket auto-acks dressed as `Re:`
|
|
are still removed.
|
|
5. **Human `Re:`/`Fwd:`** -- KEEP.
|
|
6. **Ticket tag `[##...##]` / broad auto-ack subjects** -- DELETE.
|
|
7. **Default -> KEEP** (human-safe).
|
|
|
|
Subjects are RFC 2047 MIME-decoded first (campaign subjects contain an em-dash,
|
|
so they arrive `=?utf-8?Q?...?=` encoded and would otherwise evade matching).
|
|
|
|
The ruleset was validated against a hand-labelled set drawn from the live
|
|
mailbox: **23/23 cases correct**, including keeping the real `Re:` replies from
|
|
the same campaigns whose auto-responder twins were deleted.
|
|
|
|
## Execution model
|
|
|
|
`nr_purge.sh` runs in three stages so the expensive part stays small:
|
|
|
|
- **Phase 1** -- fast server-side search-delete of `from:MAILER-DAEMON` bounces
|
|
(the ~97% bulk), guarded against unsubscribe. No per-message fetch.
|
|
- **Phase 1.5** -- fast search-delete of the common non-MAILER machine classes
|
|
(`from:postmaster`, `Undeliverable`, `automatic reply`, `out of office`,
|
|
`delivery status notification`), each hard-guarded with
|
|
`AND NOT (subject:Re OR subject:Fwd OR subject:unsubscribe ...)` so anything
|
|
ambiguous falls through to the accurate classifier.
|
|
- **Phase 2** -- header-classify the small remainder one message at a time using
|
|
the full `decide()` ruleset; KEEP decisions are cached so survivors are not
|
|
re-fetched on subsequent pages.
|
|
|
|
On the initial backfill this reduced **35,337 -> 68** messages (67 genuine human
|
|
replies + 1 unsubscribe), moving ~35,269 machine items to Trash.
|
|
|
|
## Usage
|
|
|
|
```sh
|
|
# read-only preview of the N most-recent messages (prints survivors + sample deletes)
|
|
bash nr_purge.sh --preview 150
|
|
|
|
# full purge (move matches to /Trash)
|
|
bash nr_purge.sh --apply
|
|
|
|
# date-bounded purge (only inspect last N days) -- used by the daily cron
|
|
bash nr_purge.sh --apply --days 3
|
|
|
|
# Phase-1-only fast bounce sweep
|
|
bash nr_purge.sh --apply --quick
|
|
```
|
|
|
|
## Deployment
|
|
|
|
The script lives on the Carbonio host at `/opt/zextras/nr_purge.sh` (and a copy
|
|
in `~zextras/`). It must run as the `zextras` user (owns `zmmailbox`).
|
|
|
|
A daily cron is installed in **root's** crontab (not the zextras crontab, which
|
|
Carbonio/`zmcontrol` regenerates and would wipe):
|
|
|
|
```cron
|
|
17 4 * * * su - zextras -c 'bash /opt/zextras/nr_purge.sh --apply --days 3' >> /var/log/nr_purge_cron.log 2>&1
|
|
```
|
|
|
|
`--days 3` keeps the daily run cheap: it only header-inspects mail from the last
|
|
three days (a few dozen messages), which is more than enough overlap to catch
|
|
anything that arrived since the previous run.
|
|
|
|
To (re)deploy after editing this file:
|
|
|
|
```sh
|
|
scp -P 22022 nr_purge.sh justin@co.carrierone.com:/tmp/nr_purge.sh
|
|
ssh -p 22022 justin@co.carrierone.com \
|
|
'sudo cp /tmp/nr_purge.sh /opt/zextras/nr_purge.sh && sudo chown zextras: /opt/zextras/nr_purge.sh && sudo chmod +x /opt/zextras/nr_purge.sh'
|
|
```
|
|
|
|
## Notes / gotchas
|
|
|
|
- `zmmailbox search -l` works up to 1000 results/page; offset paging (`-o`) does
|
|
not work reliably and large limits (2000+) silently return empty. The script
|
|
loops on "delete the top page, re-search" instead of offset paging.
|
|
- Trash still counts against mailbox size until emptied. The initial backfill
|
|
left Trash populated (reversible); emptying it is an optional, irreversible
|
|
follow-up.
|