ops(carbonio): add noreply@ mailbox auto-purge + daily cron

Server-side classifier for the noreply@performancewest.net Carbonio mailbox
(35,337 msgs, ~98.6% machine noise). Deletes bounces/auto-replies/ticket
auto-acks, keeps genuine human Re: replies + unsubscribes (move to Trash,
reversible).

Classifier precedence: unsubscribe guard > RFC3834 Auto-Submitted header >
machine From-address (localpart/strong-token/display-bot) > STRONG auto
subjects (deletes deceptive Re: auto-acks) > human Re: keep > broad auto-ack
subjects > default keep. Subjects RFC2047 MIME-decoded first.

Three-phase execution: Phase1 fast MAILER-DAEMON search-delete, Phase1.5 fast
search-delete of common auto classes (guarded against Re:/unsub), Phase2
header-classify the small remainder with KEEP-caching.

Validated 23/23 against hand-labelled live sample. Initial backfill reduced
35,337 -> 68 (67 human replies + 1 unsubscribe). Daily cron installed in root
crontab: 17 4 * * * --apply --days 3.
This commit is contained in:
justin 2026-06-21 04:55:50 -05:00
parent e414ec4a5f
commit 2d220a273d
3 changed files with 333 additions and 0 deletions

View file

@ -0,0 +1,121 @@
# Carbonio `noreply@` mailbox auto-purge
Server-side maintenance for the `noreply@performancewest.net` mailbox on the
Carbonio (Zextras) mail host `co.carrierone.com`.
## Problem
The `noreply@` mailbox accumulated **35,337 messages (~488 MB)**. A sampled
audit showed **~98.6% were machine noise**: bounce DSNs (this box's own Postfix
backscatter), out-of-office / auto-reply messages, and helpdesk/ticket
auto-acknowledgements. Buried in the rest were a small number of **genuine human
replies** to the trucking (DOT#/MCS-150) and telecom/FCC campaigns -- these land
here because of the historical Reply-To behaviour -- plus the occasional
**unsubscribe** request.
## Policy (explicit)
- **DELETE**: bounces, ticket/case auto-acknowledgements, out-of-office and
auto-reply messages, delivery-status notifications, authentication reports.
- **KEEP**: genuine human replies (`Re:`/`Fwd:`) and unsubscribe/opt-out
requests.
- **Fail-safe**: when a message is not clearly machine-generated, KEEP it.
- Deletions **move to `/Trash`** (reversible), never hard-delete.
## Why a header/sender classifier, not subject matching
Subject text alone is unreliable: auto-responders frequently reply with a
deceptive `Re:` prefix (e.g. an auto-responder answering our campaign with
`Re: <our subject>`). The classifier therefore uses, in precedence order:
1. **Unsubscribe guard** (compliance) -- always KEEP, overrides everything.
2. **RFC 3834 `Auto-Submitted:` header** -- if present and != `no`, the sending
system has declared the message automatic (bounces = `auto-generated`,
vacation/auto-replies = `auto-replied`). This is the single most reliable
signal and it catches the deceptive `Re:` auto-responders.
3. **Machine From-address** -- exact bot localparts (`mailer-daemon`,
`postmaster`, `no-reply`, ...), strong tokens anywhere in the localpart
(`...-bounces@`, `expense-noreply-...@`, `auth-results@`), and display-name
bots (`Mail Delivery System`, `System Administrator`, ...).
4. **STRONG auto subjects** -- unambiguous machine markers no human types
(`New Ticket Created`, `(autoresponse)`, `Auto Re:`, `your request with id
##...##`, `we're on it`, `Undeliverable`, `Authentication Report`, ...).
Checked **before** the human `Re:` guard so ticket auto-acks dressed as `Re:`
are still removed.
5. **Human `Re:`/`Fwd:`** -- KEEP.
6. **Ticket tag `[##...##]` / broad auto-ack subjects** -- DELETE.
7. **Default -> KEEP** (human-safe).
Subjects are RFC 2047 MIME-decoded first (campaign subjects contain an em-dash,
so they arrive `=?utf-8?Q?...?=` encoded and would otherwise evade matching).
The ruleset was validated against a hand-labelled set drawn from the live
mailbox: **23/23 cases correct**, including keeping the real `Re:` replies from
the same campaigns whose auto-responder twins were deleted.
## Execution model
`nr_purge.sh` runs in three stages so the expensive part stays small:
- **Phase 1** -- fast server-side search-delete of `from:MAILER-DAEMON` bounces
(the ~97% bulk), guarded against unsubscribe. No per-message fetch.
- **Phase 1.5** -- fast search-delete of the common non-MAILER machine classes
(`from:postmaster`, `Undeliverable`, `automatic reply`, `out of office`,
`delivery status notification`), each hard-guarded with
`AND NOT (subject:Re OR subject:Fwd OR subject:unsubscribe ...)` so anything
ambiguous falls through to the accurate classifier.
- **Phase 2** -- header-classify the small remainder one message at a time using
the full `decide()` ruleset; KEEP decisions are cached so survivors are not
re-fetched on subsequent pages.
On the initial backfill this reduced **35,337 -> 68** messages (67 genuine human
replies + 1 unsubscribe), moving ~35,269 machine items to Trash.
## Usage
```sh
# read-only preview of the N most-recent messages (prints survivors + sample deletes)
bash nr_purge.sh --preview 150
# full purge (move matches to /Trash)
bash nr_purge.sh --apply
# date-bounded purge (only inspect last N days) -- used by the daily cron
bash nr_purge.sh --apply --days 3
# Phase-1-only fast bounce sweep
bash nr_purge.sh --apply --quick
```
## Deployment
The script lives on the Carbonio host at `/opt/zextras/nr_purge.sh` (and a copy
in `~zextras/`). It must run as the `zextras` user (owns `zmmailbox`).
A daily cron is installed in **root's** crontab (not the zextras crontab, which
Carbonio/`zmcontrol` regenerates and would wipe):
```cron
17 4 * * * su - zextras -c 'bash /opt/zextras/nr_purge.sh --apply --days 3' >> /var/log/nr_purge_cron.log 2>&1
```
`--days 3` keeps the daily run cheap: it only header-inspects mail from the last
three days (a few dozen messages), which is more than enough overlap to catch
anything that arrived since the previous run.
To (re)deploy after editing this file:
```sh
scp -P 22022 nr_purge.sh justin@co.carrierone.com:/tmp/nr_purge.sh
ssh -p 22022 justin@co.carrierone.com \
'sudo cp /tmp/nr_purge.sh /opt/zextras/nr_purge.sh && sudo chown zextras: /opt/zextras/nr_purge.sh && sudo chmod +x /opt/zextras/nr_purge.sh'
```
## Notes / gotchas
- `zmmailbox search -l` works up to 1000 results/page; offset paging (`-o`) does
not work reliably and large limits (2000+) silently return empty. The script
loops on "delete the top page, re-search" instead of offset paging.
- Trash still counts against mailbox size until emptied. The initial backfill
left Trash populated (reversible); emptying it is an optional, irreversible
follow-up.

View file

@ -0,0 +1,10 @@
#!/bin/bash
# Install daily noreply@ auto-purge cron in ROOT crontab (NOT zextras' -- that one is
# regenerated by Carbonio/zmcontrol and would wipe our line). Root crontab is stable.
# Invokes the purge as the zextras user. Date-bounded (last 3 days) so it stays cheap.
set -e
SCRIPT=/opt/zextras/nr_purge.sh
LOG=/var/log/nr_purge_cron.log
CRON_LINE="17 4 * * * su - zextras -c 'bash $SCRIPT --apply --days 3' >> $LOG 2>&1"
( crontab -l 2>/dev/null | grep -v 'nr_purge.sh' ; echo "$CRON_LINE" ) | crontab -
echo "=== root crontab nr_purge line ==="; crontab -l | grep nr_purge

202
scripts/ops/carbonio/nr_purge.sh Executable file
View file

@ -0,0 +1,202 @@
#!/bin/bash
# nr_purge.sh -- auto-purge noreply@ mailbox on Carbonio.
# Policy: DELETE bounces + ticket auto-acks + auto-replies; KEEP human replies + unsubscribes.
# Discriminator: RFC 3834 Auto-Submitted header (reliable; catches fake "Re:" auto-responders).
# Reversible: deletions MOVE to /Trash (not hard delete).
#
# Modes:
# (no args) preview: classify most-recent $PREVIEW_N msgs, read-only, print decisions+survivors
# --preview N preview N most-recent
# --apply full two-phase purge (Phase1 bulk bounces, Phase2 header-classify remainder)
# --apply --quick Phase1 only (bulk bounce delete), skip header classify
set -uo pipefail
M="noreply@performancewest.net"
TRASH="/Trash"
PREVIEW_N=200
MODE="preview"; QUICK=0; DAYS="${NR_DAYS:-}"
while [ $# -gt 0 ]; do case "$1" in
--apply) MODE="apply";;
--quick) QUICK=1;;
--days) DAYS="${2:-}"; shift;;
--preview) MODE="preview"; PREVIEW_N="${2:-200}"; shift;;
*) ;;
esac; shift; done
# Optional date bound for Phase2 (daily cron uses a small window; initial run leaves blank=all)
DATEQ=""; [ -n "$DAYS" ] && DATEQ=" AND after:-${DAYS}day"
TS=$(date +%Y%m%d_%H%M%S); LOG="/tmp/nr_purge_$TS.log"
zm(){ zmmailbox -z -m "$M" "$@" 2>/dev/null; }
# ---- RFC 2047 MIME-header decode (handles =?utf-8?Q?..?= and ?B?..?=) ----
mime_decode(){ perl -MEncode -CS -ne 'print Encode::decode("MIME-Header",$_)' 2>/dev/null; }
# Machine-sender address localparts (exact, lowercased): definitionally non-human.
# Matched against the localpart of the From address only (not display name) to avoid eating humans.
FROM_MACHINE_RE='^(mailer-daemon|postmaster|auto-reply|autoreply|auto-responder|autoresponder|no-reply|noreply|donotreply|do-not-reply|bounce|bounces|mdaemon|odoobot|helpdesk|notification|notifications|notify|sysadmin|system|root|abuse)([._+-].*)?$'
# Strong machine tokens that may appear ANYWHERE in the localpart (no human address has these).
FROM_TOKEN_RE='noreply|no-reply|donotreply|do-not-reply|mailer-daemon|auto-reply|autoreply|autoresponse|auto-response|bounces|auth-results|postmaster'
# Display-name bots (substring, lowercased) that use human-ish addresses but are clearly automated.
FROM_DISPLAY_BOT_RE='odoobot|mail delivery (sub)?system|system administrator|microsoft outlook|mail administrator|postmaster'
# STRONG auto markers checked BEFORE the human "Re:" guard -- unambiguous machine subjects that
# no human types, so safe to delete even when wearing a "Re:" prefix (e.g. ticket auto-acks).
STRONG_AUTO_RE='new ticket|ticket created|ticket #|ticket no\.?|ticket has been|has been (assigned|received|resolved|closed|created|opened|updated)|your request with id|request with id|we.?re on it|\(autoresponse\)|auto ?re:|automatic (reply|response)|auto-?response|out of office|out-of-office|authentication report|undeliverable|undelivered|delivery status notification|could ?n.?t be delivered|could not be delivered|message could ?n.?t be|failure notice|returned mail|welcome to .*help ?desk|new case notification'
# Broader auto-ack / bounce subject patterns (lowercased subject), checked AFTER the Re: guard.
SUBJ_AUTO_RE='has been (received|resolved|closed|updated|created|opened|assigned)|case (received|closed|resolved|notification)|ticket ?#|ticket no\.?|ticket has been|your ticket|new ticket|ticket created|your request with id|thanks,? we got your|we have received your|out of office|out-of-office|automatic reply|automatic response|auto[- ]?reply|autoreply|auto-?response|\(autoresponse\)|new case|message (recieved|received)|delivery (status notification|failure|has failed)|undelivered|undeliverable|failure notice|^failed:|returned mail|mail delivery|could not be delivered|could ?n.?t be delivered|delayed mail|invalid address|address not found|recipient (address )?rejected|new email address|quota|read-?receipt|priority opened|authentication report|help ?desk'
# from_localpart <header-block> -> echoes lowercased localpart of From address
from_localpart(){
printf '%s' "$1" | grep -iE '^From:' | head -1 \
| sed -E 's/^From:[[:space:]]*//I' \
| grep -oE '[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+' | head -1 \
| sed -E 's/@.*$//' | tr 'A-Z' 'a-z'
}
from_display(){ printf '%s' "$1" | grep -iE '^From:' | head -1 | sed -E 's/^From:[[:space:]]*//I' | tr 'A-Z' 'a-z'; }
# decide <header-block> <decoded-subject> -> prints "KEEP <reason>" | "DEL <reason>"
# Precedence: (1) unsubscribe wins; (2) Auto-Submitted header; (3) machine From-address (exact
# localpart / strong token / display-bot); (4) STRONG auto subjects (delete even if "Re:");
# (5) genuine human Re:; (6) ticket-tag / broad auto-ack subjects; (7) default keep.
decide(){
local H="$1" subj="$2"
local s as lp disp
s=$(printf '%s' "$subj" | tr 'A-Z' 'a-z')
as=$(printf '%s' "$H" | grep -iE '^Auto-Submitted:' | head -1 | sed -E 's/^Auto-Submitted:[[:space:]]*//I' | tr 'A-Z' 'a-z' | tr -d ' ')
lp=$(from_localpart "$H"); disp=$(from_display "$H")
# 1) compliance: unsubscribe/opt-out always KEEP (overrides every machine signal)
if printf '%s' "$s" | grep -qE 'unsubscribe|opt[ -]?out|remove me|stop emailing'; then echo "KEEP unsubscribe"; return; fi
# 2) RFC 3834 Auto-Submitted present & != no -> machine
if [ -n "$as" ] && [ "$as" != "no" ]; then echo "DEL auto-submitted=$as"; return; fi
# 3) machine From-address (exact localpart, strong token anywhere, or display-name bot)
if printf '%s' "$lp" | grep -qE "$FROM_MACHINE_RE"; then echo "DEL from-machine=$lp"; return; fi
if printf '%s' "$lp" | grep -qE "$FROM_TOKEN_RE"; then echo "DEL from-token=$lp"; return; fi
if printf '%s' "$disp" | grep -qE "$FROM_DISPLAY_BOT_RE"; then echo "DEL from-bot"; return; fi
# 4) STRONG auto subjects: unambiguous machine markers, delete even if dressed as "Re:"
if printf '%s' "$s" | grep -qE "$STRONG_AUTO_RE"; then echo "DEL strong-auto-subject"; return; fi
# 5) genuine human threaded reply (auto ones already removed above)
if printf '%s' "$subj" | grep -qE '^[[:space:]]*(Re|RE|Fwd|Fw|FW)[:[]'; then echo "KEEP human-reply"; return; fi
# 6) ticket tag / broad auto-ack subject patterns
if printf '%s' "$subj" | grep -qE '^[[:space:]]*\[##'; then echo "DEL ticket-tag"; return; fi
if printf '%s' "$s" | grep -qE "$SUBJ_AUTO_RE"; then echo "DEL auto-ack-subject"; return; fi
# 7) default: keep (human-safe)
echo "KEEP default"
}
# fetch the joined+decoded Subject for a message id
get_subject(){ zm getRestURL "/?id=$1&fmt=rfc822" | sed -n '1,/^$/p' \
| awk 'BEGIN{IGNORECASE=1} /^Subject:/{s=$0;getline n; while(n ~ /^[ \t]/){sub(/^[ \t]+/," ",n); s=s n; getline n} print s; exit}' \
| sed -E 's/^Subject:[[:space:]]*//I' | mime_decode; }
# ---- classify one message id -> prints "KEEP <reason>" or "DEL <reason>" ----
classify(){
local H subj
H=$(zm getRestURL "/?id=$1&fmt=rfc822" | sed -n '1,/^$/p')
subj=$(printf '%s' "$H" | awk 'BEGIN{IGNORECASE=1} /^Subject:/{s=$0;getline n; while(n ~ /^[ \t]/){sub(/^[ \t]+/," ",n); s=s n; getline n} print s; exit}' | sed -E 's/^Subject:[[:space:]]*//I' | mime_decode)
decide "$H" "$subj"
}
# classify + emit decoded subject (single fetch) -> "<DECISION>\t<subject>"
classify2(){
local H subj
H=$(zm getRestURL "/?id=$1&fmt=rfc822" | sed -n '1,/^$/p')
subj=$(printf '%s' "$H" | awk 'BEGIN{IGNORECASE=1} /^Subject:/{s=$0;getline n; while(n ~ /^[ \t]/){sub(/^[ \t]+/," ",n); s=s n; getline n} print s; exit}' | sed -E 's/^Subject:[[:space:]]*//I' | mime_decode)
printf '%s\t%s\n' "$(decide "$H" "$subj")" "$(printf '%s' "$subj" | cut -c1-70)"
}
ids_for(){ zm search -l 1000 -t message "$1" | grep -w mess | awk '{print $2}'; }
move_to_trash(){ # stdin: ids one per line
local buf=() id n=0
while read -r id; do [ -z "$id" ] && continue; buf+=("$id"); n=$((n+1))
if [ ${#buf[@]} -ge 200 ]; then
local c="${buf[*]}"; zm moveMessage "${c// /,}" "$TRASH" >/dev/null; buf=(); fi
done
if [ ${#buf[@]} -gt 0 ]; then local c="${buf[*]}"; zm moveMessage "${c// /,}" "$TRASH" >/dev/null; fi
echo "$n"
}
if [ "$MODE" = "preview" ]; then
echo "=== PREVIEW (read-only) most-recent $PREVIEW_N ===" | tee -a "$LOG"
IDS=$(zm search -l "$PREVIEW_N" -t message "in:inbox" | grep -w mess | awk '{print $2}')
keep=0; del=0; survivors="/tmp/nr_survivors_$TS.txt"; deletes="/tmp/nr_deletes_$TS.txt"
: > "$survivors"; : > "$deletes"
for id in $IDS; do
line=$(classify2 "$id") # "<DECISION>\t<subject>"
d=${line%%$'\t'*}; subj=${line#*$'\t'}
if [[ "$d" == KEEP* ]]; then keep=$((keep+1)); echo "id=$id [$d] $subj" >> "$survivors"
else del=$((del+1)); echo "id=$id [$d] $subj" >> "$deletes"; fi
done
echo "kept=$keep deleted=$del" | tee -a "$LOG"
echo "--- SURVIVORS (would KEEP) ---" | tee -a "$LOG"
cat "$survivors" | tee -a "$LOG"
echo "--- sample DELETES (first 25) ---" | tee -a "$LOG"
head -25 "$deletes" | tee -a "$LOG"
exit 0
fi
# ---- APPLY ----
echo "=== APPLY purge $TS (move to $TRASH) ===" | tee -a "$LOG"
# Phase 1: fast bulk bounce delete (MAILER-DAEMON = definitionally bounce), keep-guard on unsubscribe
echo "PHASE1 bulk bounces..." | tee -a "$LOG"
p1=0; g1=0
while :; do
B=$(ids_for "in:inbox from:MAILER-DAEMON AND NOT (subject:unsubscribe OR subject:\"opt out\")$DATEQ")
[ -z "${B// }" ] && break
n=$(printf '%s\n' "$B" | move_to_trash)
p1=$((p1+n)); echo " moved $n (cum $p1)" | tee -a "$LOG"
[ "$n" -lt 1 ] && break
g1=$((g1+1)); [ "$g1" -gt 200 ] && { echo " PHASE1 guard stop" | tee -a "$LOG"; break; }
done
echo "PHASE1 done: $p1 bounces -> Trash" | tee -a "$LOG"
[ "$QUICK" = "1" ] && { echo "quick mode: stop after phase1"; exit 0; }
# Phase 1.5: fast SEARCH-based bulk delete of the common non-MAILER machine classes
# (postmaster bounces, Undeliverable/Undelivered DSNs, OOO/automatic-reply). These are
# matched server-side (no per-message fetch) and HARD-guarded so anything wearing a
# genuine "Re:"/Fwd: or unsubscribe falls through to the accurate Phase 2 classifier.
echo "PHASE1.5 fast search-delete of common auto classes..." | tee -a "$LOG"
GUARD='AND NOT (subject:Re OR subject:Fwd OR subject:unsubscribe OR subject:"opt out")'
p15=0
for q in \
"in:inbox from:postmaster $GUARD$DATEQ" \
"in:inbox subject:Undeliverable $GUARD$DATEQ" \
"in:inbox subject:\"Undelivered Mail\" $GUARD$DATEQ" \
"in:inbox subject:\"automatic reply\" $GUARD$DATEQ" \
"in:inbox subject:\"out of office\" $GUARD$DATEQ" \
"in:inbox subject:\"failure notice\" $GUARD$DATEQ" \
"in:inbox subject:\"delivery status notification\" $GUARD$DATEQ" \
; do
qg=0
while :; do
B=$(ids_for "$q")
[ -z "${B// }" ] && break
n=$(printf '%s\n' "$B" | move_to_trash)
p15=$((p15+n)); echo " [$q] moved $n (cum $p15)" | tee -a "$LOG"
[ "$n" -lt 1 ] && break
qg=$((qg+1)); [ "$qg" -gt 50 ] && break
done
done
echo "PHASE1.5 done: $p15 auto-class -> Trash" | tee -a "$LOG"
# Phase 2: header-classify the remainder. Offset paging is unreliable on this box,
# so we loop: classify the top page, delete its DELs, cache KEEPs as "seen" so we
# don't re-fetch them next pass. Terminate when a page yields only already-seen KEEPs.
echo "PHASE2 header-classify remainder..." | tee -a "$LOG"
p2=0; guard=0; SEEN="/tmp/nr_seen_$TS.txt"; : > "$SEEN"
while :; do
IDS=$(zm search -l 1000 -t message "in:inbox AND NOT from:MAILER-DAEMON$DATEQ" | grep -w mess | awk '{print $2}')
[ -z "${IDS// }" ] && break
delbuf=""; newwork=0
for id in $IDS; do
grep -qx "$id" "$SEEN" && continue # already classified KEEP, skip
newwork=1
d=$(classify "$id")
if [[ "$d" == DEL* ]]; then delbuf+="$id"$'\n'; else echo "$id" >> "$SEEN"; fi
done
if [ -n "${delbuf// }" ]; then
n=$(printf '%s' "$delbuf" | move_to_trash); p2=$((p2+n)); echo " page moved $n (cum $p2)" | tee -a "$LOG"
fi
# A page with no new (unseen) messages means everything left is cached-KEEP -> done.
if [ "$newwork" = "0" ]; then echo " page all-seen-KEEP, stop" | tee -a "$LOG"; break; fi
guard=$((guard+1)); [ "$guard" -gt 120 ] && { echo "guard stop" | tee -a "$LOG"; break; }
done
echo "PHASE2 done: $p2 auto/ack -> Trash; survivors cached in $SEEN ($(wc -l < "$SEEN"))" | tee -a "$LOG"
echo "TOTAL moved to Trash: $((p1+p2))" | tee -a "$LOG"