From 2d220a273da6fb5a113f4763f7a4ace6a190dfc1 Mon Sep 17 00:00:00 2001 From: justin Date: Sun, 21 Jun 2026 04:55:50 -0500 Subject: [PATCH] ops(carbonio): add noreply@ mailbox auto-purge + daily cron Server-side classifier for the noreply@performancewest.net Carbonio mailbox (35,337 msgs, ~98.6% machine noise). Deletes bounces/auto-replies/ticket auto-acks, keeps genuine human Re: replies + unsubscribes (move to Trash, reversible). Classifier precedence: unsubscribe guard > RFC3834 Auto-Submitted header > machine From-address (localpart/strong-token/display-bot) > STRONG auto subjects (deletes deceptive Re: auto-acks) > human Re: keep > broad auto-ack subjects > default keep. Subjects RFC2047 MIME-decoded first. Three-phase execution: Phase1 fast MAILER-DAEMON search-delete, Phase1.5 fast search-delete of common auto classes (guarded against Re:/unsub), Phase2 header-classify the small remainder with KEEP-caching. Validated 23/23 against hand-labelled live sample. Initial backfill reduced 35,337 -> 68 (67 human replies + 1 unsubscribe). Daily cron installed in root crontab: 17 4 * * * --apply --days 3. --- scripts/ops/carbonio/README.md | 121 ++++++++++++++ scripts/ops/carbonio/nr_cron_install.sh | 10 ++ scripts/ops/carbonio/nr_purge.sh | 202 ++++++++++++++++++++++++ 3 files changed, 333 insertions(+) create mode 100644 scripts/ops/carbonio/README.md create mode 100644 scripts/ops/carbonio/nr_cron_install.sh create mode 100755 scripts/ops/carbonio/nr_purge.sh diff --git a/scripts/ops/carbonio/README.md b/scripts/ops/carbonio/README.md new file mode 100644 index 0000000..e2ebf7a --- /dev/null +++ b/scripts/ops/carbonio/README.md @@ -0,0 +1,121 @@ +# Carbonio `noreply@` mailbox auto-purge + +Server-side maintenance for the `noreply@performancewest.net` mailbox on the +Carbonio (Zextras) mail host `co.carrierone.com`. + +## Problem + +The `noreply@` mailbox accumulated **35,337 messages (~488 MB)**. A sampled +audit showed **~98.6% were machine noise**: bounce DSNs (this box's own Postfix +backscatter), out-of-office / auto-reply messages, and helpdesk/ticket +auto-acknowledgements. Buried in the rest were a small number of **genuine human +replies** to the trucking (DOT#/MCS-150) and telecom/FCC campaigns -- these land +here because of the historical Reply-To behaviour -- plus the occasional +**unsubscribe** request. + +## Policy (explicit) + +- **DELETE**: bounces, ticket/case auto-acknowledgements, out-of-office and + auto-reply messages, delivery-status notifications, authentication reports. +- **KEEP**: genuine human replies (`Re:`/`Fwd:`) and unsubscribe/opt-out + requests. +- **Fail-safe**: when a message is not clearly machine-generated, KEEP it. +- Deletions **move to `/Trash`** (reversible), never hard-delete. + +## Why a header/sender classifier, not subject matching + +Subject text alone is unreliable: auto-responders frequently reply with a +deceptive `Re:` prefix (e.g. an auto-responder answering our campaign with +`Re: `). The classifier therefore uses, in precedence order: + +1. **Unsubscribe guard** (compliance) -- always KEEP, overrides everything. +2. **RFC 3834 `Auto-Submitted:` header** -- if present and != `no`, the sending + system has declared the message automatic (bounces = `auto-generated`, + vacation/auto-replies = `auto-replied`). This is the single most reliable + signal and it catches the deceptive `Re:` auto-responders. +3. **Machine From-address** -- exact bot localparts (`mailer-daemon`, + `postmaster`, `no-reply`, ...), strong tokens anywhere in the localpart + (`...-bounces@`, `expense-noreply-...@`, `auth-results@`), and display-name + bots (`Mail Delivery System`, `System Administrator`, ...). +4. **STRONG auto subjects** -- unambiguous machine markers no human types + (`New Ticket Created`, `(autoresponse)`, `Auto Re:`, `your request with id + ##...##`, `we're on it`, `Undeliverable`, `Authentication Report`, ...). + Checked **before** the human `Re:` guard so ticket auto-acks dressed as `Re:` + are still removed. +5. **Human `Re:`/`Fwd:`** -- KEEP. +6. **Ticket tag `[##...##]` / broad auto-ack subjects** -- DELETE. +7. **Default -> KEEP** (human-safe). + +Subjects are RFC 2047 MIME-decoded first (campaign subjects contain an em-dash, +so they arrive `=?utf-8?Q?...?=` encoded and would otherwise evade matching). + +The ruleset was validated against a hand-labelled set drawn from the live +mailbox: **23/23 cases correct**, including keeping the real `Re:` replies from +the same campaigns whose auto-responder twins were deleted. + +## Execution model + +`nr_purge.sh` runs in three stages so the expensive part stays small: + +- **Phase 1** -- fast server-side search-delete of `from:MAILER-DAEMON` bounces + (the ~97% bulk), guarded against unsubscribe. No per-message fetch. +- **Phase 1.5** -- fast search-delete of the common non-MAILER machine classes + (`from:postmaster`, `Undeliverable`, `automatic reply`, `out of office`, + `delivery status notification`), each hard-guarded with + `AND NOT (subject:Re OR subject:Fwd OR subject:unsubscribe ...)` so anything + ambiguous falls through to the accurate classifier. +- **Phase 2** -- header-classify the small remainder one message at a time using + the full `decide()` ruleset; KEEP decisions are cached so survivors are not + re-fetched on subsequent pages. + +On the initial backfill this reduced **35,337 -> 68** messages (67 genuine human +replies + 1 unsubscribe), moving ~35,269 machine items to Trash. + +## Usage + +```sh +# read-only preview of the N most-recent messages (prints survivors + sample deletes) +bash nr_purge.sh --preview 150 + +# full purge (move matches to /Trash) +bash nr_purge.sh --apply + +# date-bounded purge (only inspect last N days) -- used by the daily cron +bash nr_purge.sh --apply --days 3 + +# Phase-1-only fast bounce sweep +bash nr_purge.sh --apply --quick +``` + +## Deployment + +The script lives on the Carbonio host at `/opt/zextras/nr_purge.sh` (and a copy +in `~zextras/`). It must run as the `zextras` user (owns `zmmailbox`). + +A daily cron is installed in **root's** crontab (not the zextras crontab, which +Carbonio/`zmcontrol` regenerates and would wipe): + +```cron +17 4 * * * su - zextras -c 'bash /opt/zextras/nr_purge.sh --apply --days 3' >> /var/log/nr_purge_cron.log 2>&1 +``` + +`--days 3` keeps the daily run cheap: it only header-inspects mail from the last +three days (a few dozen messages), which is more than enough overlap to catch +anything that arrived since the previous run. + +To (re)deploy after editing this file: + +```sh +scp -P 22022 nr_purge.sh justin@co.carrierone.com:/tmp/nr_purge.sh +ssh -p 22022 justin@co.carrierone.com \ + 'sudo cp /tmp/nr_purge.sh /opt/zextras/nr_purge.sh && sudo chown zextras: /opt/zextras/nr_purge.sh && sudo chmod +x /opt/zextras/nr_purge.sh' +``` + +## Notes / gotchas + +- `zmmailbox search -l` works up to 1000 results/page; offset paging (`-o`) does + not work reliably and large limits (2000+) silently return empty. The script + loops on "delete the top page, re-search" instead of offset paging. +- Trash still counts against mailbox size until emptied. The initial backfill + left Trash populated (reversible); emptying it is an optional, irreversible + follow-up. diff --git a/scripts/ops/carbonio/nr_cron_install.sh b/scripts/ops/carbonio/nr_cron_install.sh new file mode 100644 index 0000000..ce59b58 --- /dev/null +++ b/scripts/ops/carbonio/nr_cron_install.sh @@ -0,0 +1,10 @@ +#!/bin/bash +# Install daily noreply@ auto-purge cron in ROOT crontab (NOT zextras' -- that one is +# regenerated by Carbonio/zmcontrol and would wipe our line). Root crontab is stable. +# Invokes the purge as the zextras user. Date-bounded (last 3 days) so it stays cheap. +set -e +SCRIPT=/opt/zextras/nr_purge.sh +LOG=/var/log/nr_purge_cron.log +CRON_LINE="17 4 * * * su - zextras -c 'bash $SCRIPT --apply --days 3' >> $LOG 2>&1" +( crontab -l 2>/dev/null | grep -v 'nr_purge.sh' ; echo "$CRON_LINE" ) | crontab - +echo "=== root crontab nr_purge line ==="; crontab -l | grep nr_purge diff --git a/scripts/ops/carbonio/nr_purge.sh b/scripts/ops/carbonio/nr_purge.sh new file mode 100755 index 0000000..230425c --- /dev/null +++ b/scripts/ops/carbonio/nr_purge.sh @@ -0,0 +1,202 @@ +#!/bin/bash +# nr_purge.sh -- auto-purge noreply@ mailbox on Carbonio. +# Policy: DELETE bounces + ticket auto-acks + auto-replies; KEEP human replies + unsubscribes. +# Discriminator: RFC 3834 Auto-Submitted header (reliable; catches fake "Re:" auto-responders). +# Reversible: deletions MOVE to /Trash (not hard delete). +# +# Modes: +# (no args) preview: classify most-recent $PREVIEW_N msgs, read-only, print decisions+survivors +# --preview N preview N most-recent +# --apply full two-phase purge (Phase1 bulk bounces, Phase2 header-classify remainder) +# --apply --quick Phase1 only (bulk bounce delete), skip header classify +set -uo pipefail +M="noreply@performancewest.net" +TRASH="/Trash" +PREVIEW_N=200 +MODE="preview"; QUICK=0; DAYS="${NR_DAYS:-}" +while [ $# -gt 0 ]; do case "$1" in + --apply) MODE="apply";; + --quick) QUICK=1;; + --days) DAYS="${2:-}"; shift;; + --preview) MODE="preview"; PREVIEW_N="${2:-200}"; shift;; + *) ;; +esac; shift; done +# Optional date bound for Phase2 (daily cron uses a small window; initial run leaves blank=all) +DATEQ=""; [ -n "$DAYS" ] && DATEQ=" AND after:-${DAYS}day" +TS=$(date +%Y%m%d_%H%M%S); LOG="/tmp/nr_purge_$TS.log" +zm(){ zmmailbox -z -m "$M" "$@" 2>/dev/null; } + +# ---- RFC 2047 MIME-header decode (handles =?utf-8?Q?..?= and ?B?..?=) ---- +mime_decode(){ perl -MEncode -CS -ne 'print Encode::decode("MIME-Header",$_)' 2>/dev/null; } + +# Machine-sender address localparts (exact, lowercased): definitionally non-human. +# Matched against the localpart of the From address only (not display name) to avoid eating humans. +FROM_MACHINE_RE='^(mailer-daemon|postmaster|auto-reply|autoreply|auto-responder|autoresponder|no-reply|noreply|donotreply|do-not-reply|bounce|bounces|mdaemon|odoobot|helpdesk|notification|notifications|notify|sysadmin|system|root|abuse)([._+-].*)?$' +# Strong machine tokens that may appear ANYWHERE in the localpart (no human address has these). +FROM_TOKEN_RE='noreply|no-reply|donotreply|do-not-reply|mailer-daemon|auto-reply|autoreply|autoresponse|auto-response|bounces|auth-results|postmaster' +# Display-name bots (substring, lowercased) that use human-ish addresses but are clearly automated. +FROM_DISPLAY_BOT_RE='odoobot|mail delivery (sub)?system|system administrator|microsoft outlook|mail administrator|postmaster' +# STRONG auto markers checked BEFORE the human "Re:" guard -- unambiguous machine subjects that +# no human types, so safe to delete even when wearing a "Re:" prefix (e.g. ticket auto-acks). +STRONG_AUTO_RE='new ticket|ticket created|ticket #|ticket no\.?|ticket has been|has been (assigned|received|resolved|closed|created|opened|updated)|your request with id|request with id|we.?re on it|\(autoresponse\)|auto ?re:|automatic (reply|response)|auto-?response|out of office|out-of-office|authentication report|undeliverable|undelivered|delivery status notification|could ?n.?t be delivered|could not be delivered|message could ?n.?t be|failure notice|returned mail|welcome to .*help ?desk|new case notification' +# Broader auto-ack / bounce subject patterns (lowercased subject), checked AFTER the Re: guard. +SUBJ_AUTO_RE='has been (received|resolved|closed|updated|created|opened|assigned)|case (received|closed|resolved|notification)|ticket ?#|ticket no\.?|ticket has been|your ticket|new ticket|ticket created|your request with id|thanks,? we got your|we have received your|out of office|out-of-office|automatic reply|automatic response|auto[- ]?reply|autoreply|auto-?response|\(autoresponse\)|new case|message (recieved|received)|delivery (status notification|failure|has failed)|undelivered|undeliverable|failure notice|^failed:|returned mail|mail delivery|could not be delivered|could ?n.?t be delivered|delayed mail|invalid address|address not found|recipient (address )?rejected|new email address|quota|read-?receipt|priority opened|authentication report|help ?desk' + +# from_localpart -> echoes lowercased localpart of From address +from_localpart(){ + printf '%s' "$1" | grep -iE '^From:' | head -1 \ + | sed -E 's/^From:[[:space:]]*//I' \ + | grep -oE '[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+' | head -1 \ + | sed -E 's/@.*$//' | tr 'A-Z' 'a-z' +} +from_display(){ printf '%s' "$1" | grep -iE '^From:' | head -1 | sed -E 's/^From:[[:space:]]*//I' | tr 'A-Z' 'a-z'; } + +# decide -> prints "KEEP " | "DEL " +# Precedence: (1) unsubscribe wins; (2) Auto-Submitted header; (3) machine From-address (exact +# localpart / strong token / display-bot); (4) STRONG auto subjects (delete even if "Re:"); +# (5) genuine human Re:; (6) ticket-tag / broad auto-ack subjects; (7) default keep. +decide(){ + local H="$1" subj="$2" + local s as lp disp + s=$(printf '%s' "$subj" | tr 'A-Z' 'a-z') + as=$(printf '%s' "$H" | grep -iE '^Auto-Submitted:' | head -1 | sed -E 's/^Auto-Submitted:[[:space:]]*//I' | tr 'A-Z' 'a-z' | tr -d ' ') + lp=$(from_localpart "$H"); disp=$(from_display "$H") + # 1) compliance: unsubscribe/opt-out always KEEP (overrides every machine signal) + if printf '%s' "$s" | grep -qE 'unsubscribe|opt[ -]?out|remove me|stop emailing'; then echo "KEEP unsubscribe"; return; fi + # 2) RFC 3834 Auto-Submitted present & != no -> machine + if [ -n "$as" ] && [ "$as" != "no" ]; then echo "DEL auto-submitted=$as"; return; fi + # 3) machine From-address (exact localpart, strong token anywhere, or display-name bot) + if printf '%s' "$lp" | grep -qE "$FROM_MACHINE_RE"; then echo "DEL from-machine=$lp"; return; fi + if printf '%s' "$lp" | grep -qE "$FROM_TOKEN_RE"; then echo "DEL from-token=$lp"; return; fi + if printf '%s' "$disp" | grep -qE "$FROM_DISPLAY_BOT_RE"; then echo "DEL from-bot"; return; fi + # 4) STRONG auto subjects: unambiguous machine markers, delete even if dressed as "Re:" + if printf '%s' "$s" | grep -qE "$STRONG_AUTO_RE"; then echo "DEL strong-auto-subject"; return; fi + # 5) genuine human threaded reply (auto ones already removed above) + if printf '%s' "$subj" | grep -qE '^[[:space:]]*(Re|RE|Fwd|Fw|FW)[:[]'; then echo "KEEP human-reply"; return; fi + # 6) ticket tag / broad auto-ack subject patterns + if printf '%s' "$subj" | grep -qE '^[[:space:]]*\[##'; then echo "DEL ticket-tag"; return; fi + if printf '%s' "$s" | grep -qE "$SUBJ_AUTO_RE"; then echo "DEL auto-ack-subject"; return; fi + # 7) default: keep (human-safe) + echo "KEEP default" +} + +# fetch the joined+decoded Subject for a message id +get_subject(){ zm getRestURL "/?id=$1&fmt=rfc822" | sed -n '1,/^$/p' \ + | awk 'BEGIN{IGNORECASE=1} /^Subject:/{s=$0;getline n; while(n ~ /^[ \t]/){sub(/^[ \t]+/," ",n); s=s n; getline n} print s; exit}' \ + | sed -E 's/^Subject:[[:space:]]*//I' | mime_decode; } + +# ---- classify one message id -> prints "KEEP " or "DEL " ---- +classify(){ + local H subj + H=$(zm getRestURL "/?id=$1&fmt=rfc822" | sed -n '1,/^$/p') + subj=$(printf '%s' "$H" | awk 'BEGIN{IGNORECASE=1} /^Subject:/{s=$0;getline n; while(n ~ /^[ \t]/){sub(/^[ \t]+/," ",n); s=s n; getline n} print s; exit}' | sed -E 's/^Subject:[[:space:]]*//I' | mime_decode) + decide "$H" "$subj" +} + +# classify + emit decoded subject (single fetch) -> "\t" +classify2(){ + local H subj + H=$(zm getRestURL "/?id=$1&fmt=rfc822" | sed -n '1,/^$/p') + subj=$(printf '%s' "$H" | awk 'BEGIN{IGNORECASE=1} /^Subject:/{s=$0;getline n; while(n ~ /^[ \t]/){sub(/^[ \t]+/," ",n); s=s n; getline n} print s; exit}' | sed -E 's/^Subject:[[:space:]]*//I' | mime_decode) + printf '%s\t%s\n' "$(decide "$H" "$subj")" "$(printf '%s' "$subj" | cut -c1-70)" +} + +ids_for(){ zm search -l 1000 -t message "$1" | grep -w mess | awk '{print $2}'; } + +move_to_trash(){ # stdin: ids one per line + local buf=() id n=0 + while read -r id; do [ -z "$id" ] && continue; buf+=("$id"); n=$((n+1)) + if [ ${#buf[@]} -ge 200 ]; then + local c="${buf[*]}"; zm moveMessage "${c// /,}" "$TRASH" >/dev/null; buf=(); fi + done + if [ ${#buf[@]} -gt 0 ]; then local c="${buf[*]}"; zm moveMessage "${c// /,}" "$TRASH" >/dev/null; fi + echo "$n" +} + +if [ "$MODE" = "preview" ]; then + echo "=== PREVIEW (read-only) most-recent $PREVIEW_N ===" | tee -a "$LOG" + IDS=$(zm search -l "$PREVIEW_N" -t message "in:inbox" | grep -w mess | awk '{print $2}') + keep=0; del=0; survivors="/tmp/nr_survivors_$TS.txt"; deletes="/tmp/nr_deletes_$TS.txt" + : > "$survivors"; : > "$deletes" + for id in $IDS; do + line=$(classify2 "$id") # "\t" + d=${line%%$'\t'*}; subj=${line#*$'\t'} + if [[ "$d" == KEEP* ]]; then keep=$((keep+1)); echo "id=$id [$d] $subj" >> "$survivors" + else del=$((del+1)); echo "id=$id [$d] $subj" >> "$deletes"; fi + done + echo "kept=$keep deleted=$del" | tee -a "$LOG" + echo "--- SURVIVORS (would KEEP) ---" | tee -a "$LOG" + cat "$survivors" | tee -a "$LOG" + echo "--- sample DELETES (first 25) ---" | tee -a "$LOG" + head -25 "$deletes" | tee -a "$LOG" + exit 0 +fi + +# ---- APPLY ---- +echo "=== APPLY purge $TS (move to $TRASH) ===" | tee -a "$LOG" +# Phase 1: fast bulk bounce delete (MAILER-DAEMON = definitionally bounce), keep-guard on unsubscribe +echo "PHASE1 bulk bounces..." | tee -a "$LOG" +p1=0; g1=0 +while :; do + B=$(ids_for "in:inbox from:MAILER-DAEMON AND NOT (subject:unsubscribe OR subject:\"opt out\")$DATEQ") + [ -z "${B// }" ] && break + n=$(printf '%s\n' "$B" | move_to_trash) + p1=$((p1+n)); echo " moved $n (cum $p1)" | tee -a "$LOG" + [ "$n" -lt 1 ] && break + g1=$((g1+1)); [ "$g1" -gt 200 ] && { echo " PHASE1 guard stop" | tee -a "$LOG"; break; } +done +echo "PHASE1 done: $p1 bounces -> Trash" | tee -a "$LOG" +[ "$QUICK" = "1" ] && { echo "quick mode: stop after phase1"; exit 0; } + +# Phase 1.5: fast SEARCH-based bulk delete of the common non-MAILER machine classes +# (postmaster bounces, Undeliverable/Undelivered DSNs, OOO/automatic-reply). These are +# matched server-side (no per-message fetch) and HARD-guarded so anything wearing a +# genuine "Re:"/Fwd: or unsubscribe falls through to the accurate Phase 2 classifier. +echo "PHASE1.5 fast search-delete of common auto classes..." | tee -a "$LOG" +GUARD='AND NOT (subject:Re OR subject:Fwd OR subject:unsubscribe OR subject:"opt out")' +p15=0 +for q in \ + "in:inbox from:postmaster $GUARD$DATEQ" \ + "in:inbox subject:Undeliverable $GUARD$DATEQ" \ + "in:inbox subject:\"Undelivered Mail\" $GUARD$DATEQ" \ + "in:inbox subject:\"automatic reply\" $GUARD$DATEQ" \ + "in:inbox subject:\"out of office\" $GUARD$DATEQ" \ + "in:inbox subject:\"failure notice\" $GUARD$DATEQ" \ + "in:inbox subject:\"delivery status notification\" $GUARD$DATEQ" \ +; do + qg=0 + while :; do + B=$(ids_for "$q") + [ -z "${B// }" ] && break + n=$(printf '%s\n' "$B" | move_to_trash) + p15=$((p15+n)); echo " [$q] moved $n (cum $p15)" | tee -a "$LOG" + [ "$n" -lt 1 ] && break + qg=$((qg+1)); [ "$qg" -gt 50 ] && break + done +done +echo "PHASE1.5 done: $p15 auto-class -> Trash" | tee -a "$LOG" + +# Phase 2: header-classify the remainder. Offset paging is unreliable on this box, +# so we loop: classify the top page, delete its DELs, cache KEEPs as "seen" so we +# don't re-fetch them next pass. Terminate when a page yields only already-seen KEEPs. +echo "PHASE2 header-classify remainder..." | tee -a "$LOG" +p2=0; guard=0; SEEN="/tmp/nr_seen_$TS.txt"; : > "$SEEN" +while :; do + IDS=$(zm search -l 1000 -t message "in:inbox AND NOT from:MAILER-DAEMON$DATEQ" | grep -w mess | awk '{print $2}') + [ -z "${IDS// }" ] && break + delbuf=""; newwork=0 + for id in $IDS; do + grep -qx "$id" "$SEEN" && continue # already classified KEEP, skip + newwork=1 + d=$(classify "$id") + if [[ "$d" == DEL* ]]; then delbuf+="$id"$'\n'; else echo "$id" >> "$SEEN"; fi + done + if [ -n "${delbuf// }" ]; then + n=$(printf '%s' "$delbuf" | move_to_trash); p2=$((p2+n)); echo " page moved $n (cum $p2)" | tee -a "$LOG" + fi + # A page with no new (unseen) messages means everything left is cached-KEEP -> done. + if [ "$newwork" = "0" ]; then echo " page all-seen-KEEP, stop" | tee -a "$LOG"; break; fi + guard=$((guard+1)); [ "$guard" -gt 120 ] && { echo "guard stop" | tee -a "$LOG"; break; } +done +echo "PHASE2 done: $p2 auto/ack -> Trash; survivors cached in $SEEN ($(wc -l < "$SEEN"))" | tee -a "$LOG" +echo "TOTAL moved to Trash: $((p1+p2))" | tee -a "$LOG"