analytics: filter email-scanner / headless traffic out of Umami stats
Email security gateways (Microsoft Defender Safe Links / ATP, Proofpoint,
Mimecast, Barracuda, etc.) auto-fetch and often render every link in a
campaign email to scan for malware. The advanced ones drive a real headless
browser, execute JS, and fire Umami pageviews/clicks that masquerade as human
visits -- inflating campaign click-through.
New site/public/js/pw-bot-filter.js queries multiple real-browser signals and
gates Umami via its official data-before-send hook (umamiBeforeSend), dropping
all events when the visitor is a bot. Signals (from empirical chromium probing):
decisive: navigator.webdriver, HeadlessChrome UA, known scanner UAs, zero/
collapsed screen|viewport|outer geometry, window LARGER than the
physical screen (impossible on real HW; uses outerW/H so page zoom
does not false-positive), software GPU rasterizer (SwiftShader/
llvmpipe/swrast via WebGL UNMASKED_RENDERER), zero logical CPUs.
soft (>=2 to trip): tiny screen, inner>screen, low color depth, empty
navigator.languages, no input device (no fine/coarse pointer + no
hover + 0 touch), no WebGL on a desktop UA.
Designed to FAIL OPEN: only strong/corroborated evidence suppresses, so real
visitors (incl. zoomed, privacy-tooled, remote-desktop, kiosk) still count.
Wired before the Umami tag in Base.astro (Astro pages) and all 86 static
public/**/*.html pages; both load with defer so order is guaranteed and the
hook is defined before Umami reads it.
Tested end-to-end with chromium (site/tests/bot-filter.test.sh, 4/4):
default headless-new, spoofed-Windows-UA + normal 1366x768 window, and
spoofed-UA + 1x1 window are all caught; hook returns null to drop the event.
This commit is contained in:
parent
40da017b79
commit
f481a1d13c
89 changed files with 341 additions and 87 deletions
67
site/tests/bot-filter.test.sh
Normal file
67
site/tests/bot-filter.test.sh
Normal file
|
|
@ -0,0 +1,67 @@
|
|||
#!/usr/bin/env bash
|
||||
# End-to-end test for the analytics email-scanner / headless filter.
|
||||
#
|
||||
# Verifies pw-bot-filter.js (a) correctly flags headless/automated browsers as
|
||||
# bots across several scanner disguises, and (b) wires the Umami before-send
|
||||
# hook so flagged traffic is dropped. Run from the repo root:
|
||||
#
|
||||
# bash site/tests/bot-filter.test.sh
|
||||
#
|
||||
# Requires: chromium (or google-chrome) + python3. Skips cleanly if absent.
|
||||
set -uo pipefail
|
||||
|
||||
FILTER="$(cd "$(dirname "$0")/.." && pwd)/public/js/pw-bot-filter.js"
|
||||
CHROME="$(command -v chromium || command -v chromium-browser || command -v google-chrome || true)"
|
||||
if [ -z "$CHROME" ]; then echo "SKIP: no chromium/chrome found"; exit 0; fi
|
||||
[ -f "$FILTER" ] || { echo "FAIL: $FILTER missing"; exit 1; }
|
||||
|
||||
WORK="$(mktemp -d)"
|
||||
SRV_PID=""
|
||||
cleanup() { [ -n "$SRV_PID" ] && kill "$SRV_PID" 2>/dev/null; rm -rf "$WORK"; }
|
||||
trap cleanup EXIT
|
||||
cp "$FILTER" "$WORK/pw-bot-filter.js"
|
||||
cat > "$WORK/probe.html" <<'HTML'
|
||||
<!doctype html><html><head><meta charset="utf-8"><title>t</title>
|
||||
<script src="/pw-bot-filter.js"></script></head><body><script>
|
||||
document.title = JSON.stringify({
|
||||
isBot: window.pwIsBot,
|
||||
reasons: window.pwBotReasons,
|
||||
hook: typeof window.umamiBeforeSend,
|
||||
drop: window.umamiBeforeSend("event", {x:1}) === null
|
||||
});
|
||||
</script></body></html>
|
||||
HTML
|
||||
( cd "$WORK" && exec python3 -m http.server 8791 >/dev/null 2>&1 ) &
|
||||
SRV_PID=$!
|
||||
sleep 1
|
||||
|
||||
run() { # args -> JSON from page title
|
||||
"$CHROME" --headless=new --no-sandbox --disable-dev-shm-usage "$@" \
|
||||
--dump-dom "http://localhost:8791/probe.html" 2>/dev/null \
|
||||
| grep -oE '\{.*\}' | head -1
|
||||
}
|
||||
PASS=0; FAIL=0
|
||||
check() { # name expr_json key wantbool
|
||||
local name="$1" json="$2"
|
||||
local got; got="$(printf '%s' "$json" | python3 -c "import sys,json;print(json.loads(sys.stdin.read()).get('isBot'))" 2>/dev/null)"
|
||||
if [ "$got" = "True" ]; then echo "PASS $name (isBot=True)"; PASS=$((PASS+1));
|
||||
else echo "FAIL $name (isBot=$got; json=$json)"; FAIL=$((FAIL+1)); fi
|
||||
}
|
||||
|
||||
# 1. default headless-new
|
||||
check "headless-new default" "$(run)"
|
||||
# 2. spoofed Windows UA + normal window (sophisticated scanner)
|
||||
check "spoofed UA + normal window" "$(run --window-size=1366,768 \
|
||||
--user-agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36')"
|
||||
# 3. spoofed UA + tiny window
|
||||
check "spoofed UA + tiny window" "$(run --window-size=1,1 \
|
||||
--user-agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0 Safari/537.36')"
|
||||
|
||||
# hook wiring on the default run
|
||||
HJSON="$(run)"
|
||||
HK="$(printf '%s' "$HJSON" | python3 -c "import sys,json;d=json.loads(sys.stdin.read());print(d.get('hook'),d.get('drop'))" 2>/dev/null)"
|
||||
if [ "$HK" = "function True" ]; then echo "PASS umamiBeforeSend drops bot events"; PASS=$((PASS+1));
|
||||
else echo "FAIL hook wiring ($HK)"; FAIL=$((FAIL+1)); fi
|
||||
|
||||
echo ""; echo "$PASS passed, $FAIL failed"
|
||||
[ "$FAIL" -eq 0 ]
|
||||
Loading…
Add table
Add a link
Reference in a new issue