Back to work
Internal

Site-Watch

A daily competitor-monitoring service that watches grouped sets of brand and competitor sites, captures the changes that matter, and emails per-group digests with a one-click link to the full diff report.

Role
Director of Technology
Timeframe
2025
Stack
Python · Playwright · BeautifulSoup · diff-match-patch · PyMuPDF

Health brand strategy depends on knowing what competitors are doing: pricing pages, indication updates, new patient resources, formulary changes. The strategy team was checking sites by hand, which meant they checked rarely and inconsistently. I built a Python service that does it every day, automatically, and emails the right people when something has actually changed.

The problem

Strategy work is competitive intelligence work. The team needed to know when an indication was added to a competitor's product page, when a new copay assistance program launched, when a label update went live. These changes drive client recommendations and pitch updates.

The existing process was a spreadsheet of URLs and a recurring calendar event. Sites were checked sporadically, often well after a change went live. Small updates were missed entirely. There was no diff capability, so reviewers had to remember what a page looked like last time, which they couldn't.

The brief I gave myself was simple: check every site every day, store snapshots, surface meaningful diffs to the people who can act on them. Engineering should run it. The brand teams should consume it.

The architecture

Site-Watch is a Python scheduler that walks each site in a config.json, captures every page as plain text, compares against the most recent archived snapshot, and only does the expensive work (screenshots, visual diffs, report generation, email) when something has actually changed. The output is a self-contained HTML report per host, hosted on a public server via SFTP, with an emailed summary linking to it. Sites are grouped by audience, and each group has its own recipient list, so the right team gets the right alerts.

Daily run
config.json
sites + emails
Schedule
daily
Archive
prior snapshots
Capture
Playwright · BeautifulSoup
Text diff
diff-match-patch
Visual diff
PIL · NumPy
HTML report
per host
SFTP upload
paramiko
Email digest
yagmail

Capture

For each URL the crawler follows internal links from a starting site, with an exclude_pages pattern list to skip irrelevant routes. requests fetches the page and BeautifulSoup extracts the plain text, stripping scripts, styles, and markup naturally as a side effect of get_text(). PDFs are handled the same way through PyMuPDF, which extracts text from each page. Snapshots land in archive/<host>/<date>/<timestamp>/<page-slug>.txt, so the snapshot store is just the filesystem, organised by host and run.

Diff

Plain text gets compared with diff-match-patch, which produces patches between the old and new snapshot. The diff is rendered into the HTML report with <ins> and <del> highlighting on the changed words. The semantic-cleanup pass collapses a one-word edit into a one-word highlight rather than reporting an entire paragraph as a delete + insert.

Screenshots and visual diff

When the text diff is non-empty, Playwright opens a headless Chromium against the page and grabs a full-page screenshot. Pillow + NumPy then generate a three-panel image (old, new, and a red-highlighted overlay of the changed pixels) so a reviewer can spot the change visually in seconds. This only runs on pages that actually changed, which is the difference between a fast daily run and an expensive one.

Report and delivery

For each host with changes, the run generates an index.html with a side-by-side text diff per page and a "Compare Screenshots" modal showing the visual diff. Style and script assets are bundled in so the report works as a standalone artifact. The bundle is uploaded over SFTP to a public web server (paramiko), and yagmail sends a per-group email with the site name, change count, a "View report" button linking to the public URL, and the list of changed pages.

Key decisions

Snapshot text, not HTML

Comparing raw HTML would surface every cache-buster, every analytics nonce, every dynamic ID as a change. BeautifulSoup's text extraction removes all of that as a side effect, so the diff is signal: copy changes, indication updates, new sections. The signal-to-noise is dramatically better than a byte-level diff and there's no custom strip-rule list to maintain.

Screenshot only on change

Capturing every page every day through Playwright across hundreds of URLs is expensive in time and CPU. Snapshotting text is cheap. Only triggering the headless browser when the text diff is non-empty keeps the daily run lightweight and the report email arrives within minutes of the schedule firing.

Filesystem snapshots, SFTP delivery

A database for snapshot history would be over-engineered. The natural unit is "the text of page X on date Y", which is exactly what the filesystem is shaped for: one file per page per timestamp, archived under host. The reports are static HTML, so any web server can host them. SFTP is the lowest-friction delivery to whatever box is already serving the team's other reports.

Per-group config

config.json groups sites by audience (DTC competitors, HCP competitors, personal sites) and pairs each group with its own email recipient list. The DTC strategist sees DTC moves without being on the HCP firehose, and vice versa. Adding a competitor is a JSON edit, not a code change.

Outcomes

Site-Watch has become the kind of tool nobody mentions but everyone uses. It catches competitor moves the team would otherwise miss and surfaces them as a daily email with a one-click link to the full diff. Because the expensive work only fires on real changes, the run cost is essentially nothing.

What I'd do differently

I'd add a noise filter for session-bound URL parameters and dynamic IDs that survive the text strip on some sites. Today these can produce a noisy email on an otherwise quiet competitor day. A small per-host allow/deny list of tokens to normalise before diffing would collapse those into the actual signal.

I'd also build a small admin UI for the URL list. Today the config sits in the repo, so adding a new competitor means a PR. A tiny self-service page that updates config.json and reloads would put the team in direct control of what's tracked, without an engineer in the loop.