America/Toronto
ProjectsMarch 23, 2024

Ford Web Scraper — Price Comparison Tool

image
Category
Details
Project TypeWeb scraping and automated price comparison
Sites Targetedford.ca (manufacturer) and fordtodealers.ca (dealer portal)
Team SizeSolo — Neal Miran
Tech StackPython, Selenium, Docker, Azure Container Instances, Azure Logic Apps, Gmail SMTP
Trigger CadenceOnce daily via Azure Logic Apps
StatusRetired — 2025
DeliverySide project — built to help a friend
The Ford Web Scraper was built to solve a specific and recurring operational problem: the team maintaining fordtodealers.ca was manually cross-referencing pricing from ford.ca once a month to keep their dealer portal in sync. The two sites operate independently — fordtodealers.ca is a dealer-facing portal, while ford.ca is the manufacturer's public-facing site — and they have no formal communication channel or shared data pipeline. Ford does not expose a public API for pricing data, so there was no programmatic way to pull prices from the source. This meant the fordtodealers.ca team was opening the manufacturer site, navigating to each model, recording trim prices by hand, and then updating their portal manually. The process was error-prone, time-consuming, and entirely dependent on someone remembering to do it each month. The scraper automated the full pipeline: it navigated both sites using Selenium, extracted trim-level pricing for every model, compared the values, and delivered a structured email via Gmail showing where prices aligned and where they diverged — including a dedicated column for the price difference on any mismatched trims. The container ran once daily on Azure Container Instances, triggered by a Logic Apps schedule. The project was retired in 2025.
  • Dual-Site Navigation with Selenium: The scraper used Selenium to drive a headless browser against both ford.ca and fordtodealers.ca in a single run. Pricing was captured at two points on each site: the navigation menu, which surfaces prices at the model level, and the individual model page, which shows prices at the trim level. Selenium was required because pricing content on both sites is rendered by JavaScript after page load and is not present in the initial HTML response.
  • Trim-Level Price Extraction: For each model, the scraper collected the price of every available trim from each site, building a structured dataset keyed by model and trim name that could be joined for comparison.
  • Price Comparison and Diff Calculation: Once data was collected from both sites, the scraper joined the datasets by model and trim, computed the price difference for each row, and flagged any discrepancies. Trims present on one site but absent on the other were captured separately rather than silently dropped.
  • Conditional Email Subject Line: The email subject line included the word "Mismatch" only when at least one price discrepancy was detected. A subject line without "Mismatch" signalled immediately — before opening the email — that all prices were in sync for that day's run.
  • Summary Table with Anchor Links: The email opened with a summary table listing every model and its comparison status. Each model name in the table was an anchor link that jumped directly to that model's section further down in the email, allowing the recipient to scan the summary and navigate to a specific mismatch without scrolling through the full report.
  • Red Mismatch Badges: Any model with at least one price discrepancy was marked with a red badge in the summary table, making mismatches immediately visible at a glance. Models where all prices matched carried no badge — reducing visual noise and keeping the focus on the rows that needed attention.
  • Per-Model Source Links: Each model section in the email included source links to the exact pages on ford.ca and fordtodealers.ca from which the prices were scraped. The recipient could click directly to the live page to verify a price or investigate a discrepancy, without having to navigate the site manually.
  • Back-to-Top Anchor Links: Each model section ended with a "back to top" anchor link returning the reader to the summary table. On a long email covering many models, this made it practical to work through mismatches one at a time without losing context.
  • Containerised with Docker: The entire scraper — Python runtime, Selenium, Chrome driver, and dependencies — was packaged into a Docker container, making the environment fully reproducible and removing any dependency on the host machine's configuration.
  • Azure Container Instances: Hosting in Azure removed the need for a dedicated local machine to run the scraper. A locally-hosted solution would have required a machine to be on and available at the scheduled time every day — unreliable and impractical for a side project. Running on ACI meant the pipeline would execute reliably regardless of any local environment. The container spun up, executed the full scrape-compare-email pipeline in under two minutes, and exited. Because ACI bills only for the time the container is actually running, the execution cost was minimal — at most a couple of minutes of compute per day.
  • Azure Logic Apps Trigger: A Logic Apps workflow triggered the container once per day on a fixed schedule, replacing any need for an always-running cron job or dedicated scheduler.
fordtodealers.ca — Homepage
fordtodealers.ca
Dealer Portal
PythonPython: Core scripting language used to orchestrate the scraping logic, data comparison, and email generation end to end.
SeleniumSelenium: Browser automation library used to drive a headless Chrome instance against both sites. Necessary because both ford.ca and fordtodealers.ca rely on JavaScript to render pricing content — static HTTP requests alone would not return the data needed.
DockerDocker: The scraper was packaged as a Docker image including the Python runtime, Selenium, and a compatible Chrome/ChromeDriver binary. Containerising the scraper ensured the environment was consistent across local development and the Azure execution environment.
Azure Container Instances (ACI)Azure Container Instances (ACI): Hosted the Docker container in Azure, removing the need for a dedicated local machine to be on and available at the scheduled run time every day. ACI ran the container on demand — it started when triggered, completed the full pipeline in under two minutes, and stopped. Because ACI bills only for active execution time, the cost of running the scraper daily was minimal.
Azure Logic AppsAzure Logic Apps: Provided the daily schedule trigger. The Logic Apps workflow fired once per day and sent a start signal to the Azure Container Instance. No persistent scheduler or cron infrastructure was required.
Gmail SMTPGmail SMTP: The comparison output was serialised into a structured HTML email and dispatched via Gmail's SMTP interface. The email layout was designed to be readable in a standard mail client with no additional tools required on the recipient's end.
Ford does not expose a public API for vehicle pricing. The only way to access current pricing data from ford.ca is to navigate the site as a browser would. This made Selenium the right tool for the job, but it also meant the scraper was sensitive to structural changes on either site. If page layout, element selectors, or URL patterns changed, the extraction logic needed to be updated to match. ford.ca and fordtodealers.ca are independent sites with different HTML structures, navigation patterns, and URL schemes. Each required its own traversal and extraction logic — assumptions that held on one site could not be carried over to the other. Model and trim names also had minor inconsistencies between the two sites that needed to be normalised before any meaningful comparison could be made. The join between the two datasets relied on model and trim names being consistent after normalisation. In practice there were naming inconsistencies — capitalisation differences, abbreviations, and slight variations for the same trim across the two sites. Building a reliable normalisation step was necessary to avoid false mismatches in the comparison output, where a trim would appear as "missing" on one side simply because it was named slightly differently. Getting Selenium and Chrome to run correctly in a headless, containerised environment required matching the Chrome binary version to the ChromeDriver version and configuring Chrome with the appropriate headless flags for a no-display environment. Debugging this locally first — confirming the container ran cleanly in Docker Desktop before pushing to ACI — avoided discovering environment issues after deployment. Configuring Logic Apps to trigger an ACI container on a schedule required setting up the correct Azure connector, scoping the managed identity permissions so Logic Apps could start the container instance, and validating that the trigger reliably reached ACI. Testing the trigger in isolation — separate from the scraper logic — made it easier to confirm the scheduling layer was working before testing the full end-to-end pipeline. The goal was not to produce a raw data dump but to give the fordtodealers.ca team something they could act on immediately. Structuring the email to surface discrepancies clearly — with model, trim, both prices, and the difference in a single row — meant the team could open the email and know exactly which entries needed to be updated, without additional sorting or interpretation. The scraper replaced the monthly manual cross-referencing process entirely. The fordtodealers.ca team received a daily email showing the current state of pricing across both sites, with any discrepancies immediately visible. Price differences that had previously gone undetected between monthly update cycles were surfaced the day they appeared. The containerised, event-driven architecture kept ongoing infrastructure costs minimal — ACI only ran when triggered, and Logic Apps handled the schedule with no always-on components required. The project ran reliably until it was retired in 2025, and demonstrated a practical pattern for bridging data gaps between systems that have no formal integration: when no API exists and manual reconciliation is the only alternative, a targeted scraper with a structured output pipeline can replace that process reliably.

Related projects

310Maxx Rebranding to OxfordMaxxsupport

310Maxx Rebranding to OxfordMaxxsupport

A full rebranding of the 310Maxx tenant support platform to OxfordMaxxsupport, introducing new brand assets, enriched building pages, a self-serve account creation flow backed by an on-premises database integration, and a new FAQ section.
IT Demand Intake Application

IT Demand Intake Application

A self-service IT demand intake application built on Power Apps, Power Automate, and SharePoint — giving the entire organisation a structured channel for submitting IT requests through a multi-stage approval workflow covering intake, team assignment, estimation, and sign-off. CI/CD through Azure DevOps, reporting through Power BI.

Transparency & Insights — Automated Data Ingestion Pipeline

An end-to-end automated ingestion pipeline for financial and non-financial data submitted by third-party property managers — built with SSIS, a custom SQL database, a two-stage validation engine, and a custom web portal — replacing a fully manual, email-driven process and connecting directly into JD Edwards.