Coles Web Scraper Breakdown: 6 Powerful Strategies I Used in My Capstone Project for Scraping Pricing Data

An end-to-end breakdown of how I successfully built a Coles Web Scraper to scrape pricing data of Coles across all categories for my academic capstone project which involved navigating anti-bot mechanisms like Incapsula with rotating proxies, human-like behavior simulation, API-oriented data parsing, and smart session control.

This blog is written solely for educational purposes, to share my personal experience and technical learnings from a postgraduate capstone project conducted at an Australian university. It documents my exploration of publicly visible web structures to understand how dynamic product pricing is rendered on retail websites like Coles.com.au.

The techniques discussed are not intended for misuse, unauthorized automation, or commercial scraping. All scraping was conducted ethically, with respect to Coles’ system integrity, robots.txt directives, and Terms & Conditions. No private APIs, authentication systems, or secured endpoints were accessed or bypassed. No scraped data is shared or redistributed. The sole purpose of this blog is to showcase technical expertise in web automation and reflect on responsible web data access in an academic context.

Coles is a registered trademark of Coles Group Limited. This blog is independently written and is not affiliated with or endorsed by Coles Group Limited.

Ethics Comes First: Foundation of My Scraping Journey

Before diving into code or scraping techniques, I focused on understanding the legal and ethical framework of the task. This project was part of my university capstone, where the objective was to gather real-time pricing data from Australian supermarkets like Coles and Woolworths to build a price comparison dataset for analysis.

Web scraping offers immense analytical potential but without ethical consideration, it can easily cross into unauthorized territory. Ignoring a site’s terms, overwhelming servers, or accessing private data can result in legal consequences, blocked IPs, or even damage to your reputation as a developer. Scraping data from modern e-commerce platforms is no longer as simple as sending requests and parsing HTML. Especially with retailers like Coles, who implement sophisticated modern web application firewalls like Incapsula to actively detect and deter bots. I wanted to approach this challenge ethically, stealthily, and scalably by respecting both the site’s functionality and considering project requirements.

So, I ensured my approach was ethical, stealthy, and mindful of site functionality. I respected robots.txt, avoided overloading the server, and used only publicly visible traffic. This blog shares my journey not just in technical problem-solving, but in practicing ethical web scraping.

Ditching HTML Parsing: API Scraping for Stability

Instead of relying on traditional HTML scraping, which is fragile and heavily dependent on UI changes, this project uses the APIs of Coles website which are used by their React frontend. These APIs return structured JSON responses which make data extraction more efficient and accurate. These APIs are not publicly documented but are visible through client-side browser traffic.

A critical component was the build_id, a dynamic value embedded in a <script> tag with the ID NEXT_DATA, which was parsed using JavaScript within the browser to construct valid API endpoints.

Here’s the overall flow:

  1. Launch a stealth browser using Selenium Wire + undetected_chromedriver with Smartproxy.
  2. Set location cookies and store ID, mimicking what a real user would select for click-and-collect.
  3. Extract the build ID by parsing the HTML with JavaScript inside the browser.
  4. Query the category discovery endpoint to dynamically retrieve all available product categories.
  5. Loop through each category’s paginated product API and extract relevant fields like name, price, promotion, and unit size.

These APIs return structured JSON responses which are more stable and scalable than DOM scraping.

Tools I Used to Create the Coles Web Scraper Script

Building a robust and stealthy web scraper required more than just writing scripts. It involved choosing the right tools to handle anti-bot measures, maintain modularity, and ensure data integrity. Below is the tech stack that powered my Coles web scraping tool:

ToolPurpose
Selenium Wire + undetected_chromedriverSimulates real browser and captures network traffic
Smartproxy (Datacenter)Rotates IPs per session using session ID
Selenium StealthSpoofs browser fingerprints (vendor, WebGL, etc.)
MongoDBStores scraped data in a structured format
Python Libraries (requests, random, configparser, etc.)Control timing, logic, config loading

All configurations like proxy credentials and category mappings were managed through external config files to keep the script modular and reusable.

Navigating Bot Detection and Anti-Scraping Ethically

Coles uses Incapsula, a sophisticated web application firewall designed to detect and block bots and scraping automation. To responsibly work around these defences, I implemented proxy rotation along with other techniques that mimicked real user behavior while staying within ethical boundaries:

Proxy Rotation: Each session used a randomly generated session ID with a different Smartproxy port. This helped reduce the chance of being flagged based on IP.

User-Agent Spoofing: The script imitated real browsers using a modern Chrome user-agent string.

Browser Fingerprint Masking: JavaScript-visible properties like navigator.vendor, WebGLRenderer, and navigator.webdriver were altered to resemble a genuine user.

Randomized Window Size: Simulated different screen sizes to avoid uniform patterns.

Human-Like Timing: The script incorporated random delays between actions to mimic how a real person would browse the site.

CAPTCHA Awareness: If Incapsula issued a CAPTCHA, the script paused and prompted for manual CAPTCHA resolution and renewed the session before continuing.

Coles web scraper CAPTCHA detection and wait logic using Selenium

Working with Coles API for Web Scraper Setup

Instead of clicking through product pages, I leveraged Coles’ frontend-facing APIs that return JSON data. This approach is faster, cleaner, and avoids UI changes.

  1. It first discovered available categories by parsing Coles’ browse API.
  2. Then, for each category, it looped through pages until no more products were returned.
  3. All data was structured, clean, and easy to parse – thanks to JSON responses.
  4. Cookies and headers were preserved to simulate an authentic shopping session.

The API endpoints depended on a dynamic build_id, which the script extracted directly from a JSON blob embedded in the homepage.

Category Discovery API

This API returns a list of available product categories for the selected location. It is dynamically generated based on store and location preferences (set via cookies). This structure was inferred from publicly available browser network traffic, for academic understanding only.

Store, Location, and Click and Collect Method are set as:-

fulfillment_store_id = config.get('Coles', 'FulfillmentStoreId', fallback='0357')
session.cookies.set("fulfillmentStoreId", fulfillment_store_id, domain=".coles.com.au")
session.cookies.set("shopping-method", "clickAndCollect", domain=".coles.com.au")
https://www.coles.com.au/_next/data/{build_id}/en/browse.json

Purpose:
Used to extract main product categories (Level 1 menus like Bakery, Meat, Frozen, etc.) and their slugs.

Example Response Structure:

{
  "pageProps": {
    "allProductCategories": {
      "catalogGroupView": [
        {
          "level": 1,
          "type": "CATALOG",
          "name": "Bakery",
          "seoToken": "bakery",
          "productCount": 120
        }
      ]
    }
  }
}

Filtering Logic:

Only categories where:

  • level == 1
  • type == “CATALOG”
  • productCount > 0
    are considered valid for scraping.

Code Snippet:

categories = {}
for cat in data.get("pageProps", {}).get("allProductCategories", {}).get("catalogGroupView", []):
    if cat.get("level") == 1 and cat.get("type") == "CATALOG" and cat.get("productCount", 0) > 0:
        categories[cat["name"]] = cat["seoToken"]
Product Listing API (per category)

Coles uses a paginated JSON data feed (used by their React frontend) to serve product data. This feed behaves like a RESTful API and powers the client interface. This API serves product data based on the C&C location and store ID as set in the cookies.

Endpoint:

https://www.coles.com.au/_next/data/{build_id}/en/browse/{category}.json?page={page}&slug={category}

Purpose:

Used to fetch product listings for a given category and page number.

Pagination:

Start with page = 1 and keep increasing until no products are returned.

Example Response Structure:

{
  "pageProps": {
    "searchResults": {
      "results": [
        {
          "_type": "PRODUCT",
          "id": "123456",
          "name": "Coles Full Cream Milk 2L",
          "pricing": {
            "now": 3.10,
            "was": 3.50,
            "comparable": 1.55
          },
          "merchandiseHeir": {
            "category": "Dairy"
          }
        }
      ]
    }
  }
}

Parsing Logic:

Only items with _type == “PRODUCT” are processed

Extract fields like:

  • id, name
  • pricing.now, pricing.was, pricing.comparable
  • category from merchandiseHeir

Code Snippet:

for item in results:
    if item.get("_type") != "PRODUCT":
        continue

    pricing = item.get("pricing", {})
    merchandise = item.get("merchandiseHeir", {})

    extracted.append({
        "product_code": item.get("id"),
        "item_name": item.get("name"),
        "best_price": pricing.get("now"),
        "item_price": pricing.get("was"),
        "unit_price": pricing.get("comparable"),
        "category": merchandise.get("category", "Unknown")
    })

Where the build_id Comes From

Both APIs require a build_id, which is a dynamic identifier used by Coles’ React framework to version their routes. This value is extracted from the HTML via:

Code Snippet:

json_data = driver.execute_script("return document.getElementById('__NEXT_DATA__').textContent;")
build_id = json.loads(json_data).get("buildId")

If the build_id isn’t present or changes format, scraping will fail—so it’s always extracted dynamically at runtime.

Data Format and Structuring the Output

Once the data was extracted through API calls, I organized it into a clean and analyzable format before storing it in MongoDB. Each product entry captured the following fields:

  • Product ID and name
  • Current and previous prices
  • Unit size and comparable pricing
  • Category and timestamp of scraping

Final Notes and Conclusion

Though the Coles Web Scraper worked effectively and was able to scrape complete pricing data from all the categories, occasional manual CAPTCHA solving was necessary especially at session start or proxy switch because we used datacenter proxies. But, running the scraper with residential ISP proxies (comes with higher costs than datacenter proxies) did not trigger any CAPTCHA challenges.

This was more than just a scraping task—it was an exercise in ethical automation, bot evasion, and system resilience. From understanding Incapsula defenses to managing sessions across proxy IPs, it brought together everything I’ve learned in scraping and automation.

Want to go deeper? Check out these related guides:

Note: All data was stored locally for private academic analysis and **not published, redistributed, or monetized**.

This article reflects personal academic research and does not endorse or promote unauthorized scraping. All findings are based on browser-visible data as of 6 June, 2025. All activities:

  • Respected Coles’ robots.txt rules
  • Did not access login-only areas or secure APIs
  • Used visible client-side traffic only
  • Did not flood or overload any server
  • Never redistributed collected data

Under Australian laws, particularly the *Copyright Act 1968 (Cth)* and *Criminal Code Act 1995 (Division 477)*, no unauthorized access, circumvention, or distribution of protected material occurred.

Share this:
Scroll to Top