Skip to main content

How can I extract data from lists and their associated details pages? (deep scraping)

Deep scraping uses workflows to connect multiple robots: one to gather lists of items, another to get details from subpage(s).

M
Written by Melissa Shires
Updated over a week ago

When building our web scrapers or monitors, you'll often encounter this scenario:

  • A list page shows basic information (like product names and prices),

  • A detail page that lives on separate pages that you access by clicking each item.

This is called deep scraping. Deep scraping is a way to extract data from multiple linked pages on a website. Simply put, it's how you scrape an entire website across multiple layers vs. a singular web page.

Depending on how the website you want to scrape or monitor is structured, you can deep scrape by:

  • Category - category > sub category > page

  • Site search - search term > results > page

  • Navigation - navigate > scrape links > pages

  • Site map - scrape site map > pages

Popular use cases for deep scraping include:

  • E-commerce price and product monitoring to create a database of product and prices based on category or search terms.

  • Scraping and monitoring directories for lead generation.

  • Competitive monitoring to scrape entire websites.

How do I deep scrape with Browse AI?

There are two ways you can set up deep scraping with Browse AI depending on the data you want to scrape or monitor, and the structure of the website.

What it does

When it's ideal

Bulk run

Upload a CSV of URLs or Input parameters to scrape up to 500,000 pages at once.

  • Site search deep scraping

  • Single use static dataset

Workflow

Connect multiple robots to automatically automatically scrape all pages and sub pages.

  • Monitored or 'live' data

  • Category deep scraping

Bulk run: How to deep scrape with bulk runs

Read our Bulk Run Guide here to learn how to set up a bulk run to scrape and monitor up to 500,000 URLs or input parameters.

Workflows: how to deep scrape using workflows

A basic workflow to deep scrape involves two main steps:

  • Robot A: a robot scrapes a list or category page to collect basic information and URLs.

  • Robot B: a robot that visits each URL to extract or scrape detailed information from individual pages.

Note that using workflows you can connect as many robots as you'd need depending on the structure of the data you need to scrape.

Step 1. Train Robot A to scrape a list or category page to get a list of URLs

In this first step, you'll create a robot that scrapes or extracts a list of URLs based on:

  1. Go to the category page URL.

  2. Use Capture List to select repeating items (like products, job listings, or properties). Make sure to capture the URL field that links to the detail pages.

  3. Finish, approve and name this robot.

Step 2: Train Robot B to extract, monitor or scrape details

  1. Create a second robot using one of the URLs for the details page.

  2. Train the robot to scrape, structure and monitor the data you'd like.

  3. Finish, approve and name this robot.

Step 3: Connect these two robots together using workflows

Automatically feed the URLs from Robot A into Robot B by creating a workflow.

  1. Go to Workflows.

  2. Set up a workflow that feeds data from Robot A to Robot B.

  3. Schedule the workflow to run automatically at your preferred frequency.

Common use cases for deep scraping

E-commerce product monitoring

  • Robot A collects product information from category pages.

  • Robot B visits each product page to gather specifications, reviews, and availability.

Real estate data

  • Robot A scans property listing pages for basic details.

  • Robot B visits individual property pages to collect specifications and amenities.

Lead generation of business listings

  • Robot A processes directory pages to gather business listings.

  • Robot B visits each business profile to extract contact details and services.

Best practices

  • Start by mapping out what the structure of the website and webpages looks like first. How many workflows/layers do you need?

  • Make sure to set up a monitor on one or both of your robots to keep this data up to date.

Did this answer your question?