All Collections
Case Tutorial
Lead Generation
Scrape business information from Yelp
Scrape business information from Yelp
Updated over a week ago

You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier, and more robust! Download and upgrade here if you haven't already done so!

Yelp is one of the largest business directory websites on the Internet. This tutorial will show you how to collect business information on Yelp.

For Yelp scraping, you could use our ready-to-use Task Template available on the home page or follow this tutorial to build the task from scratch.

You can also check out this video below -


To demonstrate, we will use this URL as an example: https://www.yelp.com/search?find_desc=&find_loc=Seattle%2C+WA&ns=1

The main steps are shown in the menu on the right and you can download the demo task file here.


1. Go to Web Page - to open the target webpage

  • Paste the URL and click Start


2. Auto-detect webpage - to create the workflow

  • Select Auto-detect web page data

  • Wait for the detection to be complete

  • Untick Add a page scroll and then select Create workflow

  • Go to Data Preview and delete the fields you do not need


3. Select subpage URL - to go into detail pages

  • Choose Select subpage URL

  • Select the Title URL from the drop-down menu (you can confirm if it's the correct link in Data Preview)

  • Click Confirm

You will notice a Click URLs in the list action is created in the workflow.


4. Adjust settings for Pagination

We need to set up AJAX for the pagination as the page is loaded with AJAX. The auto-generated XPath of Pagination and Loop Item does not always work well, so we have to modify the XPath.

  • Click on Click to Paginate - adjust Timeout to 10s

  • Click Apply to save the changes

fasfaf.gif


5. Extract Data - to get data from the detail pages

  • Select the information on the web page

  • Click Text

  • Repeat the steps to extract all the data you need

As some business pages may not display phone numbers or their website addresses, we need to modify the XPath for the fields to make it always locate the correct info even when the position of the piece of info is different.

  • Switch to Vertical View

  • Double click on the XPath - Paste the XPath below to it

vertical.jpg

We have prepared some useful XPaths for Yelp pages.

  • Business Website: //p[text()='Business website']/following-sibling::p[1]

  • Phone Number: //p[text()='Phone number']/following-sibling::p[1]

  • Address: //address

  • Business Owner: //p[text()='Business Owner']/../preceding-sibling::p[1]


6. Set up wait time - to control scraping speed

Yelp applies an anti-scraping technique and it would block your IP if you scrape too fast. We need to slow down the scraping by setting the wait time.

  • Select Extract Data in the workflow - go to the Options section - tick Wait before action and set it to 10s

sto.gif

Below is what the final workflow looks like. Once everything is in place, you can continue to run the task

mceclip0.png

7. Run task - to get the data

  • Run the task in the top right corner

  • Select Run task on your device to run the task on your local device, or select Run task in the cloud to run the task on the Cloud (for premium users only)

Here is the sample output -

mceclip3.png
Did this answer your question?