All Collections
Case Tutorial
Travel
Scrape customer reviews from Tripadvisor
Scrape customer reviews from Tripadvisor
Updated over a week ago

You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier, and more robust! Download and upgrade here if you haven't already done so!

In this tutorial, we are going to introduce how to scrape customer review data from Tripadvisor. We will scrape the hotel's basic information, the reviewers' names, and the comments from customers.

We have a pre-built template for scraping TripAdvisor hotel reviews, and you can search on the home screen to use it directly.

If you want to build your own task, please check out the detailed steps below. To follow through, you may want to use the URL in this tutorial:

The main steps are shown in the menu on the right. [Download the sample task file here.]


1. Go to Web Page - open the targeted web page

  • Enter the URL to the home page and click Start


2. Extract the data - to extract the hotel's basic information

  • Click on the data field(s) you need, then click Extract Element data in the Tips panel 、


3. Create a Loop Item - to scrape information in the customer's review

  • Scroll down the page, select the first two reviews and click to extract Text

  • Click on the Read more button in the first review section and choose Click element in the tips panel

  • Modify the XPath of Click Item as: //div[@data-test-target='expand-review'] and click Apply

  • Click on your desired data field(s), such as username, review title, comment, etc., in the first review section

  • Select Element data in the Tips panel

  • Rename the data field by double-clicking on it.

  • Click on the Edit button to delete the step Extract Data 1


4. Modify the XPaths – to locate certain data fields more accurately

Please note that the auto-generated XPath is not always accurate enough. Thus, we need to modify the XPath of the fields to fetch data more precisely.

Take the review title as an example.

  • Click on the More button > Customize XPath

  • Modify the XPath for the title as: //div[@data-test-target='review-title']/a/span/span

  • Click Apply to save the settings

We have also prepared XPaths for the review content down below. You can just copy and paste it to customize XPath.

Content: //div[@data-test-target="review-title"]/following-sibling::div[1]/div[1]/div[1]


5. Clean data - to extract the rating

As you can see, the rating of each customer has been presented as bubbles on Tripadvisor. Thus, to transfer this kind of data into numbers, we are going to use Clean Data. To learn more about clean data, please click here.

  • Click on the bubble for the rating

  • Choose OuterHTML in the Tips panel

  • Click on the More button > Customize XPath

  • Modify the XPath of the rating as: //div[@data-test-target="review-rating"]/span and click Apply

  • Click on the More button > Clean Data

  • Click on Add Step > Match with Regular Expression

  • Click on Try the RegEx tool!

  • Tick Start with, then enter rating bubble_ in the input box

  • Tick End with, then enter " in the input box

  • Click Generate > Apply

  • Click Add Step > Replace with Regular Expression

  • Enter ([0-9]+)([0-9]{1}) in the Regular Expression input box

  • Enter $1.$2 in the With input box

  • Click Evaluate > Confirm


6. Create pagination - to load reviews from multiple pages

  • Scroll down to click the Next button and choose Loop click next page

  • Set up AJAX timeout as 10s (to learn more about AJAX, please click here)


7. Run the task - to get your desired data


Here is the sample output for your reference.

Did this answer your question?