You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier, and more robust! Download and upgrade here if you haven't already done so!
In this tutorial, we are going to introduce how to scrape customer review data from Tripadvisor. We will scrape the hotel's basic information, the reviewers' names, and the comments from customers.
We have a pre-built template for scraping TripAdvisor hotel reviews, and you can search on the home screen to use it directly.
If you want to build your own task, please check out the detailed steps below. To follow through, you may want to use the URL in this tutorial:
The main steps are shown in the menu on the right. [Download the sample task file here.]
1. Go to Web Page - open the targeted web page
Enter the URL to the home page and click Start
2. Extract the data - to extract the hotel's basic information
Click on the data field(s) you need, then click Extract Element data in the Tips panel 、
3. Create a Loop Item - to scrape information in the customer's review
Click on the Read more button in the first review section and choose Click element in the tips panel
Modify the XPath of Click Item as: //div[@data-test-target='expand-review'] and click Apply
Click on your desired data field(s), such as username, review title, comment, etc., in the first review section
Select Element data in the Tips panel
Rename the data field by double-clicking on it.
Click on the Edit button to delete the step Extract Data 1
4. Modify the XPaths – to locate certain data fields more accurately
Please note that the auto-generated XPath is not always accurate enough. Thus, we need to modify the XPath of the fields to fetch data more precisely.
Take the review title as an example.
Click on the More button > Customize XPath
Modify the XPath for the title as: //div[@data-test-target='review-title']/a/span/span
Click Apply to save the settings
We have also prepared XPaths for the review content down below. You can just copy and paste it to customize XPath.
Content: //div[@data-test-target="review-title"]/following-sibling::div[1]/div[1]/div[1]
5. Clean data - to extract the rating
As you can see, the rating of each customer has been presented as bubbles on Tripadvisor. Thus, to transfer this kind of data into numbers, we are going to use Clean Data. To learn more about clean data, please click here.
Click on the bubble for the rating
Choose OuterHTML in the Tips panel
Click on the More button > Customize XPath
Modify the XPath of the rating as: //div[@data-test-target="review-rating"]/span and click Apply
Click on the More button > Clean Data
Click on Add Step > Match with Regular Expression
Click on Try the RegEx tool!
Tick Start with, then enter rating bubble_ in the input box
Tick End with, then enter " in the input box
Click Generate > Apply
Click Add Step > Replace with Regular Expression
Enter ([0-9]+)([0-9]{1}) in the Regular Expression input box
Enter $1.$2 in the With input box
Click Evaluate > Confirm
6. Create pagination - to load reviews from multiple pages
Scroll down to click the Next button and choose Loop click next page
Set up AJAX timeout as 10s (to learn more about AJAX, please click here)
7. Run the task - to get your desired data
Click Save and click Run on the upper right side
Select Run on your device or Run in the cloud to run the task.
Here is the sample output for your reference.