Skip to main content
All CollectionsCase TutorialE-Commerce
Scrape customer reviews from Trustpilot
Scrape customer reviews from Trustpilot
Wyatt avatar
Written by Wyatt
Updated over a year ago

You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier and more robust! Download and upgrade here if you haven't already done so!

In this tutorial, we will show you how to scrape customer reviews from Trustpilot.com, which is a consumer review website hosting reviews of businesses worldwide.

We will use the link below to scrape consumers' reviews of Bank of America:

In this case, we are going to scrape all the information including username, the total number of reviews posted, location, rating, date posted, title, and review contents, as shown below.

The main steps are shown in the menu on the right, and you can download the sample task file here.


1. "Go to Web Page" - to open the target webpage

  • Simply copy and paste the link on the home page

  • Click on Start


2. Set up the pagination loop - to scrape data from multiple pages

  • Scroll down to the end of the page and click Next page.

  • Click Loop click on the Tips panel

  • Set AJAX timeout for 5s (Optional setting depends on your local network speed, 5-10s are recommended)

  • Click Apply to save settings

2021-09-07_15-04-40.png

3. Modify the XPath of Pagination

The auto-generated XPath does not work well. We can modify the XPath for the Pagination to make sure we scrape all the pages.

  • Click on Pagination

  • Replace the XPath with //a[@name="pagination-button-next"]

  • Click Apply to save


4. Set up a loop item - to loop extract reviews

  • Select the first review block

    We have to make sure the whole block of the review is selected, which means the whole review block has been highlighted in green, with all the sub-elements, such as title, username, date, etc. in red, to ensure the precise positioning in the following section.

  • Select the second review block

  • Click Text after the entire review section is selected

  • After the Loop item is created, drag it into pagination. The workflow should look like this:


5. "Extract Data" - Select the data needed

  • Click the data needed (e.g., username) from the first block of the review section

  • choose Text on the Tips panel

  • Do the same to scrape other information like review content, review title, etc


6. Modify data fields - to rename, delete and refine data

  • Delete unwanted fields by clicking on the More button and choosing Delete field

  • Renaming the fields by double-clicking the header

For scraping ratings, it is a little bit complicated, but you can follow the steps below.

  • Click on the rating info and choose OuterHtml

After extracting the HTML code from the rating, we have to change its XPath since the auto-generated one is not working properly.

  • Click More on the rating data field and choose Customize XPath

  • Click Relative XPath to the loop item and paste //img[contains(@alt,"Rated")]

  • Click Apply

  • Click Customise field - Select other attributes - alt, to extract the designated attribute from HTML code

You may notice the Data Posted field is shown as "X days ago", which is hard for us to know the exact date. In this case, we want it to be in the "year/month/day" format. Therefore, we need to conduct Customize field and Clean data to modify the extracted content.

  • Click Customise field - Extract attribute - datetime, to extract the designated attribute from HTML code

_____________.gif
  • Click Clean data - Add step - Reformat extracted date/time to modify content

mmmmmmmmmmmmmmm.gif

Tip: To learn more about data cleaning, please click on the following titles:


7. "Run extraction" - Run your task and collect data

  • Click Save and Run on the top right corner

  • Select Run on your device or Run in the Cloud to run the task in the cloud (for premium users only).


Here is the output sample for your information:

Did this answer your question?