You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier and more robust! Download and upgrade here if you haven't already done so!
Bol is a leading e-commerce platform in the Netherlands. As the strongest retail brand in Holland, the website holds large numbers of both users and merchants.
In this case, we will scrape product info from Bol with Octoparse and hope the data would provide some help for both buyers and sellers. Here we use AirPods product search results page as an example.
To follow through with the tutorial, kindly please use the following URL for reference:
The main steps are shown in the menu on the right, and you can download the sample task file here.
1. Create a Go to Web Page - to open the target website
To start our scrape journey, the target website URL needs to be input first.
Enter the search URL into the search box at the center of the home screen. Click Start to create a new task
Click Alles accepteren to set cookies
Click Click button on the Tips panel to finish the cookie settings
2. Set up Pagination Loop - to scrape data from multiple pages
Click > at the bottom of the page
Then click Loop click to set the pagination
3. Start auto-detection - to generate a workflow
Octoparse's internal auto-detect function can help to automatically generate a workflow quickly. Further modifications can be made based on it.
Click Auto-detect web page data and wait for the detection to complete
Delete unwanted fields
Untick Click on a "Load More" button because there is no load more button on this page
Click Create workflow
The workflow would then be generated as below:
4. Modify XPath for Pagination - to locate the next page button accurately
In order to make sure the pagination goes right, an accurate XPath for the pagination is essential.
Click Pagination in the workflow
Choose General settings
Input //a[@aria-label='volgende']
Click Apply to apply the settings
5. Clean data - to get the correct format of the number
As shown in the data preview, the price extracted from the page missed a ".", we can add a clean data step to make it right.
Click on More (...) of the Price column
Click Clean Data
Click Add Step
Click Trim spaces
Click Trim Both to trim both the spaces behind and after the number
Click Confirm
Click Add Step again
Click Replace with Regular Expression
Enter \n in the Regular Expression column
Enter . in the With column
Click Confirm to save the settings
Click Apply to apply the formula
NOTE: The RegEx entered here means to replace line break (\n) with "." For more tutorials on RegEx, kindly please check here.
6. Run the task - to get the desired data
Click the Save button first to save all the settings you have made
Then click Run to run your task either locally or cloudly
Below is a sample data run from the local run. Excel, CSV, HTML, and JSON formats are available for export.
TIP: Local runs are great for quick runs and small amounts of data. If you are dealing with more complicated tasks or mass of data, Run in the Cloud is recommended for higher speed. You are very welcome to try the premium feature by signing up for the 14-day free trial here. Tasks could be scheduled hourly, daily, or weekly and data delivered regularly.