When we scrape product information from e-commerce websites, more often than not, we expect to extract data not only from the search result page but also from each product's detail page. In this tutorial, we will teach you how to build a customized crawler to achieve that purpose.
Let's say we need to scrape blog information from Octoparse. See the sample URL below:
In this case, we want to extract the basic information of blogs from the listing page first and then go to its detail page to get the full content. We have two methods to achieve these needs.
1. Use the Auto-detect feature to create a workflow
The smart detection feature in Octoparse 8.X is more powerful than ever. We can use it to detect the webpage to save us some time.
Click Auto-detect web page data in the Tips panel and wait for it to complete
When we search for popular product lines like the one we use to demonstrate, chances are that we need to navigate through multiple search result pages and extract data from each one of them.
Click on the Check button to see if Octoparse has successfully located a next page button
Uncheck Add a page scroll and click Create workflow
Octoparse has now created a Loop Item in the workflow, which can help to scrape from the search results page. We will continue to build the steps to go to detail pages.
Select Select subpage URL
Now Octoparse has taken us to the details page for further data extraction. We can take down the information we want from the page.
Click on any web element you want to extract
Click Text from the Tips panel
Modify the data field names in the Data Preview section by double-clicking on the field header
2. Manually create the workflow
If the auto-detect function fails for some websites, we can also set up the workflow manually. See the steps below:
Select the first item on the list page
Continue selecting the second item
Click Text
A Loop Item has now been added to the workflow, but only one field has been scrapped. We can add other fields.
Repeat the steps above to add more fields
Then we need to build an action to click on the product title URL.
Select the first title on the list page
Click Click element
Once we are taken to the detail page, we can extract the information from the Item specifics.
Click on any web element you want to extract
Click Text from the Tips panel
Modify the data field names in the Data Preview section by double-clicking on the field header
Click on Loop Item to go back to the listing page
Click on the 'next page button' and Loop click to create a pagination step so the task can go through all pages
The final workflow should look like this:
Note: If the website uses infinite scrolling to load more items, you can manually add a scroll step. To do this, click the "+" button in the workflow, select Loop.
Then switch the loop mode to Scroll Page, and click Apply.