You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier and more robust! Download and upgrade here if you haven't already done so!
Duckduckgo is a search engine that provides instant answers according to people's search keywords. The protection of privacy idea wins itself hundreds of millions of users and is still increasing. To get the wanted information in a batch, we will show you how to scrape search results on the website with Octoparse in this case.
To follow through with the tutorial, kindly please use the following URL for reference:
The main steps are shown in the menu on the right and you can download the demo task file here.
1. Create a Go to Web Page - to open the target page
The target URL needs to be input first to start a scrape journey.
Enter the Covid search URL into the search box at the center of the home screen
Click Start to create a new task
2. Auto-detect the web page data - to generate a workflow
Octoparse's internal auto-detect function can help to automatically generate a workflow quickly. Further modifications can be made based on it.
Click on Auto-detect web page data and wait for the detection to complete
Delete unwanted fields
Double-click on the field header to rename it
3. Modify Loop Item XPath - to locate the Load more button accurately
To ensure the load more results go right, modifying Xpath for the pagination is important.
Click on Loop Item
Input the XPath in the Matching Xpath box under the General setting as : //button[@id="more-results"]
Click Apply
Click Loop Item 1
Paste the updated XPath //ol[@class="react-results--main"]/li[@data-layout="organic"] in Matching XPath box under General tab
Click Apply
4. Modify XPath of data fields - to get the data correctly
Input the updated Xpaths below for the Title and Summary data fields
Title: //a[@data-testid="result-title-a"]
Summary:/article/div[3]/div[1]/span[last()]
5. Modify the workflow - to extract data after all results loaded
To avoid scraping duplicate data, moving the Extract Data Loop out of pagination would be safer.
6. Run the task - to get the desired data
Click the Save button first to save all the settings you have made
Then click Run to run your task either locally or cloudly
Select Run on your device and click Standard Mode to run the task on your local device
Wait for the task to complete
Below is a sample data run from the local. Excel, CSV, HTML, and JSON formats are available for export.