All Collections
Case Tutorial
Search Engine
Scrape data from Duckduckgo search results
Scrape data from Duckduckgo search results
Updated over a week ago

You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier and more robust! Download and upgrade here if you haven't already done so!

Duckduckgo is a search engine that provides instant answers according to people's search keywords. The protection of privacy idea wins itself hundreds of millions of users and is still increasing. To get the wanted information in a batch, we will show you how to scrape search results on the website with Octoparse in this case.

target.jpg

To follow through with the tutorial, kindly please use the following URL for reference:

The main steps are shown in the menu on the right and you can download the demo task file here.


1. Create a Go to Web Page - to open the target page

The target URL needs to be input first to start a scrape journey.

  • Enter the Covid search URL into the search box at the center of the home screen

  • Click Start to create a new task


2. Auto-detect the web page data - to generate a workflow

Octoparse's internal auto-detect function can help to automatically generate a workflow quickly. Further modifications can be made based on it.

  • Click on Auto-detect web page data and wait for the detection to complete

detec.jpg
  • Click Create workflow

  • Delete unwanted fields

  • Double-click on the field header to rename it


3. Modify Loop Item XPath - to locate the Load more button accurately

To ensure the load more results go right, modifying Xpath for the pagination is important.

  • Click on Loop Item

  • Input the XPath in the Matching Xpath box under the General setting as : //button[@id="more-results"]

  • Click Apply

  • Click Loop Item 1

  • Paste the updated XPath //ol[@class="react-results--main"]/li[@data-layout="organic"] in Matching XPath box under General tab

  • Click Apply


4. Modify XPath of data fields - to get the data correctly

  • Switch to Vertical View

  • Input the updated Xpaths below for the Title and Summary data fields

Title: //a[@data-testid="result-title-a"]

Summary:/article/div[3]/div[1]/span[last()]


5. Modify the workflow - to extract data after all results loaded

To avoid scraping duplicate data, moving the Extract Data Loop out of pagination would be safer.

  • Drag the Loop Item 1 out and put it under pagination


6. Run the task - to get the desired data

  • Click the Save button first to save all the settings you have made

  • Then click Run to run your task either locally or cloudly

  • Select Run on your device and click Standard Mode to run the task on your local device

  • Wait for the task to complete

Below is a sample data run from the local. Excel, CSV, HTML, and JSON formats are available for export.

data_overview.jpg
Did this answer your question?