You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier, and more robust! Download and upgrade here if you haven't already done so!
Scraping data from a search engine is a good way to collect information related to one topic. In this tutorial, we are going to show you how to scrape the search results data on Google search.
You can go to "Task Templates" on the home screen of the Octoparse and start with the ready-to-use Google Search Template directly to save your time. With this template, there is no need to configure scraping tasks. For further details, you may check it out here: Task Templates
If you want to create your own task with our Custom Task, you can look through this tutorial as a reference. We will scrape data such as the title, URL, and description from the search results page with Octoparse.
You may need this link to follow through: https://www.google.com/
The main steps are shown in the menu on the right. [Download demo task file click here]
1. Create a Go to Web Page - to open the target website
Enter the URL on the home page and click Start
2. Enter text - to start the search
Click the search box and then choose Enter text on the tips panel
Enter the keywords you need to search for in Textbox 1
this is what the workflow looks like:
If you want to search for a list of keywords, choose Enter text in the loop
A Loop Item with an Enter Text inside it will be created in the workflow:
To add a click, you can set it under the Enter text action
3. Auto-detect the web page - to scrape the search result page
Choose Auto-detect the page data
Untick Add a page scroll and choose Create workflow
Double-click to rename the fields or delete the fields you don't want
Tips!
If the auto-detect function scrapes several fields you don't want, it is more convenient to switch to the vertical view to delete them in batch.
4. Modify element XPaths - to locate the elements accurately
Click Loop Item and then input the //h1[contains(text(),'Page Navigation')]/following-sibling::a[1] under the Matching XPath.
Click Loop item1 and then input the //H3[@class='LC20lb MBeuO DKV0Md']/../../../../../../.. under the Matching XPath. Remember to click apply in both settings.
Click Extract data
Change to the Vertical View
Enter the XPaths for the fields you need
Here are some examples:
Title: //H3[1]
Title_URL: //div[@class='yuRUbf']//a[1]
Description: /div/div[2]
Tips!
Check out more details about XPath here: What is XPath and how to use it in Octoparse
5. Add a page scroll manually
The load more button only shows when you scroll a little on the page.
Click + and choose Loop to create a page scroll
Click Loop Item 2 and choose Scroll Page in the loop mode
Set scroll to the bottom of the page and repeat at 5
Click Apply
6. Set up wait time - to slow down the scraping speed
Google search applies an anti-scraping technique and it would show reCAPTCHA to solve. We need to slow down the scraping by setting the wait time.
Click on Extract Data action
Select Options
Tick Wait before action
Select the wait time as 1s-3s and click Apply to confirm
7. Run the task - to get your target data
Click Save
Click Run on the upper left side
Select a running mode either on your device or in the Cloud (for premium users only)
Here is the sample output.