All Collections
Case Tutorial
Search Engine
Scrape search results from Google Scholar
Scrape search results from Google Scholar
Updated over a week ago

You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier, and more robust! Download and upgrade here if you haven't already done so!

Google Scholar provides a simple way to search for scholarly literature broadly. As a freely accessible web search engine, it is a perfect site to scrape academic-related data.

In this tutorial, we will show you how to scrape search results from Google Scholar with Octoparse.

Before you build a crawler on your own, you may want to check out the pre-built Google Scholar template for an easier way to get data. Enter your keywords to get the data extracted within minutes!

If the template falls short of your needs and you want to build the crawler from scratch, you might continue with the tutorial. Check out the sample URL: https://scholar.google.com/ncr

We will search with multiple keywords and scrape each article's title, author, and description information from the search results pages.

The main steps are shown in the menu on the right. [Download task file here]


1. Create a Go to Web Page - to open the target web page

Every workflow in Octoparse starts by telling Octoparse a web page to start from.

  • Enter the sample URL into the search bar at the top of the home screen and click Start

Check if a Go to Web Page action has been generated in your workflow. If you have more than one URL, check this article to see how Octoparse handles a list of URLs.

Now we have reached the target web page.


2. Create a Loop Item - to enter multiple keywords

If we want to search for multiple keywords on Google Scholar, we need to create a loop search action for our keyword list.

  • Click on the search box

  • Choose Enter Text

  • Choose Enter text in loop

  • Input your search term list (one search term per line)

  • Click Confirm to save the settings

We can check if the steps are set up correctly by clicking the Loop Item and then Enter Text in the workflow to see if the text would be entered into the web page.

  • Click the Google Scholar search button on the web page

  • Select Click element on the Tips panel, and you will notice the Click Item action is added to the workflow

  • Click open the settings of the Click Item and extend the AJAX timeout

AJAX.jpg

Octoparse will automatically enter every search term in the list in the search box and click the search icon.


3. Auto-detect the search result page to scrape data

  • Click Auto-detect the web page data and wait for it to complete

  • Check the Paginate to scrape more pages options to see if Octoparse detects the right next page button

  • Uncheck the Add a page scroll as the web page doesn't need to be scrolled to load

  • Click Create workflow

2.jpg

Octoparse will go to each result page and scrape the data we want.

  • Turn to the Data Preview section to either rename or delete the auto-captured data fields

    • Delete unwanted data fields directly by clicking More and Delete field

    • Modify the data field names by double-clicking the headers


4. Set up Wait before action -to slow down the scraping speed

This step is mandatory as Google Scholar applies anti-scraping measures and may ask us to pass a reCAPTCHA test if we scrape too fast.

  • Click Extract Data action

  • Tick Wait before action in the Options tab and set the wait time to 3s

3.jpg
  • Click Apply to save the settings

Octoparse will wait 3 seconds every time it executes the Extract Data action.


5. Run the task - to get your target data

  • Click Save on the upper right to save your task

  • Click Run next to it and wait for a Run task window to pop up

  • Select Standard mode under Run on your device section to run the task on your local device

  • Wait for the task to complete

Here is the sample output from a local run.

Tip: Local runs are great for task troubleshooting and quick runs. If you are dealing with more complicated tasks, it is recommended that you select Run in the Cloud to run the task in Octoparse's cloud-based platform for higher speed. Try out this premium feature by signing up for the 14-day free trial here. You can also schedule your task to run hourly, daily, or weekly and get data delivered to you regularly.

Did this answer your question?