You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier, and more robust! Download and upgrade here if you haven't already done so!
Google Scholar provides a simple way to search for scholarly literature broadly. As a freely accessible web search engine, it is a perfect site to scrape academic-related data.
In this tutorial, we will show you how to scrape search results from Google Scholar with Octoparse.
Before you build a crawler on your own, you may want to check out the pre-built Google Scholar template for an easier way to get data. Enter your keywords to get the data extracted within minutes!
If the template falls short of your needs and you want to build the crawler from scratch, you might continue with the tutorial. Check out the sample URL: https://scholar.google.com/ncr
We will search with multiple keywords and scrape each article's title, author, and description information from the search results pages.
The main steps are shown in the menu on the right. [Download task file here]
1. Create a Go to Web Page - to open the target web page
Every workflow in Octoparse starts by telling Octoparse a web page to start from.
Enter the sample URL into the search bar at the top of the home screen and click Start
Check if a Go to Web Page action has been generated in your workflow. If you have more than one URL, check this article to see how Octoparse handles a list of URLs.
Now we have reached the target web page.
2. Create a Loop Item - to enter multiple keywords
If we want to search for multiple keywords on Google Scholar, we need to create a loop search action for our keyword list.
Click on the search box
Choose Enter Text
Choose Enter text in loop
Input your search term list (one search term per line)
Click Confirm to save the settings
We can check if the steps are set up correctly by clicking the Loop Item and then Enter Text in the workflow to see if the text would be entered into the web page.
Click the Google Scholar search button on the web page
Select Click element on the Tips panel, and you will notice the Click Item action is added to the workflow
Click open the settings of the Click Item and extend the AJAX timeout
Octoparse will automatically enter every search term in the list in the search box and click the search icon.
3. Auto-detect the search result page to scrape data
Click Auto-detect the web page data and wait for it to complete
Check the Paginate to scrape more pages options to see if Octoparse detects the right next page button
Uncheck the Add a page scroll as the web page doesn't need to be scrolled to load
Click Create workflow
Octoparse will go to each result page and scrape the data we want.
Turn to the Data Preview section to either rename or delete the auto-captured data fields
4. Set up Wait before action -to slow down the scraping speed
This step is mandatory as Google Scholar applies anti-scraping measures and may ask us to pass a reCAPTCHA test if we scrape too fast.
Click Extract Data action
Tick Wait before action in the Options tab and set the wait time to 3s
Click Apply to save the settings
Octoparse will wait 3 seconds every time it executes the Extract Data action.
5. Run the task - to get your target data
Click Save on the upper right to save your task
Click Run next to it and wait for a Run task window to pop up
Select Standard mode under Run on your device section to run the task on your local device
Wait for the task to complete
Here is the sample output from a local run.
Tip: Local runs are great for task troubleshooting and quick runs. If you are dealing with more complicated tasks, it is recommended that you select Run in the Cloud to run the task in Octoparse's cloud-based platform for higher speed. Try out this premium feature by signing up for the 14-day free trial here. You can also schedule your task to run hourly, daily, or weekly and get data delivered to you regularly.