Skip to main content

Scrape data from JSON links

What is JSON? Why and how to use JSON in Octoparse?

Updated over 11 months ago

What is JSON?

JSON (JavaScript Object Notation) is a lightweight text-based data-interchange format. It is not only easy for humans to read and write but also easy for machines to parse and generate. As a result, it is widely used by websites to improve network transmission efficiency.

Why extract from JSON links?

Extraction from JSON links allows for the faster and safer conversion of data from JSON format into a structured format. It can help us

  1. Achieve faster data extraction without loading images and such

  2. Bypass anti-scraping restrictions on many websites

  3. deal with load more buttons and infinite scrolling more easily

In what situations may I need to use JSON links to scrape data?

You can try to use the JSON links when the following issues are encountered:

  1. Web page not loading well inside Octoparse built-in browser

  2. Scraping results missed a lot of data

  3. Cloud run does not scrape the data well

How to use JSON extraction in Octoparse?

For demonstration purposes, we will scrape the data from a job listing page using JSON extraction. Check out the sample URL: https://jobs.booking.com/booking/jobs


1. Inspect the webpage in a regular browser - to identify the JSON URL containing data we need

  • Open the sample URL in Chrome

  • Right-click on the webpage and select Inspect to open the DevTools

  • Reload the web page

  • Select the Network tab in the DevTools

  • Press Ctrl+F to open the search box

  • Input the title of the first job and press enter

There might be several results shown and we can click on each one and check the preview data to find out which link contains a list of items we want.

  • Go to the Headers and copy the JSON request URL


Note: Some websites may display all the information with one JSON link, so you don't need to batch generate the URLs.

2. Batch generate JSON URL list in Octoparse - to extract from a list of JSON file links

As you can see from the preview, there are only 10 jobs of the first page shown. If you need to get all 81 jobs, you will have to get the links for the other pages.

By observing the request URL, you can see there is a "page" parameter in the URL:


https://jobs.booking.com/api/jobs?page=1&sortBy=relevance&descending=false&internal=false&tags1=Booking.com%20Company%20Hierarchy%7CTransport%20

We can generate the URLs for other pages by changing this parameter value.

  • Open Octoparse and start a new Custom Task

  • Choose Batch generate

  • Paste the request URL

  • Select the parameter that you want to change in the URL and click Add Parameter

  • Set the Initial value to 1, Every time to +1, and End value to 9 and click Confirm to save

  • Click the Go to Web Page action and tick the JSON box in the General tab

  • Click Apply to save your settings

Octoparse will load the link in tree structure:


3. Select the data for extraction - to get the data we need

  • Toggle the structure tree until you see the data you want to scrape. In this case, you need to open the data node.

  • Click on all the elements you want to scrape

  • Choose Element data from Tips

Octoparse will create a Loop Item to automatically capture all the elements from all the jobs.

  • Save the task and run it to get the data

Useful Tips

1. Loading JSON links has two methods, one is GET, and the other is POST.

You can find the correct method from the Headers:

2. Some JSON links require additional parameters to load. You can find the parameters inside the Request Header:

You may need to input the parameters into Octoparse when the link is loaded well.

Did this answer your question?