What is JSON?
JSON (JavaScript Object Notation) is a lightweight text-based data-interchange format. It is not only easy for humans to read and write but also easy for machines to parse and generate. As a result, it is widely used by websites to improve network transmission efficiency.
Why extract from JSON links?
Extraction from JSON links allows for the faster and safer conversion of data from JSON format into a structured format. It can help us
Achieve faster data extraction without loading images and such
Bypass anti-scraping restrictions on many websites
deal with load more buttons and infinite scrolling more easily
How to use JSON extraction in Octoparse?
For demonstration purposes, we will scrape data from a listing page on Booking.com using JSON extraction. Check out the sample URL: https://jobs.booking.com/careers?location=netherlands&query=&domain=booking.com
1. Inspect the webpage in a browser - to identify the URL containing the JSON file we need
Open the sample URL in Chrome
Right-click on the webpage and select Inspect to open the DevTools
Select Fetch/XHR from the Network tab in the DevTools
Click the clear icon to clear all the loaded information
Scroll down the job listing in the scrollable column to refresh the page
Check the reloaded URLs in XHR to see if they contain any JSON file
Click on the name of a URL and check its Headers info. We will see the content type under Request Headers contains JSON.
Switch to the Preview tab and see how much data we are talking about. We can see the total count is 363 for this demo.
Scroll down a bit more and compare the request URLs to find a pattern
By comparing the request URLs, we find that the parameter start= in the URL increases by 10 each time.
Copy the URL containing the JSON file (Request URL in Headers), which is https://jobs.booking.com/api/apply/v2/jobs?domain=booking.com&start=10&num=10&location=netherlands&domain=booking.com
Note: Some websites may display all the information with one JSON link, so you don't need to batch generate the URLs.
2. Batch generate JSON URL list in Octoparse - to extract from a list of JSON file links
Next, we need to batch generate the JSON URL list in Octoparse.
Open Octoparse and start a new Custom Task that batch generates input URLs
Paste the copied URL into the URL format box
Select the element that you want to change in the URL and click Add Parameter
Set Initial value to 0, Every time to +10, and End value to 363 and click Confirm to save
Note: The End value is constantly changing. Input the actual value you find in Chrome.
Click the Go to Web Page action and tick the JSON box in the General tab
Click Apply to save your settings
3. Select the data for extraction - to get the data we need
Toggle the structure tree and select the page elements we want in the positions node
Extract data fields like name, display_job id, business unit, and location by clicking on the information and choose Element data
Save the task and run it to get the data we need
Here is the sample data output.