What is JSON?
JSON (JavaScript Object Notation) is a lightweight text-based data-interchange format. It is not only easy for humans to read and write but also easy for machines to parse and generate. As a result, it is widely used by websites to improve network transmission efficiency.
Why extract from JSON links?
Extraction from JSON links allows for the faster and safer conversion of data from JSON format into a structured format. It can help us
Achieve faster data extraction without loading images and such
Bypass anti-scraping restrictions on many websites
deal with load more buttons and infinite scrolling more easily
In what situations may I need to use JSON links to scrape data?
You can try to use the JSON links when the following issues are encountered:
Web page not loading well inside Octoparse built-in browser
Scraping results missed a lot of data
Cloud run does not scrape the data well
How to use JSON extraction in Octoparse?
For demonstration purposes, we will scrape the data from a job listing page using JSON extraction. Check out the sample URL: https://jobs.booking.com/booking/jobs
1. Inspect the webpage in a regular browser - to identify the JSON URL containing data we need
Open the sample URL in Chrome
Right-click on the webpage and select Inspect to open the DevTools
Reload the web page
Select the Network tab in the DevTools
Press Ctrl+F to open the search box
Input the title of the first job and press enter
There might be several results shown and we can click on each one and check the preview data to find out which link contains a list of items we want.
Go to the Headers and copy the JSON request URL
Note: Some websites may display all the information with one JSON link, so you don't need to batch generate the URLs.
2. Batch generate JSON URL list in Octoparse - to extract from a list of JSON file links
As you can see from the preview, there are only 10 jobs of the first page shown. If you need to get all 81 jobs, you will have to get the links for the other pages.
By observing the request URL, you can see there is a "page" parameter in the URL:
https://jobs.booking.com/api/jobs?page=1&sortBy=relevance&descending=false&internal=false&tags1=Booking.com%20Company%20Hierarchy%7CTransport%20
We can generate the URLs for other pages by changing this parameter value.
Open Octoparse and start a new Custom Task
Choose Batch generate
Paste the request URL
Select the parameter that you want to change in the URL and click Add Parameter
Set the Initial value to 1, Every time to +1, and End value to 9 and click Confirm to save
Click the Go to Web Page action and tick the JSON box in the General tab
Click Apply to save your settings
Octoparse will load the link in tree structure:
3. Select the data for extraction - to get the data we need
Toggle the structure tree until you see the data you want to scrape. In this case, you need to open the data node.
Click on all the elements you want to scrape
Choose Element data from Tips
Octoparse will create a Loop Item to automatically capture all the elements from all the jobs.
Save the task and run it to get the data
Useful Tips
1. Loading JSON links has two methods, one is GET, and the other is POST.
You can find the correct method from the Headers:
2. Some JSON links require additional parameters to load. You can find the parameters inside the Request Header:
You may need to input the parameters into Octoparse when the link is loaded well.