In some cases, you may have a list of similar-structured URLs (like a batch of product URLs) on hand, and you want to extract the data from them directly. In this tutorial, we will introduce an easy and powerful way to extract data from multiple web pages by using a list of URLs.
How to start a task with a list of URLs
To extract data from a list of URLs, the extraction process can generally be broken down into 3 simple steps:
You may need the links below to follow through:
In Octoparse, there are two ways to create a "List of URLs" loop. You can choose either way that is suitable for your use case.
1. Start a new task with a list of URLs
Select +New and click Custom Task to create a new task
Paste the list of URLs in the textbox and click Save
After clicking Save, the Loop URLs action (which loops through each URL in the list) is automatically created in the workflow. If you click the Loop URLs, you can see that the URLs that you entered have been added to the Loop Items.
After the URLs are saved, the first page will open automatically, and you can select the data on the page to extract.
2. Create a "List of URLs" loop in workflow
This applies to scenarios where you have started a task, and you can directly make a loop for URLs in the task.
Add a Loop in the workflow
Go to Loop Mode and select List of URLs. Click the edit button to paste the list of URLs. Don’t forget to click Apply to save the settings.
Add an Open Page inside the Loop Item, then tick Load URLs in the loop and Apply to confirm
After the URLs are saved, the first page will open automatically, and you can select the data on the page to extract.
Note:
1. Sometimes if Octoparse works too fast, it is possible to have pages not loaded completely before the data extraction step is executed, which may lead to no or incomplete data being extracted. To avoid this, we can set up a "Wait before execution".
Click on the "Options" settings for the "Extract Data" step and set a wait time before the action is executed (2-3 seconds will usually work).
2. If you want to get data exported lined up with the original URL list you entered, you can add the current page URL here:
How to update the URLs in the list
After the task is created, if you want to update the URLs, you can go to the Loop Item and click on the Edit button. Check out more details here.
When should you consider scraping using a list of URLs?
Here are some cases where you can start the task with a list of URLs for extraction.
All the URLs are under the same domain, sharing the same webpage structure (Most Important).
Example: I have a list of product URLs, and I want to start a task with a list of URLs directly to scrape updated pricing data regularly.
Some websites use infinitive-scrolling/load more to load the content. If you need to collect data by clicking on each sub-page URL to scrape details on the deeper layer, you'll need to split the task into two. One task is to load the page and scrape sub-page URLs, and the other one is to use a list of extracted URLs for scraping the detailed info.
Example: Zara's search result page uses infinitive scrolling to keep loading new items. If the data you need is on the item page, you need to set scrolling times and collect enough product URLs first for the next task.
The website applies AJAX (Deal with AJAX) to load new content, which means after clicking on the first sub-page URL, the system fails to go back to the listing page automatically (and click on the second product page from there). We'll need to extract the details page URLs first and then scrape the data you want with the URL list.
Some websites tend to load pages quite slowly while paginating, which might affect the data scraping of our scheduled tasks, so it's better to loop through page URLs directly to avoid the issue.
By creating a "List of URLs" loop mode, Octoparse has no need to deal with extra steps like "Click to paginate" or "Click Item" to enter the sub-pages. As a result, the speed of extraction will be faster, especially for Cloud Extraction. Check Speed up scraping by using URL list
More FAQs:
1. How do I know if the pages have the same structure?
If you are scraping news articles from any particular website, most likely the article pages will share the same page structure, like:
Another example is from Google maps. Every business page is like this:
2. Can I use URLs that do not share the same page layout?
Unfortunately, only URLs that share the same page structure can be extracted using "List of URLs". To ensure data is extracted consistently and accurately, it is necessary to ensure that these pages share the same page layout.
To learn more about the "List of URLs" mode, you can check out the following article: Loop Item
3. Is there a limit to the number of URLs that I can add at a time?
Yes. We suggest adding no more than 10,000 URLs if you copy and paste the URLs directly into Octoparse. However, using the Batch URL input feature, you can input up to 1 million URLs.
4. Can Octoparse automatically collect and add URLs?
Octoparse can input URLs from another task. You can use one task to extract the URLs and then configure another task to use the URLs.
Octoparse API also enables updating the list of URLs without accessing the App.