In some cases, you may have a list of similar-structured URLs (like a batch of product URLs) on hand, and you want to extract the data from them directly. In this tutorial, we will introduce an easy and powerful way to extract data from multiple web pages by using a list of URLs.

How to start a task with a list of URLs

To extract data from a list of URLs, the extraction process can generally be broken down into 3 simple steps:

web scraping with octoparse - scraping with a list of urls

You may need the links below to follow through:

1. Start a new task with a list of URLs

Select +New and click Custom Task to create a new task

Paste the list of URLs in the textbox and click Save

After clicking Save, the Loop URLs action (which loops through each URL in the list) is automatically created in the workflow. If you click the Loop URLs, you can see that the URLs that you entered have been added to the Loop Items.

After the URLs are saved, the first page will open automatically, and you can select the data on the page to extract.

2. Create a "List of URLs" loop in the workflow

This applies to scenarios where you have started a task, and you can directly make a loop for URLs in the task.

Add a Loop in the workflow

Go to Loop Mode and select List of URLs. Click the edit button to paste the list of URLs. Don’t forget to click Apply to save the settings.

Add an Open Page inside the Loop Item, then tick Load URLs in the loop and Apply to confirm

After the URLs are saved, the first page will open automatically, and you can select the data on the page to extract.

Note:

1. Sometimes, if Octoparse works too fast, it is possible to have pages not loaded completely before the data extraction step is executed, which may lead to no or incomplete data being extracted. To avoid this, we can set up a "Wait before execution".

Click on the "Options" settings for the "Extract Data" step and set a wait time before the action is executed (2-3 seconds will usually work).

2. If you want to get data exported lined up with the original URL list you entered, you can add the current page URL here:

How to update the URLs in the list

After the task is created, if you want to update the URLs, you can go to the Loop Item and click on the Edit button. Check out more details here.

When should you consider scraping using a list of URLs?

Here are some cases where you can start the task with a list of URLs for extraction.

All the URLs are under the same domain, sharing the same webpage structure (Most Important).
- Example: I have a list of product URLs, and I want to start a task with a list of URLs directly to scrape updated pricing data regularly.
Some websites use infinite-scrolling/load more to load the content. If you need to collect data by clicking on each sub-page URL to scrape details on the deeper layer, you'll need to split the task into two. One task is to load the page and scrape sub-page URLs, and the other one is to use a list of extracted URLs for scraping the detailed info.
- Example: Zara's search result page uses infinite scrolling to keep loading new items. If the data you need is on the item page, you need to set scrolling times and collect enough product URLs first for the next task.
The website applies AJAX (Deal with AJAX) to load new content, which means after clicking on the first sub-page URL, the system fails to go back to the listing page automatically (and click on the second product page from there). We'll need to extract the details page URLs first and then scrape the data you want with the URL list.
Some websites tend to load pages quite slowly while paginating, which might affect the data scraping of our scheduled tasks, so it's better to loop through page URLs directly to avoid the issue.

By creating a "List of URLs" loop mode, Octoparse has no need to deal with extra steps like "Click to paginate" or "Click Item" to enter the sub-pages. As a result, the speed of extraction will be faster, especially for Cloud Extraction. Check Speed up scraping by using URL list

More FAQs:

1. How do I know if the pages have the same structure?

If you are scraping news articles from any particular website, most likely the article pages will share the same page structure, like:

Another example is from Google Maps. Every business page is like this:

2. Can I use URLs that do not share the same page layout?

Unfortunately, only URLs that share the same page structure can be extracted using "List of URLs". To ensure data is extracted consistently and accurately, it is necessary to ensure that these pages share the same page layout.

To learn more about the "List of URLs" mode, you can check out the following article: Loop Item

3. Is there a limit to the number of URLs that I can add at a time?

Yes. We suggest adding no more than 10,000 URLs if you copy and paste the URLs directly into Octoparse. However, using the Batch URL input feature, you can input up to 1 million URLs.

4. Can Octoparse automatically collect and add URLs?

Octoparse can input URLs from another task. You can use one task to extract the URLs and then configure another task to use the URLs.

Octoparse API also enables updating the list of URLs without accessing the App.

Loop Item (Loop URLs/Pagination)

Scrape data from both listing and detail pages

Smart Hacks in Octoparse

Create a task with a list of URLs

How to start a task with a list of URLs

1. Start a new task with a list of URLs

2. Create a "List of URLs" loop in the workflow

How to update the URLs in the list

When should you consider scraping using a list of URLs?