All Collections
Using Octoparse
Create a task with a list of URLs
Create a task with a list of URLs
Updated over a week ago

In some cases, you may have a list of similar-structured URLs (like a batch of product URLs) on hand, and you want to extract the data from them directly. In this tutorial, we will introduce an easy and powerful way to extract data from multiple web pages by using a list of URLs.

When should you consider scraping using a list of URLs?

Here are some cases where you can start the task with a list of URLs for extraction.

  1. All the URLs are under the same domain, sharing the same webpage structure (Most Important).

    • Example: I have a list of product URLs, and I want to start a task with a list of URLs directly to scrape updated pricing data regularly.

  2. Some websites use infinitive-scrolling/load more to load the content. If you need to collect data by clicking on each URL to scrape details on the deeper layer, then you'll need to split the task into two. One task is to load the page and scrape URLs, and the other one is to use a list of extracted URLs for scraping the detailed info.

    • Example: Zara's search result page uses infinitive scrolling to keep loading new items. If the data you need is on the item page, then you need to set scrolling times and collect enough product URLs first for the next task.

  3. The website applies AJAX (Deal with AJAX) to load new content, which means after clicking on the first product page, the system fails to go back to the listing page automatically (and click on the second product page from there). We'll need to extract the details page URLs first and then scrape the data you want with the URL list.

  4. Some websites tend to load pages quite slowly while paginating, which might affect the data scraping of our scheduled tasks, so it's better to loop through page URLs directly to avoid the issue.

How do I know if the pages have the same structure?

If you are scraping news articles from any particular website, most likely the article pages will share the same page structure, like:

22.png

Another example is from Google maps. Every business page is like this:

23.png

To scrape using a list of URLs, we'll simply set up a loop of all the URLs we need to scrape from then add a data extraction action right after it to get the data we need. Octoparse will load the URLs one by one and scrape the data from each page.

By creating a "List of URLs" loop mode, Octoparse has no need to deal with extra steps like "Click to paginate" or "Click Item" to enter the item page. As a result, the speed of extraction will be faster, especially for Cloud Extraction. Check Speed up scraping by using URL list

FAQs:

1. Can I use URLs that do not share the same page layout?

Unfortunately, only URLs that share the same page structure can be extracted using "List of URLs". To ensure data is extracted consistently and accurately, it is necessary to ensure that these pages share the same page layout.

To learn more about the "List of URLs" mode, you can check out the following article: Loop Item

2. Is there a limit to the number of URLs that I can add at a time?

Yes. We suggest adding no more than 10,000 URLs if you copy and paste the URLs directly into Octoparse. However, using the Batch URL input feature, you can input up to 1 million URLs.

3. Can Octoparse automatically collect and add URLs?

Octoparse can input URLs from another task. You can use one task to extract the URLs and then configure another task to use the URLs.

Octoparse API enables modifying the list of URLs without accessing the App.

To extract data from a list of URLs, the extraction process can generally be broken down into 3 simple steps:

web scraping with octoparse - scraping with a list of urls

You may need the links below to follow through:

In Octoparse, there are two ways to create a "List of URLs" loop. You can choose either way that is suitable for your use case. Please see below:


Method 1. Start a new task with a list of URLs

1). Select +New and click Custom Task to create a new task

2). Paste the list of URLs in the textbox and click Save

After clicking Save, the Loop URLs action (which loops through each URL in the list) is automatically created in the workflow. If you click the Loop URLs, you can see that the URLs that you entered have been added to the Loop Items.

332.png

3). After the URLs are saved, the first page will open automatically, and you can select the data on the page to extract.


Method 2. Create a "List of URLs" loop in workflow

This applies to the scenario where you have started a task, and you can directly make a loop for URLs in the task.

1). Add a Loop in the workflow

2). Go to Loop mode and select List of URLs. Click the edit button to paste the list of URLs. Don’t forget to click Apply to save the settings.

333.png

3). Add an Open Page inside the Loop Item, then tick Load URLs in the loop and Apply to confirm

Note: If the scraping stops right after we start the extraction, we can try adding a longer timeout for the opening webpage step, so the system will wait longer for the webpage to be fully loaded.

338.png

4). After the URLs are saved, the first page will open automatically, and you can select the data on the page to extract.


Here are some additional tips for the above two scenarios.

Note:

1. Sometimes if Octoparse works too fast, it is possible to have pages not loaded completely before the data extraction step is executed, which may lead to no or incomplete data being extracted. To avoid this, we can set up a "Wait before execution".

Click on the "Options" settings for the "Extract Data" step and set a wait time before the action is executed (2-3 seconds will usually work).

wait_time.png

2. If you want to get data exported lined up with the original URL list you entered, you can add the current page URL here:

339.png

After the process we mentioned above, when you run the task, you will find that after finishing one website scraping, Octoparse will go to the next page automatically.

440.png
Did this answer your question?