With the List of URLs loop mode, Octoparse does not need to deal with some steps like Click to Paginate or Click Item to enter a details page.
As a result, the speed of extraction will be faster, especially for Cloud Extraction. When a task built using List of URLs is set to run in the Cloud, the task will be split up into sub-tasks which are then set to run on various cloud nodes simultaneously.
1. Speed up scraping by using paginated URLs
A paginated URL is a type of URL that includes a parameter (such as page=1
) to indicate the specific page number being viewed in a sequence. These URLs are commonly used for websites with search results or content spread across multiple pages, allowing users to navigate through large sets of information efficiently.
If your scraping task needs to extract data from thousands of pages, you can enter a list of paginated URLs in the task instead of having the task click the 'Next' button for each page. This approach significantly speeds up the process, ensuring your task runs much faster.
Let's take the URLs below as an example:
This website has a total number of 1,011 pages. By observing the URLs for each page, you can find that they share the same structure. In this case, you can use the Batch-generate feature to auto-generate the URLs for each page.
Here are the steps you can follow:
Click New+ from the sidebar menu and select Custom Task
Note: If you’ve already set up a task, click the 'Edit URLs' button in the top-right corner to batch-generate the URLs.
Select Batch generate
Paste one of the paginated URLs for batch-generating
Highlight the page number ("1" in this case) and click Add parameter
Enter the total number of pages number ("1,011" in this case) in the 'Repeat' box
Click on Save
Once the paginated URLs are generated, the task can navigate to each page directly. Therefore, you can remove the Pagination step.
Tip: There are three ways to batch import URLs to any single task/crawler (up to a million URLs):
Batch import URLs from local files
Batch import URLs from another task
Manually Enter
Please check this tutorial Batch URL input for more details.
2. Speed up scraping by using details page URLs
When you need to click through the items on the list and scrape their corresponding detail pages, it takes some time to click all the items one by one. In this case, it is wise to scrape the URLs of all the listed items first. After you get all the URLs of detail pages, you can start a new task by inputting all the scraped URLs from the previous task.
Note: Once you've entered the URLs as mentioned above, please run your task in boost mode. This will allow the task to be divided into multiple subtasks, speeding up the overall process.