All Collections
FAQ
How can I scrape data faster in Cloud?
How can I scrape data faster in Cloud?
Updated over a week ago

You are browsing a tutorial guide for Octoparse's latest version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier and more robust! Download and upgrade here if you haven't already done so!

Octoparse offers Cloud nodes to run tasks 24/7 and reach up to 4-20 times faster than local extraction. Fast extraction is one of the highlights of Octoparse Cloud extraction. But sometimes, the speed of the Cloud may not be that satisfying. In this tutorial, we will explain the logic of speeding up tasks in the Cloud and how to revise a task to make it run faster.


The logic of speeding up tasks in the Cloud

Octoparse Cloud speeds up by splitting one task into multiple sub-tasks and running the sub-tasks with multiple Cloud nodes. One sub-task requires one Cloud node to run, so the speed depends on how many Cloud nodes your account has and if the task is splittable.

The Standard plan has up to 6 Cloud nodes while the Professional plan has up to 20. You can easily upgrade to a higher plan to speed it up. But if you don't want to change your plan, modifying the task to be splittable is quite essential.


What kind of tasks are splittable?

When you try to create any kind of loop item in Octoparse, Octoparse will automatically assign a loop mode to it based on the items selected and how they interact with the general webpage structure.

1.png

Specifically, there are three types of splittable loop modes in Octoparse:

1. List of URLs

A URL loop is used when you start an extraction task using more than one URL. This is especially handy if the desired data spans through multiple web pages sharing the same page structure. You can easily set up a loop of URLs to go through each of these pages. Octoparse will load the URLs one by one and execute the same set of extraction actions on each page.

A URL loop is splittable. Hence, when a task built with a list of URLs is set to run in the Cloud, Octoparse would split it into multiple sub-tasks for faster and more effective extraction.

To learn more about the List of URLs, please refer to the Batch URL input.

2.png

2. Text List

A Text list loop works similarly to that of the URL list loop, but instead of looping through a list of URLs now the loop works to loop through a list of predefined text values.

For more about the Text list loop, please refer to Enter Text.

3. Fixed List

Many web pages, such as e-commercial websites, often organize webpage content (ie. product information) as a collection of recurring elements with a shared HTML pattern.

When capturing such elements, such as the product titles, Octoparse would intelligently detect all the elements sharing the same HTML pattern and generate a collection of XPath(s) to locate all elements of the same kind.

4.png

Besides these 3 types of splittable loop modes, 2 other loop modes are not splittable: single element loop and variable list loop. As both loop modes only involve one single XPath, they can't be split further into sub-tasks to speed up.


How do I make my task splittable?

1. For a task with a Variable List to click through on a list of elements

  • Modify it to a Fixed List by listing the XPaths for every element on the page

  • Scrape only the element URLs first without clicking into the pages, and then create another task with the URLs to get the detailed data. Here is an example: Scrape property data from Realtor.com

2. For tasks that scrape from multiple pages

Use the URLs for each page to build the workflow: Speed up scraping by using a URL list

Did this answer your question?