Octoparse offers Cloud nodes to run tasks 24/7 and reach up to 4-20 times faster than local extraction. Fast extraction is one of the highlights of Octoparse Cloud extraction. But sometimes, the speed of the Cloud may not be that satisfying. In this tutorial, we will explain the logic of speeding up tasks in the Cloud and how to revise a task to make it run faster.
The logic of speeding up tasks in the Cloud
Octoparse Cloud speeds up by splitting one task into multiple subtasks and running the subtasks with multiple Cloud nodes. Each subtask requires one Cloud node to run, so the speed depends on how many Cloud nodes your account has and if the task is splittable.
The Standard plan has up to 6 Cloud nodes while the Professional plan has up to 20. You can easily upgrade to a higher plan to speed it up. But if you don't want to change your plan, modifying the task to be splittable is quite essential.
Note: If your task is splittable, you will see the Boost Mode option available to click. If not, this option is will be disabled.
Running with Boost Mode means to split the task when running. If you choose Standard Mode, Octoparse will not split the task.
What kind of tasks are splittable?
When you try to create any kind of loop item in Octoparse, Octoparse will automatically assign a loop mode to it based on the items selected and how they interact with the general webpage structure.
Specifically, there are three types of splittable loop modes in Octoparse:
1. List of URLs
A URL loop is used when you start an extraction task using more than one URL. This is especially handy if the desired data spans through multiple web pages sharing the same page structure. You can easily set up a loop of URLs to go through each of these pages. Octoparse will load the URLs one by one and execute the same set of extraction actions on each page.
A URL loop is splittable. Hence, when a task built with a list of URLs is set to run in the Cloud, Octoparse would split it into multiple subtasks for faster and more effective extraction.
To learn more about the List of URLs, please refer to the Batch URL input.
2. Text List
A Text list loop works similarly to that of the URL list loop, but instead of looping through a list of URLs, the loop now works to loop through a list of predefined text values.
For more about the Text list loop, please refer to Enter Text.
3. Fixed List
Many web pages, such as e-commercial websites, often organize webpage content (ie. product information) as a collection of recurring elements with a shared HTML pattern.
When capturing such elements, such as the product titles, Octoparse would intelligently detect all the elements sharing the same HTML pattern and generate a collection of XPath(s) to locate all elements of the same kind.
Besides these 3 types of splittable loop modes, 2 other loop modes are not splittable: single element loop and variable list loop. As both loop modes only involve a single XPath, they can't be split further into subtasks to speed up.
How do I make my task splittable?
1. For a task with a Variable List to click through on a list of elements
Modify it to a Fixed List by listing the XPaths for every element on the page
Scrape only the element URLs first without clicking into the pages, and then create another task with the URLs to get the detailed data.
2. For tasks that scrape from multiple pages
Use the URLs for each page to build the workflow: Speed up scraping by using a URL list
How do I know how many subtasks my task is split into?
After you run with the task in the Cloud, you can check how many subtasks your task is split into in the event log of the Cloud run window: