Skip to main content
All CollectionsUnderstanding the Workflow
Loop Item (Loop URLs/Texts/Pagination)
Loop Item (Loop URLs/Texts/Pagination)
Updated over 3 weeks ago

When you are building a scraping task in Octoparse, you'll almost surely go to use a Loop item during the process. A Loop Item is most often used for capturing a list of elements or for paginating through the different pages of a website. In this article, I will explain exactly how a Loop item works in Octoparse.


1. What is a Loop Item

A loop is a programming function that repeats an instruction continuously until a certain condition is reached. The Loop Item (also named Loop URLs/Pagination) in Octoparse is similar to a loop.

A Loop Item is usually created using more than one URL/Text/element and there will be action(s) added to the Loop Item. Once a Loop Item is created, Octoparse will repeat the looped actions for a designated X number of times or until there's no way to keep repeating the actions, for example, when there's no more next page to flip over (when you've reached the last page).

Let's look at an example. Suppose we have a list of URLs to extract data from. First, we'll create a Loop item using the list of URLs, then we'll insert a Go to Web Page action and an Extract Data action inside the Loop Item. The workflow would look like this:

555555.png

This workflow translates to a set of instructions telling Octoparse to take the first URL of the URL list, load the page with the Go to Web Page action, then scrape the data with the Extract Data action. The same set of actions will be repeated for all the URLs in the list until the last URL is taken, and then the loop stops.


2. Loop Item settings

If you click on the loop items and select General, you'll be taken to the settings panel. Let's take a look at the options available.

mmii.png

2.1. Action name: This is the place where you can change the name of the specific Loop Item. Assigning a unique name to a "Loop Item" can help you sort things out when you have more than one "Loop Item" in your workflow.

2.2. Loop Mode: In order for a "Loop Item" to work right, it is critical that you have the correct loop mode selected. There are six loop modes and each of them is well explained in the section below.

2.3. Exit Loop when: Besides having the loop quit automatically, you can also end the loop prematurely by designating the number of times to repeat the looped actions.

2.4. Wait before action: You can use this feature to set up a wait time before Octoparse executes this loop action.

001111.png

3. The 6 loop modes

There are 6 loop modes: Single Element, Variable List, Fixed List, List of URLs, Text List, and Scroll Page.

dffdf.png
  • Single Element is used to locate a specific element on the page. Octoparse would perform the looped actions to the same element over and over again until the element is no longer found on the page.

    One typical use for a single element is pagination, when you want Octoparse to click the "Next page" button repeatedly until you've reached the last page.

88741.png
  • Variable List is used to locate a list of items that can be matched with a single XPath query. Octoparse would perform the looped actions to the list of elements one by one until the last element is reached. A variable list should be used when the number of elements you'd like to loop through is inconsistent across different pages.

8000.png
  • Fixed List, similar to Variable List, also locates a list of items, but Fixed List is a list of XPath queries with each XPath locating a unique element on the page. It is used when the number of elements on the page is consistent across all pages.

9884.png
  • List of URLs is used for looping through a list of URLs, in which case Octoparse would open the URLs one by one. There are three ways to input the URLs. Check out the different ways to input the URL here.

85996.png
  • Text List is a list of the text strings. When a text list is used, Octoparse would input the strings on the page one by one.

87705.png
87706.png
  • Scroll Page is a new way of loop. This mode is particularly designed for websites that use infinite scrolling to view more content, like Twitter. This option can help scrape data while scrolling instead of scraping after the scroll finishes.

102323.png

TIPS:

  • When Fixed List, List of URLs, and Text list are used, the task can be further split into subtasks that can be run concurrently in the Cloud for faster data capturing.

  • Variable List can be changed to Fixed List for faster extraction.


4. How to create a Loop Item

What Loop Item Mode you need to use would depend on what data you are trying to fetch and the webpage structure. Check out the tutorials below on how to create a Loop Item for various use cases.

In most cases when you are creating a workflow, you don't need to pay much attention to which loop mode to use. Octoparse will automatically select the correct mode based on your selections.


5. Loop Item troubleshooting

There are many issues related to Loop Item, such as missing elements, skipping pages, and so on. The most frequently asked issues about Loop Item are listed below:

5.1 Pagination:

5.2 Missing elements:

5.3 Others:

Did this answer your question?