You are browsing a tutorial guide for Octoparse's latest version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier, and more robust! Download and upgrade here if you haven't already done so!
Extracting data from multiple pages using pagination is very common since most of the time, you'll need more than just one page of data for your project.
If you're encountering issues where Octoparse keeps scraping the last page and doesn't stop, it's likely because Octoparse is still detecting and clicking the 'Next' button even after reaching the final page. This is commonly referred to as an 'endless loop' issue.
There are two ways to solve it. You can choose either way according to your use case.
1. Set up loop ending condition - Exit loop when
The Exit loop when option allows you to end the pagination loop after repeating the loop a certain number of times. For example, if you'd like to scrape the first 50 pages of data, you can set up 50 as the number of repeats, then Octoparse will click the "Next" button 50 times, then exit the pagination loop when it reaches page 50.
This is an easy and effective way to resolve the issue if you know the exact number of pages you'd like to fetch data from. Follow the steps below to set up end-loop conditions:
Go to the settings of the Pagination
Find the Exit Loop at the bottom of the settings
Check the box and enter a number for the number of repeats
Click Apply to save the new settings
2. Modify XPath
If the issue cannot be resolved by setting up a loop-ending condition, you may need to modify the XPath of the pagination loop. Octoparse uses XPath to locate any elements on the page, including the "Next" button. In most cases, Octoparse can generate the XPath automatically and accurately; however, you may still need to revise the XPath manually from time to time. For example, in the case of an endless loop, you'll need to write an XPath that can precisely locate the "Next" button on all pages except the last page.
Tip: We suggest that you use the Chrome extension XPath Helper to write the XPath. You can check out how to write an XPath in the tutorial: What is XPath and how to use it in Octoparse.
Let's use an example to show you how to write an XPath that works for this purpose.
As you can see from the two screenshots below, the "Next" button is located by an XPath auto-generated both on the first and last pages with XPath Helper.
On the first page:
On the last page:
Now, we need to find out the difference between the buttons on the first and the last pages and utilize the difference to write the XPath. We can right-click on the button in Chrome to inspect the HTML code of the button.
On the first page:
On the last page:
Notice how the HTML code for the buttons is different. There is an attribute "aria-disabled" in the code on the last page.
We will then make use of this observation and write a new XPath to locate the "Next" button only when it is NOT on the last page. The new XPath is: //a[@class="pagination__next icon-link"][not(@aria-disabled)]
Simply enter the new XPath into Xpath Helper to verify if it can locate the "Next" button both on the first page and the last page.
On the first page:
On the last page:
Great! We've got no matching nodes on the last page, and this is exactly what we want: an XPath that successfully selects the "Next" button on the first page but not on the last page. Of course, you can always be more accurate by checking to see if "Next" can be selected on page 2, page 3, etc.
Once you have the new XPath ready, follow the steps below to apply the new XPath to the pagination loop:
Go to the settings of the Pagination
Replace the original XPath with the revised XPath
Click Apply to save the new settings
To sum up, the endless loop issue is not daunting to deal with. Depending on your scraping requirements, you may set up conditions to end the loop or revise the XPath to fix it.