Websites, such as news portals or forums, typically have new content added quickly if not dynamically. To stay up-to-date with such websites, Octoparse’s Incremental Extraction allows you to extract updated data much more effectively by skipping the pages that have already been extracted and only scraping the new ones.
When to use Incremental Extraction?
When you need the latest data from any website frequently
When the new information shows up as new web pages with new URLs (as opposed to new information being added/updated to existing web pages)
So a perfect example would be CNN.com. Imagine if you need to get news feeds from CNN.com almost in real-time. It is important to schedule and run the task as frequently as needed so whatever gets added to the site can be extracted in a timely manner. Therefore, the above criteria 1 is met.
Obviously, each news article on CNN.com has a different URL that can be easily identified - so the above criteria 2 is also met.
Assuming you have a task set up for the job, it doesn't really make sense to re-scrape those articles which have already been captured in previous runs. Using Incremental Extraction, you can easily have the URLs checked first to make sure they have not been extracted already and only capture the ones that are truly new.
How does Incremental Extraction identify "new" data?
Incremental Extraction will only work if the newly added data can be identified by new URLs. During the extraction process, Octoparse checks each URL of the page it opens to determine whether it is one that has been crawled before. If a URL is identified as one from the previous crawl, it will be skipped automatically when running with Incremental Extraction.
That is to say, Incremental Extraction cannot be used when you are only scraping from the listing page because the listing page URL does not change.
How to set up Incremental Extraction?
Go to task settings
Go to Run Settings and tick Enable incremental extraction
Select either Match the entire URL or Match by part of the URL
Click Save
Note:
With the "Match the entire URL" option, Octoparse will use the entire URL to identify if this URL is a new one. Even the slightest difference will have it identified as a "new" URL.
With the "Match by part of URL" option, Octoparse detects attributes automatically and makes them available as parameters. By having one or more attributes selected as parameters for the match, you are telling Octoparse to compare the URL based on the selected attributes. If any of those are the same, skip it; otherwise, scrape the page.
Only tasks with one Extract Data action can be run with Incremental Extraction because Octoparse scans the page URL for differences as soon as the Extract Data action is executed.