Duplicate data can occur when: Duplicate data is a common challenge in web scraping tasks, particularly when dealing with dynamic or frequently updated websites. This document explores why duplicates occur, how to resolve them, and strategies to prevent them in your Octoparse workflows.
Causes of Duplicate Data
Task Configuration Issues
Failure to Clear Data Before Reruns: When running the same task multiple times without clearing previously collected data, duplicates may appear. Octoparse stores results from each run together, and reruns without clearing the data can lead to redundancy. Always clear the dataset before starting a new task run to avoid this.
Navigation Challenges on Websites: Many websites have features like “Load more” or “See more jobs” buttons to display additional content. If these buttons are not correctly handled in your workflow configuration, the scraper might repeatedly collect the same data from the initial content load.
Dynamic Website Updates
Some websites update their content or change the order of entries, such as product listings or job postings, during scraping. These updates can lead to duplicate entries if the same items appear on different pages or listed positions.
The source website contains repeated entries
Your scraping task captures the same data multiple times
Octoparse provides two deduplication methods to clean your dataset: These tools are designed to address common causes of duplicate data, as explained in the previous section.
1. Remove Full-Line Duplicates (Default)
Octoparse automatically treats rows as duplicates only if all fields match exactly.
✅ Best for:
Ensuring completely unique records.
Example:
| Line | Field 1 | Field 2 | Field 3 |
#1 | A | B | C | (Kept) |
#2 | X | Y | Z | (Kept) |
#3 | A | B | C | (Removed—matches Line #1) |
How it works:
Only the first occurrence is kept.
Subsequent identical rows are removed.
2. Remove Duplicates Based on Selected Fields (v8.1.16+)
You can customize deduplication by selecting specific fields to compare.
✅ Best for:
Removing duplicates based on unique identifiers (e.g., product ID, URL).
Keeping variations in other fields (e.g., price changes, updated reviews).
Example 1: Dedupe by One Field
🔹 Selected Field: Field 2
| Line | Field 1 | Field 2 | Field 3 |
#1 | A | B | C | (Kept) |
#2 | X | B | Z | (Removed—same Field 2) |
#3 | P | Q | R | (Kept) |
Example 2: Dedupe by Multiple Fields
🔹 Selected Fields: Field 3 + Field 4
| Line | Field 3 | Field 4 |
#1 | C | D | (Kept) |
#2 | E | F | (Kept) |
#3 | C | D | (Removed—matches Line #1) |
How to Set Up Custom Deduplication
Build your task and define data fields.
In the Data Preview, click the ⚙️ (gear icon).
Select the field(s) for comparison.
Click Apply to save settings.
Important Notes
For Cloud Runs:
Deduplication only applies to batches with the same settings.
Changing settings mid-task? Previous batches won’t be compared.
Cloud duplicates may also occur when running the same task multiple times without new data being available, leading to previously collected items reappearing in results. Clearing data before rerunning tasks in the cloud is essential to prevent this.
Example Workflow:
1st Run: Dedupe by
Field 1
→ Batch A collected.2nd Run: Dedupe by
Field 2
→ Batch B collected (no comparison to Batch A).3rd Run: Revert to
Field 1
→ Batch C compared to Batch A (not Batch B).
Pro Tips
🔹 For reliable deduplication, use a unique identifier (e.g., product ID, URL).
🔹 Cloud tasks? Stick to one dedupe setting for consistency.
🔹 Need partial duplicates kept? Adjust field selections instead of using full-line dedupe.
🔹 Thoroughly test workflows to ensure configurations consistently collect data without duplicates, especially on dynamic websites.
🔹 Regularly update scraper settings to adapt to changes in website structures.
🔹 Leverage pre-tested templates from the Template Gallery for standard use cases like job listings or product pages.
🔹 Always clear data before initiating task reruns to prevent duplicate entries.
By fine-tuning deduplication, you can ensure clean, accurate datasets without manual cleanup! 🚀