Remove duplicates
Updated over a week ago

Having duplicates in the dataset can be due to the fact that the websites have duplicated data in themselves, or the task could have been set up to capture the same data twice or more. When this happens, there are two ways to have duplicates removed depending on your data requirements:

1. Remove duplicates when the entire data lines are the same (default setting)

When the run is completed, Octoparse treats data lines as duplicates when the entire lines are the same (all the data fields are the same) by default. You can remove the duplicates and keep only the unique lines.

Example: Line #1 and #4 below have the same values for each data field, so they are duplicates. After de-dup, Octoparse will only keep the 1st data line extracted, which is line #1 in this case.

mceclip1.png

2. Remove duplicates when selected data fields are the same

Note: This feature is for Octoparse 8.1.16 and above.

When building the task workflow, you can further customize the task to remove data lines that share the same values for one or more data fields. The data lines will be treated as duplicates as long as the values of the selected data fields are the same. Other unselected data field(s) will not be considered.

Example 1: If we select "Field2" to compare for data deduplication, then line #1, line #2, and line #4 all have the same value for "Field2". In this case, these data lines will be considered duplicates. After de-dup, Octoparse will only keep the 1st data line extracted, which is line #1 in this case, and get rid of line #2 and line #4.

mceclip2.png

Example 2: If we select "Field3" and "Field4" to compare for data deduplication, then line #1 and line #4 both have the same values for "Field3" and "Field4" respectively. In this case, line #1 and line #4 will be considered duplicates. After de-dup, Octoparse will only keep the 1st data line extracted, which is line #1 in this case, and get rid of line #4 automatically.

mceclip3.png

Follow the steps below to customize de-dup settings:

  • Set up the task and the data fields you need to collect

  • Click the icon on the right top corner of the Data Preview section

11.png
  • Select the data field(s) you'd like to compare for deduplication. After selection, click Apply to save the settings.

10.png

Tip:

For Cloud runs, only data that's been treated with the same de-dup setting will be compared and de-dup'ed on a continuous basis.

For example, let's say you set the 1st de-dup setting as A (e.g., select "Field1" to compare) and got the first bath of Cloud data.

Then, you return to your task and modify the de-dup setting to B (e.g., select "Field2" to compare) and get the 2nd batch of Cloud data. This second batch of data will not be compared against the 1st batch of data for deduplication.

After that, if you change the setting back to A (e.g., select "Field1" to compare) and get the third batch of Cloud data. This third batch of data will be compared and dedup'ed against the 1st batch of Cloud data.

Did this answer your question?