During your web scraping project, you may want to clean the data fields while doing the web scraping. Octoparse offers 10 data cleaning options for turning the extracted data into the format you need.
When should I refine the extracted data?
If you have the desired data format for a specific field, you can use our "Clean Data" function to refine the field within Octoparse. Octoparse would scrape and refine it directly during the scraping process. No need to reformat the field after exporting the data into an excel file.
How to refine the extracted data in Octoparse?
To access these features in Octoparse, you should follow the 4 steps below:
Select the data field to refine
Click on the "..." icon and select Clean data
Click Add Step
Select an operation to reformat the data
Tip:
In programming, a "string" refers to a collection of characters like letters, numerals, symbols, and punctuation marks. For example, " " (space) is a string; "Octoparse" is a string; and "Hello 2 *% World!" is also a string. A string can consist of no character as well. In other words, a string that contains no character is empty. If you replace a word with an empty string, colloquially, it is equal to saying that you delete the word.
You would see the word "string" in many function instructions of Octoparse's data reformat options. Suppose you see the word "string" there. In that case, you can use the corresponding options to deal with various character types in the data extracted, such as letters, words, sentences, numbers, spaces, symbols, and punctuation marks.
9 Data reformations
1. Replace
Function: Replace the specific string/s in the extracted data with the new string/s you want.
2. Replace with regular expression
Function: Use a specific regular expression to replace the matched string/s in the extracted data with the string/s that you want.
You can learn more about regular expression in W3schools.
3. Match with regular expression
Function: Use a specific regular expression to pick up the matched string/s from the extracted data.
You can learn more about regular expression in W3schools.
4. Trim spaces
Function: Remove the unwanted space/s from the start and/or the end of the data extracted.
If you want to delete the spaces amid the data, you can use Replace or Replace with regular expression.
5. Add a prefix
Function: Add a string or strings to the front of the data extracted.
6. Add suffix
Function: Add a string to the end of the data extracted.
7. Reformat extracted date/time
Function:
Shift the extracted date/time into one of the built-in formats or into your own customized format.
For example, you can reformat "2024-01-01" to "2024/01/01".
Convert a relative date and time to a specific date and time.
For example, you can convert "2 days ago" into 2024/01/01. This is useful when you are scraping posted time for jobs, articles or videos.
8. Timestamp conversion
Function: Shift the Unix timestamp into your own customized format.
The Unix timestamp is a sequence of numbers that represents a specific date and time. This function will convert Unix time to a format that we can understand easily.
9. Timezone conversion
Function: convert the date & time to your target timezone.
For some websites, the date & time shown on the page is based on the country where the website is from. If you want to change the timezone to your own country, you can use this feature to do it easily.
Tip:
This is useful if you are collecting the data scraped time from the Cloud run. Cloud run timezone is based UTC+0. You can convert it to your target timezone to avoid confusion.
10. HTML transcoding
Function: Convert specific HTML tags into plain text automatically. For example, transcode "&" into a "&".
Octoparse Regex Tool
Octoparse also offers a RegEx Tool to auto-generate the regular expression that you need. Let's have a quick look at how to use Octoparse's RegEx Tool to generate and apply a regular expression. For example, here we want to pick up the numeral of star-rating from the outer HTML extracted.
Click Try RegEx Tool
Enter the match criteria: start with "src="", end with " " "
Click Generate to produce regular expression
Click Match to pick up the matched strings
Click Apply
Click Confirm to save the settings
Click the link here for more information about the use of the Regex tool.