In this lesson, we will walk you through some practical tips on how you can refine your dataset after having the data extracted if they do not look exactly like how you want it to be.
Rename/ move/ duplicate/ delete a field
As soon as you have the data extracted and shown in Data Preview, you can now look through the dataset and start sorting your data. A few typical things you can do to refine your dataset include renaming the fields, reordering the columns, duplicating data fields, and deleting the fields that are not required for your project.
To rename a field, double-click the field name, then type in the new name directly.
Note:
Field names should only contain numbers, letters, and "_".
Field names should not start with a number.
To move a field, place your cursor at the front of the field and when the hand sign shows up, drag and drop the field to the right spot.
To delete a field, click on the More icon and select Delete field.
Tip:
If you have multiple fields to delete, you can switch to Vertical View and select the fields to delete.
Clean data
Octoparse provides many different ways for you to clean your data. For example, you can replace a text string, trim extra spaces, add a prefix/suffix, replace a string with RegEx, reformat date/time, and more. You can clean any single data field in one or more ways until the data meets your requirements. Some of these may require you to deal with Regular Expression with which you can use the Octoparse RegEx tool for assistance.
In Data Preview, click the More icon for the data field you'd like to refine and select Clean data
Select Add Step, and then select what you would like to do with the data. You can keep working with the data by adding more steps until the data meets your requirements.
Replace: replace the specific string(s) in the extracted data with the new string(s) that you want.
Replace with Regular Expression: use a specific regular expression to replace the matched string(s) in the extracted data with the string(s) that you want.
Match with Regular Expression: use a specific regular expression to pick up the matched string(s) from the extracted data.
Trim spaces: remove the unwanted space(s) from the start or/and the end of the data extracted.
Add a prefix: add a string/string to the front of the data extracted.
Add a suffix: add a string/string to the end of the data extracted.
Reformat extracted date/time: shift the extracted date/time into one of the 14 built-in formats, or into your own customized format.
Timestamp conversion: A timestamp is a string of coded message which is used to identify a recorded date and time. You can use timestamp conversion to convert a string to the correct time format.
Timezone conversion: Convert the date & time to your target timezone
HTML transcoding: convert some specific HTML tags into plain text automatically. For example, transcode ">" into ">" and " " into a space.
Note: To learn more about reformatting data and the Regular Expression tool with Octoparse? Check here!
Capture HTML code
When auto-detection is used to capture any data from a web page, Octoparse automatically extracts the text and the URL of the elements. However, you can manually customize the data field and tell Octoparse to extract any HTML code instead.
In the Data Preview section, click the More icon and select Customize field, then choose how you'd like to capture the selected data. You can also select other attributes from the HTML code.
Add pre-defined data fields
Octoparse offers a number of pre-defined data fields that you can use to capture page-level data, current date & time, fixed value or original input URL conveniently.
Click on the + sign at the upper top right corner of Data Preview. Select any pre-defined data fields that you would like to add to the dataset.
Current date & time: the date and time of when the data is extracted from the web page
Page-level data: page URL, page title, meta keyword, meta description, and HTML source code
Fixed value: any fixed value you define
Original input URL:the URL you originally input for scraping
Continue to >> Lesson 4: Test-run the task