Skip to main content

Regular Expression Tool

Updated this week

Regular Expression (RegEx) is a special text string that can define a search pattern, which is used by string-searching algorithms for "find" or "find and replace" operations on strings. You could grab some basics of Regular Expression here.

In Octoparse, you can use RegEx to match/replace characters in a field value to refine the extracted data directly.

Octoparse RegEx tool is a built-in tool that offers a handy way to generate Regular Expressions automatically by setting up various criteria. When knowing little about how to create a regular expression syntax, the RegEx tool would be especially helpful.


How to Access the RegEx Tool

In Octoparse, there are two ways to access the Octoparse RegEx tool:

1. Via the Clean Data Menu

  • Select the data field you wish to customize.

  • Click the "..." button and choose Clean Data.

  • Click Add step and select the RegEx option.

2. Via the Sidebar

  • Locate and click the Tools icon in the left-hand sidebar navigation.


Understanding the RegEx Tool Interface

Version 8.8.0 and later

1. RegEx Patterns
This is a library of pre-built, commonly used regular expressions. You can browse or search for a pattern that fits your need (e.g., matching emails, phone numbers, URLs, or specific date formats). This is the fastest way to apply a powerful RegEx without needing to build it yourself.

2. AI RegEx Generator
This innovative feature allows you to generate a regular expression using natural language. Simply describe in plain text what you want to match or find, and the AI will attempt to create the appropriate RegEx pattern for you.

  • ⬆️ Example Input: "Find all prices in the format like $10.99 or €5.50"

  • ⬇️ Example Output: The AI would generate a pattern similar to \$?\d+\.\d{2}.

💡Tip:

In the Test String box, describe in plain English what you want to find or extract.

  • Good Example: "Extract all 10-digit US phone numbers in the format (123) 456-7890."

  • Good Example: "Find all words that start with a capital letter and are at least 5 letters long."

3. RegEx Builder
This is the evolution of the classic "Generate" tab. It provides a user-friendly, form-based interface to build your own custom regular expression by selecting options and filling in parameters (e.g., "Starts with," "Ends with," "Contains"). It automatically translates your choices into the correct RegEx syntax, making it perfect for those learning or who prefer a visual approach.

Where you can select or search for a commonly used pattern

Before version 8.8.0

The main interface of the RegEx tool consists of 4 parts:

3.png

1. Original Text

  • If opened from the Clean Data menu, this area automatically displays the extracted text from your selected field.

  • If opened from the sidebar, you can manually type or paste a sample text string here to test your expressions.

2. Configuration Tabs (Generate/Reference/Sample)

  • Generate: This is the main tab for creating expressions. You can check various options and fill in parameters to have Octoparse build a RegEx for you automatically.

  • Reference & Sample: These tabs are reserved for future tutorials and guides.

3. Regular Expression

  • This box displays the auto-generated RegEx code based on your selections in the "Generate" tab.

  • Check the "Match All" box to find every occurrence that matches the pattern, then click the "Match" button to test the expression.

4. Matches

  • This box shows the results of the RegEx operation. The first match is displayed by default; if "Match All" is checked, all matches will be listed in order.


How to use the Octoparse RegEx Builder?

4.png

STEP 1:

Check the options and fill in the needed parameters(1) then Generate(2) a Regular Expression(3)

  • "Start/End with": Pick up the content that starts or ends with, but excludes the character/characters that you input in the box.

  • "Include Start/End": This option could only be used with "Start/End with" checked. Once you check "Include Start/End", the match result will include the text string you've entered.

  • "Contain One": Pick up the content that contains the character/characters that you've filled.

STEP 2:

Click the Match button(4) - check the Match All box if you'd like to have all matches.

STEP 3:

Once you are satisfied with the previewed matches, click the "Apply" button to confirm and implement the changes.

Did this answer your question?