What is Regular Expression?
Regular Expressions or RegEx are special sequences of characters used to search for patterns within text. It works like a “search tool” that allows you to find specific strings or patterns thus making working with text data easier.
For example, if you want to search the word “tagumpay” (success in Filipino) in a text file you can easily look for it with RegEx by creating a pattern that will look for the words that matches “tagumpay”.
Why Use RegEx?
RegEx can be really helpful in many use-cases:
- Text Searching: Find specific words or patterns within a string.
- Data Validation: Validate specific formats like email addresses, phone numbers, postal codes, and etc.
- Data Cleaning: Remove unwanted characters or whitespace from text.
- Data Extraction: Extract specific information from large datasets such as dates, prices, or keywords.
Quick Start Guide to Regular Expressions Metacharacters, Special Sequences, and Sets
Metacharacters: Your “Pattern-Making Tools”
In RegEx, metacharacters are special symbols that help define the pattern we are looking for. They work as shortcuts for specifying the “rules” of your pattern
Character | Description | Example | Explanation |
[ ] | A set of specific characters | [aeiou] | Matches any vowel (a, e, i, o, or u). |
\ | Signals a special sequence (or “escapes” special characters) | \d | Matches any digit (0-9), or lets you use symbols like \. to match a literal period. |
. | Any character except a newline | a.b | Matches “a” followed by any character and then “b” (e.g., “acb,” “a1b,” “a-b”). |
^ | Matches if the text starts with this pattern | ^Hello | Matches any string that starts with “Hello”. |
$ | Matches if the text ends with this pattern | world$ | Matches any string that ends with “world”. |
* | Zero or more occurrences of the previous character | ba* | Matches “b,” “ba,” “baa,” and so on. |
+ | One or more occurrences of the previous character | ba+ | Matches “ba,” “baa,” “baaa,” etc., but not just “b”. |
? | Zero or one occurrence of the previous character | colou?r | Matches “color” or “colour”. |
{ } | Matches an exact number of occurrences | \d{3} | Matches exactly 3 digits (e.g., “123” but not “12”). |
` | ` | Acts as an “or” between patterns | `cat |
( ) | Groups patterns together | (abc)+ | Matches “abc,” “abcabc,” etc., treating “abc” as a unit. |
Special Sequences: Quick Pattern Shortcuts
Special sequences start with \ and let you search for particular types of characters.
Character | Description | Example | Explanation |
\A | Matches if the text starts with this pattern | \AOnce | Matches any string that starts with “Once”. |
\b | Matches at the beginning or end of a word | r”\bhero\b” | Matches “hero” as a whole word, not inside words like “superhero”. |
\B | Matches inside a word (not at boundaries) | r”\Bhero” | Matches “hero” inside other words, like “superhero”. |
\d | Matches any digit (0–9) | “\d+” | Matches any sequence of digits (e.g., “123”). |
\D | Matches any non-digit | “\D+” | Matches any sequence of non-digits (e.g., “abc”). |
\s | Matches any whitespace (space, tab, newline) | “\s” | Useful for finding spaces or line breaks. |
\S | Matches any non-whitespace | "\S+" | Useful for matching words or characters with no spaces. |
\w | Matches any “word” character (letters, digits, underscore) | “\w+” | Useful for matching words or variable names. |
\W | Matches any non-word character | “\W+” | Matches spaces, punctuation, etc. |
\Z | Matches if the text ends with this pattern | r”end\Z” | Matches any string ending with “end”. |
Sets: Picking Specific Characters
Sets help you specify a group of characters that you want to match. They’re wrapped in square brackets [ ].
Character | Description | Example | Explanation |
[aeiou] | Matches any of the specified characters | [aeiou] | Matches any lowercase vowel. |
[a-z] | Matches any character in this range | [a-z] | Matches any lowercase letter. |
[^aeiou] | Matches any character except those specified | [^aeiou] | Matches any character that is not a lowercase vowel. |
[0-9] | Matches any digit from 0 to 9 | [0-9] | Matches any single digit. |
[a-Za-z] | Matches any uppercase or lowercase letter | [A-Za-z] | Matches any letter, regardless of case. |
[+] | In sets, special characters lose their meaning | [+] | Matches the plus sign literally. |
How to use RegEx in Python?
Python has a built-in module called re
that includes many functions to work with RegEx patterns.
Import the re module: Before using RegEx, import the re module.
import re
Defining Patterns: Define a pattern then use the functions provided by the re module match, search, or replace text.
Function 1: re.match()
The re.match() function checks if the pattern matches only at the start of the string.
import re # Example text mentioning tinola text = "Tinola ang paboritong ulam ng ating pambansang bayani na si Dr. Jose Riza" # Pattern to check if "tinola" appears at the start of the text pattern = r"^tinola" result = re.match(pattern, text, re.IGNORECASE) # Using re.IGNORECASE to make it case-insensitive if result: print("Match found: ", result.group()) # Output: "Tinola" else: print("No match at the start.")
Function 2: re.search()
The re.search() functions scans through the entire text to find the first occurrence of a pattern.
import re # Text with multiple mentions of Katipunan text = """Ang Katipunan ay isang lihim na samahan noong panahon ng rebolusyon. Isa sa mga layunin ng Katipunan ay ang kasarinlan ng Pilipinas.""" # Pattern to find the word "Katipunan" anywhere in the text pattern = r"Katipunan" result = re.search(pattern, text) if result: print("Found: ", result.group()) # Output: "Katipunan" else: print("No match found.")
Function 3: re.findall()
The re.findall() function is used to find all occurrences of a pattern in the text.
import re # Text with honorifics text = "Si Ginoo Juan at Aling Maria ay nagtutulungan sa bayan." # Pattern to find Filipino honorifics "Ginoo" and "Aling" pattern = r"\b(Ginoo|Aling) \w+\b" result = re.findall(pattern, text) print("Honorifics found:", result) # Output: ["Ginoo Juan", "Aling Maria"]
Function 4: re.sub()
The re.sub() function replaces all matches of a pattern with a specified replacement string.
import re # Sample text with polite terms text = "Maraming salamat po! Ako po si Ginoo Rizal. Pumunta po ako para magbayad ng sedula." # Pattern to replace "po" and "opo" with "Please" pattern = r"\b(po|opo)\b" new_text = re.sub(pattern, "Please", text) print("Adjusted text:\n", new_text) # Output: "Maraming salamat Please! Ako Please si Ginoo Rizal. Pumunta Please ako para magbayad ng sedula."
Try this out!
import re text = """Ang galing ng mga Pilipino! Isang malaking tagumpay ang parangal na natanggap nila. Ang parangal na ito ay tunay na karapat-dapat.""" # Step 1: Define a pattern to match success-related words success_pattern = r"\b(galing|tagumpay|parangal)\b" # Step 2: Find all success-related words matches = re.findall(success_pattern, text) print("Filipino excellence words found:", matches) # Output: ["galing", "tagumpay", "parangal", "parangal"] # Step 3: Replace "parangal" with "prestihiyosong parangal" for emphasis text_with_emphasis = re.sub(r"\bparangal\b", "prestihiyosong parangal", text) print("Text with emphasized achievement:\n", text_with_emphasis) # Output: "Ang galing ng mga Pilipino! Isang malaking tagumpay ang prestihiyosong parangal na natanggap nila. Ang prestihiyosong parangal na ito ay tunay na karapat-dapat." # Step 4: Check if the text starts with "Ang galing" if re.match(r"^Ang galing", text): print("The text begins with 'Ang galing' - a phrase highlighting excellence.") else: print("The text does not start with 'Ang galing'.") # Step 5: Search for the first occurrence of any success-related word first_success_word = re.search(success_pattern, text) if first_success_word: print("First success-related word found:", first_success_word.group()) # Output: "galing"