Python Regular Expressions (RegEx)

What is Regular Expression?

Regular Expressions or RegEx are special sequences of characters used to search for patterns within text. It works like a “search tool” that allows you to find specific strings or patterns thus making working with text data easier.

For example, if you want to search the word “tagumpay” (success in Filipino) in a text file you can easily look for it with RegEx by creating a pattern that will look for the words that matches “tagumpay”.

Why Use RegEx?

RegEx can be really helpful in many use-cases:

Text Searching: Find specific words or patterns within a string.
Data Validation: Validate specific formats like email addresses, phone numbers, postal codes, and etc.
Data Cleaning: Remove unwanted characters or whitespace from text.
Data Extraction: Extract specific information from large datasets such as dates, prices, or keywords.

Quick Start Guide to Regular Expressions Metacharacters, Special Sequences, and Sets

Metacharacters: Your “Pattern-Making Tools”

In RegEx, metacharacters are special symbols that help define the pattern we are looking for. They work as shortcuts for specifying the “rules” of your pattern

Character	Description	Example	Explanation
[ ]	A set of specific characters	[aeiou]	Matches any vowel (a, e, i, o, or u).
\	Signals a special sequence (or “escapes” special characters)	\d	Matches any digit (0-9), or lets you use symbols like `\.` to match a literal period.
.	Any character except a newline	a.b	Matches “a” followed by any character and then “b” (e.g., “acb,” “a1b,” “a-b”).
^	Matches if the text starts with this pattern	^Hello	Matches any string that starts with “Hello”.
$	Matches if the text ends with this pattern	world$	Matches any string that ends with “world”.
*	Zero or more occurrences of the previous character	ba*	Matches “b,” “ba,” “baa,” and so on.
+	One or more occurrences of the previous character	ba+	Matches “ba,” “baa,” “baaa,” etc., but not just “b”.
?	Zero or one occurrence of the previous character	colou?r	Matches “color” or “colour”.
{ }	Matches an exact number of occurrences	\d{3}	Matches exactly 3 digits (e.g., “123” but not “12”).
`	`	Acts as an “or” between patterns	`cat
( )	Groups patterns together	(abc)+	Matches “abc,” “abcabc,” etc., treating “abc” as a unit.

Quick guide to some of the most useful metacharacters:

Special Sequences: Quick Pattern Shortcuts

Special sequences start with \ and let you search for particular types of characters.

Character	Description	Example	Explanation
\A	Matches if the text starts with this pattern	\AOnce	Matches any string that starts with “Once”.
\b	Matches at the beginning or end of a word	r”\bhero\b”	Matches “hero” as a whole word, not inside words like “superhero”.
\B	Matches inside a word (not at boundaries)	r”\Bhero”	Matches “hero” inside other words, like “superhero”.
\d	Matches any digit (0–9)	“\d+”	Matches any sequence of digits (e.g., “123”).
\D	Matches any non-digit	“\D+”	Matches any sequence of non-digits (e.g., “abc”).
\s	Matches any whitespace (space, tab, newline)	“\s”	Useful for finding spaces or line breaks.
\S	Matches any non-whitespace	`"\S+"`	Useful for matching words or characters with no spaces.
\w	Matches any “word” character (letters, digits, underscore)	“\w+”	Useful for matching words or variable names.
\W	Matches any non-word character	“\W+”	Matches spaces, punctuation, etc.
\Z	Matches if the text ends with this pattern	r”end\Z”	Matches any string ending with “end”.

Quick guide to some of the most useful special sequences

Sets: Picking Specific Characters

Sets help you specify a group of characters that you want to match. They’re wrapped in square brackets [ ].

Character	Description	Example	Explanation
[aeiou]	Matches any of the specified characters	[aeiou]	Matches any lowercase vowel.
[a-z]	Matches any character in this range	[a-z]	Matches any lowercase letter.
[^aeiou]	Matches any character except those specified	[^aeiou]	Matches any character that is not a lowercase vowel.
[0-9]	Matches any digit from 0 to 9	[0-9]	Matches any single digit.
[a-Za-z]	Matches any uppercase or lowercase letter	[A-Za-z]	Matches any letter, regardless of case.
[+]	In sets, special characters lose their meaning	[+]	Matches the plus sign literally.

Quick guide to some of the most useful special sequences

How to use RegEx in Python?

Python has a built-in module called re that includes many functions to work with RegEx patterns.

Import the re module: Before using RegEx, import the re module.

import re

Python
​x
 
import re

Defining Patterns: Define a pattern then use the functions provided by the re module match, search, or replace text.

Function 1: re.match()

The re.match() function checks if the pattern matches only at the start of the string.

import re

# Example text mentioning tinola
text = "Tinola ang paboritong ulam ng ating pambansang bayani na si Dr. Jose Riza"

# Pattern to check if "tinola" appears at the start of the text
pattern = r"^tinola"

result = re.match(pattern, text, re.IGNORECASE) # Using re.IGNORECASE to make it case-insensitive
if result: 
  print("Match found: ", result.group())  # Output: "Tinola"
else: 
  print("No match at the start.")

Python
 
import re
​
# Example text mentioning tinola
text = "Tinola ang paboritong ulam ng ating pambansang bayani na si Dr. Jose Riza"
​
# Pattern to check if "tinola" appears at the start of the text
pattern = r"^tinola"
​
result = re.match(pattern, text, re.IGNORECASE) # Using re.IGNORECASE to make it case-insensitive
if result: 
  print("Match found: ", result.group())  # Output: "Tinola"
else: 
  print("No match at the start.")

Function 2: re.search()

The re.search() functions scans through the entire text to find the first occurrence of a pattern.

import re

# Text with multiple mentions of Katipunan
text = """Ang Katipunan ay isang lihim na samahan noong panahon ng rebolusyon. 
Isa sa mga layunin ng Katipunan ay ang kasarinlan ng Pilipinas."""

# Pattern to find the word "Katipunan" anywhere in the text
pattern = r"Katipunan"

result = re.search(pattern, text)

if result: 
  print("Found: ", result.group()) # Output: "Katipunan"
else:
  print("No match found.")

Python
 
import re
​
# Text with multiple mentions of Katipunan
text = """Ang Katipunan ay isang lihim na samahan noong panahon ng rebolusyon. 
Isa sa mga layunin ng Katipunan ay ang kasarinlan ng Pilipinas."""
​
# Pattern to find the word "Katipunan" anywhere in the text
pattern = r"Katipunan"
​
result = re.search(pattern, text)
​
if result: 
  print("Found: ", result.group()) # Output: "Katipunan"
else:
  print("No match found.")

Function 3: re.findall()

The re.findall() function is used to find all occurrences of a pattern in the text.

import re

# Text with honorifics
text = "Si Ginoo Juan at Aling Maria ay nagtutulungan sa bayan."

# Pattern to find Filipino honorifics "Ginoo" and "Aling"
pattern = r"\b(Ginoo|Aling) \w+\b"

result = re.findall(pattern, text)
print("Honorifics found:", result)  # Output: ["Ginoo Juan", "Aling Maria"]

Python
 
import re
​
# Text with honorifics
text = "Si Ginoo Juan at Aling Maria ay nagtutulungan sa bayan."
​
# Pattern to find Filipino honorifics "Ginoo" and "Aling"
pattern = r"\b(Ginoo|Aling) \w+\b"
​
result = re.findall(pattern, text)
print("Honorifics found:", result)  # Output: ["Ginoo Juan", "Aling Maria"]

Function 4: re.sub()

The re.sub() function replaces all matches of a pattern with a specified replacement string.

import re

# Sample text with polite terms
text = "Maraming salamat po! Ako po si Ginoo Rizal. Pumunta po ako para magbayad ng sedula."

# Pattern to replace "po" and "opo" with "Please"
pattern = r"\b(po|opo)\b"
new_text = re.sub(pattern, "Please", text)

print("Adjusted text:\n", new_text)
# Output: "Maraming salamat Please! Ako Please si Ginoo Rizal. Pumunta Please ako para magbayad ng sedula."

Python
 
import re
​
# Sample text with polite terms
text = "Maraming salamat po! Ako po si Ginoo Rizal. Pumunta po ako para magbayad ng sedula."
​
# Pattern to replace "po" and "opo" with "Please"
pattern = r"\b(po|opo)\b"
new_text = re.sub(pattern, "Please", text)
​
print("Adjusted text:\n", new_text)
# Output: "Maraming salamat Please! Ako Please si Ginoo Rizal. Pumunta Please ako para magbayad ng sedula."

Try this out!

import re

text = """Ang galing ng mga Pilipino! 
Isang malaking tagumpay ang parangal na natanggap nila. 
Ang parangal na ito ay tunay na karapat-dapat."""

# Step 1: Define a pattern to match success-related words
success_pattern = r"\b(galing|tagumpay|parangal)\b"

# Step 2: Find all success-related words
matches = re.findall(success_pattern, text)
print("Filipino excellence words found:", matches)
# Output: ["galing", "tagumpay", "parangal", "parangal"]

# Step 3: Replace "parangal" with "prestihiyosong parangal" for emphasis
text_with_emphasis = re.sub(r"\bparangal\b", "prestihiyosong parangal", text)
print("Text with emphasized achievement:\n", text_with_emphasis)
# Output: "Ang galing ng mga Pilipino! Isang malaking tagumpay ang prestihiyosong parangal na natanggap nila. Ang prestihiyosong parangal na ito ay tunay na karapat-dapat."

# Step 4: Check if the text starts with "Ang galing"
if re.match(r"^Ang galing", text):
    print("The text begins with 'Ang galing' - a phrase highlighting excellence.")
else:
    print("The text does not start with 'Ang galing'.")

# Step 5: Search for the first occurrence of any success-related word
first_success_word = re.search(success_pattern, text)
if first_success_word:
    print("First success-related word found:", first_success_word.group())
# Output: "galing"

Python
 
import re
​
text = """Ang galing ng mga Pilipino! 
Isang malaking tagumpay ang parangal na natanggap nila. 
Ang parangal na ito ay tunay na karapat-dapat."""
​
# Step 1: Define a pattern to match success-related words
success_pattern = r"\b(galing|tagumpay|parangal)\b"
​
# Step 2: Find all success-related words
matches = re.findall(success_pattern, text)
print("Filipino excellence words found:", matches)
# Output: ["galing", "tagumpay", "parangal", "parangal"]
​
# Step 3: Replace "parangal" with "prestihiyosong parangal" for emphasis
text_with_emphasis = re.sub(r"\bparangal\b", "prestihiyosong parangal", text)
print("Text with emphasized achievement:\n", text_with_emphasis)
# Output: "Ang galing ng mga Pilipino! Isang malaking tagumpay ang prestihiyosong parangal na natanggap nila. Ang prestihiyosong parangal na ito ay tunay na karapat-dapat."
​
# Step 4: Check if the text starts with "Ang galing"
if re.match(r"^Ang galing", text):
    print("The text begins with 'Ang galing' - a phrase highlighting excellence.")
else:
    print("The text does not start with 'Ang galing'.")
​
# Step 5: Search for the first occurrence of any success-related word
first_success_word = re.search(success_pattern, text)
if first_success_word:
    print("First success-related word found:", first_success_word.group())
# Output: "galing"
​