Python Regular Expressions (RegEx)

What is Regular Expression?

Regular Expressions or RegEx are special sequences of characters used to search for patterns within text. It works like a “search tool” that allows you to find specific strings or patterns thus making working with text data easier.

For example, if you want to search the word “tagumpay” (success in Filipino) in a text file you can easily look for it with RegEx by creating a pattern that will look for the words that matches “tagumpay”.

Why Use RegEx?

RegEx can be really helpful in many use-cases:

  1. Text Searching: Find specific words or patterns within a string.
  2. Data Validation: Validate specific formats like email addresses, phone numbers, postal codes, and etc.
  3. Data Cleaning: Remove unwanted characters or whitespace from text.
  4. Data Extraction: Extract specific information from large datasets such as dates, prices, or keywords.

Quick Start Guide to Regular Expressions Metacharacters, Special Sequences, and Sets

Metacharacters: Your “Pattern-Making Tools”

In RegEx, metacharacters are special symbols that help define the pattern we are looking for. They work as shortcuts for specifying the “rules” of your pattern

CharacterDescriptionExampleExplanation
[ ]A set of specific characters[aeiou]Matches any vowel (a, e, i, o, or u).
\Signals a special sequence (or “escapes” special characters)\dMatches any digit (0-9), or lets you use symbols like \. to match a literal period.
.Any character except a newlinea.bMatches “a” followed by any character and then “b” (e.g., “acb,” “a1b,” “a-b”).
^Matches if the text starts with this pattern^HelloMatches any string that starts with “Hello”.
$Matches if the text ends with this patternworld$Matches any string that ends with “world”.
*Zero or more occurrences of the previous characterba*Matches “b,” “ba,” “baa,” and so on.
+One or more occurrences of the previous characterba+Matches “ba,” “baa,” “baaa,” etc., but not just “b”.
?Zero or one occurrence of the previous charactercolou?rMatches “color” or “colour”.
{ }Matches an exact number of occurrences\d{3}Matches exactly 3 digits (e.g., “123” but not “12”).
``Acts as an “or” between patterns`cat
( )Groups patterns together(abc)+Matches “abc,” “abcabc,” etc., treating “abc” as a unit.
Quick guide to some of the most useful metacharacters:

Special Sequences: Quick Pattern Shortcuts

Special sequences start with \ and let you search for particular types of characters.

CharacterDescriptionExampleExplanation
\AMatches if the text starts with this pattern\AOnceMatches any string that starts with “Once”.
\bMatches at the beginning or end of a wordr”\bhero\b”Matches “hero” as a whole word, not inside words like “superhero”.
\BMatches inside a word (not at boundaries)r”\Bhero”Matches “hero” inside other words, like “superhero”.
\dMatches any digit (0–9)“\d+”Matches any sequence of digits (e.g., “123”).
\DMatches any non-digit“\D+”Matches any sequence of non-digits (e.g., “abc”).
\sMatches any whitespace (space, tab, newline)“\s”Useful for finding spaces or line breaks.
\SMatches any non-whitespace"\S+"Useful for matching words or characters with no spaces.
\wMatches any “word” character (letters, digits, underscore)“\w+”Useful for matching words or variable names.
\WMatches any non-word character“\W+”Matches spaces, punctuation, etc.
\ZMatches if the text ends with this patternr”end\Z”Matches any string ending with “end”.
Quick guide to some of the most useful special sequences

Sets: Picking Specific Characters

Sets help you specify a group of characters that you want to match. They’re wrapped in square brackets [ ].

CharacterDescriptionExampleExplanation
[aeiou]Matches any of the specified characters[aeiou]Matches any lowercase vowel.
[a-z]Matches any character in this range[a-z]Matches any lowercase letter.
[^aeiou]Matches any character except those specified[^aeiou]Matches any character that is not a lowercase vowel.
[0-9]Matches any digit from 0 to 9[0-9]Matches any single digit.
[a-Za-z]Matches any uppercase or lowercase letter[A-Za-z]Matches any letter, regardless of case.
[+]In sets, special characters lose their meaning[+]Matches the plus sign literally.
Quick guide to some of the most useful special sequences

How to use RegEx in Python?

Python has a built-in module called re that includes many functions to work with RegEx patterns.

Import the re module: Before using RegEx, import the re module.

import re

Defining Patterns: Define a pattern then use the functions provided by the re module match, search, or replace text.

    Function 1: re.match()

    The re.match() function checks if the pattern matches only at the start of the string.

    import re
    
    # Example text mentioning tinola
    text = "Tinola ang paboritong ulam ng ating pambansang bayani na si Dr. Jose Riza"
    
    # Pattern to check if "tinola" appears at the start of the text
    pattern = r"^tinola"
    
    result = re.match(pattern, text, re.IGNORECASE) # Using re.IGNORECASE to make it case-insensitive
    if result: 
      print("Match found: ", result.group())  # Output: "Tinola"
    else: 
      print("No match at the start.")

    Function 2: re.search()

    The re.search() functions scans through the entire text to find the first occurrence of a pattern.

    import re
    
    # Text with multiple mentions of Katipunan
    text = """Ang Katipunan ay isang lihim na samahan noong panahon ng rebolusyon. 
    Isa sa mga layunin ng Katipunan ay ang kasarinlan ng Pilipinas."""
    
    # Pattern to find the word "Katipunan" anywhere in the text
    pattern = r"Katipunan"
    
    result = re.search(pattern, text)
    
    if result: 
      print("Found: ", result.group()) # Output: "Katipunan"
    else:
      print("No match found.")

    Function 3: re.findall()

    The re.findall() function is used to find all occurrences of a pattern in the text.

    import re
    
    # Text with honorifics
    text = "Si Ginoo Juan at Aling Maria ay nagtutulungan sa bayan."
    
    # Pattern to find Filipino honorifics "Ginoo" and "Aling"
    pattern = r"\b(Ginoo|Aling) \w+\b"
    
    result = re.findall(pattern, text)
    print("Honorifics found:", result)  # Output: ["Ginoo Juan", "Aling Maria"]

    Function 4: re.sub()

    The re.sub() function replaces all matches of a pattern with a specified replacement string.

    import re
    
    # Sample text with polite terms
    text = "Maraming salamat po! Ako po si Ginoo Rizal. Pumunta po ako para magbayad ng sedula."
    
    # Pattern to replace "po" and "opo" with "Please"
    pattern = r"\b(po|opo)\b"
    new_text = re.sub(pattern, "Please", text)
    
    print("Adjusted text:\n", new_text)
    # Output: "Maraming salamat Please! Ako Please si Ginoo Rizal. Pumunta Please ako para magbayad ng sedula."

    Try this out!

    import re
    
    text = """Ang galing ng mga Pilipino! 
    Isang malaking tagumpay ang parangal na natanggap nila. 
    Ang parangal na ito ay tunay na karapat-dapat."""
    
    # Step 1: Define a pattern to match success-related words
    success_pattern = r"\b(galing|tagumpay|parangal)\b"
    
    # Step 2: Find all success-related words
    matches = re.findall(success_pattern, text)
    print("Filipino excellence words found:", matches)
    # Output: ["galing", "tagumpay", "parangal", "parangal"]
    
    # Step 3: Replace "parangal" with "prestihiyosong parangal" for emphasis
    text_with_emphasis = re.sub(r"\bparangal\b", "prestihiyosong parangal", text)
    print("Text with emphasized achievement:\n", text_with_emphasis)
    # Output: "Ang galing ng mga Pilipino! Isang malaking tagumpay ang prestihiyosong parangal na natanggap nila. Ang prestihiyosong parangal na ito ay tunay na karapat-dapat."
    
    # Step 4: Check if the text starts with "Ang galing"
    if re.match(r"^Ang galing", text):
        print("The text begins with 'Ang galing' - a phrase highlighting excellence.")
    else:
        print("The text does not start with 'Ang galing'.")
    
    # Step 5: Search for the first occurrence of any success-related word
    first_success_word = re.search(success_pattern, text)
    if first_success_word:
        print("First success-related word found:", first_success_word.group())
    # Output: "galing"
    

    Scroll to Top