Python Regular Expression

TechVidvan Team

3 years ago

Regular Expressions, often known as regex, are a series of characters used to determine whether or not a pattern is present in a given text (string). Regular expressions have been utilized in word processors, text editors, and search and replace functions for some time now. They can be used to parse text data files to discover, alter, or delete specific strings, validate the format of email addresses or passwords on the server side during registration, and more. In addition, they assist in text data manipulation, which is frequently a requirement for data science initiatives, including text mining.

You will be guided through the key principles of regular expressions with Python. You will begin by importing the regular expressions-supporting Python module called ‘re’. Following that, you’ll see how wild or exceptional characters are employed to make matches between basic/ordinary characters. You’ll then discover how to use repeats in regular expressions. Next, you’ll learn how to organize your search into groups and named groups for quick access to matches. You’ll then become acquainted with the idea of greedy vs non-greedy matching.

This already seems like a lot, so we’ve provided a helpful summary table with brief definitions to make it easier for you to recall what you’ve already seen. Do take a look!

The ‘re’ library’s many helpful methods, including compile(), search(), findall(), sub() for search and replace, split(), and others, are also covered in this course. Additionally, you will discover compilation flags, which you can utilize to improve your regex.

Use of Regular Expressions in Python

The ‘re’ module in Python supports regular expressions. It means you must import this module using the import: import re command to begin using them in your Python scripts.

Many functions are available in the Python re library, making it a topic worth knowing. Some of them will be seen up close.

Input:

pattern = r"Cookie"
sequence = "Cookie"
if re.match(pattern, sequence):
    print("Matched")
else: print("Not a match")

Output:

MatchedAs you can see in the sample, most alphabets and characters will match one another.

If the text matches the pattern, the match() function returns a match object. It returns None if. There are several additional functions in the ‘re’ module that you will learn about later.

Let’s concentrate on common people for the time being!

Do you see the r at the pattern’s beginning, Cookie?

An uncooked string literal is what this is. It alters the meaning of the string literally. These literals are saved precisely how they appear.

For instance, when preceded by re, the character is read as a simple backslash rather than an escape sequence. Using special characters, you can see what this implies. The raw r prefix prevents backslash-escaped characters from being read as escape sequences when the syntax calls for them occasionally.

Basic Python Regular Expression

Regular expressions’ ability to specify patterns rather than just fixed characters gives them power. The simplest patterns that match single characters are listed below:

Ordinary characters like a, X, and 9 match each other exactly. Because they have unique meanings, the following meta-characters do not match themselves: $ * +? [] | () (see details below). A period — except newline — matches any single character. A “word” character, such as a letter, number, or underbar [a-zA-Z0-9_], is represented by the character “n” (lowercase w). Although the mnemonic for this is “word,” it only fits a single word, char, not the entire word. W (capital W) matches any character that isn’t a word.
b — the line separating words from non-words.
A single whitespace character, such as space, newline, return, tab, or form [nrtf], is represented by the lowercase letter s. Any non-whitespace character matches the symbol (upper case S).
Tab, newline, return, and decimal digit (between 0 and 9). (Older regex utilities may not be compatible with)
= start, $ = end — match the start or end of the string — negate the “specialness” of a character; however, they all support w and s. Use. to match a period, for instance, or a slash. Try adding a slash in front of a character, like “@,” if you’re unsure about any additional meaning existing. Your Python program will crash if the escape sequence is invalid, such as c.

Basic Examples of Python Regular Expression

Regular expressions’ fundamental guidelines for looking for patterns in strings are as follows:

The search traverses the string from beginning to end, halting at the first match discovered.
All of the patterns but not all of the strings must match.
The match is None and is specifically a match if match = re.search(pat, str) is successful.
group() is the text that matches.

match = re.search(r'i', 'pg') # found, match.group() == "i"
  match = re.search(r'is', 'pg') # not found, match == None

  ## . = any char but \n
  match = re.search(r'g', 'pg') # found, match.group() == "ig"

  ## \d = digit char, \w = word char
  match = re.search(r'\d\d\d', 'p123g') # found, match.group() == "123"
  match = re.search(r'\w\w\w', '@@abcd!!') # found, match.group() == "abc"

Repetition in Python Regular Expression

When you indicate repetition in the pattern using + and *, things become more interesting.

+— One or more instances of the pattern to its left, for example, “i+” = one or more i’s.

*— does the pattern to its left have zero or more occurrences?

?– compare it with 0 or 1 instance of the pattern on its left.

Working with Group Extraction

A regular expression’s “group” function enables you to select certain sections of the matching text. Let’s say we want to extract the username and host individually for the email issue. To accomplish this, surround the username and host in the pattern with parenthesis (), as in the format r'([w.-]+)@([w.-]+)’. The parenthesis creates logical “groups” within the match text rather than altering what the pattern will match. When a search is successful, match.group(1) represents the match text for the first left parenthesis and match.group(2) represents the text for the second left parenthesis. The entire match text is still contained in the standard match.group().

str = 'purple welcome@techvidvan.com monkey dishwasher'
match = re.search(r'([\w.-]+)@([\w.-]+)', str)
if match:
  print(match.group())   ## 'welcome@techvidvan.com' (the whole match)
  print(match.group(1))  ## 'welcome' (the username, group 1)
  print(match.group(2))  ## 'techvidvan.com' (the host, group 2)

Writing a pattern for the object you’re seeking and using parenthesis groups to extract the portions you want is a standard regular expressions procedure.

Working with findall

The most potent function in the ‘re’ module is findall(). In the example above, we used re.search() to locate the first pattern match. Each string in the list returned by the findall() function denotes a different match that was found.

## Suppose we have a text with many email addresses
str = 'purple welcome@techvidvan.com, blah monkey Hey@techvidvan.com blah dishwasher'

## Here re.findall() returns a list of all the found email strings
emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['welcome@techvidvan.com', 'Hey@techvidvan.com']
for an email in emails:
  print(email)

For files, you could have created a loop that iterates through the file’s lines before calling findall() on each line. Better still, let findall() handle the iteration for you. Findall() will provide a list of all matches in one step if the entire file text is passed to it; keep in mind that f.read() returns the entire contents of a file as a single string.

str = 'purple welcome@techvidvan.com, blah monkey Hey@techvidvan.com blah dishwasher'
tuples = re.findall(r'([\w\.-]+)@([\w\.-]+)', str)
print(tuples)  ## [('welcome', 'techvidvan.com'), ('Hey', 'techvidvan.com')]
for tuple in tuples:
  print(tuple[0])  ## username
  print(tuple[1])  ## host

When you have a list of tuples, you can use a loop to do calculations on each tuple individually. Findall() provides a list of found strings in the same manner as in earlier examples if the pattern has no parenthesis. A list of strings corresponding to that single group is returned by findall() if the pattern contains just one set of parenthesis. (Hideous optional feature: You may occasionally see paren() groups in the pattern you do not want to remove. If so, begin the parens with a?:, as in (?:). Then, the left paren won’t be counted as a group result.)

Conclusion:

We learned about regular expressions in this article and how effectively they match text patterns. This article has given a basic overview of regular expressions suitable for our Python activities and demonstrates how they operate in Python. Support for regular expressions is available from the Python “re” module.