Learning Perl - Regular Expressions


Learning Perl - Regular Expressions


Regular expressions also known as regex or regexp, are a powerful way to match patterns in text. Most programming languages support regular expressions, however the syntax may vary slightly between languages as there is more than one standard for regular expressions. Perl is one language that uses its own engine for regular expressions, often referred to as Perl Compatible Regular Expressions(PCRE), which is very powerful and flexible. The reason behind this is that Perl was designed with text processing in mind, and regular expressions are a key part of that design. Perl’s regex syntax is highly expressive and has influenced many other languages and tools. While there are several regex standards (like POSIX), Perl’s regexes are more powerful and flexible, supporting advanced features such as non-greedy quantifiers, lookahead/lookbehind, and named captures. You would typically use regular expressions in programming for tasks such as:

  • Validating input (e.g., checking if an email address is well-formed)
  • Searching for specific patterns in text (e.g., finding all occurrences of a word)
  • Replacing text (e.g., changing all instances of a word to another)
  • Splitting strings based on patterns (e.g., breaking a sentence into words)
  • Extracting specific information from text (e.g., pulling out dates or phone numbers)

Regular expressions can be quite complex, and mastering them takes time and practice. They can match simple patterns like specific words or characters, as well as more complex patterns involving character classes, quantifiers, anchors, capturing groups, alternation, and more.

As stated Perl has it's own regex engine, which many other languages try to emulate. Perl's regex engine is known for its flexibility and power, allowing for complex pattern matching and manipulation. It supports a wide range of features, including:

Feature Syntax/Example Description
Literal Match foo Matches the exact string "foo"
Character Class [a-z], \d, \w Matches any character in the set (e.g., lowercase letters, digits, word chars)
Negated Class [^a-z], \D, \W Matches any character not in the set
Quantifiers *, +, ?, {n,m} Specify how many times to match (zero/more, one/more, optional, specific counts)
Anchors ^, $, \b, \B Match positions (start/end of line, word boundaries)
Grouping/Capturing (abc) Groups patterns and captures matched text
Non-capturing Group (?:abc) Groups patterns without capturing
Alternation foo{pipe}bar Matches "foo" or "bar"
Lookahead/Lookbehind (?=...), (?!...), (?<=...), (?<!...) Asserts patterns ahead/behind without consuming characters
Modifiers /i, /g, /s, /m, /x Change regex behavior (case-insensitive, global, dot matches newline, multiline, extended)
Named Capture (?<name>...) Captures matched text into a named variable
Substitution s/foo/bar/ Replaces matched text
Split split /,/, $str Splits a string using a regex pattern
Compiled qr/(abc)/ Compiles a regular expression so it can be used many times

In Perl, regular expressions are typically used with the =~ operator to apply a regex pattern to a string, or !~ to check if a pattern does not match. The basic syntax for using regex in Perl is as follows:

$string =~ type/pattern/modifiers;
Enter fullscreen mode Exit fullscreen mode

Where:

  • $string is the variable containing the text you want to match against.
  • type is the type of match you want to perform (e.g., m for match, s for substitution).
  • pattern is the regex pattern you want to match.
  • modifiers are optional flags that change the behavior of the regex (e.g., i for case-insensitive matching, g for global matching).

Modifiers can be added at the end of the regex pattern to change its behavior. They are typically placed after the closing delimiter of the regex pattern. For example, in the pattern /pattern/modifiers, modifiers can include flags like i, g, m, etc. The following are some common modifiers used in Perl regex:

Modifier Example Description
i /pattern/i Case-insensitive matching
g /pattern/g Global matching (find all matches, not just the first)
m /pattern/m Multiline mode (^ and $ match start/end of each line)
s /pattern/s Single line mode (. matches newline as well)
x /pattern/x Extended mode (ignore whitespace and allow comments in pattern)
o /pattern/o Compile pattern only once (useful with interpolated variables)
e s///e Evaluate replacement as Perl code in substitution
r s///r Return the result of substitution without modifying the variable

This all may sound a bit overwhelming, but don't worry! Regular expressions can be learned step by step. The key is to start with simple patterns and gradually build up to more complex ones. Here are some tips to get started:

  1. Start Simple: Begin with basic patterns like matching specific words or characters. For example, try matching the word "cat" in a string.
  2. Use Online Tools: There are many online regex testers and visualizers that can help you see how your patterns match against sample text. These tools often provide explanations for each part of the regex. I personally use https://regexr.com/
  3. Practice Regularly: Regular expressions are a skill that improves with practice. Try solving small problems or challenges that require regex.
  4. Read Documentation: Familiarise yourself with the regex documentation for the language you're using. Perl's documentation is extensive and provides many examples. https://perldoc.perl.org/perlre
  5. Experiment: Don't be afraid to experiment with different patterns and modifiers. Try changing quantifiers, character classes, or adding anchors to see how it affects the matches.
  6. Learn Common Patterns: There are many common regex patterns for tasks like validating email addresses, phone numbers, or URLs. Learning these can save you time and effort.
  7. Break Down Complex Patterns: If you encounter a complex regex, break it down into smaller parts. Understand each component before trying to grasp the whole pattern.

Okay, let's get started with some basic examples of regular expressions in Perl. First, we will look at how to match simple strings and characters create a new file 'regex.pl' and insert the following code:

#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
# Simple regex match
my $string = "Hello, world!";
if ($string =~ m/world/) {
    say "Found 'world' in the string!";
} else {
    say "'world' not found.";
}
Enter fullscreen mode Exit fullscreen mode

This code checks if the string "Hello, world!" contains the substring "world". The =~ operator is used to apply the regex pattern /world/ to the variable $string. If a match is found, it prints a message indicating that "world" was found; otherwise, it indicates that it was not found.

If we save and run this script, we should see the output:

Found 'world' in the string!
Enter fullscreen mode Exit fullscreen mode

Now let's explore character classes, which allow us to match any character from a set. For example, we can match any lowercase letter:

if ($string =~ m/[a-z]/) {
    say "Found a lowercase letter in the string!";
} else {
    say "No lowercase letters found.";
}
Enter fullscreen mode Exit fullscreen mode

This code checks if the string contains any lowercase letter from 'a' to 'z'. The character class [a-z] matches any single lowercase letter. If a match is found, it prints a message indicating that a lowercase letter was found; otherwise, it indicates that no lowercase letters were found.

If we run this code, we should see the additional output:

Found a lowercase letter in the string!
Enter fullscreen mode Exit fullscreen mode

Next, let's look at character classes with negation, which allows us to match any character that is not in a specified set. For example, we can match any character that is not a lowercase letter:

if ($string =~ m/[^a-z]/) {
    say "Found a character that is not a lowercase letter!";
} else {
    say "All characters are lowercase letters.";
}
Enter fullscreen mode Exit fullscreen mode

This code checks if the string contains any character that is not a lowercase letter. The negated character class [^a-z] matches any single character that is not in the range 'a' to 'z'. If a match is found, it prints a message indicating that a non-lowercase letter was found; otherwise, it indicates that all characters are lowercase letters.
If we run this code, we should see the additional output:

Found a character that is not a lowercase letter!
Enter fullscreen mode Exit fullscreen mode

To fix this to work we will need to modify the regex to include uppercase letters, spaces and punctuation, so we could use the following regex:

if ($string =~ m/[^\w\s\p{Punct}]/) {
    say "Found a character that is not a letter, space, or punctuation!";
} else {
    say "All characters are letters, spaces, or punctuation.";
}
Enter fullscreen mode Exit fullscreen mode

We use the character class \w to match any word character (letters, digits, and underscores), \s to match whitespace characters (spaces, tabs, etc.), and \p{Punct} to match punctuation characters, we use \p{Punct} to match all unicode versions of the punctuation character. The negated character class [^\w\s\p{Punct}] matches any character that is not a letter, space, or punctuation. If a match is found, it prints a message indicating that a non-letter, non-space, and non-punctuation character was found; otherwise, it indicates that all characters are letters, spaces, or punctuation.

If we run this code, we should see the additional output:

All characters are letters, spaces, or punctuation.
Enter fullscreen mode Exit fullscreen mode

Now let's check if we have any digits in the string. We can use the \d shorthand character class, which matches any digit (0-9). Add the following code to your script:

if ($string =~ m/\d/) {
    say "Found a digit in the string!";
} else {
    say "No digits found.";
}
Enter fullscreen mode Exit fullscreen mode

This code checks if the string contains any digit. The \d matches any single digit character. If a match is found, it prints a message indicating that a digit was found; otherwise, it indicates that no digits were found.
If we run this code, we should see the additional output:

No digits found.
Enter fullscreen mode Exit fullscreen mode

Next we will explore capture groups, which allow us to extract specific parts of a string. We can use parentheses () to create a capture group:

if ($string =~ /(Hello), (world)!/) {
    say "Found a capture group: $1 and $2";
} else {
    say "No match found.";
}
Enter fullscreen mode Exit fullscreen mode

This code checks if the string matches the pattern with two capture groups: one for "Hello" and one for "world". The matched text is stored in $1 and $2, which correspond to the first and second capture groups, respectively. If a match is found, it prints the captured values. otherwise, it indicates that no match was found.

If we run this code, we should see the additional output:

Found a capture group: Hello and world
Enter fullscreen mode Exit fullscreen mode

With that working, let's explore non-capturing groups, which allow us to group patterns without capturing the matched text. We can use (?:...) for non-capturing groups:

if ($string =~ m/(?:Hello), (world)!/) {
    say "Found a non-capturing group match: $1";
} else {
    say "No match found.";
}
Enter fullscreen mode Exit fullscreen mode

The code checks if the string matches the pattern with a non-capturing group for "Hello" and a capturing group for "world". The matched text for "world" is stored in $1. If a match is found, it prints the captured value; otherwise, it indicates that no match was found. Non capturing means it needs to match but we do not capture the 'match' as an $N argument.

If we run this code, we should see the additional output:

Found a non-capturing group match: world
Enter fullscreen mode Exit fullscreen mode

Next, let's explore the . (dot) character, which matches any single character except a newline. This can be useful for matching patterns where you don't care about the specific character:

if ($string =~ /(H.llo)/) {
    say "Found a match with a dot: $1";
} else {
    say "No match found.";
}
Enter fullscreen mode Exit fullscreen mode

This code checks if the string contains "H" followed by any character (represented by .) and then "llo". The matched text is stored in $1. If a match is found, it prints the captured value; otherwise, it indicates that no match was found.
If we run this code, we should see the additional output:

Found a match with a dot: Hello
Enter fullscreen mode Exit fullscreen mode

Next we will explore quantifiers, which allow us to specify how many times a pattern should match. For example, we can use the * quantifier to match zero or more occurrences of a character:

if ($string =~ m/(l*)/) {
    say "Found zero or more 'l' characters in the string! $1";
} else {
    say "No 'l' characters found.";
}
Enter fullscreen mode Exit fullscreen mode

This code checks if the string contains zero or more occurrences of the character 'l'. The * quantifier matches zero or more occurrences of the preceding character. If a match is found, it prints a message indicating that 'l' characters were found; otherwise, it indicates that no 'l' characters were found. Note the else here will never be hit as the * quantifier will always match zero or more occurrences, so it will always find a match, even if it is an empty string.

If we run this code, we should see the additional output:

Found zero or more 'l' characters in the string! ll
Enter fullscreen mode Exit fullscreen mode

To ensure you have one or more occurences of a charcter you can use the + quantifier, extend your code with the following:

if ($string =~ m/(l+)/) {
    say "Found one or more 'l' characters in the string!" $1;
} else {
    say "No 'l' characters found.";
}
Enter fullscreen mode Exit fullscreen mode

Here we check if the string contains one or more occurrences of the character 'l'. The + quantifier matches one or more occurrences of the preceding character. If a match is found, it prints a message indicating that 'l' characters were found; otherwise, it indicates that no 'l' characters were found.
If we run this code, we should see the additional output:

Found one or more 'l' characters in the string!
Enter fullscreen mode Exit fullscreen mode

The difference between * and + is that * can match zero occurrences, while + requires at least one occurrence. Our above example is not a great example but when you build more complex regular expressions with optional parts, this difference becomes more significant.

Next, let's look at alternation, which allows us to match one of several patterns. We can use the | operator for alternation:

if ($string =~ m/earth|world/) {
    say "Found either 'earth' or 'world' in the string!";
} else {
    say "Neither 'earth' nor 'world' found.";
}
Enter fullscreen mode Exit fullscreen mode

This code checks if the string contains either "earth" or "world". The alternation operator | allows us to specify multiple patterns to match. If a match is found, it prints a message indicating that one of the words was found; otherwise, it indicates that neither word was found.

If we run this code, we should see the additional output:

Found either 'earth' or 'world' in the string!
Enter fullscreen mode Exit fullscreen mode

Finally for matching, let's explore anchors, which allow us to match patterns at specific positions in the string. We can use ^ to match the start of the string and $ to match the end of the string:

if ($string =~ m/^Hello/) {
    say "The string starts with 'Hello'!";
} else {
    say "The string does not start with 'Hello'.";
}
if ($string =~ m/world!$/) {
    say "The string ends with 'world!'!";
} else {
    say "The string does not end with 'world!'.";
}
Enter fullscreen mode Exit fullscreen mode

This code checks if the string starts with "Hello" and ends with "world!". The ^ anchor matches the start of the string, and the $ anchor matches the end of the string. If a match is found for either condition, it prints a corresponding message; otherwise, it indicates that the conditions were not met.

If we run this code, we should see the additional output:

The string starts with 'Hello'!
The string ends with 'world!'!
Enter fullscreen mode Exit fullscreen mode

This covers the basics of pattern matching using regular expressions. You should be able to now apply these concepts when performing search and replace operations. In Perl, the s/// operator is used to search for a pattern and replace it within a string. To implement your first basic search and replace, add the following lines at the end of your file:

# Search and replace example
my $text = "I love Javascipt programming!";
$text =~ s/Javascipt/Perl/;
say "After replacement: $text";
Enter fullscreen mode Exit fullscreen mode

This code searches for the word "Javascipt" in the string $text and replaces it with "Perl". The s/// operator is used for substitution, where the first part is the pattern to search for, the second part is the replacement text, and the third part is optional modifiers (not used here). After the replacement, it prints the modified string.

If we run this code, we should see the output:

After replacement: I love Perl programming!
Enter fullscreen mode Exit fullscreen mode

Now, let's explore how to use modifiers with search and replace operations. Modifiers can change the behavior of regex matching, allowing for case-insensitive matching, global replacements, and more. We will first use the i modifier for case-insensitive matching:

# Case-insensitive search and replace
$text = "I love JAVASCRIPT programming!";
$text =~ s/javascript/Perl/i;
say "After case-insensitive replacement: $text";
Enter fullscreen mode Exit fullscreen mode

This code searches for the word "javascript" in the string $text and replaces it with "Perl", ignoring case due to the i modifier. If a match is found, it performs the replacement regardless of the case of the letters. After the replacement, it prints the modified string.

If we run this code, we should see the output:

After case-insensitive replacement: I love Perl programming!
Enter fullscreen mode Exit fullscreen mode

Next, let's explore the g modifier, which allows us to perform a global search and replace, replacing all occurrences of the pattern in the string:

# Global search and replace
$text = "I love JAVASCRIPT and javascript is great!";
$text =~ s/javascript/Perl/ig;
say "After global replacement: $text";
Enter fullscreen mode Exit fullscreen mode

This code searches for all occurrences of the word "javascript" in the string $text and replaces them with "Perl", ignoring case due to the i modifier and applying the replacement globally due to the g modifier. After the replacement, it prints the modified string.
If we run this code, we should see the output:

After global replacement: I love Perl and Perl is great!
Enter fullscreen mode Exit fullscreen mode

Next, let's explore the m modifier, which allows us to treat the string as multiple lines, enabling the ^ and $ anchors to match the start and end of each line:

# Multiline search and replace
$text = "Hello World!\nThis is a test.\nGoodbye World!";
$text =~ s/^Hello/Hi/m;
$text =~ s/Goodbye/See you/m;
say "After multiline replacement: $text";
Enter fullscreen mode Exit fullscreen mode

This code searches for the word "Hello" at the start of each line and replaces it with "Hi", and searches for "Goodbye" at the end of each line and replaces it with "See you". The m modifier allows the ^ anchor to match the start of each line. After the replacements, it prints the modified string.

If we run this code, we should see the output:

After multiline replacement: Hi World
This is a test.
See you World!
Enter fullscreen mode Exit fullscreen mode

You can also use matches in your replacement text by using $1, $2, to refer to captured groups. For example, if we want to swap the words "Hello" and "World" in the string:

# Using captured groups in replacement
$text = "Hello World!";
$text =~ s/(Hello) (World)/$2 $1/;
say "After swapping words: $text";
Enter fullscreen mode Exit fullscreen mode

This code captures "Hello" and "World" in separate groups and then swaps their positions in the replacement text. The $1 refers to the first captured group ("Hello"), and $2 refers to the second captured group ("World"). After the replacement, it prints the modified string.

If we run this code, we should see the output:

After swapping words: World Hello!
Enter fullscreen mode Exit fullscreen mode

Finally you can also evaluate perl within your regex replacement using the e modifier. This allows you to use Perl code in the replacement part of the substitution:

# Using Perl code in replacement
$text = "The price is 100 pound.";
$text =~ s/(\d+)/$1 * 1.2/e;  # Increase price by 20%
say "After applying Perl code in replacement: $text";
Enter fullscreen mode Exit fullscreen mode

This code captures a number in the string and then uses Perl code to multiply it by 1.2 in the replacement part. The e modifier allows the replacement text to be evaluated as Perl code. After the replacement, it prints the modified string.
If we run this code, we should see the output:

After applying Perl code in replacement: The price is 120 pound.
Enter fullscreen mode Exit fullscreen mode

Regular expressions in Perl are a powerful tool for text processing, allowing you to search, match, and manipulate strings with great flexibility. In this lesson, we covered the basics of regex syntax, including character classes, quantifiers, anchors, capturing groups, alternation, and modifiers.

Regular expressions can be complex, but with practice, you can become proficient in using them to solve various text processing tasks. Remember to start with simple patterns and gradually build up to more complex ones. Use online tools and resources to test and visualize your regex patterns, and don't hesitate to experiment with different features and modifiers.

This concludes our introduction to regular expressions in Perl. In the next post, we will explore how to create modules and packages in Perl, which will allow you to organise your code into reusable components. This is an essential skill for writing maintainable and modular Perl applications.

Related Blogs

Learning Perl – Introduction
Perl has long been known as the “duct tape of the Internet,” or "the Swiss Army chainsaw of scripting...
Learning Perl - Variables
I will attempt to explain things in this post in a way that is easy to understand, even for those who...
Learning Perl - Arrays
As stated in the previous post, Perl has three types of variables: scalars, arrays and hashes. Today...
Learning Perl - Hashes
In the last post we covered the basics of arrays, today we will look at hashes in more detail. What...
Learning Perl - Conditional Statements
So far we have covered basic variables in Perl, today we are going to look at how to use these...
Learning Perl - Loops and Iteration
In previous posts, we explored variables, arrays, hashes, and conditional statements in Perl. Now...
Learning Perl - Scalars
Before moving onto more complex topics lets come back to how we represent data in Perl. The most...
Learning Perl - References
In the last post we learnt how to create a reference to a scalar, an array, and a hash. In this post,...
Learning Perl - Ternary Operators
In a previous post, we learned about conditional statements in Perl. The ternary operator is an...
Learning Perl - Subroutines
Subroutines are one of the most important building blocks in programming. They allow you to organise...
Learning Perl - Modules
A module in Perl is a reusable piece of code that can be included in your scripts to provide...
Learning Perl - CPAN
In the last post I showed you how to create a new module and how to use it in your code. In this post...
Learning Perl - Plain Old Documentation
When you write and program in any language it is good practice to document your code. Each language...
Learning Perl - Testing
In this post we will look at how to test Perl code using the Test::More module. Like documentation,...
Learning Perl - Exporting
In programming, we often want to share functionality across different parts of our code. In Perl,...
Learning Perl - Object Orientation
Object-Oriented Programming (OOP) is a widely used programming paradigm that enables the creation of...
Learning Perl - Inheritance
In the last post we discussed Object Oriented Programming in Perl, focusing on the basics of creating...
Learning Perl - File Handles
In programming file processing is a key skill to master. Files are essential for storing data,...
Learning Perl - Prototypes
Today we are going to discuss Perl subroutine prototypes, which are a way to enforce a certain...
Learning Perl - Overloading Operators
In the last post we investigated prototype subroutines in Perl. In this post, we will look at...