Lnation

Learning Perl - Regular Expressions

Regular expressions also known as regex or regexp, are a powerful way to match patterns in text. Most programming languages support regular expressions, however the syntax may vary slightly between languages as there is more than one standard for regular expressions. Perl is one language that uses its own engine for regular expressions, often referred to as Perl Compatible Regular Expressions(PCRE), which is very powerful and flexible. The reason behind this is that Perl was designed with text processing in mind, and regular expressions are a key part of that design. Perl’s regex syntax is highly expressive and has influenced many other languages and tools. While there are several regex standards (like POSIX), Perl’s regexes are more powerful and flexible, supporting advanced features such as non-greedy quantifiers, lookahead/lookbehind, and named captures. You would typically use regular expressions in programming for tasks such as:

Validating input (e.g., checking if an email address is well-formed)
Searching for specific patterns in text (e.g., finding all occurrences of a word)
Replacing text (e.g., changing all instances of a word to another)
Splitting strings based on patterns (e.g., breaking a sentence into words)
Extracting specific information from text (e.g., pulling out dates or phone numbers)

Regular expressions can be quite complex, and mastering them takes time and practice. They can match simple patterns like specific words or characters, as well as more complex patterns involving character classes, quantifiers, anchors, capturing groups, alternation, and more.

As stated Perl has it's own regex engine, which many other languages try to emulate. Perl's regex engine is known for its flexibility and power, allowing for complex pattern matching and manipulation. It supports a wide range of features, including:

Feature	Syntax/Example	Description
Literal Match	`foo`	Matches the exact string "foo"
Character Class	`[a-z]`, `\d`, `\w`	Matches any character in the set (e.g., lowercase letters, digits, word chars)
Negated Class	`[^a-z]`, `\D`, `\W`	Matches any character not in the set
Quantifiers	`*`, `+`, `?`, `{n,m}`	Specify how many times to match (zero/more, one/more, optional, specific counts)
Anchors	`^`, `$`, `\b`, `\B`	Match positions (start/end of line, word boundaries)
Grouping/Capturing	`(abc)`	Groups patterns and captures matched text
Non-capturing Group	`(?:abc)`	Groups patterns without capturing
Alternation	`foo{pipe}bar`	Matches "foo" or "bar"
Lookahead/Lookbehind	`(?=...)`, `(?!...)`, `(?<=...)`, `(?<!...)`	Asserts patterns ahead/behind without consuming characters
Modifiers	`/i`, `/g`, `/s`, `/m`, `/x`	Change regex behavior (case-insensitive, global, dot matches newline, multiline, extended)
Named Capture	`(?<name>...)`	Captures matched text into a named variable
Substitution	`s/foo/bar/`	Replaces matched text
Split	`split /,/, $str`	Splits a string using a regex pattern
Compiled	`qr/(abc)/`	Compiles a regular expression so it can be used many times

In Perl, regular expressions are typically used with the =~ operator to apply a regex pattern to a string, or !~ to check if a pattern does not match. The basic syntax for using regex in Perl is as follows:

$string =~ type/pattern/modifiers;

Where:

$string is the variable containing the text you want to match against.
type is the type of match you want to perform (e.g., m for match, s for substitution).
pattern is the regex pattern you want to match.
modifiers are optional flags that change the behavior of the regex (e.g., i for case-insensitive matching, g for global matching).

Modifiers can be added at the end of the regex pattern to change its behavior. They are typically placed after the closing delimiter of the regex pattern. For example, in the pattern /pattern/modifiers, modifiers can include flags like i, g, m, etc. The following are some common modifiers used in Perl regex:

Modifier	Example	Description
i	/pattern/i	Case-insensitive matching
g	/pattern/g	Global matching (find all matches, not just the first)
m	/pattern/m	Multiline mode (`^` and `$` match start/end of each line)
s	/pattern/s	Single line mode (`.` matches newline as well)
x	/pattern/x	Extended mode (ignore whitespace and allow comments in pattern)
o	/pattern/o	Compile pattern only once (useful with interpolated variables)
e	s///e	Evaluate replacement as Perl code in substitution
r	s///r	Return the result of substitution without modifying the variable

This all may sound a bit overwhelming, but don't worry! Regular expressions can be learned step by step. The key is to start with simple patterns and gradually build up to more complex ones. Here are some tips to get started:

Start Simple: Begin with basic patterns like matching specific words or characters. For example, try matching the word "cat" in a string.
Use Online Tools: There are many online regex testers and visualizers that can help you see how your patterns match against sample text. These tools often provide explanations for each part of the regex. I personally use https://regexr.com/
Practice Regularly: Regular expressions are a skill that improves with practice. Try solving small problems or challenges that require regex.
Read Documentation: Familiarise yourself with the regex documentation for the language you're using. Perl's documentation is extensive and provides many examples. https://perldoc.perl.org/perlre
Experiment: Don't be afraid to experiment with different patterns and modifiers. Try changing quantifiers, character classes, or adding anchors to see how it affects the matches.
Learn Common Patterns: There are many common regex patterns for tasks like validating email addresses, phone numbers, or URLs. Learning these can save you time and effort.
Break Down Complex Patterns: If you encounter a complex regex, break it down into smaller parts. Understand each component before trying to grasp the whole pattern.

Okay, let's get started with some basic examples of regular expressions in Perl. First, we will look at how to match simple strings and characters create a new file 'regex.pl' and insert the following code:

#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
# Simple regex match
my $string = "Hello, world!";
if ($string =~ m/world/) {
    say "Found 'world' in the string!";
} else {
    say "'world' not found.";
}

This code checks if the string "Hello, world!" contains the substring "world". The =~ operator is used to apply the regex pattern /world/ to the variable $string. If a match is found, it prints a message indicating that "world" was found; otherwise, it indicates that it was not found.

If we save and run this script, we should see the output:

Found 'world' in the string!

Now let's explore character classes, which allow us to match any character from a set. For example, we can match any lowercase letter:

if ($string =~ m/[a-z]/) {
    say "Found a lowercase letter in the string!";
} else {
    say "No lowercase letters found.";
}

This code checks if the string contains any lowercase letter from 'a' to 'z'. The character class [a-z] matches any single lowercase letter. If a match is found, it prints a message indicating that a lowercase letter was found; otherwise, it indicates that no lowercase letters were found.

If we run this code, we should see the additional output:

Found a lowercase letter in the string!

Next, let's look at character classes with negation, which allows us to match any character that is not in a specified set. For example, we can match any character that is not a lowercase letter:

if ($string =~ m/[^a-z]/) {
    say "Found a character that is not a lowercase letter!";
} else {
    say "All characters are lowercase letters.";
}

This code checks if the string contains any character that is not a lowercase letter. The negated character class [^a-z] matches any single character that is not in the range 'a' to 'z'. If a match is found, it prints a message indicating that a non-lowercase letter was found; otherwise, it indicates that all characters are lowercase letters.
If we run this code, we should see the additional output:

Found a character that is not a lowercase letter!

To fix this to work we will need to modify the regex to include uppercase letters, spaces and punctuation, so we could use the following regex:

if ($string =~ m/[^\w\s\p{Punct}]/) {
    say "Found a character that is not a letter, space, or punctuation!";
} else {
    say "All characters are letters, spaces, or punctuation.";
}

We use the character class \w to match any word character (letters, digits, and underscores), \s to match whitespace characters (spaces, tabs, etc.), and \p{Punct} to match punctuation characters, we use \p{Punct} to match all unicode versions of the punctuation character. The negated character class [^\w\s\p{Punct}] matches any character that is not a letter, space, or punctuation. If a match is found, it prints a message indicating that a non-letter, non-space, and non-punctuation character was found; otherwise, it indicates that all characters are letters, spaces, or punctuation.

If we run this code, we should see the additional output:

All characters are letters, spaces, or punctuation.

Now let's check if we have any digits in the string. We can use the \d shorthand character class, which matches any digit (0-9). Add the following code to your script:

if ($string =~ m/\d/) {
    say "Found a digit in the string!";
} else {
    say "No digits found.";
}

This code checks if the string contains any digit. The \d matches any single digit character. If a match is found, it prints a message indicating that a digit was found; otherwise, it indicates that no digits were found.
If we run this code, we should see the additional output:

No digits found.

Next we will explore capture groups, which allow us to extract specific parts of a string. We can use parentheses () to create a capture group:

if ($string =~ /(Hello), (world)!/) {
    say "Found a capture group: $1 and $2";
} else {
    say "No match found.";
}

This code checks if the string matches the pattern with two capture groups: one for "Hello" and one for "world". The matched text is stored in $1 and $2, which correspond to the first and second capture groups, respectively. If a match is found, it prints the captured values. otherwise, it indicates that no match was found.

If we run this code, we should see the additional output:

Found a capture group: Hello and world

With that working, let's explore non-capturing groups, which allow us to group patterns without capturing the matched text. We can use (?:...) for non-capturing groups:

if ($string =~ m/(?:Hello), (world)!/) {
    say "Found a non-capturing group match: $1";
} else {
    say "No match found.";
}

The code checks if the string matches the pattern with a non-capturing group for "Hello" and a capturing group for "world". The matched text for "world" is stored in $1. If a match is found, it prints the captured value; otherwise, it indicates that no match was found. Non capturing means it needs to match but we do not capture the 'match' as an $N argument.

If we run this code, we should see the additional output:

Found a non-capturing group match: world

Next, let's explore the . (dot) character, which matches any single character except a newline. This can be useful for matching patterns where you don't care about the specific character:

if ($string =~ /(H.llo)/) {
    say "Found a match with a dot: $1";
} else {
    say "No match found.";
}

This code checks if the string contains "H" followed by any character (represented by .) and then "llo". The matched text is stored in $1. If a match is found, it prints the captured value; otherwise, it indicates that no match was found.
If we run this code, we should see the additional output:

Found a match with a dot: Hello

Next we will explore quantifiers, which allow us to specify how many times a pattern should match. For example, we can use the * quantifier to match zero or more occurrences of a character:

if ($string =~ m/(l*)/) {
    say "Found zero or more 'l' characters in the string! $1";
} else {
    say "No 'l' characters found.";
}

This code checks if the string contains zero or more occurrences of the character 'l'. The * quantifier matches zero or more occurrences of the preceding character. If a match is found, it prints a message indicating that 'l' characters were found; otherwise, it indicates that no 'l' characters were found. Note the else here will never be hit as the * quantifier will always match zero or more occurrences, so it will always find a match, even if it is an empty string.

If we run this code, we should see the additional output:

Found zero or more 'l' characters in the string! ll

To ensure you have one or more occurences of a charcter you can use the + quantifier, extend your code with the following:

if ($string =~ m/(l+)/) {
    say "Found one or more 'l' characters in the string!" $1;
} else {
    say "No 'l' characters found.";
}

Here we check if the string contains one or more occurrences of the character 'l'. The + quantifier matches one or more occurrences of the preceding character. If a match is found, it prints a message indicating that 'l' characters were found; otherwise, it indicates that no 'l' characters were found.
If we run this code, we should see the additional output:

Found one or more 'l' characters in the string!

The difference between * and + is that * can match zero occurrences, while + requires at least one occurrence. Our above example is not a great example but when you build more complex regular expressions with optional parts, this difference becomes more significant.

Next, let's look at alternation, which allows us to match one of several patterns. We can use the | operator for alternation:

if ($string =~ m/earth|world/) {
    say "Found either 'earth' or 'world' in the string!";
} else {
    say "Neither 'earth' nor 'world' found.";
}

This code checks if the string contains either "earth" or "world". The alternation operator | allows us to specify multiple patterns to match. If a match is found, it prints a message indicating that one of the words was found; otherwise, it indicates that neither word was found.

If we run this code, we should see the additional output:

Found either 'earth' or 'world' in the string!

Finally for matching, let's explore anchors, which allow us to match patterns at specific positions in the string. We can use ^ to match the start of the string and $ to match the end of the string:

if ($string =~ m/^Hello/) {
    say "The string starts with 'Hello'!";
} else {
    say "The string does not start with 'Hello'.";
}
if ($string =~ m/world!$/) {
    say "The string ends with 'world!'!";
} else {
    say "The string does not end with 'world!'.";
}

This code checks if the string starts with "Hello" and ends with "world!". The ^ anchor matches the start of the string, and the $ anchor matches the end of the string. If a match is found for either condition, it prints a corresponding message; otherwise, it indicates that the conditions were not met.

If we run this code, we should see the additional output:

The string starts with 'Hello'!
The string ends with 'world!'!

This covers the basics of pattern matching using regular expressions. You should be able to now apply these concepts when performing search and replace operations. In Perl, the s/// operator is used to search for a pattern and replace it within a string. To implement your first basic search and replace, add the following lines at the end of your file:

# Search and replace example
my $text = "I love Javascipt programming!";
$text =~ s/Javascipt/Perl/;
say "After replacement: $text";

This code searches for the word "Javascipt" in the string $text and replaces it with "Perl". The s/// operator is used for substitution, where the first part is the pattern to search for, the second part is the replacement text, and the third part is optional modifiers (not used here). After the replacement, it prints the modified string.

If we run this code, we should see the output:

After replacement: I love Perl programming!

Now, let's explore how to use modifiers with search and replace operations. Modifiers can change the behavior of regex matching, allowing for case-insensitive matching, global replacements, and more. We will first use the i modifier for case-insensitive matching:

# Case-insensitive search and replace
$text = "I love JAVASCRIPT programming!";
$text =~ s/javascript/Perl/i;
say "After case-insensitive replacement: $text";

This code searches for the word "javascript" in the string $text and replaces it with "Perl", ignoring case due to the i modifier. If a match is found, it performs the replacement regardless of the case of the letters. After the replacement, it prints the modified string.

If we run this code, we should see the output:

After case-insensitive replacement: I love Perl programming!

Next, let's explore the g modifier, which allows us to perform a global search and replace, replacing all occurrences of the pattern in the string:

# Global search and replace
$text = "I love JAVASCRIPT and javascript is great!";
$text =~ s/javascript/Perl/ig;
say "After global replacement: $text";

This code searches for all occurrences of the word "javascript" in the string $text and replaces them with "Perl", ignoring case due to the i modifier and applying the replacement globally due to the g modifier. After the replacement, it prints the modified string.
If we run this code, we should see the output:

After global replacement: I love Perl and Perl is great!

Next, let's explore the m modifier, which allows us to treat the string as multiple lines, enabling the ^ and $ anchors to match the start and end of each line:

# Multiline search and replace
$text = "Hello World!\nThis is a test.\nGoodbye World!";
$text =~ s/^Hello/Hi/m;
$text =~ s/Goodbye/See you/m;
say "After multiline replacement: $text";

This code searches for the word "Hello" at the start of each line and replaces it with "Hi", and searches for "Goodbye" at the end of each line and replaces it with "See you". The m modifier allows the ^ anchor to match the start of each line. After the replacements, it prints the modified string.

If we run this code, we should see the output:

After multiline replacement: Hi World
This is a test.
See you World!

You can also use matches in your replacement text by using $1, $2, to refer to captured groups. For example, if we want to swap the words "Hello" and "World" in the string:

# Using captured groups in replacement
$text = "Hello World!";
$text =~ s/(Hello) (World)/$2 $1/;
say "After swapping words: $text";

This code captures "Hello" and "World" in separate groups and then swaps their positions in the replacement text. The $1 refers to the first captured group ("Hello"), and $2 refers to the second captured group ("World"). After the replacement, it prints the modified string.

If we run this code, we should see the output:

After swapping words: World Hello!

Finally you can also evaluate perl within your regex replacement using the e modifier. This allows you to use Perl code in the replacement part of the substitution:

# Using Perl code in replacement
$text = "The price is 100 pound.";
$text =~ s/(\d+)/$1 * 1.2/e;  # Increase price by 20%
say "After applying Perl code in replacement: $text";

This code captures a number in the string and then uses Perl code to multiply it by 1.2 in the replacement part. The e modifier allows the replacement text to be evaluated as Perl code. After the replacement, it prints the modified string.
If we run this code, we should see the output:

After applying Perl code in replacement: The price is 120 pound.

Regular expressions in Perl are a powerful tool for text processing, allowing you to search, match, and manipulate strings with great flexibility. In this lesson, we covered the basics of regex syntax, including character classes, quantifiers, anchors, capturing groups, alternation, and modifiers.

Regular expressions can be complex, but with practice, you can become proficient in using them to solve various text processing tasks. Remember to start with simple patterns and gradually build up to more complex ones. Use online tools and resources to test and visualize your regex patterns, and don't hesitate to experiment with different features and modifiers.

This concludes our introduction to regular expressions in Perl. In the next post, we will explore how to create modules and packages in Perl, which will allow you to organise your code into reusable components. This is an essential skill for writing maintainable and modular Perl applications.