Regular Expressions 101, by Matt H.
"A Regular Expression is a pattern to be matched against a string. Matching a regular expression against a string either succeeds or fails. Sometimes, that success or failure may be all you are concerned about. At other times, you will want to take a matched pattern and replace it with another string, parts of which may depend on exactly how and where the regular expression matched."
- Schwartz/Christiansen, Learning Perl
To see if $_ contains "Hello", we'd build a construct in perl something like this:
If (/Hello/) {
do_stuff;
}
The regular expression is "/Hello/", and this kind of searching will work for any alphanumeric data.
In addition to matching, regular expressions can replace with "s". While:
If (/Hello/) {do_stuff; } #Matches Hello,
s/Hello/Goodbye/; #Replaces "Hello" with "Goodbye"
s/Hello/Goodbye/g; #Replaces all instances of "Hello" with "Goodbye"
The Caret ^ matches the beginning of a string, and the \b matches the boundaries of a word
/^Matt\b/ #Matches Strings that begin with "Matt ". Would not match "Matthew"
There are several non-alpha numerics that can be matched. Typically, they are matched by preceding the values with a backslash, like "\b" or "\n".
We can start to build some more powerful regular expressions now:
| /[aeiou]letters/ | Matches "aletters", "eletters", "iletters", etc |
| /[0-9]/ | Equivalent to /\d/ |
| /\d\d\d\-\d\d\d-\d\d\d\d/ | Matches 616-554-1022, 517-111-0102, etc. |
| /\(\d\d\d\)\-\d\d\d\-\d\d\d\d/ | Matches (616)-554-1022, (517)-111-0102, etc. |
| s/\(\d\d\d\)\-\d\d\d\-\d\d\d\d/(555)-555-1212/g; | Replaces Phone Numbers with (515)-555-1212 |
We can also add alternation, as in 1|2|3. This means to match exactly one of the alternatives (1 or 2 or 3 in this case). This works for words as well, so /blue\s(jay|bird)/ would match "blue jay" or "blue bird". This is because Alternation has the lowest precedence of any operator.
So far, we've only talked about matching the $_ operator. This Is very easy if all we're going to do is while (<>) {} Loops, but aside from that, there may be times where we'd like to match other variables. The =~ operator is used in this case. The following snippet of code should explain it:
$team = "ReD Sox";
if ($team=~/^[Rr][Ee][Dd]/) {
print "This string began with some form of "red\n";
print "That Caret symbol means 'begins with'\n";
}
We can evaluate regexps using variables, here's an example:
$phone = "(123)-456-789; $search = "[123]"; $replace = "9"; $phone =~ s/$search/$replace/g; print $phone . "\n";
Parenthesis can be used to remember what is actually matched. These are put into local variables, like $1, $2, $3, etc.
Here's an Example -
$_ = "(616)-555-1212"
if (/(\d\d\d-\d\d\d\d)/) {
print "$1\n";
}
Would print "555-1212"
S/blue(jay)/bird/g;
Would replace bluejay with bluebird.
Of course, if we wanted to truly match parenthesis, like ( ), we'd use back backslash to force a string literal. \( \).
This concludes the tutorial session of Regular Expressions. It is not even a complete introduction to regular expressions - instead, this whitepaper should simple make large regExps a bit less intimiating.
We've got a file, "data.txt", that has sales figures in this format - (state)(salesman's Phone #)($ Amount)(LF)
It looks something like this - MI616-554-1036100.00 MI616-673-0011200.00 MI616-394-1133125.00 MI616-392-1025100.00 WI517-112-2020300.00 WI517-230-2030300.00 MD301-371-5233600.00 MD410-543-1010800.00
Of course, the COBOL group in your IS department gives it to you in that format. When you ask for any other format, they say "No, we're busy with Y2K stuff."
First, we'll write a PERL script called out.pl to output the file on the screen:
#!/usr/bin/perl
# - Hash/Bang Address will change based on system
while (<>) {
#while loop will pull any filenames off Argv, loop
#through them and assign the current line to $_
print $_;
#This will print the current line
}
And run it by typing:
perl out.pl data.txt
Problem #1 - Make 5 different reports, one for each state, totaling and listing sales.
Step 1 -
#!/usr/bin/perl
$state = shift;
#Pulls state off ArgV Array
while (<>) {
#Loop through any remaining files
if (/$state/) {
#If I match state in the string
print $_;
#Print Me
}
}
Step 2 -
#!/usr/bin/perl
$state = shift;
$date = shift;
$total = 0.00;
print "Sales totals for " . $state . " ," . $date;
print "\n\n\n\n";
print "Phone Sales\n";
print "---------------------\n";
while (<>) {
if (/$state/) {
$phone = substr($_, 2, 11);
$sales = substr($_, 14, length($_)-14);
$total = $total + $sales;
print $phone . " \$" . $sales;
}
}
The next step would be to count the number of lines printed, and add a form feed around line 55, new headers on every page, and a page-count. Then, format the output so it looks "pretty." But we'll leave that to you. The program would be invoked something like this -
perl WI 06/01/99 data.txt > WI-06-01-99rpt.txt perl MI 06/01/99 data.txt > MI-06-01-99rpt.txt
Problem #2 - The COBOL guys made a mistake - all the "KS" addresses should have been "MI". Write a program to re-produce the database but change the data.
#!/usr/bin/perl
while (<>) {
s/MI/KS/g;
print $_;
}
perl data.txt > fixed.txt
Problem #3 - Match the Account #'s and $ amounts and put them into a hash -
#!/usr/bin/perl
while (<>) {
if (/(\d\d\d\-\d\d\d\-\d\d\d\d)/) {
$account = $1; }
if (/\d\d\d\-\d\d\d-\d\d\d\d(\d*)/) {
$HashAct{$account} = $1; }
}
while (($account, $sales) = each(%HashAct)) {
print "The dollar amount for $account is $sales\n";
}
For more information about Regular Expressions, check out http://www.oreilly.com and pick up a perl book - or read the man pages. This code was tested using ActiveState Perl for Win32 Systems. No warranty of any kind is expressed or implied, including a warranty of suitability or fitness for a particular purpose. If the code above does not work on your build of Perl 3.0 for the Commodore Vic 20 Operating System, I apologize, but I'm not going to re-write it for you.