Practical Perl 6 Regexes


Brian Duggan

bduggan@matatu.org


DC Baltimore Perl Workshop, April 6, 2019

Why?

Perl 5 set the standard for regexes BUT

Perl 6 regexes

Outline
  • Characters
  • Groups
  • Quantifiers
  • Capturing
  • Composing
Characters

Question:

Which of these print True? (in Perl 6) [press return or click]

say so 'abc' =~ /b/        
===SORRY!=== Error while compiling example.p6
Unsupported use of =~ to do pattern matching; in Perl 6 please use ~~
at example.p6:1
------> say so 'abc' =~<HERE> /b/        
say so 'abc' ~~ /b/        
True        
say so 'abc' ~~ / 'b' /        
True        
say so 'abc' ~~ regex { b }        
True        
my regex letter-b {
   b
}
say so 'abc' ~~ / <letter-b> /        
True        

Use / or regex to make a regex.

Characters

Literals

How about these?

say so 'good' ~~ / good /        
True        
say so 'not-good' ~~ / not-good /        
===SORRY!===
Unrecognized regex metacharacter - (must be quoted to match literally)
at example.p6:1
------> say so 'not-good' ~~ / not<HERE>-good /
Unable to parse regex; couldn't find final '/'
at example.p6:1
------> say so 'not-good' ~~ / not-<HERE>good /        
say so 'not-good' ~~ / 'not-good' /        
True        
say so 'schőn' ~~ / schőn /        
True        

Use quotes inside a regex. Everything except alphanumeric characters and underscores must be quoted.

Characters

Spaces

say so 'abc' ~~ / abc /        
True        
say so 'abc' ~~ / a b c /        
Potential difficulties:
    Space is not significant here; please use quotes or :s (:sigspace) modifier (or, to suppress this warning, omit the space, or otherwise change the spacing)
    at example.p6:1
    ------> say so 'abc' ~~ / a<HERE> b c /
    Space is not significant here; please use quotes or :s (:sigspace) modifier (or, to suppress this warning, omit the space, or otherwise change the spacing)
    at example.p6:1
    ------> say so 'abc' ~~ / a b<HERE> c /
True        
say so 'a b c' ~~ / a b c /        
Potential difficulties:
    Space is not significant here; please use quotes or :s (:sigspace) modifier (or, to suppress this warning, omit the space, or otherwise change the spacing)
    at example.p6:1
    ------> say so 'a b c' ~~ / a<HERE> b c /
    Space is not significant here; please use quotes or :s (:sigspace) modifier (or, to suppress this warning, omit the space, or otherwise change the spacing)
    at example.p6:1
    ------> say so 'a b c' ~~ / a b<HERE> c /
False        
say so 'a b c' ~~ / 'a b c' /        
True        
say so 'a b c' ~~ / a ' ' b ' ' c /        
True        
say so 'a b c' ~~ / a \s+ b \s+ c /        
True        
say so 'a b c' ~~ / a 
    # hey, this is a comment
    \s+ b \s+ c /        
True        

Spaces are not significant. Neither are comments.

Characters

Adverbs

say so 'a b c' ~~ /a \s* b \s* c/;
say so 'a b c' ~~ /a <ws> b <ws> c/;
say so 'a b c' ~~ /:s a b c/;
say so 'a b c' ~~ /:sigspace a b c/;        
True
True
True
True        
say so 'ABC' ~~ /:i b/;
say so 'ABC' ~~ /:ignorecase b/;        
True
True        
say so 'abc' ~~ /:r b/;
say so 'abc' ~~ /:ratchet b/;        
True
True        

Adverbs start with :.

Ratcheting makes matching much faster -- no backtracking.

Sigspace improves readability.

Characters

Tokens and rules

say so 'abc' ~~ regex { :r abc }
say so 'abc' ~~ token { abc }        
True
True        
say so 'a b c' ~~ token { :s a b c }
say so 'a b c' ~~ rule { a b c }        
True
True        

A token is a regex with ratching.

A rule is a token with sigspace.

These are deep concepts! Tokens and rules are building blocks for grammars.

Characters

Back to basics

Vehicle Identification Numbers

my $vin = '1FAHP3GNXBW107581';
if $vin ~~ / I | O | Q / {
  say "Invalid VIN"
} else {
  say "Maybe it's okay";
}        
Maybe it's okay        

For alternation, use |.

Character classes

TMTOWDI

Alternation

say so 'QUIT' ~~ / I | O | Q /        
True        
say so 'QUIT' ~~ / | I | O | Q /        
True        
say so 'QUIT' ~~ / | I
                   | O
                   | Q /        
True        
say so 'QUIT' ~~ / <[IOQ]> /        
True        

You can put an extra | at the beginning.

Construct character classes using <[ and ]>.

Character classes

Character classes

say so 'e' ~~ / <[a e i o u]> /        
True        
say so 'b' ~~ / <[a..e]> /        
True        
my regex vowels { <[a e i o u]> }
say so 'e' ~~  / <vowels> /;        
True        

Put lists of characters or ranges in character classes.

Spaces can be in character classes.

Character classes

Negate Character classes

my regex not-vowels {
  <-[aeiou]>
}
say so 'x' ~~ / <not-vowels> /;
say so '!' ~~ / <not-vowels> /;        
True
True        
my regex consonants {
 <[a..z] - [aeiou]>
}
say so '!' ~~ / <consonants> /;
say so 'x' ~~ / <consonants> /;        
False
True        

Take the complement of a character class <-[ ... ]>.

Or use - to take the set difference.

Outline
  • Characters
  • Groups
  • Quantifiers
  • Capturing
  • Composing
Groups

Grouping

Brackets make a non-capturing group.

say so 'sat, apr 6' ~~ /
  [ sat | sun ] ', '
  [ mar | apr | may ]
  <[0..9]> 
 /        
False        

Like (?:...) from Perl 5.

Groups

Grouping

Digression -- why did that not match?
say so 'sat, apr 6' ~~ / 'sat, apr 6' /        
True        
say so 'sat, apr 6' ~~ / 'sat, ' 'apr ' '6' /        
True        
say so 'sat, apr 6' ~~ / sat ', ' apr ' ' 6 /        
True        
say so 'sat, apr 6' ~~ /
   [ sat | sun ] ', '
   [ mar | apr | may ]
   ' '
   <[0..9]> /        
True        
Groups

Grouping

Spot the difference
say so 'sat, apr 6' ~~ /
  [ sat | sun ] ', '
  [ mar | apr | may ]
  <[0..9]> 
 /        
False        
say so 'sat, apr 6' ~~ /
   [ sat | sun ] ', '
   [ mar | apr | may ]
   ' '
   <[0..9]>
 /        
True        
Groups

Anyway, back to groups

say so 'sat, apr 6' ~~ /
   < sat sun> ', '
   < mar apr may>
   ' '
   <[0..9]>
 /        
True        

Start < > with a space to make a word list.

Groups

As usual, tmtowtdi

my @days = <sat sun>;
my @months = <mar apr may>;
say so 'sat, apr 6' ~~ /
   @days ', ' @months ' '
   <[0..9]>
 /        
True        

Or use an array. Scalar are interpolated too, btw.

How about two digit days?

Outline
  • Characters
  • Groups
  • Quantifiers
  • Capturing
  • Composing
Quantifiers

Quantifiers

say so 'a' ~~ / a? /; # 0 or 1        
True        
say so 'a' ~~ / a* /; # 0 or more        
True        
say so 'a' ~~ / a+ /; # 1 or more        
True        
say so 'a' ~~ / a**2 /; # exactly 2        
False        
say so 'a' ~~ / a**1..5 /; # 1 to 5        
True        

Use ?, *, and + as usual.

Use ** (exponentiation) for values or ranges.

Quantifiers

Quantifiers

my @days = <sat sun>;
my @months = <mar apr may>;
say so 'sat, apr 6' ~~ /
   @days ', ' @months ' '
   <[0..9]> ** 1..2
 /        
True        
Quantifiers

Modified Quantifiers

my regex part { <-[/]>+ }
my regex path { '/' [ <part> '/' ]* <part> }
say so '/home/brian/talk.txt' ~~ / <path> /        
True        
my regex part { <-[/]>+ }
my regex path { '/' <part>* % '/' }
say so '/home/brian/talk.txt' ~~ / <path> /;        
True        
"separated by" A* % B is a shorthand for [ AB ]* A?.

Works for other quantifiers too (`+`, **)

Useful with ,.

See also %%.

Outline
  • Characters
  • Groups
  • Quantifiers
  • Capturing
  • Composing
Capturing

Capturing

say 'abc' ~~ / abc /;        
「abc」        
my $match = 'abc' ~~ / abc /;
say $match.WHAT;        
(Match)        

A match returns a match object.

'abc' ~~ / abc/;
say $/.WHAT;
say $/;        
(Match)
「abc」        

The most recent match is stored in $/.

Use say to print $/.gist which provides the match tree.

Capturing

Captures

'hello, world' ~~ /^ [ <-[,]>+ ] ', ' (.*) $/;
say $/;        
「hello, world」
 0 => 「world」        

Parentheses will capture.

'hello, world' ~~ /^ [ <-[,]>+ ] ', ' (.*) $/;
say $/[0];
say ~$/[0];        
「world」
world        

You can get positional captures by treating $/ like an array.

Stringify with ~.

Capturing

Named Captures

my regex word { <-[,]>+ }
'hello, world' ~~
  /^ <word> ', ' (.*) $/;
say $/;        
「hello, world」
 word => 「hello」
 0 => 「world」        

Named captures use the names of embedded regexes.

The match tree can help.

Capturing

Named Captures

my regex word { <-[,]>+ }
'hello, world' ~~
  /^ <word> ', ' (.*) $/;
say $/{'word'};
say $/<word>;
say $<word>;   # all the same        
「hello」
「hello」
「hello」        

When accessing named captures in $/, you can omit the /.

Capturing

Named Captures

my regex word { <-[,]>+ }
'hello, world' ~~
  /^ <word> ', ' <word> $/;
say $<word>;
say $<word>[0];        
[「hello」 「world」]
「hello」        

It's matches all the way down.

Capturing

Named Captures

my regex word { <-[,]>+ }
say 'new york, new york' ~~
  /^ <word> ', ' $<word> $/;        
「new york, new york」
 word => 「new york」        

You can interpolate the match variable in the regex to be clever.

Capturing

Named Captures

my regex word { <-[,]>+ }
say 'oh, ho' ~~
/^ <word> ', ' <{ $<word>.flip }> $/;        
「oh, ho」
 word => 「oh」        

You can even put code in the regex if you want to be very clever.

Capturing

Restricted Captures

my regex char {
 <-["]> | '\"' 
}
my regex quoted {
   '"' <char>* '"'
}
'a "good" program' ~~ / <quoted> /;
say ~$<quoted>;        
"good"        
Capturing

Restricted Captures

my regex char {
 <-["]> | '\"' 
}
my regex quoted {
   '"' <( <char>* )> '"'
}
'a "good" program' ~~ / <quoted> /;
say ~$<quoted>;        
good        

Pro tip: use <( and >) to restrict the entire match.

Outline
  • Characters
  • Groups
  • Quantifiers
  • Capturing
  • Composing
Composing

Grammars

grammar G {
  regex TOP { 'a ' <quoted> ' program' }
  regex letters { <[a..z]>+ }
  regex quoted {
     '"' <( <letters> )> '"'
  }
}
say G.parse('a "good" program');        
「a "good" program」
 quoted => 「good」
  letters => 「good」        

Put regexes together into grammars.

Composing

Grammars

grammar G {
  rule TOP { a <quoted> program }
  token letters { <[a..z]>+ }
  token quoted {
     '"' <( <letters> )> '"'
  }
}
say G.parse('a "good" program');        
「a "good" program」
 quoted => 「good」
  letters => 「good」        

Reminders --

Use token for regexes that don't need backtracking.

Use rule for tokens with sigspace.

Composing

Grammars

Examples on modules.perl6.org and docs.perl6.org.

Also JSON-Tiny

Or Protobuf (EBNF).

Have fun!

The End