Practical Perl 6 Regexes

Brian Duggan
bduggan@matatu.org

DC Baltimore Perl Workshop, April 6, 2019

Why?

Perl 5 set the standard for regexes BUT

terse
too many special cases
"write only"
not composable

Perl 6 regexes

not as terse
consistent
more readable
first class objects. Building blocks for grammars.

Outline

Characters
Groups
Quantifiers
Capturing
Composing

Characters

Question:

Which of these print True? (in Perl 6) [press return or click]

say so 'abc' =~ /b/

===SORRY!=== Error while compiling example.p6
Unsupported use of =~ to do pattern matching; in Perl 6 please use ~~
at example.p6:1
------> say so 'abc' =~<HERE> /b/

say so 'abc' ~~ /b/

True

say so 'abc' ~~ / 'b' /

True

say so 'abc' ~~ regex { b }

True

my regex letter-b {
   b
}
say so 'abc' ~~ / <letter-b> /

True

Use / or regex to make a regex.

Characters

Literals

How about these?

say so 'good' ~~ / good /

True

say so 'not-good' ~~ / not-good /

===SORRY!===
Unrecognized regex metacharacter - (must be quoted to match literally)
at example.p6:1
------> say so 'not-good' ~~ / not<HERE>-good /
Unable to parse regex; couldn't find final '/'
at example.p6:1
------> say so 'not-good' ~~ / not-<HERE>good /

say so 'not-good' ~~ / 'not-good' /

True

say so 'schőn' ~~ / schőn /

True

Use quotes inside a regex. Everything except alphanumeric characters and underscores must be quoted.

Characters

Spaces

say so 'abc' ~~ / abc /

True

say so 'abc' ~~ / a b c /

Potential difficulties:
    Space is not significant here; please use quotes or :s (:sigspace) modifier (or, to suppress this warning, omit the space, or otherwise change the spacing)
    at example.p6:1
    ------> say so 'abc' ~~ / a<HERE> b c /
    Space is not significant here; please use quotes or :s (:sigspace) modifier (or, to suppress this warning, omit the space, or otherwise change the spacing)
    at example.p6:1
    ------> say so 'abc' ~~ / a b<HERE> c /
True

say so 'a b c' ~~ / a b c /

Potential difficulties:
    Space is not significant here; please use quotes or :s (:sigspace) modifier (or, to suppress this warning, omit the space, or otherwise change the spacing)
    at example.p6:1
    ------> say so 'a b c' ~~ / a<HERE> b c /
    Space is not significant here; please use quotes or :s (:sigspace) modifier (or, to suppress this warning, omit the space, or otherwise change the spacing)
    at example.p6:1
    ------> say so 'a b c' ~~ / a b<HERE> c /
False

say so 'a b c' ~~ / 'a b c' /

True

say so 'a b c' ~~ / a ' ' b ' ' c /

True

say so 'a b c' ~~ / a \s+ b \s+ c /

True

say so 'a b c' ~~ / a 
    # hey, this is a comment
    \s+ b \s+ c /

True

Spaces are not significant. Neither are comments.

Characters

Adverbs

say so 'a b c' ~~ /a \s* b \s* c/;
say so 'a b c' ~~ /a <ws> b <ws> c/;
say so 'a b c' ~~ /:s a b c/;
say so 'a b c' ~~ /:sigspace a b c/;

True
True
True
True

say so 'ABC' ~~ /:i b/;
say so 'ABC' ~~ /:ignorecase b/;

True
True

say so 'abc' ~~ /:r b/;
say so 'abc' ~~ /:ratchet b/;

True
True

Adverbs start with :.

Ratcheting makes matching much faster -- no backtracking.

Sigspace improves readability.

Characters

Tokens and rules

say so 'abc' ~~ regex { :r abc }
say so 'abc' ~~ token { abc }

True
True

say so 'a b c' ~~ token { :s a b c }
say so 'a b c' ~~ rule { a b c }

True
True

A token is a regex with ratching.

A rule is a token with sigspace.

These are deep concepts! Tokens and rules are building blocks for grammars.

Characters

Back to basics

Vehicle Identification Numbers

my $vin = '1FAHP3GNXBW107581';
if $vin ~~ / I | O | Q / {
  say "Invalid VIN"
} else {
  say "Maybe it's okay";
}

Maybe it's okay

For alternation, use |.

Character classes

TMTOWDI

Alternation

say so 'QUIT' ~~ / I | O | Q /

True

say so 'QUIT' ~~ / | I | O | Q /

True

say so 'QUIT' ~~ / | I
                   | O
                   | Q /

True

say so 'QUIT' ~~ / <[IOQ]> /

True

You can put an extra | at the beginning.

Construct character classes using <[ and ]>.

Character classes

say so 'e' ~~ / <[a e i o u]> /

True

say so 'b' ~~ / <[a..e]> /

True

my regex vowels { <[a e i o u]> }
say so 'e' ~~  / <vowels> /;

True

Put lists of characters or ranges in character classes.

Spaces can be in character classes.

Character classes

Negate Character classes

my regex not-vowels {
  <-[aeiou]>
}
say so 'x' ~~ / <not-vowels> /;
say so '!' ~~ / <not-vowels> /;

True
True

my regex consonants {
 <[a..z] - [aeiou]>
}
say so '!' ~~ / <consonants> /;
say so 'x' ~~ / <consonants> /;

False
True

Take the complement of a character class <-[ ... ]>.

Or use - to take the set difference.

Outline

Characters
Groups
Quantifiers
Capturing
Composing

Groups

Grouping

Brackets make a non-capturing group.

say so 'sat, apr 6' ~~ /
  [ sat | sun ] ', '
  [ mar | apr | may ]
  <[0..9]> 
 /

False

Like (?:...) from Perl 5.

Groups

Grouping

Digression -- why did that not match?

say so 'sat, apr 6' ~~ / 'sat, apr 6' /

True

say so 'sat, apr 6' ~~ / 'sat, ' 'apr ' '6' /

True

say so 'sat, apr 6' ~~ / sat ', ' apr ' ' 6 /

True

say so 'sat, apr 6' ~~ /
   [ sat | sun ] ', '
   [ mar | apr | may ]
   ' '
   <[0..9]> /

True

Groups

Grouping

Spot the difference

say so 'sat, apr 6' ~~ /
  [ sat | sun ] ', '
  [ mar | apr | may ]
  <[0..9]> 
 /

False

say so 'sat, apr 6' ~~ /
   [ sat | sun ] ', '
   [ mar | apr | may ]
   ' '
   <[0..9]>
 /

True

Groups

Anyway, back to groups

say so 'sat, apr 6' ~~ /
   < sat sun> ', '
   < mar apr may>
   ' '
   <[0..9]>
 /

True

Start < > with a space to make a word list.

Groups

As usual, tmtowtdi

my @days = <sat sun>;
my @months = <mar apr may>;
say so 'sat, apr 6' ~~ /
   @days ', ' @months ' '
   <[0..9]>
 /

True

Or use an array. Scalar are interpolated too, btw.

How about two digit days?

Outline

Characters
Groups
Quantifiers
Capturing
Composing

Quantifiers

say so 'a' ~~ / a? /; # 0 or 1

True

say so 'a' ~~ / a* /; # 0 or more

True

say so 'a' ~~ / a+ /; # 1 or more

True

say so 'a' ~~ / a**2 /; # exactly 2

False

say so 'a' ~~ / a**1..5 /; # 1 to 5

True

Use ?, *, and + as usual.

Use ** (exponentiation) for values or ranges.

Quantifiers

my @days = <sat sun>;
my @months = <mar apr may>;
say so 'sat, apr 6' ~~ /
   @days ', ' @months ' '
   <[0..9]> ** 1..2
 /

True

Quantifiers

Modified Quantifiers

my regex part { <-[/]>+ }
my regex path { '/' [ <part> '/' ]* <part> }
say so '/home/brian/talk.txt' ~~ / <path> /

True

my regex part { <-[/]>+ }
my regex path { '/' <part>* % '/' }
say so '/home/brian/talk.txt' ~~ / <path> /;

True

"separated by" A* % B is a shorthand for [ AB ]* A?.

Works for other quantifiers too (`+`, **)

Useful with ,.

Capturing

say 'abc' ~~ / abc /;

｢abc｣

my $match = 'abc' ~~ / abc /;
say $match.WHAT;

(Match)

A match returns a match object.

'abc' ~~ / abc/;
say $/.WHAT;
say $/;

(Match)
｢abc｣

The most recent match is stored in $/.

Use say to print $/.gist which provides the match tree.

Capturing

Captures

'hello, world' ~~ /^ [ <-[,]>+ ] ', ' (.*) $/;
say $/;

｢hello, world｣
 0 => ｢world｣

Parentheses will capture.

'hello, world' ~~ /^ [ <-[,]>+ ] ', ' (.*) $/;
say $/[0];
say ~$/[0];

｢world｣
world

You can get positional captures by treating $/ like an array.

Stringify with ~.

Capturing

Named Captures

my regex word { <-[,]>+ }
'hello, world' ~~
  /^ <word> ', ' (.*) $/;
say $/;

｢hello, world｣
 word => ｢hello｣
 0 => ｢world｣

Named captures use the names of embedded regexes.

The match tree can help.

Capturing

Named Captures

my regex word { <-[,]>+ }
'hello, world' ~~
  /^ <word> ', ' (.*) $/;
say $/{'word'};
say $/<word>;
say $<word>;   # all the same

｢hello｣
｢hello｣
｢hello｣

When accessing named captures in $/, you can omit the /.

Capturing

Named Captures

my regex word { <-[,]>+ }
'hello, world' ~~
  /^ <word> ', ' <word> $/;
say $<word>;
say $<word>[0];

[｢hello｣ ｢world｣]
｢hello｣

It's matches all the way down.

Capturing

Named Captures

my regex word { <-[,]>+ }
say 'new york, new york' ~~
  /^ <word> ', ' $<word> $/;

｢new york, new york｣
 word => ｢new york｣

You can interpolate the match variable in the regex to be clever.

Capturing

Named Captures

my regex word { <-[,]>+ }
say 'oh, ho' ~~
/^ <word> ', ' <{ $<word>.flip }> $/;

｢oh, ho｣
 word => ｢oh｣

You can even put code in the regex if you want to be very clever.

Capturing

Restricted Captures

my regex char {
 <-["]> | '\"' 
}
my regex quoted {
   '"' <char>* '"'
}
'a "good" program' ~~ / <quoted> /;
say ~$<quoted>;

"good"

Capturing

Restricted Captures

my regex char {
 <-["]> | '\"' 
}
my regex quoted {
   '"' <( <char>* )> '"'
}
'a "good" program' ~~ / <quoted> /;
say ~$<quoted>;

good

Pro tip: use <( and >) to restrict the entire match.

Outline

Characters
Groups
Quantifiers
Capturing
Composing

Composing

Grammars

grammar G {
  regex TOP { 'a ' <quoted> ' program' }
  regex letters { <[a..z]>+ }
  regex quoted {
     '"' <( <letters> )> '"'
  }
}
say G.parse('a "good" program');

｢a "good" program｣
 quoted => ｢good｣
  letters => ｢good｣

Put regexes together into grammars.

Composing

Grammars

grammar G {
  rule TOP { a <quoted> program }
  token letters { <[a..z]>+ }
  token quoted {
     '"' <( <letters> )> '"'
  }
}
say G.parse('a "good" program');

｢a "good" program｣
 quoted => ｢good｣
  letters => ｢good｣

Reminders --

Use token for regexes that don't need backtracking.

Use rule for tokens with sigspace.

Composing

Grammars

Examples on modules.perl6.org and docs.perl6.org.

Also JSON-Tiny

Or Protobuf (EBNF).

Have fun!

Practical Perl 6 Regexes

Brian Duggan bduggan@matatu.org

Question:

Literals

Spaces

Adverbs

Tokens and rules

Back to basics

TMTOWDI

Character classes

Negate Character classes

Grouping

Grouping

Grouping

Anyway, back to groups

As usual, tmtowtdi

Quantifiers

Quantifiers

Modified Quantifiers

Capturing

Captures

Named Captures

Named Captures

Named Captures

Named Captures

Named Captures

Restricted Captures

Restricted Captures

Grammars

Grammars

Grammars

Brian Duggan
bduggan@matatu.org