Brian Duggan
bduggan@matatu.org
Perl 5 set the standard for regexes BUT
Perl 6 regexes
Which of these print True
? (in Perl 6)
[press return or click]
say so 'abc' =~ /b/
===SORRY!=== Error while compiling example.p6 Unsupported use of =~ to do pattern matching; in Perl 6 please use ~~ at example.p6:1 ------> say so 'abc' =~<HERE> /b/
say so 'abc' ~~ /b/
True
say so 'abc' ~~ / 'b' /
True
say so 'abc' ~~ regex { b }
True
my regex letter-b { b } say so 'abc' ~~ / <letter-b> /
True
Use /
or regex
to make a regex.
How about these?
say so 'good' ~~ / good /
True
say so 'not-good' ~~ / not-good /
===SORRY!=== Unrecognized regex metacharacter - (must be quoted to match literally) at example.p6:1 ------> say so 'not-good' ~~ / not<HERE>-good / Unable to parse regex; couldn't find final '/' at example.p6:1 ------> say so 'not-good' ~~ / not-<HERE>good /
say so 'not-good' ~~ / 'not-good' /
True
say so 'schőn' ~~ / schőn /
True
Use quotes inside a regex. Everything except alphanumeric characters and underscores must be quoted.
say so 'abc' ~~ / abc /
True
say so 'abc' ~~ / a b c /
Potential difficulties: Space is not significant here; please use quotes or :s (:sigspace) modifier (or, to suppress this warning, omit the space, or otherwise change the spacing) at example.p6:1 ------> say so 'abc' ~~ / a<HERE> b c / Space is not significant here; please use quotes or :s (:sigspace) modifier (or, to suppress this warning, omit the space, or otherwise change the spacing) at example.p6:1 ------> say so 'abc' ~~ / a b<HERE> c / True
say so 'a b c' ~~ / a b c /
Potential difficulties: Space is not significant here; please use quotes or :s (:sigspace) modifier (or, to suppress this warning, omit the space, or otherwise change the spacing) at example.p6:1 ------> say so 'a b c' ~~ / a<HERE> b c / Space is not significant here; please use quotes or :s (:sigspace) modifier (or, to suppress this warning, omit the space, or otherwise change the spacing) at example.p6:1 ------> say so 'a b c' ~~ / a b<HERE> c / False
say so 'a b c' ~~ / 'a b c' /
True
say so 'a b c' ~~ / a ' ' b ' ' c /
True
say so 'a b c' ~~ / a \s+ b \s+ c /
True
say so 'a b c' ~~ / a # hey, this is a comment \s+ b \s+ c /
True
Spaces are not significant. Neither are comments.
say so 'a b c' ~~ /a \s* b \s* c/; say so 'a b c' ~~ /a <ws> b <ws> c/; say so 'a b c' ~~ /:s a b c/; say so 'a b c' ~~ /:sigspace a b c/;
True True True True
say so 'ABC' ~~ /:i b/; say so 'ABC' ~~ /:ignorecase b/;
True True
say so 'abc' ~~ /:r b/; say so 'abc' ~~ /:ratchet b/;
True True
Adverbs start with :
.
Ratcheting makes matching much faster -- no backtracking.
Sigspace improves readability.
say so 'abc' ~~ regex { :r abc } say so 'abc' ~~ token { abc }
True True
say so 'a b c' ~~ token { :s a b c } say so 'a b c' ~~ rule { a b c }
True True
A token
is a regex with ratching.
A rule
is a token with sigspace.
These are deep concepts! Tokens and rules are building blocks for grammars.
Vehicle Identification Numbers
my $vin = '1FAHP3GNXBW107581'; if $vin ~~ / I | O | Q / { say "Invalid VIN" } else { say "Maybe it's okay"; }
Maybe it's okay
For alternation, use |
.
Alternation
say so 'QUIT' ~~ / I | O | Q /
True
say so 'QUIT' ~~ / | I | O | Q /
True
say so 'QUIT' ~~ / | I | O | Q /
True
say so 'QUIT' ~~ / <[IOQ]> /
True
You can put an extra |
at the beginning.
Construct character classes using <[
and ]>
.
say so 'e' ~~ / <[a e i o u]> /
True
say so 'b' ~~ / <[a..e]> /
True
my regex vowels { <[a e i o u]> } say so 'e' ~~ / <vowels> /;
True
Put lists of characters or ranges in character classes.
Spaces can be in character classes.
my regex not-vowels { <-[aeiou]> } say so 'x' ~~ / <not-vowels> /; say so '!' ~~ / <not-vowels> /;
True True
my regex consonants { <[a..z] - [aeiou]> } say so '!' ~~ / <consonants> /; say so 'x' ~~ / <consonants> /;
False True
Take the complement of a character class <-[
... ]>
.
Or use -
to take the set difference.
Brackets make a non-capturing group.
say so 'sat, apr 6' ~~ / [ sat | sun ] ', ' [ mar | apr | may ] <[0..9]> /
False
Like (?:...)
from Perl 5.
say so 'sat, apr 6' ~~ / 'sat, apr 6' /
True
say so 'sat, apr 6' ~~ / 'sat, ' 'apr ' '6' /
True
say so 'sat, apr 6' ~~ / sat ', ' apr ' ' 6 /
True
say so 'sat, apr 6' ~~ / [ sat | sun ] ', ' [ mar | apr | may ] ' ' <[0..9]> /
True
say so 'sat, apr 6' ~~ / [ sat | sun ] ', ' [ mar | apr | may ] <[0..9]> /
False
say so 'sat, apr 6' ~~ / [ sat | sun ] ', ' [ mar | apr | may ] ' ' <[0..9]> /
True
say so 'sat, apr 6' ~~ / < sat sun> ', ' < mar apr may> ' ' <[0..9]> /
True
Start <
>
with a space to make a word list.
my @days = <sat sun>; my @months = <mar apr may>; say so 'sat, apr 6' ~~ / @days ', ' @months ' ' <[0..9]> /
True
Or use an array. Scalar are interpolated too, btw.
How about two digit days?
say so 'a' ~~ / a? /; # 0 or 1
True
say so 'a' ~~ / a* /; # 0 or more
True
say so 'a' ~~ / a+ /; # 1 or more
True
say so 'a' ~~ / a**2 /; # exactly 2
False
say so 'a' ~~ / a**1..5 /; # 1 to 5
True
Use ?
, *
, and +
as usual.
Use **
(exponentiation) for values or ranges.
my @days = <sat sun>; my @months = <mar apr may>; say so 'sat, apr 6' ~~ / @days ', ' @months ' ' <[0..9]> ** 1..2 /
True
my regex part { <-[/]>+ } my regex path { '/' [ <part> '/' ]* <part> } say so '/home/brian/talk.txt' ~~ / <path> /
True
my regex part { <-[/]>+ } my regex path { '/' <part>* % '/' } say so '/home/brian/talk.txt' ~~ / <path> /;
True"separated by"
A* % B
is a shorthand for [ AB ]* A?
.
Works for other quantifiers too (`+`, **
)
Useful with ,
.
See also %%
.
say 'abc' ~~ / abc /;
「abc」
my $match = 'abc' ~~ / abc /; say $match.WHAT;
(Match)
A match returns a match object.
'abc' ~~ / abc/; say $/.WHAT; say $/;
(Match) 「abc」
The most recent match is stored in $/
.
Use say
to print $/.gist
which provides the match tree.
'hello, world' ~~ /^ [ <-[,]>+ ] ', ' (.*) $/; say $/;
「hello, world」 0 => 「world」
Parentheses will capture.
'hello, world' ~~ /^ [ <-[,]>+ ] ', ' (.*) $/; say $/[0]; say ~$/[0];
「world」 world
You can get positional captures by treating $/
like an array.
Stringify with ~
.
my regex word { <-[,]>+ } 'hello, world' ~~ /^ <word> ', ' (.*) $/; say $/;
「hello, world」 word => 「hello」 0 => 「world」
Named captures use the names of embedded regexes.
The match tree can help.
my regex word { <-[,]>+ } 'hello, world' ~~ /^ <word> ', ' (.*) $/; say $/{'word'}; say $/<word>; say $<word>; # all the same
「hello」 「hello」 「hello」
When accessing named captures in $/
, you can omit the /
.
my regex word { <-[,]>+ } 'hello, world' ~~ /^ <word> ', ' <word> $/; say $<word>; say $<word>[0];
[「hello」 「world」] 「hello」
It's matches all the way down.
my regex word { <-[,]>+ } say 'new york, new york' ~~ /^ <word> ', ' $<word> $/;
「new york, new york」 word => 「new york」
You can interpolate the match variable in the regex to be clever.
my regex word { <-[,]>+ } say 'oh, ho' ~~ /^ <word> ', ' <{ $<word>.flip }> $/;
「oh, ho」 word => 「oh」
You can even put code in the regex if you want to be very clever.
my regex char { <-["]> | '\"' } my regex quoted { '"' <char>* '"' } 'a "good" program' ~~ / <quoted> /; say ~$<quoted>;
"good"
my regex char { <-["]> | '\"' } my regex quoted { '"' <( <char>* )> '"' } 'a "good" program' ~~ / <quoted> /; say ~$<quoted>;
good
Pro tip: use <(
and >)
to restrict the entire match.
grammar G { regex TOP { 'a ' <quoted> ' program' } regex letters { <[a..z]>+ } regex quoted { '"' <( <letters> )> '"' } } say G.parse('a "good" program');
「a "good" program」 quoted => 「good」 letters => 「good」
Put regexes together into grammars.
grammar G { rule TOP { a <quoted> program } token letters { <[a..z]>+ } token quoted { '"' <( <letters> )> '"' } } say G.parse('a "good" program');
「a "good" program」 quoted => 「good」 letters => 「good」
Reminders --
Use token
for regexes that don't need backtracking.
Use rule
for tokens with sigspace.
Examples on modules.perl6.org and docs.perl6.org.
Also JSON-Tiny
Or Protobuf (EBNF).
Have fun!
The End