Brian Duggan
bduggan@matatu.org
Perl 5 set the standard for regexes BUT
Perl 6 regexes
Which of these print True? (in Perl 6)
[press return or click]
say so 'abc' =~ /b/
===SORRY!=== Error while compiling example.p6 Unsupported use of =~ to do pattern matching; in Perl 6 please use ~~ at example.p6:1 ------> say so 'abc' =~<HERE> /b/
say so 'abc' ~~ /b/
True
say so 'abc' ~~ / 'b' /
True
say so 'abc' ~~ regex { b }
True
my regex letter-b {
b
}
say so 'abc' ~~ / <letter-b> /
True
Use / or regex to make a regex.
How about these?
say so 'good' ~~ / good /
True
say so 'not-good' ~~ / not-good /
===SORRY!=== Unrecognized regex metacharacter - (must be quoted to match literally) at example.p6:1 ------> say so 'not-good' ~~ / not<HERE>-good / Unable to parse regex; couldn't find final '/' at example.p6:1 ------> say so 'not-good' ~~ / not-<HERE>good /
say so 'not-good' ~~ / 'not-good' /
True
say so 'schőn' ~~ / schőn /
True
Use quotes inside a regex. Everything except alphanumeric characters and underscores must be quoted.
say so 'abc' ~~ / abc /
True
say so 'abc' ~~ / a b c /
Potential difficulties:
Space is not significant here; please use quotes or :s (:sigspace) modifier (or, to suppress this warning, omit the space, or otherwise change the spacing)
at example.p6:1
------> say so 'abc' ~~ / a<HERE> b c /
Space is not significant here; please use quotes or :s (:sigspace) modifier (or, to suppress this warning, omit the space, or otherwise change the spacing)
at example.p6:1
------> say so 'abc' ~~ / a b<HERE> c /
True
say so 'a b c' ~~ / a b c /
Potential difficulties:
Space is not significant here; please use quotes or :s (:sigspace) modifier (or, to suppress this warning, omit the space, or otherwise change the spacing)
at example.p6:1
------> say so 'a b c' ~~ / a<HERE> b c /
Space is not significant here; please use quotes or :s (:sigspace) modifier (or, to suppress this warning, omit the space, or otherwise change the spacing)
at example.p6:1
------> say so 'a b c' ~~ / a b<HERE> c /
False
say so 'a b c' ~~ / 'a b c' /
True
say so 'a b c' ~~ / a ' ' b ' ' c /
True
say so 'a b c' ~~ / a \s+ b \s+ c /
True
say so 'a b c' ~~ / a
# hey, this is a comment
\s+ b \s+ c /
True
Spaces are not significant. Neither are comments.
say so 'a b c' ~~ /a \s* b \s* c/; say so 'a b c' ~~ /a <ws> b <ws> c/; say so 'a b c' ~~ /:s a b c/; say so 'a b c' ~~ /:sigspace a b c/;
True True True True
say so 'ABC' ~~ /:i b/; say so 'ABC' ~~ /:ignorecase b/;
True True
say so 'abc' ~~ /:r b/; say so 'abc' ~~ /:ratchet b/;
True True
Adverbs start with :.
Ratcheting makes matching much faster -- no backtracking.
Sigspace improves readability.
say so 'abc' ~~ regex { :r abc }
say so 'abc' ~~ token { abc }
True True
say so 'a b c' ~~ token { :s a b c }
say so 'a b c' ~~ rule { a b c }
True True
A token is a regex with ratching.
A rule is a token with sigspace.
These are deep concepts! Tokens and rules are building blocks for grammars.
Vehicle Identification Numbers
my $vin = '1FAHP3GNXBW107581';
if $vin ~~ / I | O | Q / {
say "Invalid VIN"
} else {
say "Maybe it's okay";
}
Maybe it's okay
For alternation, use |.
Alternation
say so 'QUIT' ~~ / I | O | Q /
True
say so 'QUIT' ~~ / | I | O | Q /
True
say so 'QUIT' ~~ / | I
| O
| Q /
True
say so 'QUIT' ~~ / <[IOQ]> /
True
You can put an extra | at the beginning.
Construct character classes using <[ and ]>.
say so 'e' ~~ / <[a e i o u]> /
True
say so 'b' ~~ / <[a..e]> /
True
my regex vowels { <[a e i o u]> }
say so 'e' ~~ / <vowels> /;
True
Put lists of characters or ranges in character classes.
Spaces can be in character classes.
my regex not-vowels {
<-[aeiou]>
}
say so 'x' ~~ / <not-vowels> /;
say so '!' ~~ / <not-vowels> /;
True True
my regex consonants {
<[a..z] - [aeiou]>
}
say so '!' ~~ / <consonants> /;
say so 'x' ~~ / <consonants> /;
False True
Take the complement of a character class <-[ ... ]>.
Or use - to take the set difference.
Brackets make a non-capturing group.
say so 'sat, apr 6' ~~ / [ sat | sun ] ', ' [ mar | apr | may ] <[0..9]> /
False
Like (?:...) from Perl 5.
say so 'sat, apr 6' ~~ / 'sat, apr 6' /
True
say so 'sat, apr 6' ~~ / 'sat, ' 'apr ' '6' /
True
say so 'sat, apr 6' ~~ / sat ', ' apr ' ' 6 /
True
say so 'sat, apr 6' ~~ / [ sat | sun ] ', ' [ mar | apr | may ] ' ' <[0..9]> /
True
say so 'sat, apr 6' ~~ / [ sat | sun ] ', ' [ mar | apr | may ] <[0..9]> /
False
say so 'sat, apr 6' ~~ / [ sat | sun ] ', ' [ mar | apr | may ] ' ' <[0..9]> /
True
say so 'sat, apr 6' ~~ / < sat sun> ', ' < mar apr may> ' ' <[0..9]> /
True
Start < > with a space to make a word list.
my @days = <sat sun>; my @months = <mar apr may>; say so 'sat, apr 6' ~~ / @days ', ' @months ' ' <[0..9]> /
True
Or use an array. Scalar are interpolated too, btw.
How about two digit days?
say so 'a' ~~ / a? /; # 0 or 1
True
say so 'a' ~~ / a* /; # 0 or more
True
say so 'a' ~~ / a+ /; # 1 or more
True
say so 'a' ~~ / a**2 /; # exactly 2
False
say so 'a' ~~ / a**1..5 /; # 1 to 5
True
Use ?, *, and + as usual.
Use ** (exponentiation) for values or ranges.
my @days = <sat sun>; my @months = <mar apr may>; say so 'sat, apr 6' ~~ / @days ', ' @months ' ' <[0..9]> ** 1..2 /
True
my regex part { <-[/]>+ }
my regex path { '/' [ <part> '/' ]* <part> }
say so '/home/brian/talk.txt' ~~ / <path> /
True
my regex part { <-[/]>+ }
my regex path { '/' <part>* % '/' }
say so '/home/brian/talk.txt' ~~ / <path> /;
True"separated by"
A* % B is a shorthand for [ AB ]* A?.
Works for other quantifiers too (`+`, **)
Useful with ,.
See also %%.
say 'abc' ~~ / abc /;
「abc」
my $match = 'abc' ~~ / abc /; say $match.WHAT;
(Match)
A match returns a match object.
'abc' ~~ / abc/; say $/.WHAT; say $/;
(Match) 「abc」
The most recent match is stored in $/.
Use say to print $/.gist which provides the match tree.
'hello, world' ~~ /^ [ <-[,]>+ ] ', ' (.*) $/; say $/;
「hello, world」 0 => 「world」
Parentheses will capture.
'hello, world' ~~ /^ [ <-[,]>+ ] ', ' (.*) $/; say $/[0]; say ~$/[0];
「world」 world
You can get positional captures by treating $/ like an array.
Stringify with ~.
my regex word { <-[,]>+ }
'hello, world' ~~
/^ <word> ', ' (.*) $/;
say $/;
「hello, world」 word => 「hello」 0 => 「world」
Named captures use the names of embedded regexes.
The match tree can help.
my regex word { <-[,]>+ }
'hello, world' ~~
/^ <word> ', ' (.*) $/;
say $/{'word'};
say $/<word>;
say $<word>; # all the same
「hello」 「hello」 「hello」
When accessing named captures in $/, you can omit the /.
my regex word { <-[,]>+ }
'hello, world' ~~
/^ <word> ', ' <word> $/;
say $<word>;
say $<word>[0];
[「hello」 「world」] 「hello」
It's matches all the way down.
my regex word { <-[,]>+ }
say 'new york, new york' ~~
/^ <word> ', ' $<word> $/;
「new york, new york」 word => 「new york」
You can interpolate the match variable in the regex to be clever.
my regex word { <-[,]>+ }
say 'oh, ho' ~~
/^ <word> ', ' <{ $<word>.flip }> $/;
「oh, ho」 word => 「oh」
You can even put code in the regex if you want to be very clever.
my regex char {
<-["]> | '\"'
}
my regex quoted {
'"' <char>* '"'
}
'a "good" program' ~~ / <quoted> /;
say ~$<quoted>;
"good"
my regex char {
<-["]> | '\"'
}
my regex quoted {
'"' <( <char>* )> '"'
}
'a "good" program' ~~ / <quoted> /;
say ~$<quoted>;
good
Pro tip: use <( and >) to restrict the entire match.
grammar G {
regex TOP { 'a ' <quoted> ' program' }
regex letters { <[a..z]>+ }
regex quoted {
'"' <( <letters> )> '"'
}
}
say G.parse('a "good" program');
「a "good" program」 quoted => 「good」 letters => 「good」
Put regexes together into grammars.
grammar G {
rule TOP { a <quoted> program }
token letters { <[a..z]>+ }
token quoted {
'"' <( <letters> )> '"'
}
}
say G.parse('a "good" program');
「a "good" program」 quoted => 「good」 letters => 「good」
Reminders --
Use token for regexes that don't need backtracking.
Use rule for tokens with sigspace.
Examples on modules.perl6.org and docs.perl6.org.
Also JSON-Tiny
Or Protobuf (EBNF).
Have fun!
The End