@r_compile(key:value, r:string)
@r_match(key:value, s:string)
@r_match(r:string, s:string)
@r_search(key:value, s:string)
@r_search(r:string, s:string)
@r_findall(key:value, s:string)
@r_findall(r:string, s:string)
The function @r_compile(k, r)
compiles the regular
expression r
(a pattern given as a string) and stores the
result under the key k
(any value) in some internal
dictironnary. Then, this key can be used as the first argument to the
three functions @r_match()
, @r_search()
and @r_findall()
:
-
Function
@r_match()
looks for a pattern matching the whole string. -
The
@r_search()
function allows to find a single occurrence of a pattern within a string. -
Function
@r_findall()
report all the occurences of a pattern withing the whole string.
A regular expression r
can be given directly to these
functions as a string, but using @r_compile()
avoids the
re-compilation of the pattern r
at each call. This is
usefull if the pattern is often reused.
Function @r_match(p, s)
returns false if the pattern
does not match the entire string. If string s
is an
instance of the pattern p
, a non empty tab is
returned. This tab contains either the submatches of the pattern (the
substrings of s
matching the groups in the pattern) or
the entire string if there is no group. See below for groups in a
pattern.
Function @r_search(p, s)
returns false if there is no
occurence of the pattern p
in the string s
. If an occurence is found, the returned value is a non empty tab
providing the matched string and the characters before and after the
match:
[ prefix, m[1], …, m[n], suffix ]
The strings m[1]
, …, m[n]
are the
substrings matched by the groups of the regular expression.
Function @r_findall(p, s)
returns false if there is no
occurence of the pattern p
in the string s
. If occurences are found, the returned value is a non empty tab
providing the matched string and the characters before and after the
match:
[
prefix,
[ m[1,1], …, m[1,n] ],
sep_1,
[ m[2,1], …, m[2,n] ],
sep_2,
…,
[ m[p,1], …, m[p,n] ],
suffix
]
The strings m[i, 1]
, …, m[i, n]
are the
substrings matched by the groups of the ith occurences of the regular
expression in the string s
. The string sep_i
is the substring
between the ith occurence of the pattern and the occurence
i+1
. String prefix
is the prefix of s
before the first occurence
and suffix
the suffix of s
after the last occurence.
Regular expression notation¶
A regular expression (RE) is a matching engine constructed from a string. Several notations can be used to specify the RE: ECMAScript (the default), POSIX, awk, grep, and egrep notation. The convention used can be changed by changing the value of the global variable
$regexp_syntax_option
The recognized values are:
"ECMAScript" ; ECMAScript notation (aka JavaScript)
"ECMA" ; ECMAScript (alias)
"default" ; ECMAScript (alias)
"basic" ; POSIX basic RE
"extended" ; POSIX extended RE
"awk" ; awk RE
"grep" ; grep RE
"egrep" ; egrep RE
If the value of $regexp_syntax_option
is not valid, the
default notation (EMACScript) is used. A change in convention affects
only the subsequent compilation and the already compiled RE are not
affected.
We do not present here the various RE notations. However, we recall some features of ECMAScript:
-
A suffix
?
after any of the repetition notations makes the pattern matcher ‘‘lazy’’ or ‘‘non-greedy.’’ That is, when looking for a pattern, it will look for the shortest match rather than the longest. By default, the pattern matcher always looks for the longest match. For example, the pattern(ab)*
matches all ofababab
. However,(ab)∗?
matches only the firstab
. -
The most common character classifications have names
xxx
that can be used in a character class[ … ]
using the notation[:xxx:]
. For example$[[:alpha:]_][[:alnum:]_]*
matches an antescofo variable identifier: it starts by a dollar sign, the second character must be an alphabetic character (class[:alpha:]
) or an underscore and the rest of its characters are alphanumeric characters including the underscore. Some of these classes are also supported through the@char_is_xxx()
predicates. -
A group (a subpattern) potentially to be represented by a submatch is delimited by parentheses. If you need parentheses that should not define a subpattern, use
(?
rather than plain(
.
Examples¶
In this example, the RE is directly given to the @r_match
function:
@r_match("[a-e]*", "abcde") -> [ "abcde" ]
@r_match("[a-e]*", "ab.cde") -> false
@r_search("[a-e]*", "ab.cde") -> ["", "ab", ".cde"]
The RE is compiled at each call. To avoid this recompilation, function
@r_compile
can be used:
_ := @r_compile(1, "[a-z]+")
The key 1
can then be an argument of the matching functions:
@r_match(1, "abcde") -> [ "abcde" ]
@r_search(1, "888ab12cde999") -> ["888", "ab", "12cde999"]
Any value can be used as a key, even the string specifying the RE:
_ := @r_compile("[a-z]+", "[a-z]+")
@r_search("[a-z]+", "888ab12cde999") -> ["888", "ab", "12cde999"]
In this last example, the occurence of the pattern is ab
, the prefix
(the substring preceding the match) is 888
and the suffix is
12cde999
.
The pattern "[a-z]+"
does not contain groups. Pattern
"([a-d]+)[0-9]+([a-z]+)"
contains two groups
"([a-d]+)"
and "([a-z]+)"
. In presence of
groups, submatchs are reported in the returned tab:
@r_match("([a-d]+)[0-9]+([a-z]+)", "ab12cde") -> [ "ab", "cde" ]
@r_search("([a-d]+)[0-9]+([a-z]+)", "88ab12cde99") -> ["88", "ab", "cde", "99"]
Nota Bene: only groups are reported. In the previous example the
substring matched by [0-9]+
is not reported.
Function @r_findall
can be used to report all the
occurence of the pattern found:
@r_findall("([a-z]+)[0-9]+",
"a1.b2::c3 dd44____abcdefghi12345678-------")
The pattern defines a sequence alphabetic lower letters followed by a sequence of digits. Only the alphabetic part is reported. The call returns:
[
"", ; prefix
["a"], ; first occurence
".", ; sep_1
["b"], ; occurence 2
"::", ; sep_2
["c"], ; occurence 3
" ", ; sep_3
["dd"], ; occurence 4
"____", ; sep_4
["abcdefghi"], ; occurence 5
"-------" ; suffix
]
The prefix is an empty string because s
starts with an occurence of
the pattern. Each occurence is reported as a tab. These tabs contains
only one element corresponding to the group in the pattern.
See also: @char_is_alnum, @char_is_alpha, @char_is_ascii, @char_is_blank, @char_is_cntrl, @char_is_digit, @char_is_graph, @char_is_lower, @char_is_print, @char_is_punct, @char_is_space, @char_is_upper, @char_is_xdigit
See also String Management @car, @cdr, @char_is_alnum, @char_is_alpha, @char_is_ascii, @char_is_blank, @char_is_cntrl, @char_is_digit, @char_is_graph, @char_is_lower, @char_is_print, @char_is_punct, @char_is_space, @char_is_upper, @char_is_xdigit, @copy, @count, @drop, @dump, @dumpvar, @empty, @explode, @find, @is_prefix, @is_string, @is_subsequence, @is_suffix, @last, @member, @occurs, @parse, @permute, @push_back, @r_compile, @r_findall, @r_match, @r_search, @remove, @remove_duplicate, @replace, @scramble, @slice, @sort, @sputter, @string2fun, @strip_path, @stutter, @string2proc, @system, @take @to_num, @Tracing, @UnTracing