parse - AntescofoDoc

@parse(format:string, input:string)
@parse(format:string, input:tab)

The @parse function provides simple functionality and fixed format to read and convert an input into a set of Antescofo values according a format string. Compared to more sophisticated and flexible parsers or template engines, @parse provides only fixed formats dedicated to a few relevant data types, but they are sufficient for many purposes. The functionalities are similar to the sscanf() function in the C environment. Compared to this function:

the format specification is oriented to the conversion to Antescofo values
the format may include the scan of tabs and nested tabs
the data input to scan are read from a string or from a tab (or a mix of the two)
the parsing process can be seen as scaning a sequence (which is the behavior of sscanf) or as scaning a multiset (in this case, the order of elements in input becomes irrelevant).

Format, patterns, sequence and set matching¶

A format string is a sequence or a set of pattern's specifications which define a method for reading and converting the input argument into an arbitrary number of values of various types. The decoded values are collected in a tab returned by the function.

A pattern is a directive which describes how to process the sequence of characters (when the the input is a string) or the sequence of elements (when the input is a tab) and what kind of value must be retrieved. The input is not altered by the parsing and is always processed left-to-right.

The order of pattern declarations in the format can be taken into account, which leads to sequence matching, or can be ignored, which leads to set matching. In addition, the input argument may be a string or a tab, which gives four parsing behaviors:

Sequence matching:
- If __ input is a string__, the first pattern is used to parse the begining of the input string and to decode the Antescofo value specified by the pattern. We say that the pattern matches the (begining of the) input string. The matched value is appended to the returned tab. And then the parsing proceeds whith the rest of the format string and the rest of the input string.
- If __ input is a tab__, the first partern is used to check the type of the first tab's element. There is a conversion from this tab element $e$ into (another) Antescofo value only if $e$ is a string. The matched value is appended to the resulting tab. And the parsing proceed with the rest of the format string and the rest of the input tab.
Set matching:
- If __ input is a string__, the different patterns in the format are tried sequentially (in some predefinite order) until finding one that matches the (begining of the) input string. The match is appended to the result. And the parsing proceed with the entire format and the rest of the input string.
- Similarly, if __ input is a tab__, the patterns in the format are tried sequentially until finding one that matches the first element in the tab. The match is added to the result. And the parsing proceed with the entire format and the rest of the input tab.

The @parse function returns a tab which contains the data parsed in the input. The value <undef> is returned if there is a failure during the parsing. A failure occurs in sequeznce matching if the current pattern fails to match teh current input. In set matching, a failure occurs if no pattern is able to match the current input.

Format specification¶

They are two kinds of patterns:

explicit patterns starting with a % character in the format string
and implicit patterns that correspond to data in the input but do not result in a value in the output.

A pattern specification stops with the begining of the next pattern, the end of the format string or using a terminator. They are two terminators , used for sequence of patterns and ; used for set of patterns. Terminators do not match anything in the data input. They are present to ease the reading of a format.

The terminators, the space character and the % character are called special characters.

Implicit patterns¶

These patterns are simply ignored when the data is presented as a tab.

Space¶

A sequence of white-space characters (space, tab, newline, etc.; see isspace) in the format matches any amount of white-space, including none, in an input string. The matched part does not appear in the result. If you don't want any space in the input, replace the spaces in the format by a terminator.

Space patterns are ignored while matching a tab input.

Literals¶

A sequence of characters that are not space nor terminator, and that are not part of the specification of another pattern, matches for itself in the input string. These patterns are called literals.

Literals are simply ignored if the input is a tab. If the input is a string, a literal matches a substring verbatim, but the matched part does not appear in the result.

To include a special character in a literal, it must be escaped by a % character before it.

Explicit Patterns for scalar values¶

Explicit patterns are used to match a scalar value or to match a (possibly nested) tab:

matches some part of the input sequence,
implies some conversion (if the input is a string)
and the result is stored as an AntescofoValue in the returned tab.

We first review the scalar patterns and then the tab pattern.

Matching a boolean¶

%b matches a boolean, i.e. a boolean value in a tab. In an input string it matches the substrings 'true', 'false', '0', '1'. The matching is case-insensitive.

Note that any Antescofo value can be interpreted as a boolean. So in a tab a %b matches any values and this value is converted into a boolean in the result.

Here is an example:

    @parse("boolean %b", "boolean        0")

will return [ false ]. As a matter of fact, "boolean" is a literal and matches the corresponding input. The following spaces in the format matches a sequence of spaces in the input. Literal and spaces are implicit pattern and their matching does not appear in the result. The pattern "%b" is an explicit pattern and matches a boolean. The character '0' is matched and converted int false.

Notice that space may matches zero occurences of a space, so @parse("boolean %b", "boolean1") will returns [ true ].

Here are some additional example of succesful matching for "boolean %b":

    @parse("boolean %b", "boolean 0") --> [ false ]
    @parse("boolean %b", "boolean false") --> [ false ]
    @parse("boolean %b", "booleanfalse") --> [ false ]
    @parse("boolean %b", "boolean FaLsE") --> [ false ]

    @parse("boolean %b", "boolean 1") --> [ true ]
    @parse("boolean %b", "boolean true") --> [ true ]
    @parse("boolean %b", "booleanTrue") --> [ true ]
    @parse("boolean %b", "boolean TRUE") --> [ true ]

Notice that

    @parse("boolean%b", "boolean   0")

will returns <undef> because the spaces in the input are not matched.

Matching an integer¶

%i matches an integer, i.e. either an integer value in a tab, or a substring of the input that represents an integer. The substring may begin with an arbitrary amount of white space (as determined by isspace) followed by a single optional + or - sign and then a sequence of decimal digits stopping at the first character which is not a decimal digit.

Leading white-space characters in the substring (as defined by the isspace function) are skipped, but not the eventual white-spaces after the integer.

Additional forms, like hexadecimal constants, are accepted. The full syntax acdepted by %i is the same as the strtol Posix function on Unix.

Matching a float¶

%d matches a floating point value. The expected form of the matched string is an optional + (plus) or - (minus) sign, followed by a decimal significand, consisting of a sequence of decimal digits (optionally containing a decimal-point character).

The significand may be optionally followed by an exponent. An exponent consists of an E or e character followed by an optional plus or minus sign, followed by a sequence of decimal digits. The exponent indicates the power of 10 by which the significand should be scaled.

Alternatively, if the portion of the string following the optional plus or minus sign begins with INFINITY or NAN, ignoring case, it is interpreted as an infinity or a quiet NaN, respectively.

In any of the above cases, leading white-space characters in the string (as defined by the isspace function) are skipped.

Additional forms, like hexadecimal constants, are accepted. The full syntax acdepted by %d is the same as the strtod Posix function on Unix.

Matching an arithmetic expression¶

%a matches an arithmetic expression, as parsed by @to_num for an input string, and a numeric value (an integer or a float) in a tab.

Matching a string¶

%s matches a string. The string is either a whole string in a tab or a substring in the input string. The substring stops at the first white-space character.

Two alternatives can be used to define where the substring stops:

by giving a string length with the pattern %snnn : the decimal digits nnn specify the length of the expected substring in the input string. In a tab, %snnn matches a string value of length nnn.
by giving a set of ending character with the pattern %s=xxxx. The presence in the input of one character belonging to xxxx signals the end of the matched string. xxxx is a sequence of characters that ends with a special character. To include a special character in xxxx they must be escaped with %. When matching a tab, this constraint is ignored and the pattern %s=xxxx becomes equivalent to %s.

To match a string followed by an equal sign, use a terminator in the format: "%s,=". Including an equal sign in the characters set used to signal the end of the string is not a problem: "%s==".

The tab pattern¶

The pattern matching a tab take the following form

    "... %[ length, delim=(, sep=%, p1, p2, ...  %] ...

%[ %] matches a tab. The token %] is closing the pattern opened by %[ and does not introduce a new pattern. The items between %[ and %] is the specification of the tab content. The specification of the tab elements p1, p2, ... are patterns which may contain in turn tab patterns, allowing the parsing of nested sequences. The format given here corresponds to the parsing of a coma-separated (the sep directive) list of elements enclosed in a pair of parenthesis (the delim directive. Elemnts of the tab are matched by p1, p2, etc.

The tab content specification is given as a list of optional items that are given in the following order:

the optional specification of the tab lenght
the optional specification of tab delimitors delim= and element separators sep=
a sequence or a set of pattern used to parse the tab elements

Items are separated by terminators or spaces. Beware that the terminator choosen in the tab specification has an impact on the pattern matching of the tab content (set or sequence matching). But the decision is local to the current tab. That is, the use of ',' or ';' does not impact the choice of the matching for the enclosing structure or for nested structures.

Specifying the length of the matched tab¶

The first (optional) item controls the length of the expected sequence

* denotes a tab of any size, possibly empty
+ match a non-empty tab of any size
nnn refers to a tab with nnn elements.

If there is no length specification:

* is assumed for set matching
$N$ is expected for sequence matching where
- $N$ is the number of patterns in the format when matching an input string
- $N$ is the number of explicit patterns when matching an input tab

Specifying the tab delimiters¶

Tab delimiters and element separators are ignored matching whenhe input is a tab (the structure of the tab acts for itself).

In an input string, a tab starts by a tab opening delimiter specified using the (optional) item delim=c where c is one character. The tab closes with a closing delimiter which is computed from the opening delimiters. There are two kinds of delimiters:

nesting delimiters: in the input string, the tab closes with a character which is paired with the opening delimiter:
- opening [ is paired with closing ]
- opening ( is paired with closing )
- opening { is paired with closing }
- opening < is paired with closing >
simple delimiters are any other characters. The same character is used for the opening and the closing delimiter. For instance, ", ', or a space. If the character is a special character, it must be escaped.

By default, delim=[ is assumed. It is possible to have no delimiters by specifying delim=, or delim=;. This is case is not of much use: all the remaining of the input is used to match the tab elements and the patterns that follow are then ignored.

If a tab uses a nested delimiter, then the elements of the matched tab can themselves be tabs. It implies that the delimiters are well balenced in the substring that represents the tab in the input:

there is an opening delimiter that occurs at the left of any closing delimiters
and for any prefix of the substring, there is at least as many opening delimiters as closing delimiters.

Specifying element seperators¶

In the input string, the matched elements are substrings delimited by a separator. Several possible separators can be specified using the keyword sep= followed by the list of characters that can be used as separators: a separator in the input is only one character, but various character can be used for this purpose. This list ends with a special character and if a separator is amongst the special character, it must be escaped.

The default separator is a space or a coma ','.

It is possible to have no separator at all

    @parse("%[sep=,%i%s=% ,%i%]", "[123toto 456]") --> [123, "toto", 456]

The first element is an int, which matches "123" in the input string, the second element is a string that end with a space, so "toto " and the last int matches " 456" (leading white-space characters are skipped by integer patterns).

Tab pattern elements:¶

The last item in the tab content specification is the list of patterns used to match the elements of the tab in the input. If element's patterns are only space or comma separated, the pattern must match sequentially the input elements (sequence matching). If a semi-column is used, then the format defines a set matching: elements in the input must be matched by one element's pattern but the element's pattern order does not matter.

There is no direct relationships between the number of patterns and the number of elements in the matched tab:

If there is no element's pattern, any value is accepted in the input tab.
element's patterns that are not used are simply ignored.
When matching a sequence, if there is to few element's patterns, the remaining elements in the input are not constrained.

So, to constrain the size of the matched input, a size specification must be used.

The implicit tab in top-level matching¶

Implicitly, the top-level format string specify a tab but

the enclosing bracket %[ and %] are not specified;
the size of the input is computed as the number of element's pattern when the input is a string, and the number of explicit patterns if the input is a tab;
the delimiter is set to “nothing”;
the separators are set to “nothing”.

One can specify explicitly the size, delimiters and separators optional items to override the defaults.

For example,

    @parse("sep=,%i%s=% ,%i", "123toto 456") --> [123, "toto", 456]

Sequence or Set Matching¶

When matching a sequence, the element's pattern are considered in order and are consumed by the matching: the first element in the input is matched by the first pattern element, the second element in the input is matched by the second pattern, etc. (modulo, when the input is a tab, the distinction between implicit and explicit pattern). Recall that:

if there is too few patterns, the rest of elements in the input are not constrained nor converted
if there is too few elements in the input, the remaining patterns are ignored.

When matching a sequence, the patterns are not consumed and their order does not matter. An element in the input is matched if there is a pattern in the format able to match it.

In set latching, patterns can be ambiguous, meaning that the same element in the input can be matched by distinct patterns. This is a problem because distinct patterns mean distinct conversions. For example, "false" can be matched by %b (and the result is a boolean) but also %s (and the result is a string). So, irrespective of their definition order, patterns in the format are tried during set matching following this order:

    Literal < tab < arith < bool < space < string

Integer pattern and double pattern are implicitly transformed into arithmetic pattern in set matching because the notation of an int is also a valid notation for a double and arithmetic pattern subsumes the two.

A consequence is that when both a numeric pattern and a boolean pattern are present in the format string, "0" and "1" in the input are always interpreted as numeric.

Examples¶

The @parse function can be used to〝typecheck〞a data structure. For instance

    @parse("%s;", t)

checks that t is a tab of strings. The terminator ; in the format induces a set matching but there is only one pattern, which means that all elements in the tab must check with %s. Beware that

    @parse("%i;", t)
    @parse("%d;", t)

checks that t is a tab of numeric values, i.e. integers or float, because for set matching, %iand %d patterns are rewritten in %a. To specify a given size, we can use the length item in the format:

    @parse("3;%i", t)

will return <undef> if t elements are not three numerics. Notice that checking if a tab contains only the booleans true or false is not possible because every Antescofo value is convertible into a boolean.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search