[]
(optional) and |
(choices). Text parsing with a grammar begins with a start rule, in this case <Sentence>.
Given this example, sort of a diagram of a very small part of the English language, we can examine a sentence to see if it fits the grammar. These sentences will fit:
These sentences will not fit, because they use text that is not in the grammar, or they do not fit any of the declarations:
Look at the dog.
compared to a <Sentence>? The technique is to essentially have two pointers, which is like using one finger of your hand. One hand points to where you are in the text, and one hand points to where you are in the rules. You can try it with your own hands! Do programmers really use their hands? Of course we do! Whatever works!
Look at the dog. | <Sentence> = <Verb> the <Noun>. |
Start with your pointers at the beginning of Look at the dog.
and the beginning of <Sentence>. At all times, the text is compared to the rule. At the beginning of <Sentence>, we have <Verb>. So the rule pointer makes a side trip to <Verb>, finds that Look at
is a correct match, and scans over it. Now that <Verb> is done, the rule pointer returns to the next part of <Sentence> with the remaining text the dog.
The second part of <Sentence> is the
, making a match, leaving the text dog.
and the rule pointer takes another side trip to parse <Noun>. The first test, ball
, is not a match, so the parser moves over it to the next choice, dog
, which does work, and <Noun> is complete. Then the last token in <Sentence> is the final period. The start rule was parsed successfully, so the parse was successful.
The common way of formatting the digits of United States telephone numbers, such as 867-5309
, or 415-5551212
, or 1 (123)-123-4242
, fits this grammar. However, it's difficult to read on a single line, which helps conceal a tiny error. We could break it out into a slightly friendlier description:
What we see now is more readable. A <PhoneNumber> optionally starts with an <AreaCode> before the seven digits. The seven digits may be separated in the right place with an optional hyphen. The <AreaCode> has an optional <Prefix>, which is 1 and an optional hyphen. The optional nature of some features are interesting. You can see that the seven digits can be broken up by a hyphen, but it has to be in the right place: 86753-09 is not a successful parse. The 1- prefix is optional, and you can have just the 1 (1800-555-1222), but you can't have just the hyphen (-800-555-1222), because the hyphen alone would not succesfully parse the prefix, and although the prefix is optional, the area code has no rule to accept the hyphen.
In this more readable version, we see there are two versions of <AreaCode> one after the other. The first lets you parenthesize the area code, and the second lets you leave off the parentheses. Writing alternate versions of the same rule is similar to the choice operator |
. The parser will try the versions in the order they're provided, and always accepts the first one that works.
With BNF, the grammar has some room for interpretation in the case that two versions could both fit the next piece of text. In context-free grammar (CFG), this is allowed. BNFToLPEG uses parsing expression grammar (PEG), which resolves this ambiguity. When using the choice rule |
, you're expressing a preference. The earlier rule in the list is tried first. If it parses correctly, that's the one that the parse uses. This simple rule – first parse wins – simplifies everything.
[]
makes them optional, which is another kind of choice: this or nothing. The parser tries to parse the contents first. If it does not succeed, it backtracks and tries the second choice: nothing. Nothing always works and succeeds, so []
always succeeds by itself, even if what is inside it doesn't. Note that []
won't accept some of an incompletely successful parse; it's all or nothing. It's also called the one-or-zero operator.
There are variations on []
that aren't always necessary to use, but can still make your grammar a little more tidy. The first is putting an asterisk after the end bracket. []*
is more than one-or-zero; sometimes called the star or repetition, it matches as many as it can get. If there are ten repetitions, it matches all of them. If at any time when parsing it, the next attempt fails, the whole operator doesn't fail. It returns as many as it could get, and leaves the next one, and will also succeed if there were zero matches.
Instead of square brackets, you can substitute curly brackets before the asterisk. {}*
accepts multiple matches, but with a minimum of one. Zero matches won't succeed. Of course, if you wanted to accept zero matches, you could put them in optional brackets, but then it would be the same as using the optional brackets alone: [ {}* ]
and []*
are actually identical operations.
Curly braces without the star is also valid. {}
accepts exactly one of what's inside it, and needs the inside to succeed for the curly braces to succeed. Okay, you might ask, how is that better than just writing the inside without the curly braces? Good point – the curly braces and no braces at all are identical operations! But, it can make the grammar clearer in one very specific case: when you want to put choice operators inside a part of the rule, not the whole rule.
Hey, let's you and me eat
", "drink
", or "go dancing after work
", and while the computer would have no reason to doubt you, we can see that's not the intended rule.
[]
, a pair of brackets with nothing inside. It will always succeed. There was just nothing to do. In the above sample, <EmptyRule> always succeeds, and always consumes no text.
There are some fine points to know about the empty parse and why it is important. First of all, the empty parse consumes none of the text, so what happens when you use []*
? Would it successfully consume no text, and then repeat successfully to consume no text, and then continue until the universe ends? Nope! Brackets will not accept an empty parse. []*
has nothing inside. But []*
will succeed because success is optional.
Abraham Lincoln had a goldfish
. <NeverSuccessfulRule> cannot ever succeed, because brackets require their inner rule to consume text.
{}@
only succeeds if the inner rule succeeds like {}
, but it won't consume any of the text. The pointer will still be at the same place afterward. This gives your grammar a chance to confirm something in two different ways. You can also use negative lookahead, succeeding only if the text does not match the inner rule, with []@
. Negative lookahead succeeds when the parse fails, and fails if the parse succeeds.
Sir,
(in case you insist on being referred to as Sir). <NegativeLookaheadRule> accepts a <Sentence>, but only if it does not begins with But,
(so if you don't accept excuses, at least ones starting with But, they will fail to parse).
[]@
fails, but since it does fail, the outer []@
succeeds. If <InnerRule> fails, the inner []@
succeeds, and the outer []@
fails. Nesting []@
twice is equivalent to {}@
.
In addition to that, you have a more casual option. You can tell the parser what text you want the rule to return with the =>
operator. This is put after the rule on the same line or on the next line. After =>
, write the rule output, which can include the rules used by that rule. The rules used in the output will be replaced by their parsed text or by any output set for them.
Walk to the Guitar.
will produce the output The Guitar is supposed to have me Walk to it now.
10 + 20 - 30 + 40
. Look at this example, without no recursion:
10
, the output will be just 10
.
When given 10 + 20
, the rule will recurse (call itself) once after parsing +
. The second inner call will produce 20
and then the whole output will be addition( 10, 20 )
.
10 + 20 - 33 + 44
will recurse three times, producing the nested output addition( 10, subtraction( 20, addition ( 33, 44 ) ) )
. The parser produced exactly what we asked it to, but there's one unfortunate problem: this grammar doesn't get the right answer! Sums are evaluated from the left to the right: we calculate ((10 + 20) - 33) + 44. But this parser drilled in from the right: 10 + (20 - (33 + 44)).
Now 10 + 20 - 33 + 44
will correctly recurse from the left side, producing the nested output addition( subtraction( addition( 10, 20 ), 33 ), 44)
.
Parsing expression grammars have traditionally had caveats about left recursion. Using the two-pointer system mentioned earlier, the problem with the previous rule that the first step of parsing <Sum> is to parse <Sum>, of which the first step is to parse <Sum>; the attempt would never end. BNFToLPEG is equipped to handle the situation. When it detects left recursion (defined as: attempting to parse a rule at the same location in the text that rule is already being parsed) it performs a two-step process. First, the original attempt to parse the rule is allowed to continue with any attempts to left-recurse blocked. On the condition this succeeds in producing a non-left-recursive parse (the first 10
), it uses it as a seed for the next steps.
The next steps are the same repeated routine. The left-recursive rule is parsed again, but when it encounters its left recursion, the pre-parsed seed is considered to be the inner result. The resulting parse at each point is required to succeed, and to consume more text doing so, to become the new pre-parsed seed for the next round. At the end of this process, the final attempt either having failed or having produced a smaller parse (often the original seed), the last successful attempt is used as the final result.
Let's try to make the arithmetic parser even better by adding multiplication and division:
1 * 2 + 3
will produce addition( multiplication ( 1, 2 ), 3 )
. But the input 1 + 2 * 3
will produce multiplication( addition ( 1, 2 ), 3 )
, because the parser is just proceeding left to right, and that violates the rule that the multiplication must go first. How about reordering the rules, so multiplication and division gets first crack?
*
and +
. And if reordering the rules like this could control what order the operators parsed in, then addition would always come before subtraction and you don't want that either because 1 - 2 + 3
would turn into subtraction( 1, addition( 2, 3 ) )
.
How about putting <Sum> on the right side, instead of <Number>?
1 + 2 * 3
as addition( 1, multiplication( 2, 3 ) )
and parse 1 * 2 + 3
as addition( multiplication( 1, 2 ), 3 )
. The left side <Sum> takes the first version available to it, but then on the right side, <Sum> has to take a later version. Then the left side, which contains the right side parse so far, is enclosing it with its own right side again. And again, until the <Sum> is finished parsing.
Putting the arithmetic parsing techniques so far all together, this grammar uses common names in arithmetic: Expr, short for expression (a series of values and operators) and Factor (one number, or an expression in parenthesis). Expressions are parsed and rewritten respecting operator precedence.
The grammar is editable! You can just select with the mouse and keyboard. There is no way to hurt the page by typing; errors can stop a parser but they cannot freeze or crash the page no matter what you entered. One fun trick is to add more kinds of factors to the end of the list.
|
|
|
BNFToLPEGParser(grammar).parse(input).output
will construct a parser, parse input, and produce output. Using the success
parameter instead of output
will access a object containing { index, endIndex, subrules[], rule, version }
, or null if the input didn't fit the grammar. See the text of the script for documentation.
The techniques in this document are just ideas and you are always free to use them wherever you can. You're also invited to use the code in personal noncommercial projects and change or improve it as necessary.
BNFToLPEG : (c) 2023 Hypervariety Custom Programming