Hypervariety BNFToLPEG

To use, you can download or link at BNFToLPEG.js.

What is it?

BNFToLPEG is a compact JavaScript library that will read in a description of a parsing expression grammar. A grammar defines a set of named rules and how they are combined. Given a grammar, you could parse text and diagram it automatically, link the result to other functions, or even convert the result into a similar grammar. BNF (Backus–Naur form) is a good choice for simple grammars because it is easy to read the declarations (rules, written out).

Simple example

In this simple example, we denote rules like this: <Rule>, by enclosing their names in angle brackets. Everything after the name and the equals sign is considered the text of the rule itself except for the special [] (optional) and | (choices). Text parsing with a grammar begins with a start rule, in this case <Sentence>. <Sentence> = <Verb> the <Noun>. <Sentence> = <Person>, [would you] [please] <Verb> the <Noun>? <Verb> = look at | walk to | throw | eat | learn about <Noun> = ball | dog | guitar | turkey sandwich <Person> = you | Dad | Mr. President | Dog

Given this example, sort of a diagram of a very small part of the English language, we can examine a sentence to see if it fits the grammar. These sentences will fit:

Look at the ball. Eat the turkey sandwich. Dad, please walk to the guitar? Mr. President, would you please learn about the turkey sandwich? Eat the guitar.

These sentences will not fit, because they use text that is not in the grammar, or they do not fit any of the declarations:

Eat the chocolate. Mr. President, dog guitar. Dad, would you throw the you? Eat the dad.

How it works

So how, exactly, is Look at the dog. compared to a <Sentence>? The technique is to essentially have two pointers, which is like using one finger of your hand. One hand points to where you are in the text, and one hand points to where you are in the rules. You can try it with your own hands! Do programmers really use their hands? Of course we do! Whatever works!

Look at the dog.<Sentence> = <Verb> the <Noun>.
<Verb> = look at | walk to | throw | eat | learn about
<Noun> = ball | dog | guitar | turkey sandwich

Start with your pointers at the beginning of Look at the dog. and the beginning of <Sentence>. At all times, the text is compared to the rule. At the beginning of <Sentence>, we have <Verb>. So the rule pointer makes a side trip to <Verb>, finds that Look at is a correct match, and scans over it. Now that <Verb> is done, the rule pointer returns to the next part of <Sentence> with the remaining text the dog.

The second part of <Sentence> is the, making a match, leaving the text dog. and the rule pointer takes another side trip to parse <Noun>. The first test, ball, is not a match, so the parser moves over it to the next choice, dog, which does work, and <Noun> is complete. Then the last token in <Sentence> is the final period. The start rule was parsed successfully, so the parse was successful.

Another example

Parsing text in a grammar is a very mechanical process, but describing a grammar needs a human being (for now). Since a computer does not understand anything, the grammar needs to be designed to be appropriate and expressive for the situation it's to be used in. Here's another grammar: <PhoneNumber> = [ [ 1 [-] ] [(] [<D><D><D>] [)] ] <D><D><D> [-] <D><D><D><D> <D> = 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0

The common way of formatting the digits of United States telephone numbers, such as 867-5309, or 415-5551212, or 1 (123)-123-4242, fits this grammar. However, it's difficult to read on a single line, which helps conceal a tiny error. We could break it out into a slightly friendlier description:

<PhoneNumber> = [<AreaCode>] <SevenDigits> <AreaCode> = [<Prefix>] (<D><D><D>) <AreaCode> = [<Prefix>] <D><D><D> <Prefix> = 1 [-] <SevenDigits> = <D><D><D> [-] <D><D><D><D> <D> = 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0

What we see now is more readable. A <PhoneNumber> optionally starts with an <AreaCode> before the seven digits. The seven digits may be separated in the right place with an optional hyphen. The <AreaCode> has an optional <Prefix>, which is 1 and an optional hyphen. The optional nature of some features are interesting. You can see that the seven digits can be broken up by a hyphen, but it has to be in the right place: 86753-09 is not a successful parse. The 1- prefix is optional, and you can have just the 1 (1800-555-1222), but you can't have just the hyphen (-800-555-1222), because the hyphen alone would not succesfully parse the prefix, and although the prefix is optional, the area code has no rule to accept the hyphen.

In this more readable version, we see there are two versions of <AreaCode> one after the other. The first lets you parenthesize the area code, and the second lets you leave off the parentheses. Writing alternate versions of the same rule is similar to the choice operator |. The parser will try the versions in the order they're provided, and always accepts the first one that works.

With BNF, the grammar has some room for interpretation in the case that two versions could both fit the next piece of text. In context-free grammar (CFG), this is allowed. BNFToLPEG uses parsing expression grammar (PEG), which resolves this ambiguity. When using the choice rule |, you're expressing a preference. The earlier rule in the list is tried first. If it parses correctly, that's the one that the parse uses. This simple rule – first parse wins – simplifies everything.

Brackets

Enclosing rules or pieces of text in [] makes them optional, which is another kind of choice: this or nothing. The parser tries to parse the contents first. If it does not succeed, it backtracks and tries the second choice: nothing. Nothing always works and succeeds, so [] always succeeds by itself, even if what is inside it doesn't. Note that [] won't accept some of an incompletely successful parse; it's all or nothing. It's also called the one-or-zero operator.

There are variations on [] that aren't always necessary to use, but can still make your grammar a little more tidy. The first is putting an asterisk after the end bracket. []* is more than one-or-zero; sometimes called the star or repetition, it matches as many as it can get. If there are ten repetitions, it matches all of them. If at any time when parsing it, the next attempt fails, the whole operator doesn't fail. It returns as many as it could get, and leaves the next one, and will also succeed if there were zero matches.

Instead of square brackets, you can substitute curly brackets before the asterisk. {}* accepts multiple matches, but with a minimum of one. Zero matches won't succeed. Of course, if you wanted to accept zero matches, you could put them in optional brackets, but then it would be the same as using the optional brackets alone: [ {}* ] and []* are actually identical operations.

Curly braces without the star is also valid. {} accepts exactly one of what's inside it, and needs the inside to succeed for the curly braces to succeed. Okay, you might ask, how is that better than just writing the inside without the curly braces? Good point – the curly braces and no braces at all are identical operations! But, it can make the grammar clearer in one very specific case: when you want to put choice operators inside a part of the rule, not the whole rule.

<Sentence> = Hey, let's you and me { eat | drink | go dancing } after work. As you can see, that wouldn't make much sense without the curly braces: <Sentence> = Hey, let's you and me eat | drink | go dancing after work. Without the curly braces, that rule would match the three possibilities: "Hey, let's you and me eat", "drink", or "go dancing after work", and while the computer would have no reason to doubt you, we can see that's not the intended rule.

About the empty parse

<EmptyRule> = [] The empty parse is when a rule allows no text to be consumed. You can express the empty parse with [], a pair of brackets with nothing inside. It will always succeed. There was just nothing to do. In the above sample, <EmptyRule> always succeeds, and always consumes no text.

There are some fine points to know about the empty parse and why it is important. First of all, the empty parse consumes none of the text, so what happens when you use []*? Would it successfully consume no text, and then repeat successfully to consume no text, and then continue until the universe ends? Nope! Brackets will not accept an empty parse. []* has nothing inside. But []* will succeed because success is optional.

<RepeatedEmptyRule> = [ ]* <UsuallyEmptyRule> = [ Abraham Lincoln had a goldfish ] <NeverSuccessfulRule> = { } In the above sample, <RepeatedEmptyRule> always succeeds with zero matches, and always consumes no text. <UsuallyEmptyRule> always succeeds as well, and will consume no text except Abraham Lincoln had a goldfish. <NeverSuccessfulRule> cannot ever succeed, because brackets require their inner rule to consume text.

Lookahead and negative lookahead

The lookahead operation is one more variation on bracket syntax. {}@ only succeeds if the inner rule succeeds like {}, but it won't consume any of the text. The pointer will still be at the same place afterward. This gives your grammar a chance to confirm something in two different ways. You can also use negative lookahead, succeeding only if the text does not match the inner rule, with []@. Negative lookahead succeeds when the parse fails, and fails if the parse succeeds. <LookaheadRule> = { Sir, }@ <Sentence> <NegativeLookaheadRule> = [ But, ]@ <Sentence> In the above, <LookaheadRule> accepts a <Sentence>, but only if it begins with Sir, (in case you insist on being referred to as Sir). <NegativeLookaheadRule> accepts a <Sentence>, but only if it does not begins with But, (so if you don't accept excuses, at least ones starting with But, they will fail to parse). <DoubleNegativeLookaheadRule> = [ [ <InnerRule> ]@ ]@ Can you see what <DoubleNegativeLookaheadRule> does? If <InnerRule> is successful, the inner []@ fails, but since it does fail, the outer []@ succeeds. If <InnerRule> fails, the inner []@ succeeds, and the outer []@ fails. Nesting []@ twice is equivalent to {}@.

Output from the rule

Beyond whether a parse succeeds or fails, there is the question of what choices were made in it. Was Dad supposed to eat the turkey sandwich, or the dog? The parse will return the selected choices in a JavaScript object, so a scripter can easily drill in to find exactly which way the parse went. The parser can also find the longest attempt, which is the one that consumed the most text, even if it failed.

In addition to that, you have a more casual option. You can tell the parser what text you want the rule to return with the => operator. This is put after the rule on the same line or on the next line. After =>, write the rule output, which can include the rules used by that rule. The rules used in the output will be replaced by their parsed text or by any output set for them.

<Sentence> = <Verb> the <Noun>. => The <Noun> is supposed to have me <Verb> it now. <Verb> = look at | walk to | throw | eat | learn about <Noun> = ball | dog | guitar | turkey sandwich The above rule <Sentence> if given the successful parse Walk to the Guitar. will produce the output The Guitar is supposed to have me Walk to it now.

Recursion

Sometimes it's convenient for a grammar to be recursive. This is when a rule uses itself directly (direct recursion), or when it uses a rule that uses it (indirect recursion). The most common example is to parse simple arithmetic, like 10 + 20 - 30 + 40. Look at this example, without no recursion: <Sum> = <Number> [ <PlusOrMinus> <Number> ]* <Number> = { <Digit> }* <PlusOrMinus> = + | - The shortcoming with this is that, although you'll get a success if the rule parses, you don't know how many terms are in the expression and so the <Sum> rule isn't helpful writing an output. The insight here is that if you write the rule using the rule, by being recursive, you can have an output. See the next example, which has recursion. Each version of <Sum> grabs a number, but the first two versions allow the number to be followed by another operation or number. <Sum> = <Number> + <Sum> => addition( <Number> , <Sum> ) | <Number> - <Sum> => subtraction( <Number> , <Sum> ) <Sum> = <Number> When given a single number 10, the output will be just 10.

When given 10 + 20, the rule will recurse (call itself) once after parsing +. The second inner call will produce 20 and then the whole output will be addition( 10, 20 ).

10 + 20 - 33 + 44 will recurse three times, producing the nested output addition( 10, subtraction( 20, addition ( 33, 44 ) ) ). The parser produced exactly what we asked it to, but there's one unfortunate problem: this grammar doesn't get the right answer! Sums are evaluated from the left to the right: we calculate ((10 + 20) - 33) + 44. But this parser drilled in from the right: 10 + (20 - (33 + 44)).

Left recursion

We need a rule that looks more like this: <Sum> = <Sum> + <Number> => addition( <Sum> , <Number> ) | <Sum> - <Number> => subtraction( <Sum> , <Number> ) | <Number> Now 10 + 20 - 33 + 44 will correctly recurse from the left side, producing the nested output addition( subtraction( addition( 10, 20 ), 33 ), 44).

Parsing expression grammars have traditionally had caveats about left recursion. Using the two-pointer system mentioned earlier, the problem with the previous rule that the first step of parsing <Sum> is to parse <Sum>, of which the first step is to parse <Sum>; the attempt would never end. BNFToLPEG is equipped to handle the situation. When it detects left recursion (defined as: attempting to parse a rule at the same location in the text that rule is already being parsed) it performs a two-step process. First, the original attempt to parse the rule is allowed to continue with any attempts to left-recurse blocked. On the condition this succeeds in producing a non-left-recursive parse (the first 10), it uses it as a seed for the next steps.

The next steps are the same repeated routine. The left-recursive rule is parsed again, but when it encounters its left recursion, the pre-parsed seed is considered to be the inner result. The resulting parse at each point is required to succeed, and to consume more text doing so, to become the new pre-parsed seed for the next round. At the end of this process, the final attempt either having failed or having produced a smaller parse (often the original seed), the last successful attempt is used as the final result.

Let's try to make the arithmetic parser even better by adding multiplication and division:

<Sum> = <Sum> + <Number> => addition( <Sum> , <Number> ) | <Sum> - <Number> => subtraction( <Sum> , <Number> ) | <Sum> * <Number> => multiplication( <Sum> , <Number> ) | <Sum> / <Number> => division( <Sum> , <Number> ) | <Number> Again, this will correctly parse, but it isn't quite what we want. The input 1 * 2 + 3 will produce addition( multiplication ( 1, 2 ), 3 ). But the input 1 + 2 * 3 will produce multiplication( addition ( 1, 2 ), 3 ), because the parser is just proceeding left to right, and that violates the rule that the multiplication must go first. How about reordering the rules, so multiplication and division gets first crack? <Sum> = <Sum> * <Number> => multiplication( <Sum> , <Number> ) | <Sum> / <Number> => division( <Sum> , <Number> ) | <Sum> + <Number> => addition( <Sum> , <Number> ) | <Sum> - <Number> => subtraction( <Sum> , <Number> ) | <Number> This actually makes no difference to the parse, because although earlier versions of a rule have preference, there isn't any ambiguity between * and +. And if reordering the rules like this could control what order the operators parsed in, then addition would always come before subtraction and you don't want that either because 1 - 2 + 3 would turn into subtraction( 1, addition( 2, 3 ) ).

How about putting <Sum> on the right side, instead of <Number>?

<Sum> = <Sum> + <Sum> => addition( <Sum> , <Sum> ) | <Sum> - <Sum> => subtraction( <Sum> , <Sum> ) | <Sum> * <Sum> => multiplication( <Sum> , <Sum> ) | <Sum> / <Sum> => division( <Sum> , <Sum> ) | <Number> That's almost it. Now we can even define operator precedence. <Sum> = <Sum> + <Sum> => addition( <Sum> , <Sum> ) | <Sum> - <Sum> => subtraction( <Sum> , <Sum> ) <Sum> = <Sum> * <Sum> => multiplication( <Sum> , <Sum> ) | <Sum> / <Sum> => division( <Sum> , <Sum> ) <Sum> = <Number> Now BNFToLPEG will parse 1 + 2 * 3 as addition( 1, multiplication( 2, 3 ) ) and parse 1 * 2 + 3 as addition( multiplication( 1, 2 ), 3 ). The left side <Sum> takes the first version available to it, but then on the right side, <Sum> has to take a later version. Then the left side, which contains the right side parse so far, is enclosing it with its own right side again. And again, until the <Sum> is finished parsing.

Right recursion

Exponentiation recurses right to left: 4^5^6^7 is 4^(5^(6^7)). This is probably because, if you're really raising them left to right, it's just multiplying the exponents: ((1^2)^3)^4 equals 1^(2*3*4) and you would have just typed that if it was what you had meant, so right recursion is assumed. This one-line rule will collect the exponents from the right side: <Exponent> = <Number>^{ <Exponent> | <Number> } If we want to rewrite the output in the two different forms, we do have to break the choice into another line. <Exponent> = <Number>^<Power> => power(<Number>,<Power>) <Power> = <Exponent> | <Number> Another right-recursive operator is negation, the minus sign before a value, which means to multiply by -1. You could use extra minus signs in a row to negate twice or thrice, which should read as -(-(-(value))) of course. <Negation> = - <Negated> => multiplication( -1, <Negated> ) <Negated> = <Negation> | <Number>

Editable grammar with <Expr> and <Factor>

Putting the arithmetic parsing techniques so far all together, this grammar uses common names in arithmetic: Expr, short for expression (a series of values and operators) and Factor (one number, or an expression in parenthesis). Expressions are parsed and rewritten respecting operator precedence.

The grammar is editable! You can just select with the mouse and keyboard. There is no way to hurt the page by typing; errors can stop a parser but they cannot freeze or crash the page no matter what you entered. One fun trick is to add more kinds of factors to the end of the list.

<Expr> = <Expr> + <Expr> => add( <Expr> , <Expr> ) | <Expr> - <Expr> => sub( <Expr> , <Expr> ) <Expr> = <Expr> * <Expr> => mult( <Expr> , <Expr> ) | <Expr> / <Expr> => div( <Expr> , <Expr> ) <Expr> = <Exponent> <Exponent> = <Factor> ^ <Exponent> => pow( <Factor>, <Exponent> ) <Exponent> = <Factor> ^ <Factor> => pow( <Factor>, <Factor> ) <Expr> = <Factor> <Factor> = (<Expr>) <Factor> = - <Factor> => mult( -1, <Factor> ) <Factor> = { <Digit> }*
12 * (3+1)^4^1 + 5 * -(6/7+8)
...

How to use BNFToLPEG

The BNFToLPEG library has no options. First, download or link the JavaScript at BNFToLPEG.js

BNFToLPEGParser(grammar).parse(input).output will construct a parser, parse input, and produce output. Using the success parameter instead of output will access a object containing { index, endIndex, subrules[], rule, version }, or null if the input didn't fit the grammar. See the text of the script for documentation.

The techniques in this document are just ideas and you are always free to use them wherever you can. You're also invited to use the code in personal noncommercial projects and change or improve it as necessary.

BNFToLPEG : (c) 2023 Hypervariety Custom Programming