Parsing arbitrary context-free grammars, mostly short snippets

Question

I want to parse user-defined domain specific languages. These languages are typically close to mathematical notations (I am not parsing a natural language). Users define their DSL in a BNF notation, like this:

expr ::= LiteralInteger
       | ( expr )
       | expr + expr
       | expr * expr

Input like 1 + ( 2 * 3 ) must be accepted, while input like 1 + must be rejected as incorrect, and input like 1 + 2 * 3 must be rejected as ambiguous.

A central difficulty here is coping with ambiguous grammars in a user-friendly way. Restricting the grammar to be unambiguous is not an option: that's the way the language is — the idea is that writers prefer to omit parentheses when they are not necessary to avoid ambiguity. As long as an expression isn't ambiguous, I need to parse it, and if it isn't, I need to reject it.

My parser must work on any context-free grammar, even ambiguous ones, and must accept all unambiguous input. I need the parse tree for all accepted input. For invalid or ambiguous input, I ideally want good error messages, but to start with I'll take what I can get.

I will typically invoke the parser on relatively short inputs, with the occasional longer input. So the asymptotically faster algorithm may not be the best choice. I would like to optimize for a distribution of around 80% inputs less than 20 symbols long, 19% between 20 and 50 symbols, and 1% rare longer inputs. Speed for invalid inputs is not a major concern. Furthermore, I expect a modification of the DSL around every 1000 to 100000 inputs; I can spend a couple of seconds preprocessing my grammar, not a couple of minutes.

What parsing algorithm(s) should I investigate, given my typical input sizes? Should error reporting be a factor in my selection, or should I concentrate on parsing unambiguous inputs and possibly run a completely separate, slower parser to provide error feedback?

(In the project where I needed that (a while back), I used CYK, which wasn't too hard to implement and worked adequately for my input sizes but didn't produce very nice errors.)

score 19 · Accepted Answer · answered Mar 12 '12 at 14:17

Probably the ideal algorithm for your needs is Generalized LL parsing, or GLL. This is a very new algorithm (the paper was published in 2010). In a way, it is the Earley algorithm augmented with a graph structured stack (GSS), and using LL(1) lookahead.

The algorithm is quite similar to plain old LL(1), except that it doesn't reject grammars if they are not LL(1): it just tries out all possible LL(1) parses. It uses a directed graph for every point in the parse, which means that if a parse state is encountered that has been dealt with before, it simply merges these two vertices. This makes it suitable for even left-recursive grammars, unlike LL. For exact details on its inner workings, read the paper (it's quite a readable paper, though the label soup requires some perseverance).

The algorithm has a number of clear advantages relevant to your needs over the other general parsing algorithms (that I know of). Firstly, implementation is very easy: I think only Earley is easier to implement. Secondly, performance is quite good: in fact, it becomes just as fast as LL(1) on grammars that are LL(1). Thirdly, recovering the parse is quite easy, and checking whether there is more than one possible parse is as well.

The main advantage GLL has is that it is based on LL(1) and is therefore very easy to understand and debug, when implementing, when designing grammars as well as when parsing inputs. Furthermore, it also makes error handling easier: you know exactly where possible parses stranded and how they might have continued. You can easily give the possible parses at the point of the error and, say, the last 3 points where parses stranded. You might instead opt to try to recover from the error, and mark the production that the parse that got the furthest was working on as 'complete' for that parse, and see if parsing can continue after that (say someone forgot a parenthesis). You could even do that for, say, the 5 parses that got the furthest.

The only downside to the algorithm is that it's new, which means there are no well-established implementations readily available. This may not be a problem to you - I've implemented the algorithm myself, and it was quite easy to do.

score 6 · Answer 2 · edited May 23 '17 at 12:37

My company (Semantic Designs) has used GLR parsers very successfully to do exactly what OP suggest in parsing both domain specific languages, and parsing "classic" programming languages, with our DMS Software Reengineering Toolkit. This supports source-to-source program transformations used for large-scale program restructuring/reverse engineering/forward code generation. This includes automatic repair of syntax errors in a pretty practical way. Using GLR as a foundation, and some other changes (semantic predicates, token-set input rather than just token input, ...) we've managed to build parsers for some 40 languages.

As important as the ability to parse full languages instances, GLR has also proven extremely useful in parsing source-to-source rewrite rules. These are program fragments with a lot less context than a full program, and thus generally have more ambiguity. We use special annotations (e.g, insisting that a phrase correspond to a specific grammar nonterminal) to help resolve those ambiguities during/after parsing the rules. By organizing the GLR parsing machinery and the tools around it, we get parsers for rewrite rules for "free" once we have a parser for its language. The DMS engine has a built-in rewrite-rule applier that can then be used to apply these rule to carry out desired code changes.

Probably our most spectacular result is the ability to parse full C++14, in spite of all the ambiguities, using a context-free grammar as a basis. I note that all the classic C++ compilers (GCC, Clang) have given up the ability to do this and use hand-written parsers (which IMHO makes them much harder to maintain, but then, they're not my problem). We have used this machinery to carry out massive changes to the architecture of large C++ systems.

Performance-wise, our GLR parsers are reasonably fast: tens of thousands of lines per second. This is well below the state of the art, but we have made no serious attempt to optimize this, and some of the bottlenecks are in the character stream processing (full Unicode). To build such parsers, we pre-process the context free grammars using something quite close to an LR(1) parser generator; this normally runs on a modern workstation in ten seconds on big grammars the size of C++. Surprisingly, for very complex languages like modern COBOL and C++, the generation of lexers turns out to take about a minute; some of the DFAs defined over Unicode get pretty hairy. I just did Ruby (with a full subgrammar for its incredible regexps) as a finger-exercise; DMS can process its lexer and grammar together in about 8 seconds.

score 1 · Answer 3 · edited Apr 13 '17 at 12:54

There are many general context-free parser that can parse ambiguous sentences (according to an ambiguous grammar). They come under various names, notably dynamic-programming or chart parsers. The best known one, and next to simplest, is probably the CYK parser that you have been using. That generality is needed since you have to handle multiple parses and may not know till the end whether you are dealing with an ambiguity or not.

From what you say, I would think that CYK is not such a bad choice. You probably do not have much to gain by adding predictiveness (LL or LR), and it may actually have a cost by discriminating computations that should be merged rather than discriminated (especially in the LR case). They may also have a corresponding cost in the size of the parse forest that is produced (which may have a role in ambiguity errors). Actually, while I am not sure how to compare formally the adequacy of the more sophisticated algorithms, I do know that CYK does give good computation sharing.

Now, I do not believe there is much literature on general CF parsers for ambiguous grammars that should only accept unambiguous input. I do not recall seeing any, probably because even for technical documents, or even programming languages, syntactic ambiguity is acceptable as long as it can be resolved by other means (e.g. ambiguity in ADA expressions).

I am actually wondering why you want to change your algorithm, rather than stick to what you have. That might help me understand what kind of change could best help you. Is it a speed issue, is it the representation of parses, or is it the error detection and recovery?

The best way to represent multiple parses is with a shared forest, which is simply a context-free grammar that generates only your input, but with exactly all the same parse trees as the DSL grammar. That makes it very easy to understand and process. For more details, I suggest you look at this answer I gave on the linguistic site. I do understand that you are not interested in getting a parse forest, but a proper representation of parse forest can help you give better messages as to what the ambiguity problen is. It could also help you decide that the ambiguity does not matter in some cases (associativity) if you wantd to do that.

You mention the processing time constraints of your DSL grammar, but give no hint as to its size (which does not means that I could answer with figures it you did).

Some error processing can be integrated in these general CF algorithms in simple ways. But I would need to understand what kind of error processing you expect to be more affirmative. Would you have some examples.

I am a bit ill at ease to say more, because I do not understand what are really your motivations and constraints. On the basis of what you say, I would stick to CYK (and I do know the other algorithms and some of their properties).

Parsing arbitrary context-free grammars, mostly short snippets

3 Answers3