Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Well, as the author of ANTLR, I will disagree with your assessment of the tool as you can imagine. It never claimed to be a general tool. Specifically, it cannot handle indirect left-recursion; hence, not a general context free grammar parser. It is however the sweet spot between generality and speed. That quote was having a bit of fun using a tiny bit of hyperbole. If you take a look at our academic paper you will see that the general parsers are like walking a minefield. One never knows when they will hit a landmine and the parser takes overnight to parse a file. It took me 30 years to find the sweet spot in power and speed (with the help of Sam Harwell). I welcome the introduction of new tools, but I'm not sure your assessment is accurate nor is your understanding of the parsing landscape.

Also as the paper says, "GLR return multiple parse trees (forests) for ambiguous grammars because they were designed to handle natural language grammars, which are often intentionally ambiguous. For computer languages, ambiguity is almost always an error." When was the last time you wanted to parse a language where the same syntax meant two different things? C++ and a few other cases, sure, but most languages are specifically designed to be unambiguous.

You are welcome to use a tool that handles all context free grammars, but the speed and ambiguity issues are not something I care to deal with.

You also mischaracterize ANTLR's handling of left recursion. It handles immediate left recursion through a rewrite automatically such as for expressions. It does not handle indirect left recursion. You may have not seen ANTLR 4, and are basing your assessment on ANTLR 3? There is no aborting at compile time for left recursion, except in the unusual case of indirect left record.

http://www.antlr.org/papers/allstar-techreport.pdf



> That quote was […] using a tiny bit of hyperbole.

Then you certainly won't have a problem changing the wording to make it true, right? The Owl author explains the limitations of his software up-front, why don't you?

> It never claimed to be a general tool.

The documentation does give the impression.

> If you take a look at our academic paper

Consider that HN is populated by craftsmen and businessmen, not scientists. We judge the tools on how well they work in practical terms.

> One never knows when they will hit a landmine and the parser takes overnight to parse a file.

No wonder, I see it's O(n⁴) for ALL()/ANTLR. You need to keep up with the current state of the art, which is O(n³).

> It took me 30 years

Time and effort does not count for anything, the quality of the end result does.

> I'm not sure […] your understanding of the parsing landscape [is accurate]

I can see the practical results of the various software I tried. Does the software get the task done? Yes, fine, that's a candidate for further consideration. No, fine, I can eliminate it. Documentation does not say…? Now one has to come up with experiments to find out the flaws and limitations, multiplied by every single user. That's a waste of people's time.

> When was the last time you wanted to parse a language where the same syntax meant two different things? […] ambiguity is almost always an error.

That was 2014, and I did not design that language. It does contain ambiguities, it can't be helped. No one gains anything by simply wishing it weren't so; there was a task to be done. There exists software that can cope, you should take that as an incentive to improve yours.

> but the speed and ambiguity issues are not something I care to deal with.

… or don't improve. But why would you refuse to? Is not your reputation at stake?

> You also mischaracterize ANTLR's handling of left recursion. […] It does not handle indirect left recursion.

That's what I meant. I was incorrect when I wrote "any grammar that's left-recursive".


It always amazes me that so many "famous" people still take time to read HN and respond to questions (even insulting ones)! I used Antlr4 in the past and have to say it's a well-designed system, though at the time (2016) the Python runtime which I was most interested in was not very stable.

I've implemented several parser generators myself (nothing as polished as Antlr) and worked on a "rapid prototyping" parser generator tool a while ago (https://github.com/adewes/parsejoy). The idea was a tool that could creates parsers on-the-fly without requiring compilation of any code, and let the user control and modify parsing logic using Lua if necessary. I started writing it in Go but switched to C++ later in order to get more predictable performance.

In the tool I implemented several parsing strategies, notably "parsing expression grammars (PEGs)" as well as a GLR parser (based on the Elkhound paper). And while I find GLR parsing powerful it's not without problems as e.g. errors are hard to debug. That said, it's pretty amazing that you can throw almost any grammar at a GLR parser it and it will (usually) be able to parse it. As you said though writing a "bad" grammar can yield exponential time complexity on some input strings.

PEGs are also nice in general but require you to put more work into the grammar as you are basically required to resolve ambiguities yourself using the "and" or "not" operators. Also, a naive implementation of a PEG parser will have horrid performance as it basically evaluates all possible alternatives via backtracking until it hits a valid one, and for a real-world languages (I've implemented e.g. a Python parser for testing) this will ruin your performance due to the large nesting depth of the rules (Python has around 8-12 levels of rules I think). Using a prefix tree can alleviate this though, as can packrat parsing, which comes with its own cost though as remembering parse results requires memory allocation.

Anyway, in terms of popularity and number of actually implemented grammars, Antlr4 is probably by far the most relevant OS parser generator out there. Thanks for writing it :)

One suggestion I have is to consider improving the tooling for tokenization, as it's often a problem that can be quite challenging in itself. For example, tokenizing Python code is not easy as it's not context-free (due to the indentation that indicates the nesting) and the presence of several types of newline characters (those that occur inside bracketed expressions vs. those that occur outside of them)


Howdy. Agreed. It'd be nice to have a simpler "match x if NOT followed by y" then and something to handle context-sensitive lexical stuff like Python. I often just send all char to the parser and do scannerless parsing. :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: