Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Veering a bit offtopic, but does anyone have any pointers to important recent work on parsing? There are alot of papers out there and I guess I don't know how to sift through them. I've heard the "parsing is solved" line before, but so much of my time is spent doing some type of parsing that even incremental improvements are extremely interesting.


It really depends on the kind of parsing that you're doing.

Full context-free grammar are supported by "generalised parsing". The older stuff is GLR by Tomita, Early parsing, the CYK algorithm. The newer stuff based on Tomita is the particular rabbit hole I stuck with for a while. I read about SGLR which eliminates lexers, GLL which is GLR but based on the LL algorithm. The people who do GLL research also did improvements on SGLR with improved speed on right-nulled grammars. Then there is the SGLR improvements with disambiguation filters, automatic derivation of error recovery with island grammars, etc. The disambiguation filters include a kind of negation, making the current implementation of SGLR for SDF capable to parsing Boolean Grammars, which are a superset of context-free grammars.

Anyway, there's more than context-free grammars. Definitely look into data-dependant grammars! It's able to define a lot of interesting things, like network protocol formats where you parse the length of the payload, then based on that length you know how much to interpret as the payload before you read the footer of the message. And you write all of that in a more declarative way and get a nice parser generated from it.

There is so much more, but I think I should to stop now :)


I can probably help if you can narrow things down a bit. Are you parsing deterministic languages (e.g. for programming languages), general context-free languages, binary file formats, or something else?

Because parsing is a field of little active research interest where most of the work happened more than 20 years ago, there are a lot of techniques from the 70s, 80s, and 90s that are relatively unknown.


I'm most interested in deterministic languages. Non-deterministic context-free languages would be extremely interesting as well but more out of curiosity than an applicable need. Thanks!


For deterministic languages, hand written recursive descent is usually the choice even today because of simplicity and ease of error reporting.

There are exceptions, but relatively few production compilers uses anything else, and most of the innovations in parsing provides relatively little value in this space because they tend to be focused on better expressing complex languages rather than provide improved ways of expressing / capturing errors, and it's the latter that would provide most benefit for deterministic languages.


> Veering a bit offtopic, but does anyone have any pointers to important recent work on parsing?

Parser combinators are a relatively modern topic in parsing. First research into the topic was made in the late 1980s, but parser combinators became popular after Parsec [0], a practical implementation of parser combinators in Haskell. These ideas have been borrowed to many different implementations in different languages since then.

[0] Daan Leijen: "Parsec, a fast combinator parser", 2001 http://research.microsoft.com/en-us/um/people/daan/download/...


Check out ANTLR which is probably the best modern parser generator. ANTLR 4 uses "adaptive LL(*)" (aka "all star") for its underlying theory. http://www.antlr.org/papers/allstar-techreport.pdf


PEG (Packrat) is hot now.


It is important to distinguish between PEG and packrat.

PEG is a model for recognizers, distinct from traditional grammers whose theoretical model is usually based on generating strings rather than recognising them. (Yes this is a bit backwards.) The most distinctive feature of PEGs is the ordered choice operator - traditional context-free grammars and regexes use unordered choice, but ordered choice is a closer model of how recursive descent parsers work in practice. Ordered choice naturally leads to implementations that have less backtracking than common regex matchers, e.g. Lpeg.

Packrat parsers are based on a clever non-backtracking PEG matching algorithm. They spend O(n) memory to get O(n) parsing time, so they are reliably fast but not good for long inputs. They are also effective in a pure, lazy setting like Haskell.

Lpeg is a PEG parser but not a packrat parser.


Interesting paper on left-recursion in recursive descent parsers like PEGs and Packrats

http://web.cs.ucla.edu/~todd/research/pub.php?id=pepm08


If you're interested in left-recursion in PEGS then, at the risk of gross immodesty, you may be interested in http://tratt.net/laurie/research/pubs/html/tratt__direct_lef...

With less risk of immodesty you may also find http://arxiv.org/pdf/1207.0443.pdf?ref=driverlayer.com/web interesting.

There's probably more recent work than these two papers, but I'm a little out of date when it comes to the PEG world.


Thanks, looks like you've put a lot of work in to it and I'll enjoy reading it.


Yes, I implemented exactly this approach. But for the binary expressions I am using Pratt, it is faster and easier to construct left-associative nodes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: