One of my first jobs was parsing XML with Regular Expressions. Like TFA the goal was not to construct the entire document tree, but rather extract data. It worked great!
For years and years I ran a web service that scraped another site's HTML to extract data. There were other APIs doing the same thing. They used a proper HTML parser, and I just used the moral equivalent of String.IndexOf() to walk a cursor through the text to locate the start and end of strings I wanted and String.Substring() to extract them. Theirs were slow and sometimes broke when unrelated structural HTML changes were made. Mine was a straight linear scan over the text and didn't care at all about the HTML in between the parts I was scraping. It was even an arbitrarily recursive data structure I was parsing, too. I was able to tell at each step, by counting the end and start tags, how many levels up or down I had moved without building any tree structures in memory. Worked great, reliably, and I'd do it again.
It really is a central example of the bell curve meme, isn't it?
The reason we tell people not to parse HTML/XML/whatever with regular expressions isn't so much that you can't use regular (CS sense) patterns to extract information from regular (CS sense) strings* that happen to be drawn from a language that can also express non-regular strings, but because when you let the median programmer try, he'll screw it up.
So we tell people you "can't" parse XML with regular expressions, even though the claim is nonsense if you think about it, so that the ones that aren't smart and independent-enough minded to see through the false impossibility claim don't create messes the rest of us have to clean up.
(* That is, strings that belonging to some regular language L_r (which you can parse with a state machine), L_r being a subset of the L you really want to parse (which you can't). L_r can be a surprisingly large subset of L, e.g. all XML with nesting depth of at most 1,000. The result isn't necessarily a practical engineering solution, but it's a CS possibility, and sometimes more practical than you think, especially because in many cases nesting depth is schema-limited.)
Concrete example: "JSON" in general isn't a regular language, but JavaScript-ecosystem package.json, constrained by its schema, IS.
Likewise, XML isn't a regular language in general, but AndroidManifest.xml specifically is!
Is it a good idea to use "regex" (whatever that means in your langauge) to parse either kind of file? No, probably not. But it's just not honest to tell people it can't be done. It can be.
Can regular expressions parse the subset of XML that I need to pull something out of a document: Maybe.
We have enough library "ergonomics" now that it's not any more difficult to use a regex vs a full XML parser now in dynlangs. Back when this wasn't the case, it really did mean the differnce between a one or two line solution, and about 300 lines of SAX boiler-pate.
The less like 'random' XML the document is the better the extraction will work. As soon as something oddball gets tossed in that drifts from the expected pattern things will break.
Of course. But the mathematical, computer-science level truth is that you can make a regular pattern that recognizes a string in any context-free language so long as you're willing to place a bound on the length (or equivalently, the nesting depth) of that string. Everything else is a lie-to-children (https://en.wikipedia.org/wiki/Lie-to-children).
Our discourse does a terrible job of distinguishing impossible things from things merely ill-advise. Intellectual honestly requires us to be up front about the difference.
Yeah, I'd almost certainly reject a code review using, say, Python's re module to extract stuff from XML, but while doing so, I would give every reason except "you can't do that".
If I’m not mistaken, even JSON couldn’t be parsed by a regex due to the recursive nature of nested objects.
But in general we aren’t trying to parse arbitrary documents, we are trying to parse a document with a somewhat-known schema. In this sense, we can parse them so long as the input matches the schema we implicitly assumed.
> If I’m not mistaken, even JSON couldn’t be parsed by a regex due to the recursive nature of nested objects.
You can parse ANY context-free language with regex so long as you're willing to put a cap on the maximum nesting depth and length of constructs in that language. You can't parse "JSON" but you can, absolutely, parse "JSON with up to 1000 nested brackets" or "JSON shorter than 10GB". The lexical complexity is irrelevant. Mathematically, whether you have JSON, XML, sexps, or whatever is irrelevant: you can describe any bounded-nesting context-free language as a regular language and parse it with a state machine.
It is dangerous to tell the wrong people this, but it is true.
(Similarly, you can use a context-free parser to understand a context-sensitive language provided you bound that language in some way: one example is the famous C "lexer hack" that allows a simple LALR(1) parser to understand C, which, properly understood, is a context-sensitive language in the Chomsky sense.)
The best experience for the average programmer is describing their JSON declaratively in something like Zod and having their language runtime either build the appropriate state machine (or "regex") to match that schema or, if it truly is recursive, using something else to parse --- all transparently to the programmer.
What everyone forgets is that regexes as implemented in most programming languages are a strict superset of mathematical regular expressions. E.g., PCRE has "subroutine references" that can be used to match balanced brackets, and .NET has "balancing groups" that can similarly be used to do so. In general, most programming languages can recognize at least the context-free languages.
It's impossible to parse arbitrary XML with regex. But it's perfectly reasonable to parse a subset of XML with regex, which is a very important distinction.
Oh, no I didn't mean to say regex or not, I meant regex over XML vs regex over a string. The first has the illusions everyone is bringing up that XML is not regular, but having clarity that it is ultimately a string is the correct set of assumptions.