#1
  1. --
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2012
    Posts
    3,959
    Rep Power
    1014

    You think a regex solves your problem? Think twice!


    Hi,

    since regexes are probably the most abused feature in the whole IT industry, I thought it makes sense to point out what they can do, what they can't do and why most of the time they're not appropriate.

    Many people think that regexes are some kind of all-powerful text manipulation tool which can parse any expression, be it an email address, a JSON file or a complete HTML document. That's not the case. Regexes are very primitive grammars. In fact, they are the most primitive grammars of all. While they work well for simple expressions like dates or telephone numbers, they're completely and utterly unsuitable for anything more complex, let alone a full-blown language like HTML.

    I repeat: You cannot parse HTML with a regex.

    Regexes can only express simple strings consisting of a straight sequence of patterns. A date falls into this category, because it's a digit group followed by a separator followed by another digit group etc. An HTML document, however, is far more complex. It's not just a series of tags. It has nested expressions consisting of tag pairs. A regex isn't even remotely powerful enough to parse this. It simply cannot parse nested expressions.*

    So whenever you need to parse HTML, regexes are not the solution. This also applies to XML, JSON, YAML and any other complex language.

    Why would you want to write your own parser, anyway? Every mainstream programming language already has parsers for all common data formats. They're either built-in or available as third-party libraries. If you're using PHP, for example, there's the DOMDocument class for parsing HTML. And JSON can be parsed with json_decode().

    Using an existing tool is almost always better than writing your own regex. It saves you time, it keeps the code readable and flexible, it's much more likely to work correctly, and it's usually more robust (many regexes choke on simple variations like additional whitespace).

    If you need to parse, say, a URL, you could do that with a regex. You can post your question and then wait for somebody to write the regex for you (assuming you can't do it yourself). This may take 15 minutes or several hours. In the end, you'll have some cryptic regex which may or may not work. But you might as well spend 2 minutes on looking up the right function or class and using that. What do you choose?

    Regexes are great tools in some situations. But most of the time, they're being overused and misused for applications far beyond their scope. If you need to do text manipulations, don't just ask for a regex. Think about the problem and then choose the right tool. 90% of the time, it's not gonna be a regex.



    *For the sake of completeness: Some languages like Perl have extended pseudo-regexes which actually can express nested patterns. But don't even think about using them unless you really, truly know what you're doing.
    The 6 worst sins of security ē How to (properly) access a MySQL database with PHP

    Why canít I use certain words like "drop" as part of my Security Question answers?
    There are certain words used by hackers to try to gain access to systems and manipulate data; therefore, the following words are restricted: "select," "delete," "update," "insert," "drop" and "null".
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    836
    Rep Power
    496
    I definitely agree with what you are saying.

    With just one limitation, not in what you personally are saying but in the way some people sometimes apply this idea. I regularly see on forums questions like this: "I have this 500-million-line XML (or HTML or JSON, whatever) file, and there is one single place in this file which says '10 million dollars' and I need to replace it with '10 million dollars or more'. Can you give me the Regex to do that?" And immediately, some well-trained fanatics come up and warn against parsing XML, HTML or whatever other format with regex. Wait a minute. Do you really want to use DOM or whatever to load 30 GB of data into a parse tree in memory? Good luck, it is probably doomed to fail. Even using SAX is probably better, but complete overkill, what is needed it is characer recognition, nothing more. The OP did not ask to parse XML, HTML or whatever, she or he just wanted to replace the occurrence of a sentence with another. Whether the file is HTML, XML, JSON , CSV or whatever is irrelevant, a regex might be the best way to do it in the context. Or may be it is not. But it might just be it. Let us not rule it out on quasi religious considerations.

    But, again, at the deeper level of what you ware saying, I fully agree with you.
  4. #3
  5. --
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2012
    Posts
    3,959
    Rep Power
    1014
    Let me put it this way: If you know what you're doing, and if you actively choose a regex over a "real" parser, there's nothing wrong with that. That's not what I'm arguing against (why should I?).

    The trouble starts when people choose regexes simply because that's all they know. And I think that's the vast majority. Especially beginners use regexes for everything. Need to replace one string with another? Regex! Need to parse a date? Regex! Need to analyze a file path? Regex! Need to transform an HTML document? Regex!

    Regular expressions become some kind of answer to everything and cover up all other (better) solutions. That's what I'm arguing against.

    I think for an average programmer (who usually doesn't have to deal with 500 million lines of XML), the legitimate use cases of regexes are far and few. Sure, often times you can come up with a quick-and-dirty regex hack to kinda sorta solve the problem. But why waste your time writing some messy regex (or have somebody else do it for you) when you might as well implement a real solution with a single function call?

    Regexes are cryptic, fragile and hard to maintain and reuse. Even a seemingly simple task like fetching an HTML attribute already proves difficult for a regex. No, it's not /attr="[^"]*"/. This may work 80% of the time, but it doesn't account for whitespace, single quoted attributes and unquoted attributes. It's a quick hack, nothing more. An HTML parser, on the other hand, will give you a short piece of correct and readable code you can change and reuse easily. Need to look for an additional attribute? No problem. Try doing that with a regex.

    You said that regexes are fine for simply replacing characters without taking care of the context. I think that's exactly the problem. XML, JSON etc. is meaningful data. It's not just characters, so it shouldn't be treated like that. Replacing, say, all b elements of an HTML document with strong elements is a semantic operation. When people treat it as a search-and-replace task for strings, they kind of miss the point.

    You talked about "well-trained fanatics". I agree. But on the other end of the spectrum, there's the people who just hand out some regex without actually analyzing the problem and trying to find a proper solution. Sure, the OP will be happy for the nonce. But in the long run, I think it's better to go a bit deeper than just "I need a regex for this and that", "Here you go".
    The 6 worst sins of security ē How to (properly) access a MySQL database with PHP

    Why canít I use certain words like "drop" as part of my Security Question answers?
    There are certain words used by hackers to try to gain access to systems and manipulate data; therefore, the following words are restricted: "select," "delete," "update," "insert," "drop" and "null".
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    836
    Rep Power
    496
    I think that we really agree. I was just pointing to a special cases or behaviors that are in my view a form of exageration. Not very common, but happening once in a while.

IMN logo majestic logo threadwatch logo seochat tools logo