#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2001
    Posts
    52
    Rep Power
    13

    Need help with re module


    Could anyone write a quick reg expression script to extract anything in between some sort of tag? For example, <title>this needs to become a python var</title> .....thanks!
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Feb 2004
    Location
    London, England
    Posts
    1,585
    Rep Power
    1373
    There are two answers to this question: (a) yes, it is trivial, and (b) no, it is virtually impossible.

    If you do not care about nested tags then the answer is trivial:

    regex = r'<title>.*?</title>'

    will match anything between <title> tags. However if there is the possibility that tags will be nested then it cannot be done with a single regex (as far as I know) and needs more complex parsing.

    Dave - The Developers' Coach
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2001
    Location
    Houston, TX
    Posts
    383
    Rep Power
    13
    And another caveat: If you're going to be heavy HTML/XML/SGML parsing, you should use the standard Python libraries for doing so. Parsing via regular expressions is often not the best solution simply because the regexes themselves are generally pretty ugly. Using simple string processing and/or *ML parsing libs like the ones mentioned above are often better solutions.
    Debian - because life's too short for worrying.
    Best. (Python.) IRC bot. ever.

IMN logo majestic logo threadwatch logo seochat tools logo