Thread: re.sub help

    #1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2005
    Posts
    1
    Rep Power
    0

    re.sub help


    I'm having problems with a Python Regex. I've got a load of pages with a load of links to /pages/code/ (where code is a varying code with letters and numbers in). I'm managing to find this string, but I can't change it. I've got the code

    Code:
    new = re.sub(r'''"/pages/[a-z0-9]*/"''', r'''"http://intranet/pages/\1"''', text)
    But that's outputting http://intranet/pages/SOH (SOH being the non-printable character SOH). How can I get it to output the correct links?
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Feb 2004
    Location
    London, England
    Posts
    1,585
    Rep Power
    1373
    The \1 in the replace string should be replaced with the first group that matched... but your regex does not have any groups in it!

    To create a group, put brackets round the part that you want to be inserted into the replace string. e.g.
    Code:
    new = re.sub(r'''"/pages/([a-z0-9]*)/"''', r'''"http://intranet/pages/\1"''', text)
    Dave - The Developers' Coach
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2001
    Location
    Houston, TX
    Posts
    383
    Rep Power
    14
    For something like this it may be better just to split the string on '/' and pick the appropriate one. Using regexps where you don't have to is kind of ugly.

    Code:
    >>> url = 'http://intranet/pages/foobar'
    >>> path = url[7:]
    >>> path
    'intranet/pages/foobar'
    >>> (host, dir, rest) = path.split('/', 2)
    >>> print host, dir, rest
    intranet pages foobar
    >>> url = 'http://intranet/pages/foobar/even/more/directories'
    >>> path = url[7:]
    >>> (host, dir, rest) = path.split('/', 2)
    >>> print host, dir, rest
    intranet pages foobar/even/more/directories
    Debian - because life's too short for worrying.
    Best. (Python.) IRC bot. ever.

IMN logo majestic logo threadwatch logo seochat tools logo