#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2004
    Posts
    45
    Rep Power
    20

    Python Regex - Replace all occurances of multiple spaces with dashes


    Here is my scenario:
    May 24th 2009 10:10:00 PM something something 10.10.10.10 url uri port something something "something 80 info" 10.10.10.10 something

    I need the above string to have all occurances of " " (3 spaces) replaced with " - - " (space dash space dash space) and " " (2 spaces) replaced with " - " (space dash space).

    The trick is if the spaces are in a set of double quotes I need them to be ignored.

    I've tried lots of things and i'm just not seeing what to do, really have nothing to start with for a regex sorry.

    I've got code I use now to take care of it, but it's more than one regex, my goal is to use one regex if it can be done.

    Anyone got some ideas?
  2. #2
  3. Contributed User
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    Jun 2005
    Posts
    4,417
    Rep Power
    1871
    > I've got code I use now to take care of it, but it's more than one regex, my goal is to use one regex if it can be done.
    So use it, and get on with solving more pressing problems.

    You're not going to buy anything by making it one hellishly complicated regex. Even if you manage to create it, you'll need half a page of comment describing it in nauseating detail if you want to have any hope of changing it in the future (as early as next week say).

    If it's a few simple steps, you'll be able to change any one of them in a few seconds with minimal effort.

    Some hairy monster of the kind you're trying to get would take you another week to figure out.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2004
    Posts
    45
    Rep Power
    20
    Originally Posted by salem
    > I've got code I use now to take care of it, but it's more than one regex, my goal is to use one regex if it can be done.
    So use it, and get on with solving more pressing problems.

    You're not going to buy anything by making it one hellishly complicated regex. Even if you manage to create it, you'll need half a page of comment describing it in nauseating detail if you want to have any hope of changing it in the future (as early as next week say).

    If it's a few simple steps, you'll be able to change any one of them in a few seconds with minimal effort.

    Some hairy monster of the kind you're trying to get would take you another week to figure out.
    That's all well and good, but I was just curious if someone had an actual solution possibly, not a paragraph trying to talk me out of it haha, no offense.
    You tell me to "get on with solving more pressing problems", as if you know what problems I have?
    As far as "hellishly complicated regex" goes, who says it has to be complicated, from my experiences some of the more crazy sounding regexs often end up looking very simplistic with a following moment of "Ah....". I will buy something with the regex, less lines and better on resources than what I have now.
    Trust me when I say this, I don't need a lot of comments with my code and when I do use comments, clear and succinct is my goal... not long and nauseating.
    Thank you for your words.
  6. #4
  7. No Profile Picture
    User 165270
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2005
    Posts
    497
    Rep Power
    938
    Originally Posted by xyexz
    Here is my scenario:
    May 24th 2009 10:10:00 PM something something 10.10.10.10 url uri port something something "something 80 info" 10.10.10.10 something

    I need the above string to have all occurances of " " (3 spaces) replaced with " - - " (space dash space dash space) and " " (2 spaces) replaced with " - " (space dash space).

    The trick is if the spaces are in a set of double quotes I need them to be ignored.

    I've tried lots of things and i'm just not seeing what to do, really have nothing to start with for a regex sorry.

    I've got code I use now to take care of it, but it's more than one regex, my goal is to use one regex if it can be done.

    Anyone got some ideas?
    I concur with the previous poster: doing this in one regex isn't advisable. Especially when you strings are large, this will lead to a poor performance.

    But this will do the trick:

    python Code:
    import re
    text = 'abc  def   ghi "ab   cd" pkm'
    text = re.sub(' (?= )(?=([^"]*"[^"]*")*[^"]*$)', " -", text)
    print text
     
    # output: 
    #           abc - def - - ghi "ab   cd" pkm


    Note that it also replaces four successive spaces (outside quotes) with " - - - " and 5 successive spaces, etc.
  8. #5
  9. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2004
    Posts
    45
    Rep Power
    20
    prometheuzz, thanks so much for this, I was on the right track with the positive lookaheads but lacked the extra positive lookahead at the beginning.
    I know that this regex wouldn't be highly effecient on large strings given the (*) greedy matches but sometimes you can't get around using greedy regex, this would be one of those times.
    My string data doesn't usually get over 300 chars in length.

    Thanks again!
  10. #6
  11. No Profile Picture
    User 165270
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2005
    Posts
    497
    Rep Power
    938
    Originally Posted by xyexz
    prometheuzz, thanks so much for this, I was on the right track with the positive lookaheads but lacked the extra positive lookahead at the beginning. I know that this regex wouldn't be highly effecient on large strings given the (*) greedy matches but sometimes you can't get around using greedy regex,
    It's more because of the look aheads and their contents that makes this regex not too efficient: for every white space the regex encounters, it will always look ahead to the end of the string, making it a quadratic running time Big-O speaking while a linear time algorithm is easily crafted "by hand".

    Originally Posted by xyexz
    this would be one of those times.
    My string data doesn't usually get over 300 chars in length.

    Thanks again!
    Ah, 300 characters is indeed peanuts.
    Last edited by prometheuzz; June 11th, 2009 at 09:11 AM.
  12. #7
  13. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2006
    Posts
    177
    Rep Power
    234
    Originally Posted by xyexz
    Here is my scenario:
    May 24th 2009 10:10:00 PM something something 10.10.10.10 url uri port something something "something 80 info" 10.10.10.10 something

    I need the above string to have all occurances of " " (3 spaces) replaced with " - - " (space dash space dash space) and " " (2 spaces) replaced with " - " (space dash space).

    The trick is if the spaces are in a set of double quotes I need them to be ignored.

    I've tried lots of things and i'm just not seeing what to do, really have nothing to start with for a regex sorry.

    I've got code I use now to take care of it, but it's more than one regex, my goal is to use one regex if it can be done.

    Anyone got some ideas?
    why make it so complicated. there's no need for regular expression. use the csv module
    Code:
    import csv
    filename = "file"
    reader = csv.reader(open(filename),delimiter=" ")
    for row in reader: 
        for item in row:
            if "   " in item: #3 spaces
                item=item.replace("   "," - - "
            if "  " in item:
                item=item.replace("  "," - ")
  14. #8
  15. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2004
    Posts
    45
    Rep Power
    20
    I'll take a look at your solution but to be honest one line of regex looks less complicated than what you just posted. Also you can't use a single space as your delimiter because spaces inside of double quotes must not be touched. If this doesn't touch those spaces then great!
    I have the regex inside of a loop (regex compiled outside of the loop), so I'm not sure which would be more effecient something like this, or the regex.
    Also I have no file reference it's just a string, can you supply this funciton just a string?
    Thanks for the suggestion!
  16. #9
  17. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2006
    Posts
    177
    Rep Power
    234
    Originally Posted by xyexz
    I'll take a look at your solution but to be honest one line of regex looks less complicated than what you just posted.
    i will give you an analogy. reading an essay written with english, versus reading an essay full of numbers, where the numbers represent the alphabets. which is more complicated?? that's what regex does. use symbols to represent logic. Its ok for short expressions, but if your string manipulation requirement gets more complex, being more verbose will help you alot. How much time have you wasted coming up with that regular expression?

    Also you can't use a single space as your delimiter because spaces inside of double quotes must not be touched. If this doesn't touch those spaces then great!
    the module is for you to experiment and find out for yourself what's best. i am only providing you an example of how you can do it the easier way.

IMN logo majestic logo threadwatch logo seochat tools logo