#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2008
    Posts
    28
    Rep Power
    0

    RE help needed - split string with punctuation except when enclosed with quotations


    Hi Everyone,

    I have a bit of a problem that requires Regular Expressions and I haven't used them often enough to know how to do this. I'm using Python to split strings in a file, but I want to split the strings on commas, like so:

    ' item1, "item2", "item3", "item4, item4 more stuff", null, "states: IL, AL, CO, etc. item5, (more extraneous punctuation)" '

    Where the enclosing single quotes represent the string I mentioned, and I basically want to split the string on commas - however not the commas that are inside quotation marks ("states: IL, AL, CO, etc ...").

    Basically what I'd like to get back is an array with the following 6 items from the example string above:

    item1, "item2", "item3", "item4. item4 more stuff", null,
    "states: IL, AL, CO, etc. item5, (more extraneous punctuation)"

    So essentially its just the commas inside the quotation marks which are ignored, only the commas on the outside are used. Any help is greatly appreciated, thanks.
  2. #2
  3. kill 9, $$;
    Devshed Supreme Being (6500+ posts)

    Join Date
    Sep 2001
    Location
    Shanghai, An tSín
    Posts
    6,897
    Rep Power
    3887
    I wouldn't recommend using regexps for this type of task. Rather, a proper CSV parser would be preferable. I'm no python programmer, but this looks useful.

    Comments on this post

    • prometheuzz agrees
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2008
    Posts
    28
    Rep Power
    0
    ishnid:

    I'm using the CSV module for the excel side of things.

    Basically I have two files, a .csv file and a text file -- I'm parsing the csv file just fine - but I need to parse the text file. That's what my original post is referring to.

    The excel file is parsed out into key/val pairs - now I need to do the same with the text file so that I can compare: key from csv to key from text, are the values the same? But to do this I need to be able to properly parse the text file, which is where I'm currently at
  6. #4
  7. kill 9, $$;
    Devshed Supreme Being (6500+ posts)

    Join Date
    Sep 2001
    Location
    Shanghai, An tSín
    Posts
    6,897
    Rep Power
    3887
    The sample data you posted looks exactly like a CSV file to me: values separated by commas, with commas being embedded within fields by using quotes. Perhaps I'm missing something.
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2009
    Posts
    18
    Rep Power
    0
    The text file I need to parse isn't separated by commas, only part of each line is, so I'll have something like this:

    1:random text here, random text here. there's a list following here, the list is what I need to parse, everything in the single quotes: ' item1, "item2", "item3", "item4, item4 more stuff", null, "states: IL, AL, CO, etc. item5, (more extraneous punctuation)" '

    2:now I've got another line here with another list: 'mehmeh, "yes, here we go", etc'

    I'm able to get the single-quote list substrings from each line easily enough, but it's just the parsing those lists that I'm having trouble with. I tried the following pattern which I found online:

    [code='python']
    import re
    pattern = re.compile(“/,(?=(?:[^\"]*\”[^\"]*\”)*(?![^\"]*\”))/”)
    myarray = re.split(pattern, string)
    [/code]

    but that didn't work.
  10. #6
  11. No Profile Picture
    User 165270
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2005
    Posts
    497
    Rep Power
    937
    Originally Posted by vodolan
    The text file I need to parse isn't separated by commas, only part of each line is, so I'll have something like this:

    1:random text here, random text here. there's a list following here, the list is what I need to parse, everything in the single quotes: ' item1, "item2", "item3", "item4, item4 more stuff", null, "states: IL, AL, CO, etc. item5, (more extraneous punctuation)" '

    2:now I've got another line here with another list: 'mehmeh, "yes, here we go", etc'

    I'm able to get the single-quote list substrings from each line easily enough, but it's just the parsing those lists that I'm having trouble with. I tried the following pattern which I found online:

    [code='python']
    import re
    pattern = re.compile(“/,(?=(?:[^\"]*\”[^\"]*\”)*(?![^\"]*\”))/”)
    myarray = re.split(pattern, string)
    [/code]

    but that didn't work.
    Python regex-es don't need delimiters and be sure to use the correct quotes:

    python Code:
    p = re.compile(",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))")


    But I agree with ishnid: I don't see why you couldn't use a CSV parser for this. AFAIK, the example data you posted will work just fine with a CSV parser.

    What do you think happens when you get a string like this:

    Code:
    item1, "item2", "item3", "item4, item4 a quote \"more stuff", null
    Last edited by prometheuzz; July 7th, 2009 at 06:09 AM.

IMN logo majestic logo threadwatch logo seochat tools logo