#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2008
    Posts
    2
    Rep Power
    0

    Breaking text into matching chunks


    Hi all,

    I'm trying to break up a string into quoted strings, commented strings, brackets (just the character) and other strings. It happens to be SQL, but it's not SQL specific.

    For example, if I have:

    create view "My View Name" -- this is where I define the name
    as select
    myFirstColumn || (select me from mine)
    , ' a static literal' as mySecondColumn /* weird */

    Then I want to break it up into successive chunks:

    1: create view
    2: "My View Name"
    3: -- this is where I define the name\n
    4: as select\n myFirstColumn ||
    5: (
    6: select me from mine
    7: )\n ,
    8: ' a static literal'
    6: as mySecondColumn
    9: /* weird */

    I tried this regex:

    m/(?xism)
    ( # one of the following
    ".*?"|'.*?'|\[.*?\] # quoted
    | --.*?(?:\r\n|\n|\r)|/[*].*?[*]/ # comments
    | \(|\) # brackets
    | .+? # other
    )
    /g

    but it's not quite right. Any ideas?

    Thanks,
    Tom
    BareFeet
  2. #2
  3. Did you steal it?
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    13,965
    Rep Power
    9397
    Having had time to think about it, I don't think I'd use a regular expression for this.

    What programming language are you working with?
  4. #3
  5. Sarcky
    Devshed Supreme Being (6500+ posts)

    Join Date
    Oct 2006
    Location
    Pennsylvania, USA
    Posts
    10,846
    Rep Power
    6351
    I agree, I started working on this Friday and didn't get anywhere. He appears to be using Perl. I would walk through the string one character at a time and set yourself flags for things like comments and quoted strings.

    -Dan
    HEY! YOU! Read the New User Guide and Forum Rules

    "They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin

    "The greatest tragedy of this changing society is that people who never knew what it was like before will simply assume that this is the way things are supposed to be." -2600 Magazine, Fall 2002

    Think we're being rude? Maybe you asked a bad question or you're a Help Vampire. Trying to argue intelligently? Please read this.
  6. #4
  7. Did you steal it?
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    13,965
    Rep Power
    9397
    Maybe you could use a regex. Multiple, actually.

    One generic will read a string until it reaches a delimiter: a quote, --, (, ), or /*.
    Code:
    /(.*?)(['"()]|--|/\*)/s
    Depending on that second group, you use another regex for the next part:
    Code:
    '  -> /(.*?)'(?<!\\')/s
    "  -> /(.*?)"(?<!\\")/s
    -- -> /(.*)/
    /* -> ~(.*?)\*/~s
    ( doesn't have one (it's just the character) and ) uses the generic.
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2008
    Posts
    2
    Rep Power
    0

    Lightbulb


    Originally Posted by requinix
    Having had time to think about it, I don't think I'd use a regular expression for this.
    I managed to do it by just modifying my regular expression to include a lookahead after the "other" grouping for any known groups. It basically says "find on of the following up to just before the next occurrence of another match)

    m/(?xis)
    ( # one of the following
    ".*?"|'.*?'|\[.*?\] # quoted
    | --.*?(?:\r\n|\n|\r) # comments at end of line
    | /[*].*?[*]/ # comments in line
    | \(|\) # brackets
    | .+? # other
    (?= # look ahead for one of the following
    ".*?"|'.*?'|\[.*?\] # quoted
    | --.*?(?:\r\n|\n|\r) # comments at end of line
    | /[*].*?[*]/ # comments in line
    | \(|\) # brackets
    )
    )/g

    It gives me the result I needed:

    1: create view
    2: "My View Name"
    3: <space>
    4: -- this is where I define the name\n
    5: as select\n myFirstColumn ||
    6: (
    7: select me from mine
    8: )
    9: \n ,
    10: ' a static literal'
    11: as mySecondColumn
    12: /* weird */

    Obviously this RegEx is scanning some of the text twice, which is slightly inefficient. I thought that the /g option would be smart enough to interpret a .+? within the RegEx as meaning up to just before the next iteration of the grouping, but I guess not, so I have to include the lookahead to do this specifically.

    And of course I can simplify the lookahead to only look for the starting characters of the quotes and comments, so the RegEx is:

    m/(?xis)
    ( # one of the following
    ".*?"|'.*?'|\[.*?\] # quoted
    | --.*?(?:\r\n|\n|\r) # comments at end of line
    | /[*].*?[*]/ # comments in line
    | \(|\) # brackets
    | .+? # other
    (?= # look ahead for one of the following
    "|'|\[ # quoted
    | -- # comments at end of line
    | /[*] # comments in line
    | \(|\) # brackets
    )
    )/g

    Originally Posted by requinix
    What programming language are you working with?
    I'm using AppleScript, currently calling perl to do the actual regex, but may move that function into a scripting addition that does regex, or even some Cocoa subroutine. So I'm after a generic PCRE RegEx that I can use in whatever environment.

    Thanks,
    Tom
    BareFeet

IMN logo majestic logo threadwatch logo seochat tools logo