#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2010
    Posts
    29
    Rep Power
    0

    Capturing Groups with greedy "+"


    Hey all,
    I'm having trouble with a regular expression. I'm using php to parse a filename and using preg_match. Here is my regexp:
    PHP Code:
    /^(\d+)_([a-zA-Z_]+)_([\d_]+)_([a-zA-Z]{1}[a-zA-Z\d]+_?)+(_[a-zA-Z]+)*\.csv$/ 
    Here is a filename that is giving me problems:
    Code:
    100613_CAT_100522_23_NO3_PO4_JG.csv
    The capture looks like this:
    Code:
    Array ( [0] => 100613_CAT_100522_23_NO3_PO4_JG.csv [1] => 100613 [2] => CAT [3] => 100522_23 [4] => JG )
    So what I am missing there is the NO3_PO4 part. I found that the capture for JG was happening in this part here:
    ([a-zA-Z]{1}[a-zA-Z\d]+_?)+
    That was not what I intended. I inteded for it to capture in the last capture group : (_[a-zA-Z]+)*

    So I assume that the + at the end of the other capture group is capturing both but overwriting or something?

    Any ideas?

    Here's another capture to show some of the variation explaining my choices in the regexp:
    Code:
    Array ( [0] => 100405_WordsHere_CSA_100326_DIT.csv [1] => 100405 [2] => WordsHere_CSA [3] => 100326 [4] => DIT )
    Filenames have been modified as well as captures so I'm sorry if I missed something when I changed them and you see a mismatch of a letter/number.
  2. #2
  3. A94528C464D168DC82FE4933E9DF37
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2010
    Location
    California
    Posts
    119
    Rep Power
    73
    Do you need to essentially match everything separated by the underscores?
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2010
    Posts
    29
    Rep Power
    0
    Originally Posted by jalucas
    Do you need to essentially match everything separated by the underscores?
    No, some of the data separated by underscores should be grouped together. The first group of numbers is never separated by underscores, then the second group of letters can be separated by underscores. Then the next group or groups of letters and/or numbers can be separated by underscores. Then finally there is an optional group of two letters before the ".csv".
  6. #4
  7. A94528C464D168DC82FE4933E9DF37
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2010
    Location
    California
    Posts
    119
    Rep Power
    73
    Do you have an example of a filename that fills all optional conditions? Also, can the groups that allow underscores contain multiple underscores?

    So far, as far as the two examples go, and assuming I grouped them correctly, this is the closest I have been able to come up with:

    Code:
    ^(\d+)_([^_]+_?[^_]+)_([^_]+_?[^_]+)_([^_]+)(_[A-z]{2})?\.csv$
    Which matches "100613_CAT_100522_23_NO3_PO4_JG.csv" as:

    1) 100613
    2) CAT_100522
    3) 23_NO3
    4) PO4
    5) _JG
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2010
    Posts
    29
    Rep Power
    0
    Originally Posted by jalucas
    Do you have an example of a filename that fills all optional conditions? Also, can the groups that allow underscores contain multiple underscores?

    So far, as far as the two examples go, and assuming I grouped them correctly, this is the closest I have been able to come up with:

    Code:
    ^(\d+)_([^_]+_?[^_]+)_([^_]+_?[^_]+)_([^_]+)(_[A-z]{2})?\.csv$
    Which matches "100613_CAT_100522_23_NO3_PO4_JG.csv" as:

    1) 100613
    2) CAT_100522
    3) 23_NO3
    4) PO4
    5) _JG
    Thanks Jalucas,
    The match would want to look like this:
    1) 100613
    2) CAT
    3) 100522_23
    4) NO3_PO4
    5) JG or _JG

    Can I ask what the "^" means inside your character lists? I've never seen that unless it just means "start" like it does at the start of a regexp.

    A name with all the optionals would look like: "100613_CAT_TMP_100522_23_NO3_PO4_JG.csv"

    I can't give "all" the optionals because the first groups of letters can be repeated infinitely many times. Like "100613_CAT_TMP_SHOE_DOG_BAT_BASEBALL_100522_23_NO3_PO4_JG.csv". Same goes for the 2nd groups of numbers and the chemical names (NO3, PO4), etc. But the idea is that anything but the first date, and the last 2 initials can be repeated, and the initials can be excluded.
  10. #6
  11. A94528C464D168DC82FE4933E9DF37
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2010
    Location
    California
    Posts
    119
    Rep Power
    73
    Ok I've revised it a bit here:

    Code:
    ^([0-9]+)_([A-z]+(?:_[A-z]+)*)_([0-9]+(?:_[0-9]+)*)_([A-z]+[0-9]+(?:_[A-z]+[0-9]+)*)(_[A-z]{2})?\.csv$
    The [^_] means it will capture anything but the underscore, but I removed those expression because I found this to be insecure as it would allow for other special characters to be present.

    Using the above expression, I can catch "100613_CAT_TMP_SHOE_DOG_BAT_BASEBALL_100522_23_NO3_PO4_JG.csv" as

    1) 100613
    2) CAT_TMP_SHOE_DOG_BAT_BASEBALL
    3) 100522_23
    4) NO3_PO4
    5) _JG
  12. #7
  13. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2010
    Posts
    29
    Rep Power
    0
    Originally Posted by jalucas
    Ok I've revised it a bit here:

    Code:
    ^([0-9]+)_([A-z]+(?:_[A-z]+)*)_([0-9]+(?:_[0-9]+)*)_([A-z]+[0-9]+(?:_[A-z]+[0-9]+)*)(_[A-z]{2})?\.csv$
    The [^_] means it will capture anything but the underscore, but I removed those expression because I found this to be insecure as it would allow for other special characters to be present.

    Using the above expression, I can catch "100613_CAT_TMP_SHOE_DOG_BAT_BASEBALL_100522_23_NO3_PO4_JG.csv" as

    1) 100613
    2) CAT_TMP_SHOE_DOG_BAT_BASEBALL
    3) 100522_23
    4) NO3_PO4
    5) _JG
    Thanks again, Jalucas.

    Unfortunately it doesn't work for some of them like this one:
    "100506_Aloha_Chow_100522_DIY"

    What are the sections with the "?:"
    I hadn't seen those before and don't know what they mean.
    I can tell we're getting closer though because it caught the one I was having trouble with.
  14. #8
  15. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2010
    Posts
    29
    Rep Power
    0
    I think I see one of the problems being that the regexp doesn't provide for chemicals with no numbers in their names. Like my example DIY (fake chemical ) which is close to one i need to allow.
  16. #9
  17. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2010
    Posts
    29
    Rep Power
    0
    Thankfully, my boss has freed me from having to do this. We are going to make the user do more checking themselves and manually enter stuff. Phew!

    We can continue as a matter of academic interest if anyone likes though!

IMN logo majestic logo threadwatch logo seochat tools logo