#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    5
    Rep Power
    0

    Trouble parsing CSV file


    I have a log file: below a single line of the log:

    2012-10-10 ; 09:56:28 ; W3SVC ; OnPreprocHeaders ; 192.168.1.10 ; ; ; site.test.acme.com ; Mozilla/5.0 (X11; Linux x86_64; rv:11.0) Gecko/20100101 Firefox/11.0 ; GET ; /internet/default/acmecomunication/p/Arch/Arch/index.html ;

    ,
    I would like to parse the line using ; as a separator:



    2012-10-10 field #1
    09:56:28 field #2
    W3SVC field #3
    OnPreprocHeaders field #4
    192.168.1.10 field #5
    blank field #6
    blank field #7
    blank field #8
    site.test.acme.com field #9
    Mozilla/5.0 field #10
    (X11; Linux x86_64; rv:11.0) field #11
    Gecko/20100101 Firefox/11.0 field #12
    GET field #13
    /internet/default/acmecomunication/p/Arch/Arch/index.html field #14

    I have tried this:

    (\d+\-\d+\-\d+) ; (\d+\:\d+\:\d+.) ; (\S+) ; (\S+) ; (\d+\.\d+\.\d+\.\d+) ; ; ; (\S+) ; (\S+)\/(\d+)\.(\d+) (\(.*?\)) (.*\;)

    but dosen't work well.

    Below the result:

    1: 2012-10-10
    2: 09:56:28
    3: W3SVC
    4: OnPreprocHeaders
    5: 192.168.1.10
    6: site.test.acme.com
    7: Mozilla
    8: 5
    9: 0
    10: (X11; Linux x86_64; rv:11.0)
    11:Gecko/20100101 Firefox/11.0 ; GET ; /internet/default/acmecomunication/p/Arch/Arch/index.html ;

    Any suggestion?

    Thank you
  2. #2
  3. --
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2012
    Posts
    3,959
    Rep Power
    1014
    Hi,

    Originally Posted by vgy_regex
    Any suggestion?
    Yes, don't use regular expressions. Whatever language you're using, it most certainly has a function/method that will split a string at a separator character.

    The only situation where a regex might make sense is when the application that writes these logs is broken and you need to debug it by validating each line. But I guess that's not what you want.
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    5
    Rep Power
    0
    Originally Posted by Jacques1
    Hi,



    Yes, don't use regular expressions. Whatever language you're using, it most certainly has a function/method that will split a string at a separator character.

    The only situation where a regex might make sense is when the application that writes these logs is broken and you need to debug it by validating each line. But I guess that's not what you want.
    I can use only regular expression.
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Location
    Paris area, France
    Posts
    843
    Rep Power
    496
    I was also just going to advise you not to use regexes, but rather a split function (whatever its name in the language you are using).

    If you insist on regexes, one of the problem I can see is that you are using capturing parentheses in places where you probably don't want them.

    If you want to capture: "Mozilla/5.0", then you should not use:

    Code:
     (\S+)\/(\d+)\.(\d+)
    which will capture that part into 3 fields, but rather something like:

    Code:
     (\S+\/\d+\.\d+)
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    5
    Rep Power
    0
    The problem is on the part:"Gecko/20100101 Firefox/11.0 ; GET ; /internet/default/acmecomunication/p/Arch/Arch/index.html ;"
    the reg ex dosen't split the line on tthe character ";"
  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Location
    Paris area, France
    Posts
    843
    Rep Power
    496
    Replace at the end of your regex:

    Code:
    (\(.*?\)) (.*\;)
    by:

    Code:
    (\([^)]+\)) (Ge[^;]+;)
    ,

    which should capture:
    Code:
    (X11; Linux x86_64; rv:11.0)
    and
    Code:
    Gecko/20100101 Firefox/11.0 ;
  12. #7
  13. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    5
    Rep Power
    0
    Thank you for your rapid response.
    Maybe I can't explain what is the problem.
    I would like to split the line for every ";" characters.

    So the part "Gecko/20100101 Firefox/11.0 ; GET ; /internet/default/acmecomunication/p/Arch/Arch/index.html ;"

    should be splitted in this way:
    Gecko/20100101 Firefox/11.0 ;
    GET ;
    /internet/default/acmecomunication/p/Arch/Arch/index.html ;
  14. #8
  15. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Location
    Paris area, France
    Posts
    843
    Rep Power
    496
    The problem is "greedy matching". When you have this regex:

    Code:
    (.*\;)
    it will try to match anything from your starting point to the last ';'. This is why I have added negated character classes in the previous regexes I suggested. Just do the same, or use non greedy match (something like (.*?).
    Last edited by Laurent_R; November 25th, 2012 at 02:21 PM.
  16. #9
  17. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    5
    Rep Power
    0
    Originally Posted by Laurent_R
    The problem is "greedy matching". When you have this regex:

    Code:
    (.*\;)
    it will try to match anything from your starting point to the last ';'. This is why I have added negated character classes in the previous regexes I suggested. Just do the same, or use non greedy match (something like (.*?;)).
    Thank you for your patience.

    I have tried the entire reg ex (\d+\-\d+\-\d+) ; (\d+\:\d+\:\d+.) ; (\S+) ; (\S+) ; (\d+\.\d+\.\d+\.\d+) ; ; ; (\S+) ; (\S+\/\d+\.\d+) (\([^)]+\)) ((.*?;)*)

    on the web http://www.regextester.com/

    below the results;

    1: (2012-10-10)
    2: (09:56:28)
    3: (W3SVC)
    4: (OnPreprocHeaders)
    5: (192.168.1.10)
    6: (site.test.acme.com)
    7: (Mozilla/5.0)
    8: ((X11; Linux x86_64; rv:11.0))
    9: (Gecko/20100101 Firefox/11.0 ; GET ; /internet/default/acmecomunication/p/Arch/Arch/index.html ;)
    10: ( /internet/default/acmecomunication/p/Arch/Arch/index.html ;)


    point number 9 is not splitten well.
  18. #10
  19. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Location
    Paris area, France
    Posts
    843
    Rep Power
    496
    OK, let's take it again.

    In the end of your regex,

    Code:
    (\([^)]+\)) ((.*?;)*)
    we have "(\([^)]+\))" matching the "X11;Linux..." part. Forget the end of it.

    Now we want to match "Gecko/20100101 Firefox/11.0 ; "

    We want to match everything to the semi-colon.

    This can be done again with a negated character class: some word characters, as many characters different from ';' as possible, followed by a space and then a ';'. This can be:

    Code:
    (\w+[^;]+ ;)
    Then again, for the GET part: some word characters, a space and a ';' :

    Code:
    ( \w+ ; )
    then a whole bunch of characters different from ;, followed by ';':

    Code:
    (\/\w+[^;]+;)
    So, altogether the end of your regex could look like this:

    Code:
    (\([^)]+\)) (\w+[^;]+  ;) ( \w+;) (\/\w+[^;]+;)
    # X11;Li...  Gecko...         GET     /internet/...
    which should work if I did not miss any part or made a mistake on spaces...

    The point is that I do not really want to give you the final solution (although now you probably have it), but I want to try to teach you how to use this language and to understand it. So that if I missed something out, you will be able to correct the regex when testing it.

IMN logo majestic logo threadwatch logo seochat tools logo