The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.
|
 |
|
Dev Shed Forums
> Programming Languages - More
> Regex Programming
|
Trouble parsing CSV file
Discuss Trouble parsing CSV file in the Regex Programming forum on Dev Shed. Trouble parsing CSV file Regular expressions forum covering PCRE and POSIX techniques, practices, and standards. Regular expressions help shorten coding time by providing the ability to compact many lines of code into one string.
|
|
 |
|
|
|
|

Dev Shed Forums Sponsor:
|
|
|

November 25th, 2012, 07:38 AM
|
|
Registered User
|
|
Join Date: Nov 2012
Posts: 5
Time spent in forums: 1 h 9 m 50 sec
Reputation Power: 0
|
|
|
Trouble parsing CSV file
I have a log file: below a single line of the log:
2012-10-10 ; 09:56:28 ; W3SVC ; OnPreprocHeaders ; 192.168.1.10 ; ; ; site.test.acme.com ; Mozilla/5.0 (X11; Linux x86_64; rv:11.0) Gecko/20100101 Firefox/11.0 ; GET ; /internet/default/acmecomunication/p/Arch/Arch/index.html ;
,
I would like to parse the line using ; as a separator:
2012-10-10 field #1
09:56:28 field #2
W3SVC field #3
OnPreprocHeaders field #4
192.168.1.10 field #5
blank field #6
blank field #7
blank field #8
site.test.acme.com field #9
Mozilla/5.0 field #10
(X11; Linux x86_64; rv:11.0) field #11
Gecko/20100101 Firefox/11.0 field #12
GET field #13
/internet/default/acmecomunication/p/Arch/Arch/index.html field #14
I have tried this:
(\d+\-\d+\-\d+) ; (\d+\:\d+\:\d+.) ; (\S+) ; (\S+) ; (\d+\.\d+\.\d+\.\d+) ; ; ; (\S+) ; (\S+)\/(\d+)\.(\d+) (\(.*?\)) (.*\;)
but dosen't work well.
Below the result:
1: 2012-10-10
2: 09:56:28
3: W3SVC
4: OnPreprocHeaders
5: 192.168.1.10
6: site.test.acme.com
7: Mozilla
8: 5
9: 0
10: (X11; Linux x86_64; rv:11.0)
11:Gecko/20100101 Firefox/11.0 ; GET ; /internet/default/acmecomunication/p/Arch/Arch/index.html ;
Any suggestion?
Thank you
|

November 25th, 2012, 08:08 AM
|
 |
pollyanna
|
|
Join Date: Jul 2012
Location: Germany
|
|
Hi,
Quote: | Originally Posted by vgy_regex Any suggestion? |
Yes, don't use regular expressions. Whatever language you're using, it most certainly has a function/method that will split a string at a separator character.
The only situation where a regex might make sense is when the application that writes these logs is broken and you need to debug it by validating each line. But I guess that's not what you want.
|

November 25th, 2012, 10:07 AM
|
|
Registered User
|
|
Join Date: Nov 2012
Posts: 5
Time spent in forums: 1 h 9 m 50 sec
Reputation Power: 0
|
|
Quote: | Originally Posted by Jacques1 Hi,
Yes, don't use regular expressions. Whatever language you're using, it most certainly has a function/method that will split a string at a separator character.
The only situation where a regex might make sense is when the application that writes these logs is broken and you need to debug it by validating each line. But I guess that's not what you want. |
I can use only regular expression.
|

November 25th, 2012, 12:07 PM
|
|
|
I was also just going to advise you not to use regexes, but rather a split function (whatever its name in the language you are using).
If you insist on regexes, one of the problem I can see is that you are using capturing parentheses in places where you probably don't want them.
If you want to capture: "Mozilla/5.0", then you should not use:
Code:
(\S+)\/(\d+)\.(\d+)
which will capture that part into 3 fields, but rather something like:
|

November 25th, 2012, 12:21 PM
|
|
Registered User
|
|
Join Date: Nov 2012
Posts: 5
Time spent in forums: 1 h 9 m 50 sec
Reputation Power: 0
|
|
|
The problem is on the part:"Gecko/20100101 Firefox/11.0 ; GET ; /internet/default/acmecomunication/p/Arch/Arch/index.html ;"
the reg ex dosen't split the line on tthe character ";"
|

November 25th, 2012, 12:42 PM
|
|
|
Replace at the end of your regex:
by:
Code:
(\([^)]+\)) (Ge[^;]+;)
,
which should capture:
Code:
(X11; Linux x86_64; rv:11.0)
and
Code:
Gecko/20100101 Firefox/11.0 ;
|

November 25th, 2012, 12:49 PM
|
|
Registered User
|
|
Join Date: Nov 2012
Posts: 5
Time spent in forums: 1 h 9 m 50 sec
Reputation Power: 0
|
|
|
Thank you for your rapid response.
Maybe I can't explain what is the problem.
I would like to split the line for every ";" characters.
So the part "Gecko/20100101 Firefox/11.0 ; GET ; /internet/default/acmecomunication/p/Arch/Arch/index.html ;"
should be splitted in this way:
Gecko/20100101 Firefox/11.0 ;
GET ;
/internet/default/acmecomunication/p/Arch/Arch/index.html ;
|

November 25th, 2012, 01:02 PM
|
|
|
The problem is "greedy matching". When you have this regex:
it will try to match anything from your starting point to the last ';'. This is why I have added negated character classes in the previous regexes I suggested. Just do the same, or use non greedy match (something like (.*?  ).
Last edited by Laurent_R : November 25th, 2012 at 01:21 PM.
|

November 25th, 2012, 01:39 PM
|
|
Registered User
|
|
Join Date: Nov 2012
Posts: 5
Time spent in forums: 1 h 9 m 50 sec
Reputation Power: 0
|
|
Quote: | Originally Posted by Laurent_R The problem is "greedy matching". When you have this regex:
it will try to match anything from your starting point to the last ';'. This is why I have added negated character classes in the previous regexes I suggested. Just do the same, or use non greedy match (something like (.*?;)). |
Thank you for your patience.
I have tried the entire reg ex (\d+\-\d+\-\d+) ; (\d+\:\d+\:\d+.) ; (\S+) ; (\S+) ; (\d+\.\d+\.\d+\.\d+) ; ; ; (\S+) ; (\S+\/\d+\.\d+) (\([^)]+\)) ((.*?;)*)
on the web http://www.regextester.com/
below the results;
1: (2012-10-10)
2: (09:56:28)
3: (W3SVC)
4: (OnPreprocHeaders)
5: (192.168.1.10)
6: (site.test.acme.com)
7: (Mozilla/5.0)
8: ((X11; Linux x86_64; rv:11.0))
9: (Gecko/20100101 Firefox/11.0 ; GET ; /internet/default/acmecomunication/p/Arch/Arch/index.html ;)
10: ( /internet/default/acmecomunication/p/Arch/Arch/index.html ;)
point number 9 is not splitten well.
|

November 25th, 2012, 03:44 PM
|
|
|
OK, let's take it again.
In the end of your regex,
Code:
(\([^)]+\)) ((.*?;)*)
we have "(\([^)]+\))" matching the "X11;Linux..." part. Forget the end of it.
Now we want to match "Gecko/20100101 Firefox/11.0 ; "
We want to match everything to the semi-colon.
This can be done again with a negated character class: some word characters, as many characters different from ';' as possible, followed by a space and then a ';'. This can be:
Then again, for the GET part: some word characters, a space and a ';' :
then a whole bunch of characters different from ;, followed by ';':
So, altogether the end of your regex could look like this:
Code:
(\([^)]+\)) (\w+[^;]+ ;) ( \w+;) (\/\w+[^;]+;)
# X11;Li... Gecko... GET /internet/...
which should work if I did not miss any part or made a mistake on spaces...
The point is that I do not really want to give you the final solution (although now you probably have it), but I want to try to teach you how to use this language and to understand it. So that if I missed something out, you will be able to correct the regex when testing it.
|
Developer Shed Advertisers and Affiliates
| Thread Tools |
Search this Thread |
|
|
|
| Display Modes |
Rate This Thread |
Linear Mode
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
|
|