Regex Programming
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me

The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.

Go Back   Dev Shed ForumsProgramming Languages - MoreRegex Programming

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old November 25th, 2012, 07:38 AM
vgy_regex vgy_regex is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Nov 2012
Posts: 5 vgy_regex User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 h 9 m 50 sec
Reputation Power: 0
Trouble parsing CSV file

I have a log file: below a single line of the log:

2012-10-10 ; 09:56:28 ; W3SVC ; OnPreprocHeaders ; 192.168.1.10 ; ; ; site.test.acme.com ; Mozilla/5.0 (X11; Linux x86_64; rv:11.0) Gecko/20100101 Firefox/11.0 ; GET ; /internet/default/acmecomunication/p/Arch/Arch/index.html ;

,
I would like to parse the line using ; as a separator:



2012-10-10 field #1
09:56:28 field #2
W3SVC field #3
OnPreprocHeaders field #4
192.168.1.10 field #5
blank field #6
blank field #7
blank field #8
site.test.acme.com field #9
Mozilla/5.0 field #10
(X11; Linux x86_64; rv:11.0) field #11
Gecko/20100101 Firefox/11.0 field #12
GET field #13
/internet/default/acmecomunication/p/Arch/Arch/index.html field #14

I have tried this:

(\d+\-\d+\-\d+) ; (\d+\:\d+\:\d+.) ; (\S+) ; (\S+) ; (\d+\.\d+\.\d+\.\d+) ; ; ; (\S+) ; (\S+)\/(\d+)\.(\d+) (\(.*?\)) (.*\;)

but dosen't work well.

Below the result:

1: 2012-10-10
2: 09:56:28
3: W3SVC
4: OnPreprocHeaders
5: 192.168.1.10
6: site.test.acme.com
7: Mozilla
8: 5
9: 0
10: (X11; Linux x86_64; rv:11.0)
11:Gecko/20100101 Firefox/11.0 ; GET ; /internet/default/acmecomunication/p/Arch/Arch/index.html ;

Any suggestion?

Thank you

Reply With Quote
  #2  
Old November 25th, 2012, 08:08 AM
Jacques1's Avatar
Jacques1 Jacques1 is online now
pollyanna
Click here for more information.
 
Join Date: Jul 2012
Location: Germany
Posts: 1,874 Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level) 
Time spent in forums: 1 Month 2 Weeks 2 Days 2 h 59 m 13 sec
Reputation Power: 813
Hi,

Quote:
Originally Posted by vgy_regex
Any suggestion?


Yes, don't use regular expressions. Whatever language you're using, it most certainly has a function/method that will split a string at a separator character.

The only situation where a regex might make sense is when the application that writes these logs is broken and you need to debug it by validating each line. But I guess that's not what you want.

Reply With Quote
  #3  
Old November 25th, 2012, 10:07 AM
vgy_regex vgy_regex is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Nov 2012
Posts: 5 vgy_regex User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 h 9 m 50 sec
Reputation Power: 0
Quote:
Originally Posted by Jacques1
Hi,



Yes, don't use regular expressions. Whatever language you're using, it most certainly has a function/method that will split a string at a separator character.

The only situation where a regex might make sense is when the application that writes these logs is broken and you need to debug it by validating each line. But I guess that's not what you want.


I can use only regular expression.

Reply With Quote
  #4  
Old November 25th, 2012, 12:07 PM
Laurent_R Laurent_R is online now
Contributing User
Dev Shed Novice (500 - 999 posts)
 
Join Date: Jun 2012
Posts: 511 Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level) 
Time spent in forums: 4 Days 19 h 58 m 22 sec
Reputation Power: 405
I was also just going to advise you not to use regexes, but rather a split function (whatever its name in the language you are using).

If you insist on regexes, one of the problem I can see is that you are using capturing parentheses in places where you probably don't want them.

If you want to capture: "Mozilla/5.0", then you should not use:

Code:
 (\S+)\/(\d+)\.(\d+) 


which will capture that part into 3 fields, but rather something like:

Code:
 (\S+\/\d+\.\d+) 

Reply With Quote
  #5  
Old November 25th, 2012, 12:21 PM
vgy_regex vgy_regex is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Nov 2012
Posts: 5 vgy_regex User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 h 9 m 50 sec
Reputation Power: 0
The problem is on the part:"Gecko/20100101 Firefox/11.0 ; GET ; /internet/default/acmecomunication/p/Arch/Arch/index.html ;"
the reg ex dosen't split the line on tthe character ";"

Reply With Quote
  #6  
Old November 25th, 2012, 12:42 PM
Laurent_R Laurent_R is online now
Contributing User
Dev Shed Novice (500 - 999 posts)
 
Join Date: Jun 2012
Posts: 511 Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level) 
Time spent in forums: 4 Days 19 h 58 m 22 sec
Reputation Power: 405
Replace at the end of your regex:

Code:
(\(.*?\)) (.*\;)


by:

Code:
(\([^)]+\)) (Ge[^;]+;)
,

which should capture:
Code:
(X11; Linux x86_64; rv:11.0)

and
Code:
Gecko/20100101 Firefox/11.0 ;

Reply With Quote
  #7  
Old November 25th, 2012, 12:49 PM
vgy_regex vgy_regex is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Nov 2012
Posts: 5 vgy_regex User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 h 9 m 50 sec
Reputation Power: 0
Thank you for your rapid response.
Maybe I can't explain what is the problem.
I would like to split the line for every ";" characters.

So the part "Gecko/20100101 Firefox/11.0 ; GET ; /internet/default/acmecomunication/p/Arch/Arch/index.html ;"

should be splitted in this way:
Gecko/20100101 Firefox/11.0 ;
GET ;
/internet/default/acmecomunication/p/Arch/Arch/index.html ;

Reply With Quote
  #8  
Old November 25th, 2012, 01:02 PM
Laurent_R Laurent_R is online now
Contributing User
Dev Shed Novice (500 - 999 posts)
 
Join Date: Jun 2012
Posts: 511 Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level) 
Time spent in forums: 4 Days 19 h 58 m 22 sec
Reputation Power: 405
The problem is "greedy matching". When you have this regex:

Code:
(.*\;)


it will try to match anything from your starting point to the last ';'. This is why I have added negated character classes in the previous regexes I suggested. Just do the same, or use non greedy match (something like (.*?).

Last edited by Laurent_R : November 25th, 2012 at 01:21 PM.

Reply With Quote
  #9  
Old November 25th, 2012, 01:39 PM
vgy_regex vgy_regex is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Nov 2012
Posts: 5 vgy_regex User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 h 9 m 50 sec
Reputation Power: 0
Quote:
Originally Posted by Laurent_R
The problem is "greedy matching". When you have this regex:

Code:
(.*\;)


it will try to match anything from your starting point to the last ';'. This is why I have added negated character classes in the previous regexes I suggested. Just do the same, or use non greedy match (something like (.*?;)).


Thank you for your patience.

I have tried the entire reg ex (\d+\-\d+\-\d+) ; (\d+\:\d+\:\d+.) ; (\S+) ; (\S+) ; (\d+\.\d+\.\d+\.\d+) ; ; ; (\S+) ; (\S+\/\d+\.\d+) (\([^)]+\)) ((.*?;)*)

on the web http://www.regextester.com/

below the results;

1: (2012-10-10)
2: (09:56:28)
3: (W3SVC)
4: (OnPreprocHeaders)
5: (192.168.1.10)
6: (site.test.acme.com)
7: (Mozilla/5.0)
8: ((X11; Linux x86_64; rv:11.0))
9: (Gecko/20100101 Firefox/11.0 ; GET ; /internet/default/acmecomunication/p/Arch/Arch/index.html ;)
10: ( /internet/default/acmecomunication/p/Arch/Arch/index.html ;)


point number 9 is not splitten well.

Reply With Quote
  #10  
Old November 25th, 2012, 03:44 PM
Laurent_R Laurent_R is online now
Contributing User
Dev Shed Novice (500 - 999 posts)
 
Join Date: Jun 2012
Posts: 511 Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level) 
Time spent in forums: 4 Days 19 h 58 m 22 sec
Reputation Power: 405
OK, let's take it again.

In the end of your regex,

Code:
(\([^)]+\)) ((.*?;)*)


we have "(\([^)]+\))" matching the "X11;Linux..." part. Forget the end of it.

Now we want to match "Gecko/20100101 Firefox/11.0 ; "

We want to match everything to the semi-colon.

This can be done again with a negated character class: some word characters, as many characters different from ';' as possible, followed by a space and then a ';'. This can be:

Code:
(\w+[^;]+ ;)


Then again, for the GET part: some word characters, a space and a ';' :

Code:
( \w+ ; )


then a whole bunch of characters different from ;, followed by ';':

Code:
(\/\w+[^;]+;)


So, altogether the end of your regex could look like this:

Code:
(\([^)]+\)) (\w+[^;]+  ;) ( \w+;) (\/\w+[^;]+;)
# X11;Li...  Gecko...         GET     /internet/...


which should work if I did not miss any part or made a mistake on spaces...

The point is that I do not really want to give you the final solution (although now you probably have it), but I want to try to teach you how to use this language and to understand it. So that if I missed something out, you will be able to correct the regex when testing it.

Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming Languages - MoreRegex Programming > Trouble parsing CSV file

Developer Shed Advertisers and Affiliates



Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 


Powered by: vBulletin Version 3.0.5
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.

© 2003-2013 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap