Python Programming
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me

The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.

Go Back   Dev Shed ForumsProgramming LanguagesPython Programming

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old August 3rd, 2003, 03:48 PM
theperfectsoup theperfectsoup is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Jul 2003
Posts: 35 theperfectsoup User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 10
Question fix regex and flatten array?

Incoming Python newbie question...

Given a string, I'd like to write a regular expression that finds all substrings only of the alphabet: lowercase letters, digits, and the two characters apostraphe and period, and which meet the following criteria:


1. The first character of the substring that adheres to the alphabet is a lowercase letter. I think this is:

Code:
[a-z][a-z0-9.\']*


2. The first character of the substring is a digit, but the entire substring is not entirely composed of digits (i.e., 10 would not match, but 10.2 would, or 10a10.a would). I think this is:

Code:
[0-9]+[a-z.\'][a-z0-9.\']*


Combining the two, I thought the regex would be:

Code:
regex = re.compile('([a-z]|[0-9]+[a-z.\'])[a-z0-9.\']*')


But that doesn't work...

For example, given the string:

100 xyz jk-10abcdef

The RE would match xyz by rule 1, jk by rule 1, and 10abcdef by rule 2 (note 100 would not be matched because even though it starts with a digit, it violates rule 2 because it is composed entirely of digits... also abcdef would not match because the expression should be greedy and would match 10abcef first (?)).

I was then hoping to get all the results via findall().

Any help would be much appreciated.

Thanks,
theperfectsoup

Reply With Quote
  #2  
Old August 3rd, 2003, 04:47 PM
percivall percivall is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Jul 2003
Posts: 133 percivall User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 10
My first reaction is that you really should consider using a custom solution, i.e. some sort of state-machine. I'll look into this, though.

Reply With Quote
  #3  
Old August 3rd, 2003, 04:58 PM
theperfectsoup theperfectsoup is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Jul 2003
Posts: 35 theperfectsoup User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 10
I thought that underlying a regex engine was a finite state machine, i.e., when you call compile() it translates the provided string pattern into a fsm which you could use on successive strings by using findall().

And the two regular expressions, when used separately, work. I just having trouble combining them into a singular regular expression which I can use. I think it's just a syntax issue, really, but I can't figure it out for the life of me.

Thanks again.

Reply With Quote
  #4  
Old August 3rd, 2003, 05:31 PM
percivall percivall is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Jul 2003
Posts: 133 percivall User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 10
Well, the problem with regular expressions is that they provide very limited state switching concepts. Python RE:s don't even support if-then-else constructs; even with if-then-else support, it's very hard to switch state and discard already captured groups if the match failes further ahead.

Anyway, I think I solved your problem. The RE I got is:
Code:
(?![\d]*?[^a-z\d.'])[a-z\d][a-z\d.']*?(?![a-z\d.'])
The RE passed a very brief test, you'll have to check further for yourself.

In advanced cases, I really would recommend a custom state machine. It's easier for another programmer to understand; it's also easier to extend and support.

Also, the RE you provided should fail because you use single quotes both around the RE and inside, but I suspect that's just for the post.

Last edited by percivall : August 3rd, 2003 at 05:34 PM.

Reply With Quote
  #5  
Old August 3rd, 2003, 07:56 PM
theperfectsoup theperfectsoup is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Jul 2003
Posts: 35 theperfectsoup User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 10
Wow, thank you so much! The regular expression you made works almost perfectly... The only time I think it messes up is when the string ends in a number, e.g.:

>>> regexp.findall('103 45bc abd 67')
['45bc', 'abd', '67']

Is there a quick way to fix it? I'd fix it myself, but I'm new to regular expressions and don't understand it as a whole. Could you possibly give me a quick run-down on how it works?

If that's too much of a bother, don't worry. I'll keep trying and I'm sure one day I'll get it...

Thanks,
theperfectsoup

Reply With Quote
  #6  
Old August 3rd, 2003, 09:02 PM
percivall percivall is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Jul 2003
Posts: 133 percivall User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 10
Code:
(?![\d]*?[^a-z\d.'])(?![\d]*?$)[a-z\d][a-z\d.']*?(?=[^a-z\d.']|$)

The above RE is exactly why I would recommend a custom state-machine solution. Anyway. I think this works. It took me a while. Python seems not to support certain constructs you'd think it would support; it made it much harder. On the other hand, I might be mistaken.

Simply put, I've added a negative look-ahead after the first one, to check if we're dealing with only numbers leading to the end of the string, which is where the previous version failed.

It's too complicated to explain exactly how this works if you don't understand exactly (that's why a custom solution is better. I'm not sure I understand completely )

Have fun.

Reply With Quote
  #7  
Old August 4th, 2003, 08:29 AM
sacrilege sacrilege is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Dec 2002
Location: Norwich, UK
Posts: 53 sacrilege User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 9 h 1 m 53 sec
Reputation Power: 11
I think this pattern should do what you want, plus it's a little bit cleaner/smaller than the other suggested one.
((?:[a-z]|\d+(?=[a-z.']))[a-z0-9.']*)

eg.
Code:
rex = re.compile("((?:[a-z]|\d+(?=[a-z.']))[a-z0-9.']*)")
re.findall(rex, "3r35* r 578 moo,moo hi'there7 eTc. 56.7 84e9")

would return: ['3r35', 'r', 'moo', 'moo', "hi'there7", 'e', 'c.', '56.7', '84e9']

and for your example:
Code:
re.findall(rex, "100 xyz jk-10abcdef")

returns: ['xyz', 'jk', '10abcdef']

It's possible I misunderstood what you wanted. In which case feel free to disregard this post entirely.

Reply With Quote
  #8  
Old August 4th, 2003, 04:40 PM
theperfectsoup theperfectsoup is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Jul 2003
Posts: 35 theperfectsoup User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 10
Smile Thanks guys!

percivall, sacrilege, both of your regular expressions work for me... You guys are my heros! I can't tell you enough times how thankful I am!

Thanks again,
theperfectsoup
(humble python newbie)

Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming LanguagesPython Programming > fix regex and flatten array?

Developer Shed Advertisers and Affiliates



Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 


Powered by: vBulletin Version 3.0.5
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.

© 2003-2013 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap