The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.
|
 |
|
Dev Shed Forums
> Programming Languages
> Python Programming
|
fix regex and flatten array?
Discuss fix regex and flatten array? in the Python Programming forum on Dev Shed. fix regex and flatten array? Python Programming forum discussing coding techniques, tips and tricks, and Zope related information. Python was designed from the ground up to be a completely object-oriented programming language.
|
|
 |
|
|
|
|

Dev Shed Forums Sponsor:
|
|
|

August 3rd, 2003, 03:48 PM
|
|
Contributing User
|
|
Join Date: Jul 2003
Posts: 35
Time spent in forums: < 1 sec
Reputation Power: 10
|
|
fix regex and flatten array?
Incoming Python newbie question...
Given a string, I'd like to write a regular expression that finds all substrings only of the alphabet: lowercase letters, digits, and the two characters apostraphe and period, and which meet the following criteria:
1. The first character of the substring that adheres to the alphabet is a lowercase letter. I think this is:
2. The first character of the substring is a digit, but the entire substring is not entirely composed of digits (i.e., 10 would not match, but 10.2 would, or 10a10.a would). I think this is:
Code:
[0-9]+[a-z.\'][a-z0-9.\']*
Combining the two, I thought the regex would be:
Code:
regex = re.compile('([a-z]|[0-9]+[a-z.\'])[a-z0-9.\']*')
But that doesn't work...
For example, given the string:
100 xyz jk-10abcdef
The RE would match xyz by rule 1, jk by rule 1, and 10abcdef by rule 2 (note 100 would not be matched because even though it starts with a digit, it violates rule 2 because it is composed entirely of digits... also abcdef would not match because the expression should be greedy and would match 10abcef first (?)).
I was then hoping to get all the results via findall().
Any help would be much appreciated.
Thanks,
theperfectsoup
|

August 3rd, 2003, 04:47 PM
|
|
Contributing User
|
|
Join Date: Jul 2003
Posts: 133
Time spent in forums: < 1 sec
Reputation Power: 10
|
|
|
My first reaction is that you really should consider using a custom solution, i.e. some sort of state-machine. I'll look into this, though.
|

August 3rd, 2003, 04:58 PM
|
|
Contributing User
|
|
Join Date: Jul 2003
Posts: 35
Time spent in forums: < 1 sec
Reputation Power: 10
|
|
|
I thought that underlying a regex engine was a finite state machine, i.e., when you call compile() it translates the provided string pattern into a fsm which you could use on successive strings by using findall().
And the two regular expressions, when used separately, work. I just having trouble combining them into a singular regular expression which I can use. I think it's just a syntax issue, really, but I can't figure it out for the life of me.
Thanks again.
|

August 3rd, 2003, 05:31 PM
|
|
Contributing User
|
|
Join Date: Jul 2003
Posts: 133
Time spent in forums: < 1 sec
Reputation Power: 10
|
|
Well, the problem with regular expressions is that they provide very limited state switching concepts. Python RE:s don't even support if-then-else constructs; even with if-then-else support, it's very hard to switch state and discard already captured groups if the match failes further ahead.
Anyway, I think I solved your problem. The RE I got is:
Code:
(?![\d]*?[^a-z\d.'])[a-z\d][a-z\d.']*?(?![a-z\d.'])
The RE passed a very brief test, you'll have to check further for yourself.
In advanced cases, I really would recommend a custom state machine. It's easier for another programmer to understand; it's also easier to extend and support.
Also, the RE you provided should fail because you use single quotes both around the RE and inside, but I suspect that's just for the post.
Last edited by percivall : August 3rd, 2003 at 05:34 PM.
|

August 3rd, 2003, 07:56 PM
|
|
Contributing User
|
|
Join Date: Jul 2003
Posts: 35
Time spent in forums: < 1 sec
Reputation Power: 10
|
|
|
Wow, thank you so much! The regular expression you made works almost perfectly... The only time I think it messes up is when the string ends in a number, e.g.:
>>> regexp.findall('103 45bc abd 67')
['45bc', 'abd', '67']
Is there a quick way to fix it? I'd fix it myself, but I'm new to regular expressions and don't understand it as a whole. Could you possibly give me a quick run-down on how it works?
If that's too much of a bother, don't worry. I'll keep trying and I'm sure one day I'll get it...
Thanks,
theperfectsoup
|

August 3rd, 2003, 09:02 PM
|
|
Contributing User
|
|
Join Date: Jul 2003
Posts: 133
Time spent in forums: < 1 sec
Reputation Power: 10
|
|
Code:
(?![\d]*?[^a-z\d.'])(?![\d]*?$)[a-z\d][a-z\d.']*?(?=[^a-z\d.']|$)
The above RE is exactly why I would recommend a custom state-machine solution. Anyway. I think this works. It took me a while. Python seems not to support certain constructs you'd think it would support; it made it much harder. On the other hand, I might be mistaken.
Simply put, I've added a negative look-ahead after the first one, to check if we're dealing with only numbers leading to the end of the string, which is where the previous version failed.
It's too complicated to explain exactly how this works if you don't understand exactly (that's why a custom solution is better. I'm not sure I understand completely  )
Have fun.
|

August 4th, 2003, 08:29 AM
|
|
Contributing User
|
|
Join Date: Dec 2002
Location: Norwich, UK
Posts: 53
Time spent in forums: 9 h 1 m 53 sec
Reputation Power: 11
|
|
I think this pattern should do what you want, plus it's a little bit cleaner/smaller than the other suggested one.
((?:[a-z]|\d+(?=[a-z.']))[a-z0-9.']*)
eg.
Code:
rex = re.compile("((?:[a-z]|\d+(?=[a-z.']))[a-z0-9.']*)")
re.findall(rex, "3r35* r 578 moo,moo hi'there7 eTc. 56.7 84e9")
would return: ['3r35', 'r', 'moo', 'moo', "hi'there7", 'e', 'c.', '56.7', '84e9']
and for your example:
Code:
re.findall(rex, "100 xyz jk-10abcdef")
returns: ['xyz', 'jk', '10abcdef']
It's possible I misunderstood what you wanted. In which case feel free to disregard this post entirely.
|

August 4th, 2003, 04:40 PM
|
|
Contributing User
|
|
Join Date: Jul 2003
Posts: 35
Time spent in forums: < 1 sec
Reputation Power: 10
|
|
Thanks guys!
percivall, sacrilege, both of your regular expressions work for me... You guys are my heros! I can't tell you enough times how thankful I am!
Thanks again,
theperfectsoup
(humble python newbie)
|
Developer Shed Advertisers and Affiliates
| Thread Tools |
Search this Thread |
|
|
|
| Display Modes |
Rate This Thread |
Linear Mode
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
|
|