August 3rd, 2003, 04:48 PM
fix regex and flatten array?
Incoming Python newbie question...
Given a string, I'd like to write a regular expression that finds all substrings only of the alphabet: lowercase letters, digits, and the two characters apostraphe and period, and which meet the following criteria:
1. The first character of the substring that adheres to the alphabet is a lowercase letter. I think this is:
2. The first character of the substring is a digit, but the entire substring is not entirely composed of digits (i.e., 10 would not match, but 10.2 would, or 10a10.a would). I think this is:
Combining the two, I thought the regex would be:
But that doesn't work...
regex = re.compile('([a-z]|[0-9]+[a-z.\'])[a-z0-9.\']*')
For example, given the string:
100 xyz jk-10abcdef
The RE would match xyz by rule 1, jk by rule 1, and 10abcdef by rule 2 (note 100 would not be matched because even though it starts with a digit, it violates rule 2 because it is composed entirely of digits... also abcdef would not match because the expression should be greedy and would match 10abcef first (?)).
I was then hoping to get all the results via findall().
Any help would be much appreciated.
August 3rd, 2003, 05:47 PM
My first reaction is that you really should consider using a custom solution, i.e. some sort of state-machine. I'll look into this, though.
August 3rd, 2003, 05:58 PM
I thought that underlying a regex engine was a finite state machine, i.e., when you call compile() it translates the provided string pattern into a fsm which you could use on successive strings by using findall().
And the two regular expressions, when used separately, work. I just having trouble combining them into a singular regular expression which I can use. I think it's just a syntax issue, really, but I can't figure it out for the life of me.
August 3rd, 2003, 06:31 PM
Well, the problem with regular expressions is that they provide very limited state switching concepts. Python RE:s don't even support if-then-else constructs; even with if-then-else support, it's very hard to switch state and discard already captured groups if the match failes further ahead.
Anyway, I think I solved your problem. The RE I got is:
The RE passed a very brief test, you'll have to check further for yourself.
In advanced cases, I really would recommend a custom state machine. It's easier for another programmer to understand; it's also easier to extend and support.
Also, the RE you provided should fail because you use single quotes both around the RE and inside, but I suspect that's just for the post.
Last edited by percivall; August 3rd, 2003 at 06:34 PM.
August 3rd, 2003, 08:56 PM
Wow, thank you so much! The regular expression you made works almost perfectly... The only time I think it messes up is when the string ends in a number, e.g.:
>>> regexp.findall('103 45bc abd 67')
['45bc', 'abd', '67']
Is there a quick way to fix it? I'd fix it myself, but I'm new to regular expressions and don't understand it as a whole. Could you possibly give me a quick run-down on how it works?
If that's too much of a bother, don't worry. I'll keep trying and I'm sure one day I'll get it...
August 3rd, 2003, 10:02 PM
The above RE is exactly why I would recommend a custom state-machine solution. Anyway. I think this works. It took me a while. Python seems not to support certain constructs you'd think it would support; it made it much harder. On the other hand, I might be mistaken.
Simply put, I've added a negative look-ahead after the first one, to check if we're dealing with only numbers leading to the end of the string, which is where the previous version failed.
It's too complicated to explain exactly how this works if you don't understand exactly (that's why a custom solution is better. I'm not sure I understand completely )
August 4th, 2003, 09:29 AM
I think this pattern should do what you want, plus it's a little bit cleaner/smaller than the other suggested one.
would return: ['3r35', 'r', 'moo', 'moo', "hi'there7", 'e', 'c.', '56.7', '84e9']
rex = re.compile("((?:[a-z]|\d+(?=[a-z.']))[a-z0-9.']*)")
re.findall(rex, "3r35* r 578 moo,moo hi'there7 eTc. 56.7 84e9")
and for your example:
returns: ['xyz', 'jk', '10abcdef']
re.findall(rex, "100 xyz jk-10abcdef")
It's possible I misunderstood what you wanted. In which case feel free to disregard this post entirely.
August 4th, 2003, 05:40 PM
percivall, sacrilege, both of your regular expressions work for me... You guys are my heros! I can't tell you enough times how thankful I am!
(humble python newbie)