#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2003
    Posts
    35
    Rep Power
    11

    Question fix regex and flatten array?


    Incoming Python newbie question...

    Given a string, I'd like to write a regular expression that finds all substrings only of the alphabet: lowercase letters, digits, and the two characters apostraphe and period, and which meet the following criteria:


    1. The first character of the substring that adheres to the alphabet is a lowercase letter. I think this is:

    Code:
    [a-z][a-z0-9.\']*
    2. The first character of the substring is a digit, but the entire substring is not entirely composed of digits (i.e., 10 would not match, but 10.2 would, or 10a10.a would). I think this is:

    Code:
    [0-9]+[a-z.\'][a-z0-9.\']*
    Combining the two, I thought the regex would be:

    Code:
    regex = re.compile('([a-z]|[0-9]+[a-z.\'])[a-z0-9.\']*')
    But that doesn't work...

    For example, given the string:

    100 xyz jk-10abcdef

    The RE would match xyz by rule 1, jk by rule 1, and 10abcdef by rule 2 (note 100 would not be matched because even though it starts with a digit, it violates rule 2 because it is composed entirely of digits... also abcdef would not match because the expression should be greedy and would match 10abcef first (?)).

    I was then hoping to get all the results via findall().

    Any help would be much appreciated.

    Thanks,
    theperfectsoup
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2003
    Posts
    133
    Rep Power
    11
    My first reaction is that you really should consider using a custom solution, i.e. some sort of state-machine. I'll look into this, though.
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2003
    Posts
    35
    Rep Power
    11
    I thought that underlying a regex engine was a finite state machine, i.e., when you call compile() it translates the provided string pattern into a fsm which you could use on successive strings by using findall().

    And the two regular expressions, when used separately, work. I just having trouble combining them into a singular regular expression which I can use. I think it's just a syntax issue, really, but I can't figure it out for the life of me.

    Thanks again.
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2003
    Posts
    133
    Rep Power
    11
    Well, the problem with regular expressions is that they provide very limited state switching concepts. Python RE:s don't even support if-then-else constructs; even with if-then-else support, it's very hard to switch state and discard already captured groups if the match failes further ahead.

    Anyway, I think I solved your problem. The RE I got is:
    Code:
    (?![\d]*?[^a-z\d.'])[a-z\d][a-z\d.']*?(?![a-z\d.'])
    The RE passed a very brief test, you'll have to check further for yourself.

    In advanced cases, I really would recommend a custom state machine. It's easier for another programmer to understand; it's also easier to extend and support.

    Also, the RE you provided should fail because you use single quotes both around the RE and inside, but I suspect that's just for the post.
    Last edited by percivall; August 3rd, 2003 at 05:34 PM.
  8. #5
  9. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2003
    Posts
    35
    Rep Power
    11
    Wow, thank you so much! The regular expression you made works almost perfectly... The only time I think it messes up is when the string ends in a number, e.g.:

    >>> regexp.findall('103 45bc abd 67')
    ['45bc', 'abd', '67']

    Is there a quick way to fix it? I'd fix it myself, but I'm new to regular expressions and don't understand it as a whole. Could you possibly give me a quick run-down on how it works?

    If that's too much of a bother, don't worry. I'll keep trying and I'm sure one day I'll get it...

    Thanks,
    theperfectsoup
  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2003
    Posts
    133
    Rep Power
    11
    Code:
    (?![\d]*?[^a-z\d.'])(?![\d]*?$)[a-z\d][a-z\d.']*?(?=[^a-z\d.']|$)
    The above RE is exactly why I would recommend a custom state-machine solution. Anyway. I think this works. It took me a while. Python seems not to support certain constructs you'd think it would support; it made it much harder. On the other hand, I might be mistaken.

    Simply put, I've added a negative look-ahead after the first one, to check if we're dealing with only numbers leading to the end of the string, which is where the previous version failed.

    It's too complicated to explain exactly how this works if you don't understand exactly (that's why a custom solution is better. I'm not sure I understand completely )

    Have fun.
  12. #7
  13. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2002
    Location
    Norwich, UK
    Posts
    53
    Rep Power
    12
    I think this pattern should do what you want, plus it's a little bit cleaner/smaller than the other suggested one.
    ((?:[a-z]|\d+(?=[a-z.']))[a-z0-9.']*)

    eg.
    Code:
    rex = re.compile("((?:[a-z]|\d+(?=[a-z.']))[a-z0-9.']*)")
    re.findall(rex, "3r35* r 578 moo,moo hi'there7 eTc. 56.7 84e9")
    would return: ['3r35', 'r', 'moo', 'moo', "hi'there7", 'e', 'c.', '56.7', '84e9']

    and for your example:
    Code:
    re.findall(rex, "100 xyz jk-10abcdef")
    returns: ['xyz', 'jk', '10abcdef']

    It's possible I misunderstood what you wanted. In which case feel free to disregard this post entirely.
  14. #8
  15. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2003
    Posts
    35
    Rep Power
    11

    Smile Thanks guys!


    percivall, sacrilege, both of your regular expressions work for me... You guys are my heros! I can't tell you enough times how thankful I am!

    Thanks again,
    theperfectsoup
    (humble python newbie)

IMN logo majestic logo threadwatch logo seochat tools logo