#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2017
    Posts
    3
    Rep Power
    0

    Extracting integers


    Hi folks, my first post, hoping it starts off well
    I program in Python plus using Regex to keep my 75yr old brain active.

    I am trying to extract integers from a mixed string including floats:

    '334ght5.89abc567'
    The closest I have so far is:

    [^A-z.]([0-9]+[0-9]+)

    which gives '34' and '67'
    Thanks in advance for any advice / critique
  2. #2
  3. Lazy Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    16,325
    Rep Power
    9645
    First step is list out all the forms of number you want to match. I count four. How about you?
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2017
    Posts
    3
    Rep Power
    0
    Originally Posted by requinix
    First step is list out all the forms of number you want to match. I count four. How about you?
    As an integer I was thinking unsigned but I could think of:
    100
    -100
    +100
    but basic unsigned would suffice.
  6. #4
  7. Lazy Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    16,325
    Rep Power
    9645
    ...oh, when you said floats you meant the string could contain them, not that you wanted to find them. That certainly clears up my "all integers including floats" confusion. It also makes my question far less relevant.

    So from the example you want 334 and 567 but not 5 or 89?

    First, let's see what you started with:
    Code:
    [^A-z.]([0-9]+[0-9]+)
    1. Saying A-z does not mean uppercase and lowercase letters. It means uppercase letters, lowercase letters, and [ \ ] ^ _ ` because those characters exist in the 'A' (ASCII 65) - 'z' (ASCII 122) range. If you want just letters then you need A-Za-z or just one of those two and a case-insensitive flag on the regex.
    2. The ^ for negation throws the regex off. It will match "3" quite happily, so the $1 capture will only get 34. What you need is a way to say that there must not be a certain character before the match.
    3. [0-9]+[0-9]+ will only match integers with two or more digits. So the regex won't match anything in the string "abc1def" (or, as things stand now, in "abc12def" either).


    Let's start from the ground up.

    - Use \d+ to match numbers - \d is short for [0-9]. That will get all numbers: 334, 5, 89, 567

    - Exclude the 5 by saying that the number cannot be followed by a period, or perhaps by a period with more digits, because that identifies a float. \d+[^.]

    The regex will now also capture the "g" after "334", but now it matches "56" instead of "567" because that '7' is required for the [^.]. What you need now is a negative lookahead assertion, which in English means "at this point the following regex must not match". It looks ahead but does not capture. The syntax is (?!...).

    - Turn the [^.] into an assertion. Note that since the assertion itself is already in the negative, the character set should now be positive. Or not even a set since it's just one character: \d+(?!\.)

    Now the regex matches 334, not 5 (excluded by the assertion), 89, and properly matches 567 (because after the 567 the \. does not match).

    To get rid of the 89 you can use the same assertion trick but in the other direction: a negative lookbehind assertion means "at this point the following regex must not match backwards". The syntax is (?<!...).

    - Add another \. assertion at the beginning of the regex: (?<!\.)\d+(?!\.)

    Now the regex matches 334, not 5 (excluded by the ?! assertion), matches 9 (while the entire 89 was excluded just the 9 is still allowed), and matches 567.

    - Tweak the leading assertion so that it cannot be a period or a number. You have to go back to a character set for it. (?<![.\d])\d+(?!\.)

    Now it matches just 334 and 567. But that's not good enough.

    Consider
    Code:
    334ght55.89abc567
    The current regex will match 334 and 567, but also 5 - the first "5" of the two - because the trailing assertion only checks for a period.

    - Adjust that assertion to account for optional numbers between the current position being matched against and the period: (?<![.\d])\d+(?!\d*\.)

    Demonstration
    Last edited by requinix; September 6th, 2017 at 10:30 AM.
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2017
    Posts
    3
    Rep Power
    0
    Superb explanation, much to learn here, I shall drum up a few more strings and practice, thanks.

IMN logo majestic logo threadwatch logo seochat tools logo