#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Posts
    4
    Rep Power
    0

    Find lowest keyword match in string in a performant way


    hi,

    i have got a raw unicode string, somthing like

    Code:
    \n\n\t\t\tBOOL MYBOOL { }\n\t\t\tLONG MYLONG { }
    i now need to get first match out of an keyword list. which would be for the example ['BOOL', 'LONG']. I also need
    to make sure, that every match is valid match, or in other words begins and ends with a whitespace, a \n or \t.

    str.find is not very well suited for this as i would have to find all keyword matches and then would have to pick the
    lowest index, i also would have to check for each match, if it is a 'valid' one first. i also cannot split the string, as
    this would alter the 'result' of the string.

    any hints on some fancy python classes which could help me with that, or would be the best way to crawl manually
    through the string ?

    what i want to do:

    1. find the lowest index match out of a keyword list in a string
    2. make sure this match begins and ends with a whitespace, a new line or a tab
    3. as fast as possible
    4. splitting the string is not an option

    i am currently doing this manually, by steping char by char through the string, which is quite slow.

    thanks for reading and your help
  2. #2
  3. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,894
    Rep Power
    481
    Code:
    import re
    
    #     0 1 2 3 4 567
    S = u'\n\n\t\t\tBOOL MYBOOL { }\n\t\t\tLONG MYLONG { }'
    
    keywords = 'bool long'.upper().split()
    regexp = (r'\b({})\b'.format('|'.join(keywords)))   # for example '\\b(BOOL|LONG)\\b'
    
    search = re.compile(regexp).search
    print(search(S).start())  # prints 5 for example
    [code]Code tags[/code] are essential for python code and Makefiles!
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Posts
    4
    Rep Power
    0
    hey,

    i will have to do some reading now, as regular expressions are a subject, which
    i have been avoiding until now. i will report back later.

    thanks for your reply.
  6. #4
  7. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Posts
    4
    Rep Power
    0
    hey,

    i have made some progress, but while exploring regex, i have stumbled upon a question which i could
    not solve with google.

    first of all, with regex i ended up splitting the string using the regex group feature, which preserves
    the my braces (and the information of the string). currently i am using this very simple regex
    expression to split my string :

    Code:
    regex = re.compile('\s|(;)')
    with this test string:

    Code:
    GROUP
    {
    	//vir 2000, 3000, 4000
    	LONG MYLONG { FAKEFLAG 2000, 3000, 4000;}
    }
    the cleaned result would be:

    Code:
    ['GROUP',
    '{',
    '//vir',
    '2000, 3000, 4000;',
    'LONG','
    MYLONG',
    '{',
    'FAKEFLAG', 
    '2000, 3000, 4000;',
    '}',
    '}']
    is it possible to use the group feature of regex expressions to do this ?

    Code:
    ['GROUP',
    '{',
    '//vir',
    ['2000'',' '3000 ',' '4000', ';'],
    'LONG','
    MYLONG',
    '{',
    'FAKEFLAG', 
    ['2000'',' 3000 ',' 4000', ';'],
    '}',
    '}']
    or in other words, define a comma as delimitter, but tell the epxression that i want the
    these splits movved into a seperate group or count as single match, so that the split
    result would be this nested list ? i hope this does make sense for anybody else

    thank you for reading and your help .

    edit: i forgot to split the semicolons in my example, but i guess
    the idea should be clear
  8. #5
  9. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,894
    Rep Power
    481
    Of the many ways to denote a string,
    you probably mean either
    r'\s|(;)'
    or
    '\\s|(;)'

    regex = re.compile('\s|(;)')

    I've used python re groups about once, and not recently.

    Originally Posted by zipit
    4. splitting the string is not an option
    ... regex expression to split my string
    Whatever. I prefer to use several expressions rather than a complicated monstrosity.
    Last edited by b49P23TIvg; February 8th, 2013 at 06:07 PM. Reason: Smilies disabled.
    [code]Code tags[/code] are essential for python code and Makefiles!
  10. #6
  11. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Posts
    4
    Rep Power
    0
    Originally Posted by b49P23TIvg
    Quote:
    Originally Posted by zipit
    4. splitting the string is not an option
    ... regex expression to split my string

    Whatever. I prefer to use several expressions rather than a complicated monstrosity..
    i am aware of the contradiction, this is why is started the sentence with 'with regex i ended up...'
    using the str.split method i would have lost the passed delimiter in my string. but the braces are
    important for the information, with regex groups i can split the string while maintaining the delimiter
    and add it as a token to the result list, which is kind of the best solution.

    so my first statement could be read as the expression of my ignorance of the possibilities of regex.
    read it is as, 'i cannot split the string and loose the splitting delimiter.'

    ps: and yes i meant actually \s, as i do not have to use the raw form of my unicode string anymore
    with regex. for my clunky string crawling approach i forced the string into this representation, because
    str.whitespace had some problems with tabs for whatever reason.

IMN logo majestic logo threadwatch logo seochat tools logo