The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.
|
 |
|
Dev Shed Forums
> Programming Languages
> Python Programming
|
Find lowest keyword match in string in a performant way
Discuss Find lowest keyword match in string in a performant way in the Python Programming forum on Dev Shed. Find lowest keyword match in string in a performant way Python Programming forum discussing coding techniques, tips and tricks, and Zope related information. Python was designed from the ground up to be a completely object-oriented programming language.
|
|
 |
|
|
|
|

Dev Shed Forums Sponsor:
|
|
|

February 8th, 2013, 10:17 AM
|
|
Registered User
|
|
Join Date: Feb 2013
Posts: 4
Time spent in forums: 37 m 9 sec
Reputation Power: 0
|
|
|
Find lowest keyword match in string in a performant way
hi,
i have got a raw unicode string, somthing like
Code:
\n\n\t\t\tBOOL MYBOOL { }\n\t\t\tLONG MYLONG { }
i now need to get first match out of an keyword list. which would be for the example ['BOOL', 'LONG']. I also need
to make sure, that every match is valid match, or in other words begins and ends with a whitespace, a \n or \t.
str.find is not very well suited for this as i would have to find all keyword matches and then would have to pick the
lowest index, i also would have to check for each match, if it is a 'valid' one first. i also cannot split the string, as
this would alter the 'result' of the string.
any hints on some fancy python classes which could help me with that, or would be the best way to crawl manually
through the string ?
what i want to do:
1. find the lowest index match out of a keyword list in a string
2. make sure this match begins and ends with a whitespace, a new line or a tab
3. as fast as possible
4. splitting the string is not an option
i am currently doing this manually, by steping char by char through the string, which is quite slow.
thanks for reading and your help 
|

February 8th, 2013, 01:28 PM
|
 |
Contributing User
|
|
|
|
Code:
import re
# 0 1 2 3 4 567
S = u'\n\n\t\t\tBOOL MYBOOL { }\n\t\t\tLONG MYLONG { }'
keywords = 'bool long'.upper().split()
regexp = (r'\b({})\b'.format('|'.join(keywords))) # for example '\\b(BOOL|LONG)\\b'
search = re.compile(regexp).search
print(search(S).start()) # prints 5 for example
__________________
[code] Code tags[/code] are essential for python code!
|

February 8th, 2013, 04:15 PM
|
|
Registered User
|
|
Join Date: Feb 2013
Posts: 4
Time spent in forums: 37 m 9 sec
Reputation Power: 0
|
|
|
hey,
i will have to do some reading now, as regular expressions are a subject, which
i have been avoiding until now. i will report back later.
thanks for your reply.
|

February 8th, 2013, 05:46 PM
|
|
Registered User
|
|
Join Date: Feb 2013
Posts: 4
Time spent in forums: 37 m 9 sec
Reputation Power: 0
|
|
hey,
i have made some progress, but while exploring regex, i have stumbled upon a question which i could
not solve with google.
first of all, with regex i ended up splitting the string using the regex group feature, which preserves
the my braces (and the information of the string). currently i am using this very simple regex
expression to split my string :
Code:
regex = re.compile('\s|(;)')
with this test string:
Code:
GROUP
{
//vir 2000, 3000, 4000
LONG MYLONG { FAKEFLAG 2000, 3000, 4000;}
}
the cleaned result would be:
Code:
['GROUP',
'{',
'//vir',
'2000, 3000, 4000;',
'LONG','
MYLONG',
'{',
'FAKEFLAG',
'2000, 3000, 4000;',
'}',
'}']
is it possible to use the group feature of regex expressions to do this ?
Code:
['GROUP',
'{',
'//vir',
['2000'',' '3000 ',' '4000', ';'],
'LONG','
MYLONG',
'{',
'FAKEFLAG',
['2000'',' 3000 ',' 4000', ';'],
'}',
'}']
or in other words, define a comma as delimitter, but tell the epxression that i want the
these splits movved into a seperate group or count as single match, so that the split
result would be this nested list ? i hope this does make sense for anybody else
thank you for reading and your help .
edit: i forgot to split the semicolons in my example, but i guess
the idea should be clear 
|

February 8th, 2013, 06:06 PM
|
 |
Contributing User
|
|
|
|
Of the many ways to denote a string,
you probably mean either
r'\s|(;)'
or
'\\s|(;)'
regex = re.compile('\s|(;)')
I've used python re groups about once, and not recently.
Quote: | Originally Posted by zipit 4. splitting the string is not an option
... regex expression to split my string
|
Whatever. I prefer to use several expressions rather than a complicated monstrosity.
Last edited by b49P23TIvg : February 8th, 2013 at 06:07 PM.
Reason: Smilies disabled.
|

February 8th, 2013, 06:34 PM
|
|
Registered User
|
|
Join Date: Feb 2013
Posts: 4
Time spent in forums: 37 m 9 sec
Reputation Power: 0
|
|
Quote: | Originally Posted by b49P23TIvg
Quote:
Originally Posted by zipit
4. splitting the string is not an option
... regex expression to split my string
Whatever. I prefer to use several expressions rather than a complicated monstrosity.. |
i am aware of the contradiction, this is why is started the sentence with 'with regex i ended up...'
using the str.split method i would have lost the passed delimiter in my string. but the braces are
important for the information, with regex groups i can split the string while maintaining the delimiter
and add it as a token to the result list, which is kind of the best solution.
so my first statement could be read as the expression of my ignorance of the possibilities of regex.
read it is as, 'i cannot split the string and loose the splitting delimiter.'
ps: and yes i meant actually \s, as i do not have to use the raw form of my unicode string anymore
with regex. for my clunky string crawling approach i forced the string into this representation, because
str.whitespace had some problems with tabs for whatever reason.
|
Developer Shed Advertisers and Affiliates
| Thread Tools |
Search this Thread |
|
|
|
| Display Modes |
Rate This Thread |
Linear Mode
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
|
|