Python Programming
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me

The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.

Go Back   Dev Shed ForumsProgramming LanguagesPython Programming

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old February 8th, 2013, 10:17 AM
zipit zipit is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Feb 2013
Posts: 4 zipit User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 37 m 9 sec
Reputation Power: 0
Find lowest keyword match in string in a performant way

hi,

i have got a raw unicode string, somthing like

Code:
\n\n\t\t\tBOOL MYBOOL { }\n\t\t\tLONG MYLONG { }


i now need to get first match out of an keyword list. which would be for the example ['BOOL', 'LONG']. I also need
to make sure, that every match is valid match, or in other words begins and ends with a whitespace, a \n or \t.

str.find is not very well suited for this as i would have to find all keyword matches and then would have to pick the
lowest index, i also would have to check for each match, if it is a 'valid' one first. i also cannot split the string, as
this would alter the 'result' of the string.

any hints on some fancy python classes which could help me with that, or would be the best way to crawl manually
through the string ?

what i want to do:

1. find the lowest index match out of a keyword list in a string
2. make sure this match begins and ends with a whitespace, a new line or a tab
3. as fast as possible
4. splitting the string is not an option

i am currently doing this manually, by steping char by char through the string, which is quite slow.

thanks for reading and your help

Reply With Quote
  #2  
Old February 8th, 2013, 01:28 PM
b49P23TIvg's Avatar
b49P23TIvg b49P23TIvg is offline
Contributing User
Dev Shed Loyal (3000 - 3499 posts)
 
Join Date: Aug 2011
Posts: 3,354 b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level) 
Time spent in forums: 1 Month 2 Weeks 3 Days 8 h 21 m 53 sec
Reputation Power: 383
Code:
import re

#     0 1 2 3 4 567
S = u'\n\n\t\t\tBOOL MYBOOL { }\n\t\t\tLONG MYLONG { }'

keywords = 'bool long'.upper().split()
regexp = (r'\b({})\b'.format('|'.join(keywords)))   # for example '\\b(BOOL|LONG)\\b'

search = re.compile(regexp).search
print(search(S).start())  # prints 5 for example
__________________
[code]Code tags[/code] are essential for python code!

Reply With Quote
  #3  
Old February 8th, 2013, 04:15 PM
zipit zipit is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Feb 2013
Posts: 4 zipit User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 37 m 9 sec
Reputation Power: 0
hey,

i will have to do some reading now, as regular expressions are a subject, which
i have been avoiding until now. i will report back later.

thanks for your reply.

Reply With Quote
  #4  
Old February 8th, 2013, 05:46 PM
zipit zipit is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Feb 2013
Posts: 4 zipit User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 37 m 9 sec
Reputation Power: 0
hey,

i have made some progress, but while exploring regex, i have stumbled upon a question which i could
not solve with google.

first of all, with regex i ended up splitting the string using the regex group feature, which preserves
the my braces (and the information of the string). currently i am using this very simple regex
expression to split my string :

Code:
regex = re.compile('\s|(;)')


with this test string:

Code:
GROUP
{
	//vir 2000, 3000, 4000
	LONG MYLONG { FAKEFLAG 2000, 3000, 4000;}
}


the cleaned result would be:

Code:
['GROUP',
'{',
'//vir',
'2000, 3000, 4000;',
'LONG','
MYLONG',
'{',
'FAKEFLAG', 
'2000, 3000, 4000;',
'}',
'}']


is it possible to use the group feature of regex expressions to do this ?

Code:
['GROUP',
'{',
'//vir',
['2000'',' '3000 ',' '4000', ';'],
'LONG','
MYLONG',
'{',
'FAKEFLAG', 
['2000'',' 3000 ',' 4000', ';'],
'}',
'}']


or in other words, define a comma as delimitter, but tell the epxression that i want the
these splits movved into a seperate group or count as single match, so that the split
result would be this nested list ? i hope this does make sense for anybody else

thank you for reading and your help .

edit: i forgot to split the semicolons in my example, but i guess
the idea should be clear

Reply With Quote
  #5  
Old February 8th, 2013, 06:06 PM
b49P23TIvg's Avatar
b49P23TIvg b49P23TIvg is offline
Contributing User
Dev Shed Loyal (3000 - 3499 posts)
 
Join Date: Aug 2011
Posts: 3,354 b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level) 
Time spent in forums: 1 Month 2 Weeks 3 Days 8 h 21 m 53 sec
Reputation Power: 383
Of the many ways to denote a string,
you probably mean either
r'\s|(;)'
or
'\\s|(;)'

regex = re.compile('\s|(;)')

I've used python re groups about once, and not recently.

Quote:
Originally Posted by zipit
4. splitting the string is not an option
... regex expression to split my string

Whatever. I prefer to use several expressions rather than a complicated monstrosity.

Last edited by b49P23TIvg : February 8th, 2013 at 06:07 PM. Reason: Smilies disabled.

Reply With Quote
  #6  
Old February 8th, 2013, 06:34 PM
zipit zipit is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Feb 2013
Posts: 4 zipit User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 37 m 9 sec
Reputation Power: 0
Quote:
Originally Posted by b49P23TIvg
Quote:
Originally Posted by zipit
4. splitting the string is not an option
... regex expression to split my string

Whatever. I prefer to use several expressions rather than a complicated monstrosity..


i am aware of the contradiction, this is why is started the sentence with 'with regex i ended up...'
using the str.split method i would have lost the passed delimiter in my string. but the braces are
important for the information, with regex groups i can split the string while maintaining the delimiter
and add it as a token to the result list, which is kind of the best solution.

so my first statement could be read as the expression of my ignorance of the possibilities of regex.
read it is as, 'i cannot split the string and loose the splitting delimiter.'

ps: and yes i meant actually \s, as i do not have to use the raw form of my unicode string anymore
with regex. for my clunky string crawling approach i forced the string into this representation, because
str.whitespace had some problems with tabs for whatever reason.

Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming LanguagesPython Programming > Find lowest keyword match in string in a performant way

Developer Shed Advertisers and Affiliates



Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 


Powered by: vBulletin Version 3.0.5
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.

© 2003-2013 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap