The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.
|
 |
|
Dev Shed Forums
> Programming Languages
> Python Programming
|
Longest common substring
Discuss Longest common substring in the Python Programming forum on Dev Shed. Longest common substring Python Programming forum discussing coding techniques, tips and tricks, and Zope related information. Python was designed from the ground up to be a completely object-oriented programming language.
|
|
 |
|
|
|
|
|

Dev Shed Forums Sponsor:
|
|
|

December 26th, 2012, 04:28 AM
|
|
Contributing User
|
|
Join Date: Nov 2012
Posts: 32
Time spent in forums: 8 h 42 m 41 sec
Reputation Power: 1
|
|
|
Longest common substring
Ok let's say I need a very little push with an assignment. For now, I had to read relatively large files and sort them into a list according certain criteria (not relevant). Done that.
Now I need to compare two of these lists and find the longest common sub-string which exists in both lists, in fact, I have to return the length of the longest common sub-string.
This is the code that I have so far, it works well for long enough text but it surely doesn't do the trick with the files i'm trying to deal with. Could anyone please give a hint how I could improve the time complexity?
PHP Code:
def findLongest(proteome1,proteome2):
x = ""
Maximum = 0
Found = False
Poss = []
for D in proteome1:
for M in proteome2:
Found = False
for slice_pat in range(0,len(D)):
if(not Found):
Poss = findNstring(D,len(D)-slice_pat)
for i in Poss:
if i in M:
if len(i)>Maximum:
Maximum = len(i)
Found= True
break
else:
Found=True
break
else:
break
else:
break
return Maximum
def findNstring(txt,length):
### (string,int)--->list , computes all possible sub-strings of length 'length' using txt####
lst = []
start = 0
end = length
while(end!=len(txt)+1):
if not(txt[start:end] in lst):
lst.append(txt[start:end])
start+=1
end+=1
else:
start+=1
end +=1
return lst
|

December 26th, 2012, 10:23 AM
|
 |
Contributing User
|
|
Join Date: May 2012
Location: 39N 104.28W
Posts: 90
Time spent in forums: 1 Day 13 h 41 m 29 sec
Reputation Power: 2
|
|
|
It might (and I stress, might) be marginally faster if you looped backwards: range(len(D),0,-1)
|

December 26th, 2012, 11:22 AM
|
|
Contributing User
|
|
Join Date: Nov 2012
Posts: 32
Time spent in forums: 8 h 42 m 41 sec
Reputation Power: 1
|
|
Quote: | Originally Posted by rrashkin It might (and I stress, might) be marginally faster if you looped backwards: range(len(D),0,-1) |
In fact it is looped backwards... I set it that way so that the first match found is actually the maximal match. My problem occurs due to recurring loops, I'd like a suggestion of any possible conditions where I could simply break the loop and skip to the next one...
|

December 26th, 2012, 01:35 PM
|
 |
Contributing User
|
|
|
|
|
a different algorithm
Code:
import logging
from loopstatus import loopstatus
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s %(levelname)-8s L%(lineno)d %(message)s',
datefmt='%H:%M:%S',
)
def longest_common_string(A,B):
status = loopstatus()
try:
L = 1
RB = range(len(B))
for i in range(len(A)):
a = A[i:i+L]
if status():
logging.info('character %d of %d',i,len(A))
for j in RB:
if a == B[j:j+L]:
while A[i+L] == B[j+L]:
a += A[i+L]
L += 1
print(a,i)
except IndexError:
pass
return L
My test strings are so long that I didn't wait for either of our functions to finish. My inputs were text from public domain pdf versions of
A_Christmas_Carol_NT.txt Anna_Karenina_NT.txt
The result so far starting at character position 33 is the supremely disappointing (but expected, I've looked at these files before today)
This eBook is designed and published by Planet PDF. For more free
eBooks visit our Web site at http://www.planetpdf.com/.
Here's my loopstatus.py which is useful for ocassional reporting during long running, otherwise silent loops.
This eBook is designed and published by Planet PDF. For more free
eBooks visit our Web site at http://www.planetpdf.com/.
A 33
14:22:52 INFO L25 character 63 of 162805
14:23:53 INFO L25 character 127 of 162805
14:25:57 INFO L25 character 255 of 162805
14:30:14 INFO L25 character 511 of 162805
Code:
'''
import logging
from loopstatus import loopstatus
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s %(levelname)-8s L%(lineno)d %(message)s',
datefmt='%H:%M:%S',
)
status = loopstatus()
...
if status():
logging.info('%dnth occurence',status.Value)
'''
import signal
class loopstatus(object):
def __init__(self,base=2,seconds=5*60,target=1):
self.i = 0
self.base = base
self.target = target
self.nonzero = True
self.timesup = False
self.interval = seconds
if 0 < seconds:
self.reset_timer()
signal.signal(signal.SIGALRM,self.trap_alarm)
def trap_alarm(self,*args,**kwargs):
self.timesup = True
def reset_timer(self,interval=None):
if interval is None:
interval = self.interval
if 0 < interval:
self.timesup = False
signal.alarm(interval)
def __call__(self):
self.i += 1
self.nonzero = self.timesup or (self.target <= self.i)
if self:
self.target *= self.base
self.reset_timer()
return self.nonzero
@property
def Value(self):
return self.i
def __bool__(self):
return self.nonzero
def __str__(self):
return str(self.Value)
def __lt__(self,other): return self.Value < other
def __le__(self,other): return self.Value <= other
def __eq__(self,other): return self.Value == other
def __ne__(self,other): return self.Value != other
def __ge__(self,other): return self.Value >= other
def __gt__(self,other): return self.Value > other
def main(base=2,seconds=5*60,target=1):
return loopstatus(base,seconds,target)
'''
I'd include loopstatustest.py, except that I'd also have to post my testing framework. tests are important. Here is an annotated sample use case terminated by keyboard interrupt:
>>> import loopstatus
>>> status = loopstatus.loopstatus(seconds=4)
>>> while True:
... if status():
... print(s.Value)
...
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
8192
16384
32768
65536
131072
262144
524288
1048576
2097152
3592136 # the 4 second timer caused this event
5106735
6616367
8126450
^C
Traceback (most recent call last):
file "<stdin>", line 2, in <module>
file "ls.py", line 44, in __call__
if self.timesup or (self.target <= self.i):
KeyboardInterrupt
>>>
**CONTROL**
To closer pack the geometric sequence---set *base* to a float between 1 and 2.
For decade counting---set *base* to 10. I don't understand the fuss regardinging the usual number of fingers, hence the binary default.
For more frequent initial true tests---set *target* between 0 and 1
0 < target < 1
To always report True---set base to 1.
To test true less often initially---increase *target*.
To disable truth by time---choose 0 for *seconds*. The default is 5*60 seconds == 5 minutes.
To disable truth by count---use a large value of *target*.
**METHODS**
Call the loopstatus object to increment its count, set the object's truth, and manage internals. It returns its truth value.
loopstatus defines __nonzero__ which does not alter the object's truth.
LSObject.Value returns the LSObject() call count.
sample:
from WriteNumber import WriteNumber
unfinished = True
status = loopstatus.loopstatus()
while unfinished:
if status(): # Call updates truth state
info('measure %s.',WriteNumber(status.Value)) # progress
unfinished = symphony(
verbose = not not status) # safely use boolean state
'''
__________________
[code] Code tags[/code] are essential for python code!
|

December 26th, 2012, 01:51 PM
|
|
Contributing User
|
|
Join Date: May 2009
Posts: 294
  
Time spent in forums: 3 Days 18 h 55 m 54 sec
Reputation Power: 7
|
|
Lists are not efficient for any data that is "large". Unfortunately "large" is not a concrete number. To access the lists, let us say you are comparing the 100th element of each, you
start at the beginning of list_1 and go to the 100th element
start at the beginning of list_2 and go to the 100th element
and if they are equal
start at the beginning of list_1 and go to the 101st element
start at the beginning of list_2 and go to the 101st element, etc.
You should try with a set or dictionary since they are hashed. The example below uses a set. Also, you can eliminate around 90% of the lengths used to generate the sub-strings by using the divide by two method. Divide the longest length by two and test that length for sub-strings that match. If not found, divide the lower half by two and test for 1/4 the longest length. If found, divide the upper half, but either way you have eliminated half of the possible lengths. Continue dividing by two until you get to a manageable number of lengths remaining, say 10 or 20, start at the lowest number and increment by one until a match is not found. If you want more help with this, post some test data that we can use along with any problems.
Code:
## do not print any extra messages during an actual run
# because printing is slow comparatively.
def one_group(length, str_in):
""" divide the string into sub-groups of the length passed
and return a set
"""
str_in = str_in.lower()
return_set = set()
start = 0
end = length
print end
while end <= len(str_in):
this_group=str_in[start:end]
return_set.add(this_group)
print " ", start, this_group
start += 1
end += 1
return return_set
def find_matches(str_1, str_2):
for ctr in range(len(str_1)-1, 0, -1):
group_1=one_group(ctr, str_1)
group_2=one_group(ctr, str_2)
for group in group_1:
if group in group_2:
print "*****Found*****", ctr, group
return
str_1 = "abcdef"
str_2 = "cdEfg"
if str_1 == str_2:
print "whole strings are equal"
else:
find_matches(str_1, str_2)
Last edited by dwblas : December 26th, 2012 at 01:58 PM.
|

December 26th, 2012, 02:26 PM
|
 |
Contributing User
|
|
|
|
|
I tried dwblas functions (without the demonstration print statements) for the aforementioned strings of 0.1 and 2.1 million characters, which immediately sent my computer into page fault convulsions.
Still, hashing of some sort might produce a faster algorithm.
Extrapolating, my code would have finished the job in a little under 2 days. No estimates available for the other algorithms.
|

December 26th, 2012, 04:55 PM
|
|
Contributing User
|
|
Join Date: Nov 2012
Posts: 32
Time spent in forums: 8 h 42 m 41 sec
Reputation Power: 1
|
|
Quote: | Originally Posted by b49P23TIvg I tried dwblas functions (without the demonstration print statements) for the aforementioned strings of 0.1 and 2.1 million characters, which immediately sent my computer into page fault convulsions.
Still, hashing of some sort might produce a faster algorithm.
Extrapolating, my code would have finished the job in a little under 2 days. No estimates available for the other algorithms. |
Yeah, I did encounter the same problem as well. The given functions run a little faster than mine does but still it isn't fast enough. My assignment page says that the program must run under 2 minutes to be accepted. I am starting to doubt my self, the text files are immense I could swim in them, I just can't see how this can be achieved as requested...
How could hashing improve the performance of the program?
|

December 26th, 2012, 05:37 PM
|
 |
Contributing User
|
|
|
|
|
What size are the strings please????
You could compute a running hash, and then comparing only 1 byte would tell you if there's a chance the strings could match, only then would you need to use strcmp.
1 2 3 4 2
The running hash, say I use sum, of length 3
1+2+3 starting at position 0
(6-1)+4 starting at position 1
(9-2)+2 starting at position 2
I'd use exclusive or instead of sum, but whatever. It might be part of a faster algorithm.
Of course one can build faster searches, but there's some overhead in constructing the tables.
find "abca" in "ababcax"
"a" matches "a"
"b" matches "b"
"c" does not match "a" but our precomputed table says that there is an "a" two characters before "c" in the pattern, so shift two characters to the right.
etceteras.
Instead, maybe the trick is to remove that inner loop, replace it with built-in (c level) code. I'll try this next.
|

December 26th, 2012, 05:46 PM
|
|
Contributing User
|
|
Join Date: May 2009
Posts: 294
  
Time spent in forums: 3 Days 18 h 55 m 54 sec
Reputation Power: 7
|
|
Quote: | the text files are immense I could swim in them | There is no way anyone can help without something more specific than "immense". Quote: | Now I need to compare two of these lists and find the longest common sub-string which exists in both lists | Are they lists or files? Do they contain one long string, if so how long, or do they contain many records and each record has to be checked against many other records, if so approximately how long is the longest record. Quote: | I had to read relatively large files and sort them into a list according certain criteria (not relevant). Done that. | This implies that they will fit into existing memory, so a solution which holds the entire files/lists in memory is called for. Quote: | I tried dwblas functions (without the demonstration print statements) for the aforementioned strings of 0.1 and 2.1 million characters, which immediately sent my computer into page fault convulsions | Obviously some assumptions are being made which don't apply since the lists have been sorted previously by the OP.
Last edited by dwblas : December 26th, 2012 at 05:50 PM.
|

December 26th, 2012, 06:19 PM
|
 |
Contributing User
|
|
|
|
Much faster, if correct! You can verify that.
Code:
def longestdwl(A,B):
L = 1
for i in range(len(A)):
a = A[i:i+L]
j = 0
while True:
j = B.find(a,j)
if j < 0:
break
while (i+L < len(A)) and (j+L < len(B)) and (A[i+L] == B[j+L]):
a += A[i+L]
L += 1
j += 1
return L
|

December 27th, 2012, 09:07 AM
|
|
Contributing User
|
|
Join Date: Nov 2012
Posts: 32
Time spent in forums: 8 h 42 m 41 sec
Reputation Power: 1
|
|
Quote: | Originally Posted by dwblas There is no way anyone can help without something more specific than "immense". Are they lists or files? Do they contain one long string, if so how long, or do they contain many records and each record has to be checked against many other records, if so approximately how long is the longest record. This implies that they will fit into existing memory, so a solution which holds the entire files/lists in memory is called for. Obviously some assumptions are being made which don't apply since the lists have been sorted previously by the OP. |
Ok it is time to specify more.
The examined text is found inside a file which contains a one piece text formed of approximately 3 million chars without spaces. As a starting point I have to import the text into a LIST inside python and sort the text so that in each index of the list there's a 800+- long sub-string on average.
In total I need to create two of these lists (2 files imported). Basically I need to find the longest sub-string which is in common in both lists, but I think that's clear. So I have no choice but run on all the indexes in both lists and check for longest common string...
My problem is that no matter how hard I try to make my algorithm faster, I always lose against my lecturer's challenge of finishing the computing task in less than 2 minutes...
I am sure there's some sort of trick involving "memoization" and something about Karp-Rabin algorithm of converting text to int and compare... I can post my attempt at that if you want to see? But I repeat, it is nowhere near efficient...
|

December 27th, 2012, 09:10 AM
|
|
Contributing User
|
|
Join Date: Nov 2012
Posts: 32
Time spent in forums: 8 h 42 m 41 sec
Reputation Power: 1
|
|
Quote: | Originally Posted by b49P23TIvg Much faster, if correct! You can verify that.
Code:
def longestdwl(A,B):
L = 1
for i in range(len(A)):
a = A[i:i+L]
j = 0
while True:
j = B.find(a,j)
if j < 0:
break
while (i+L < len(A)) and (j+L < len(B)) and (A[i+L] == B[j+L]):
a += A[i+L]
L += 1
j += 1
return L
|
Tests say that this broke all my previous records... Fast one!
|

December 27th, 2012, 10:51 AM
|
 |
Contributing User
|
|
|
|
|
What I mean by "verify" is
I did not thoroughly test the program I submitted. Does it give correct answer?
Also, I'm pretty sure it's faster if the arguments are in the order of (shorter_string, longer_string)
If so you could sort the arguments by length in the function.
Last edited by b49P23TIvg : December 27th, 2012 at 10:58 AM.
|

December 28th, 2012, 04:07 PM
|
|
Contributing User
|
|
Join Date: Nov 2012
Posts: 32
Time spent in forums: 8 h 42 m 41 sec
Reputation Power: 1
|
|
Quote: | Originally Posted by b49P23TIvg What I mean by "verify" is
I did not thoroughly test the program I submitted. Does it give correct answer?
Also, I'm pretty sure it's faster if the arguments are in the order of (shorter_string, longer_string)
If so you could sort the arguments by length in the function. |
Well I am about to finish the project!
It is still quite inefficient when talking about the 3 millions char string but I am on my way to getting it done. Do you have any suggestions for a halting condition? Maybe some way to stop break the loop and prevent the program from comparing useless sub-strings?
|

December 28th, 2012, 04:26 PM
|
 |
Contributing User
|
|
|
|
What is your current code?
Use the profiler. Maybe there are alternatives to the bad spot.
|
Developer Shed Advertisers and Affiliates
| Thread Tools |
Search this Thread |
|
|
|
| Display Modes |
Rate This Thread |
Linear Mode
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
|
|