The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.
|
 |
|
Dev Shed Forums
> Programming Languages
> Python Programming
|
Page 2 -
Longest common substring
Page 2 - Discuss Longest common substring in the Python Programming forum on Dev Shed. Longest common substring Python Programming forum discussing coding techniques, tips and tricks, and Zope related information. Python was designed from the ground up to be a completely object-oriented programming language.
|
|
 |
|
|
|
|
|

Dev Shed Forums Sponsor:
|
|
|

December 30th, 2012, 10:30 AM
|
|
Contributing User
|
|
Join Date: Nov 2012
Posts: 32
Time spent in forums: 8 h 42 m 41 sec
Reputation Power: 1
|
|
PHP Code:
def substring_len(f, n):
l = set()
for i in range(len(f)-n+1):
if not (f[i:i+n] in l):
l.add(f[i:i+n])
return l
def BinaryfindLongest(proteome1,proteome2):
proteome1 = sorted(proteome1)
proteome2 = sorted(proteome2)
Maximum = 0
Minimum = 0
Mid = 0
Closer = 4
Hold1 = set()
Hold2 = set()
H = set()
###finds the longest string
if(len(proteome1[-1])>len(proteome2[-1])):
Maximum = len(proteome1[-1])
else:
Maximum = len(proteome1[-1])
###tries to find common substrings of length = Mid
while(Maximum-Minimum>Closer):
Mid = (Maximum+Minimum)//2
for i in proteome1:
if(len(i)<Mid):
continue
else:
Hold1 = Hold1.union(substring_len(i,Mid))
for j in proteome2:
if(len(j)<Mid):
continue
else:
Hold2 = Hold2.union(substring_len(j,Mid))
###performs binary search according "matching" result
if(not Hold1.isdisjoint(Hold2)):
Minimum = (Mid)
else:
Maximum = (Mid)
Hold1 = set()
Hold2 = set()
###Continue...
This is the code that I have so far. I am using sets and a binary search. Basically, I sorted the lists from longest string to shortest. Then I calculated the longest string of both lists and set it to be the Maximum. (Minimum = 0 by default)
Then I look for the common sub-string between the lists which is (Maximum+Minimum)/2 long. If such match is found, then I continue with an upper binary search, otherwise, bottom binary search till I get to a point where Maximum and Minimum are close enough, that's where I will do a final manual search.
The problem is that I have no idea how to make a fast comparison of finding the common string of length "Mid" as described in the code.
Directions?
|

December 30th, 2012, 02:22 PM
|
 |
Contributing User
|
|
|
|
|
Looks like we're solving different problems altogether.
You have
if(len(proteome1[-1])>len(proteome2[-1])):
which might make sense if proteome[12] are lists of strings. I treated them as strings.
You also sort proteome[12], again, which makes sense if these are lists of strings.
However, you (probably) did not sort them by length. What is the class definition of the objects in proteome[12]?
(In other words, what is proteome1[0] )?
Also, you might want the standard bisect module.
__________________
[code] Code tags[/code] are essential for python code!
|

December 30th, 2012, 03:55 PM
|
|
Contributing User
|
|
Join Date: Nov 2012
Posts: 32
Time spent in forums: 8 h 42 m 41 sec
Reputation Power: 1
|
|
PHP Code:
Amino_Acids = ["A","C","D","E","F","G","H","I","K","L","M","N","P","Q","R","S","T","V","W","Y"]
files = ["Vibrio_cholerae_B33.txt","Mycobacterium_leprae_Br4923.txt","Mycobacterium_tuberculosis_H37Ra.txt","Plasmodium_falciparum_malaria.txt","Salmonella_Typhi_AG3.txt"]
def readProteome(filename):
lst = []
F = open(filename)
line = F.readline()
line = F.readline()
PRO = ""
while(line!=""):
for ELEM in line[0:-1]:
if ELEM in Amino_Acids:
PRO += ELEM
line = F.readline()
if (line==""):
lst.append(PRO)
break
if line[0]==">":
line = F.readline()
lst.append(PRO)
PRO = ""
return lst
def substring_len(f, n):
l = set()
for i in range(len(f)-n+1):
if not (f[i:i+n] in l):
l.add(f[i:i+n])
return l
def BinaryfindLongest(proteome1,proteome2):
proteome1.sort(key = lambda s: len(s))
proteome2.sort(key = lambda s: len(s))
Maximum = 0
Minimum = 0
Mid = 0
Closer = 4
H = set()
X = ''.join(proteome1)
Y = ''.join(proteome2)
found = False
count = len(proteome2)
###finds the longest string
if(len(proteome1[-1])>len(proteome2[-1])):
Maximum = len(proteome1[-1])
else:
Maximum = len(proteome2[-1])
while(Maximum-Minimum>Closer):
found = False
Mid = (Maximum+Minimum)//2
for i in proteome2:
if(found):
break
if(len(i)<Mid):
continue
else:
H = substring_len(i,Mid)
for k in H:
if k in X:
Minimum = (Mid)
found = True
break
if(not found):
Maximum = (Mid)
print (Maximum,Minimum,Mid)
Whoa big mistake big mistake!!!...
forget about that one code above, I posted the wrong one!
Here it is, sorting the listing according to the length of strings. You can assume freely that all elements in the list are strings since I have another function called readProteome which makes sure it happens.
Basically, we are back to the point. sets and binary search with two big lists containing a lot of strings. a good algorithm in my opinion but its taking too much time...
|
Developer Shed Advertisers and Affiliates
| Thread Tools |
Search this Thread |
|
|
|
| Display Modes |
Rate This Thread |
Linear Mode
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
|
|