Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support |

#16
December 30th, 2012, 10:30 AM
 Nightmareix35
Contributing User

Join Date: Nov 2012
Posts: 32
Time spent in forums: 8 h 42 m 41 sec
Reputation Power: 1
PHP Code:
```  def substring_len(f, n):     l = set()     for i in range(len(f)-n+1):         if not (f[i:i+n] in l):             l.add(f[i:i+n])     return l def BinaryfindLongest(proteome1,proteome2):     proteome1 = sorted(proteome1)     proteome2 = sorted(proteome2)     Maximum = 0     Minimum = 0     Mid = 0     Closer = 4     Hold1 = set()     Hold2 = set()     H = set()          ###finds the longest string     if(len(proteome1[-1])>len(proteome2[-1])):         Maximum = len(proteome1[-1])     else:         Maximum = len(proteome1[-1])     ###tries to find common substrings of length = Mid     while(Maximum-Minimum>Closer):                  Mid = (Maximum+Minimum)//2         for i in proteome1:             if(len(i)<Mid):                 continue             else:                 Hold1 = Hold1.union(substring_len(i,Mid))                          for j in proteome2:             if(len(j)<Mid):                 continue             else:                 Hold2 = Hold2.union(substring_len(j,Mid))     ###performs binary search according "matching" result         if(not Hold1.isdisjoint(Hold2)):             Minimum = (Mid)         else:             Maximum = (Mid)                      Hold1 = set()         Hold2 = set()     ###Continue...  ```

This is the code that I have so far. I am using sets and a binary search. Basically, I sorted the lists from longest string to shortest. Then I calculated the longest string of both lists and set it to be the Maximum. (Minimum = 0 by default)

Then I look for the common sub-string between the lists which is (Maximum+Minimum)/2 long. If such match is found, then I continue with an upper binary search, otherwise, bottom binary search till I get to a point where Maximum and Minimum are close enough, that's where I will do a final manual search.

The problem is that I have no idea how to make a fast comparison of finding the common string of length "Mid" as described in the code.

Directions?

#17
December 30th, 2012, 02:22 PM
 b49P23TIvg
Contributing User

Join Date: Aug 2011
Posts: 3,350
Time spent in forums: 1 Month 2 Weeks 3 Days 7 h 38 m 45 sec
Reputation Power: 383
Looks like we're solving different problems altogether.
You have

if(len(proteome1[-1])>len(proteome2[-1])):

which might make sense if proteome[12] are lists of strings. I treated them as strings.

You also sort proteome[12], again, which makes sense if these are lists of strings.

However, you (probably) did not sort them by length. What is the class definition of the objects in proteome[12]?

(In other words, what is proteome1[0] )?

Also, you might want the standard bisect module.
__________________
[code]Code tags[/code] are essential for python code!

#18
December 30th, 2012, 03:55 PM
 Nightmareix35
Contributing User

Join Date: Nov 2012
Posts: 32
Time spent in forums: 8 h 42 m 41 sec
Reputation Power: 1
PHP Code:
```  Amino_Acids = ["A","C","D","E","F","G","H","I","K","L","M","N","P","Q","R","S","T","V","W","Y"] files = ["Vibrio_cholerae_B33.txt","Mycobacterium_leprae_Br4923.txt","Mycobacterium_tuberculosis_H37Ra.txt","Plasmodium_falciparum_malaria.txt","Salmonella_Typhi_AG3.txt"] def readProteome(filename):     lst = []     F = open(filename)     line = F.readline()     line = F.readline()     PRO = ""     while(line!=""):         for ELEM in line[0:-1]:             if ELEM in Amino_Acids:                 PRO += ELEM         line = F.readline()         if (line==""):             lst.append(PRO)             break         if line[0]==">":             line = F.readline()             lst.append(PRO)             PRO = ""     return lst def substring_len(f, n):     l = set()     for i in range(len(f)-n+1):         if not (f[i:i+n] in l):             l.add(f[i:i+n])     return l def BinaryfindLongest(proteome1,proteome2):     proteome1.sort(key = lambda s: len(s))     proteome2.sort(key = lambda s: len(s))     Maximum = 0     Minimum = 0     Mid = 0     Closer = 4     H = set()     X = ''.join(proteome1)     Y = ''.join(proteome2)     found = False     count = len(proteome2)          ###finds the longest string     if(len(proteome1[-1])>len(proteome2[-1])):         Maximum = len(proteome1[-1])     else:         Maximum = len(proteome2[-1])     while(Maximum-Minimum>Closer):         found = False         Mid = (Maximum+Minimum)//2         for i in proteome2:             if(found):                 break             if(len(i)<Mid):                 continue             else:                 H = substring_len(i,Mid)                 for k in H:                     if k in X:                         Minimum = (Mid)                         found = True                         break         if(not found):             Maximum = (Mid)     print (Maximum,Minimum,Mid)  ```

Whoa big mistake big mistake!!!...

forget about that one code above, I posted the wrong one!
Here it is, sorting the listing according to the length of strings. You can assume freely that all elements in the list are strings since I have another function called readProteome which makes sure it happens.

Basically, we are back to the point. sets and binary search with two big lists containing a lot of strings. a good algorithm in my opinion but its taking too much time...

 Viewing: Dev Shed Forums > Programming Languages > Python Programming > Longest common substring