Page 1 of 2 12 Last
  • Jump to page:
    #1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    32
    Rep Power
    2

    Longest common substring


    Ok let's say I need a very little push with an assignment. For now, I had to read relatively large files and sort them into a list according certain criteria (not relevant). Done that.

    Now I need to compare two of these lists and find the longest common sub-string which exists in both lists, in fact, I have to return the length of the longest common sub-string.

    This is the code that I have so far, it works well for long enough text but it surely doesn't do the trick with the files i'm trying to deal with. Could anyone please give a hint how I could improve the time complexity?




    PHP Code:
    def findLongest(proteome1,proteome2):
        
    ""
        
    Maximum 0
        Found 
    False
        Poss 
    = []
        for 
    D in proteome1:
            for 
    M in proteome2:
                
    Found False
                
    for slice_pat in range(0,len(D)):
                    if(
    not Found):
                        
    Poss findNstring(D,len(D)-slice_pat)   
                        for 
    i in Poss:
                            if 
    i in M:
                                if 
    len(i)>Maximum:
                                    
    Maximum len(i)
                                    
    FoundTrue
                                    
    break
                                else:
                                    
    Found=True
                                    
    break
                            else:
                                break
                    else:
                        break

        return 
    Maximum

    def findNstring
    (txt,length): 
     
    ### (string,int)--->list , computes all possible sub-strings of length 'length' using txt####

        
    lst = []
        
    start 0
        end 
    length
        
    while(end!=len(txt)+1):
            if 
    not(txt[start:endin lst):
                
    lst.append(txt[start:end])
                
    start+=1
                end
    +=1
            
    else:
                
    start+=1
                end 
    +=1
        
    return lst 
  2. #2
  3. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2012
    Location
    39N 104.28W
    Posts
    158
    Rep Power
    3
    It might (and I stress, might) be marginally faster if you looped backwards: range(len(D),0,-1)
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    32
    Rep Power
    2
    Originally Posted by rrashkin
    It might (and I stress, might) be marginally faster if you looped backwards: range(len(D),0,-1)
    In fact it is looped backwards... I set it that way so that the first match found is actually the maximal match. My problem occurs due to recurring loops, I'd like a suggestion of any possible conditions where I could simply break the loop and skip to the next one...
  6. #4
  7. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,931
    Rep Power
    481

    a different algorithm


    Code:
    import logging
    from loopstatus import loopstatus
    
    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s %(levelname)-8s L%(lineno)d %(message)s',
        datefmt='%H:%M:%S',
        )
    
    def longest_common_string(A,B):
        status = loopstatus()
        try:
            L = 1
            RB = range(len(B))
            for i in range(len(A)):
                a = A[i:i+L]
                if status():
                    logging.info('character %d of %d',i,len(A))
                for j in RB:
                    if a == B[j:j+L]:
                        while A[i+L] == B[j+L]:
                            a += A[i+L]
                            L += 1
                            print(a,i)
        except IndexError:
            pass
        return L
    My test strings are so long that I didn't wait for either of our functions to finish. My inputs were text from public domain pdf versions of
    A_Christmas_Carol_NT.txt Anna_Karenina_NT.txt

    The result so far starting at character position 33 is the supremely disappointing (but expected, I've looked at these files before today)

    This eBook is designed and published by Planet PDF. For more free
    eBooks visit our Web site at http://www.planetpdf.com/.


    Here's my loopstatus.py which is useful for ocassional reporting during long running, otherwise silent loops.
    This eBook is designed and published by Planet PDF. For more free
    eBooks visit our Web site at http://www.planetpdf.com/.

    A 33
    14:22:52 INFO L25 character 63 of 162805
    14:23:53 INFO L25 character 127 of 162805
    14:25:57 INFO L25 character 255 of 162805
    14:30:14 INFO L25 character 511 of 162805
    Code:
    '''
        import logging
        from loopstatus import loopstatus
    
        logging.basicConfig(
            level=logging.INFO,
            format='%(asctime)s %(levelname)-8s L%(lineno)d %(message)s',
            datefmt='%H:%M:%S',
            )
    
        status = loopstatus()
        ...
        if status():
            logging.info('%dnth occurence',status.Value)
    '''
    
    import signal
    
    class loopstatus(object):
    
        def __init__(self,base=2,seconds=5*60,target=1):
            self.i = 0
            self.base = base
            self.target = target
            self.nonzero = True
            self.timesup = False
            self.interval = seconds
            if 0 < seconds:
                self.reset_timer()
                signal.signal(signal.SIGALRM,self.trap_alarm)
    
        def trap_alarm(self,*args,**kwargs):
            self.timesup = True
    
        def reset_timer(self,interval=None):
            if interval is None:
                interval = self.interval
            if 0 < interval:
                self.timesup = False
                signal.alarm(interval)
    
        def __call__(self):
            self.i += 1
            self.nonzero = self.timesup or (self.target <= self.i)
            if self:
                self.target *= self.base
                self.reset_timer()
            return self.nonzero
    
        @property
        def Value(self):
            return self.i
    
        def __bool__(self):
            return self.nonzero
    
        def __str__(self):
            return str(self.Value)
    
        def __lt__(self,other): return self.Value < other
        def __le__(self,other): return self.Value <= other
        def __eq__(self,other): return self.Value == other
        def __ne__(self,other): return self.Value != other
        def __ge__(self,other): return self.Value >= other
        def __gt__(self,other): return self.Value > other
    
    
    def main(base=2,seconds=5*60,target=1):
        return loopstatus(base,seconds,target)
    
    
    '''
    I'd include loopstatustest.py, except that I'd also have to post my testing framework.  tests are important.  Here is an annotated sample use case terminated by keyboard interrupt:
    
        >>> import loopstatus
        >>> status = loopstatus.loopstatus(seconds=4)
        >>> while True:
        ...     if status():
        ...         print(s.Value)
        ...
        1
        2
        4
        8
        16
        32
        64
        128
        256
        512
        1024
        2048
        4096
        8192
        16384
        32768
        65536
        131072
        262144
        524288
        1048576
        2097152
        3592136       # the 4 second timer caused this event
        5106735
        6616367
        8126450
        ^C
        Traceback (most recent call last):
          file "<stdin>", line 2, in <module>
          file "ls.py", line 44, in __call__
            if self.timesup or (self.target <= self.i):
        KeyboardInterrupt
        >>>
    
    **CONTROL**
    
    To closer pack the geometric sequence---set *base* to a float between 1 and 2.
    
    For decade counting---set *base* to 10.  I don't understand the fuss regardinging the usual number of fingers, hence the binary default.
    
    For more frequent initial true tests---set *target* between 0 and 1
    
        0 < target < 1
    
    To always report True---set base to 1.
    
    To test true less often initially---increase *target*.
    
    To disable truth by time---choose 0 for *seconds*.  The default is 5*60 seconds == 5 minutes.
    
    To disable truth by count---use a large value of *target*.
    
    
    **METHODS**
    
    Call the loopstatus object to increment its count, set the object's truth, and manage internals.  It returns its truth value.
    
    loopstatus defines __nonzero__ which does not alter the object's truth.
    
    LSObject.Value returns the LSObject() call count.
    
    sample:
    
        from WriteNumber import WriteNumber
        unfinished = True
        status = loopstatus.loopstatus()
    
        while unfinished:
    
            if status():                    # Call updates truth state
                info('measure %s.',WriteNumber(status.Value)) # progress
    
            unfinished = symphony(
                verbose = not not status)   # safely use boolean state
    '''
    [code]Code tags[/code] are essential for python code and Makefiles!
  8. #5
  9. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    May 2009
    Posts
    514
    Rep Power
    33
    Lists are not efficient for any data that is "large". Unfortunately "large" is not a concrete number. To access the lists, let us say you are comparing the 100th element of each, you
    start at the beginning of list_1 and go to the 100th element
    start at the beginning of list_2 and go to the 100th element
    and if they are equal
    start at the beginning of list_1 and go to the 101st element
    start at the beginning of list_2 and go to the 101st element, etc.

    You should try with a set or dictionary since they are hashed. The example below uses a set. Also, you can eliminate around 90% of the lengths used to generate the sub-strings by using the divide by two method. Divide the longest length by two and test that length for sub-strings that match. If not found, divide the lower half by two and test for 1/4 the longest length. If found, divide the upper half, but either way you have eliminated half of the possible lengths. Continue dividing by two until you get to a manageable number of lengths remaining, say 10 or 20, start at the lowest number and increment by one until a match is not found. If you want more help with this, post some test data that we can use along with any problems.
    Code:
    ## do not print any extra messages during an actual run 
    # because printing is slow comparatively.
    
    def one_group(length, str_in):
        """ divide the string into sub-groups of the length passed 
              and return a set
        """
        str_in = str_in.lower()
        return_set = set()
        start = 0
        end = length
        print end
        while end <= len(str_in):
            this_group=str_in[start:end]
            return_set.add(this_group)
            print "     ", start, this_group
            start += 1
            end += 1
    
        return return_set
    
    def find_matches(str_1, str_2):
        for ctr in range(len(str_1)-1, 0, -1):
            group_1=one_group(ctr, str_1)
            group_2=one_group(ctr, str_2)
            for group in group_1:
                if group in group_2:
                    print "*****Found*****", ctr, group
                    return
     
    str_1 = "abcdef"
    str_2 = "cdEfg"
    if str_1 == str_2:
        print "whole strings are equal"
    else:
        find_matches(str_1, str_2)
    Last edited by dwblas; December 26th, 2012 at 01:58 PM.
  10. #6
  11. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,931
    Rep Power
    481
    I tried dwblas functions (without the demonstration print statements) for the aforementioned strings of 0.1 and 2.1 million characters, which immediately sent my computer into page fault convulsions.

    Still, hashing of some sort might produce a faster algorithm.

    Extrapolating, my code would have finished the job in a little under 2 days. No estimates available for the other algorithms.
    [code]Code tags[/code] are essential for python code and Makefiles!
  12. #7
  13. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    32
    Rep Power
    2
    Originally Posted by b49P23TIvg
    I tried dwblas functions (without the demonstration print statements) for the aforementioned strings of 0.1 and 2.1 million characters, which immediately sent my computer into page fault convulsions.

    Still, hashing of some sort might produce a faster algorithm.

    Extrapolating, my code would have finished the job in a little under 2 days. No estimates available for the other algorithms.
    Yeah, I did encounter the same problem as well. The given functions run a little faster than mine does but still it isn't fast enough. My assignment page says that the program must run under 2 minutes to be accepted. I am starting to doubt my self, the text files are immense I could swim in them, I just can't see how this can be achieved as requested...

    How could hashing improve the performance of the program?
  14. #8
  15. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,931
    Rep Power
    481
    What size are the strings please????

    You could compute a running hash, and then comparing only 1 byte would tell you if there's a chance the strings could match, only then would you need to use strcmp.

    1 2 3 4 2

    The running hash, say I use sum, of length 3

    1+2+3 starting at position 0
    (6-1)+4 starting at position 1
    (9-2)+2 starting at position 2

    I'd use exclusive or instead of sum, but whatever. It might be part of a faster algorithm.


    Of course one can build faster searches, but there's some overhead in constructing the tables.
    find "abca" in "ababcax"
    "a" matches "a"
    "b" matches "b"
    "c" does not match "a" but our precomputed table says that there is an "a" two characters before "c" in the pattern, so shift two characters to the right.
    etceteras.


    Instead, maybe the trick is to remove that inner loop, replace it with built-in (c level) code. I'll try this next.
    [code]Code tags[/code] are essential for python code and Makefiles!
  16. #9
  17. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    May 2009
    Posts
    514
    Rep Power
    33
    the text files are immense I could swim in them
    There is no way anyone can help without something more specific than "immense".
    Now I need to compare two of these lists and find the longest common sub-string which exists in both lists
    Are they lists or files? Do they contain one long string, if so how long, or do they contain many records and each record has to be checked against many other records, if so approximately how long is the longest record.
    I had to read relatively large files and sort them into a list according certain criteria (not relevant). Done that.
    This implies that they will fit into existing memory, so a solution which holds the entire files/lists in memory is called for.
    I tried dwblas functions (without the demonstration print statements) for the aforementioned strings of 0.1 and 2.1 million characters, which immediately sent my computer into page fault convulsions
    Obviously some assumptions are being made which don't apply since the lists have been sorted previously by the OP.
    Last edited by dwblas; December 26th, 2012 at 05:50 PM.
  18. #10
  19. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,931
    Rep Power
    481
    Much faster, if correct! You can verify that.
    Code:
    def longestdwl(A,B):
        L = 1
        for i in range(len(A)):
            a = A[i:i+L]
            j = 0
            while True:
                j = B.find(a,j)
                if j < 0:
                    break
                while (i+L < len(A)) and (j+L < len(B)) and (A[i+L] == B[j+L]):
                    a += A[i+L]
                    L += 1
                j += 1
        return L
    [code]Code tags[/code] are essential for python code and Makefiles!
  20. #11
  21. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    32
    Rep Power
    2
    Originally Posted by dwblas
    There is no way anyone can help without something more specific than "immense". Are they lists or files? Do they contain one long string, if so how long, or do they contain many records and each record has to be checked against many other records, if so approximately how long is the longest record. This implies that they will fit into existing memory, so a solution which holds the entire files/lists in memory is called for. Obviously some assumptions are being made which don't apply since the lists have been sorted previously by the OP.

    Ok it is time to specify more.
    The examined text is found inside a file which contains a one piece text formed of approximately 3 million chars without spaces. As a starting point I have to import the text into a LIST inside python and sort the text so that in each index of the list there's a 800+- long sub-string on average.

    In total I need to create two of these lists (2 files imported). Basically I need to find the longest sub-string which is in common in both lists, but I think that's clear. So I have no choice but run on all the indexes in both lists and check for longest common string...

    My problem is that no matter how hard I try to make my algorithm faster, I always lose against my lecturer's challenge of finishing the computing task in less than 2 minutes...

    I am sure there's some sort of trick involving "memoization" and something about Karp-Rabin algorithm of converting text to int and compare... I can post my attempt at that if you want to see? But I repeat, it is nowhere near efficient...
  22. #12
  23. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    32
    Rep Power
    2
    Originally Posted by b49P23TIvg
    Much faster, if correct! You can verify that.
    Code:
    def longestdwl(A,B):
        L = 1
        for i in range(len(A)):
            a = A[i:i+L]
            j = 0
            while True:
                j = B.find(a,j)
                if j < 0:
                    break
                while (i+L < len(A)) and (j+L < len(B)) and (A[i+L] == B[j+L]):
                    a += A[i+L]
                    L += 1
                j += 1
        return L



    Tests say that this broke all my previous records... Fast one!
  24. #13
  25. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,931
    Rep Power
    481
    What I mean by "verify" is

    I did not thoroughly test the program I submitted. Does it give correct answer?


    Also, I'm pretty sure it's faster if the arguments are in the order of (shorter_string, longer_string)
    If so you could sort the arguments by length in the function.
    Last edited by b49P23TIvg; December 27th, 2012 at 10:58 AM.
    [code]Code tags[/code] are essential for python code and Makefiles!
  26. #14
  27. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    32
    Rep Power
    2
    Originally Posted by b49P23TIvg
    What I mean by "verify" is

    I did not thoroughly test the program I submitted. Does it give correct answer?


    Also, I'm pretty sure it's faster if the arguments are in the order of (shorter_string, longer_string)
    If so you could sort the arguments by length in the function.

    Well I am about to finish the project!
    It is still quite inefficient when talking about the 3 millions char string but I am on my way to getting it done. Do you have any suggestions for a halting condition? Maybe some way to stop break the loop and prevent the program from comparing useless sub-strings?
  28. #15
  29. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,931
    Rep Power
    481
    What is your current code?

    Use the profiler. Maybe there are alternatives to the bad spot.
    [code]Code tags[/code] are essential for python code and Makefiles!
Page 1 of 2 12 Last
  • Jump to page:

IMN logo majestic logo threadwatch logo seochat tools logo