#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2004
    Posts
    5
    Rep Power
    0

    Question pickling long lists (size limit?)


    I am attempting to pickle a long list of strings.
    For instance:

    Code:
    strings_list = list()  #make an empty list                  
    for i in range(100): #then fill it up
         strings_list.append("30 chars......................")
    
    list_pickled = pickle.dumps(obj, -1) #pickle the list (-1 is for HIGHEST_PROTOCOL)
    
    list_unpickled = pickle.loads(list_pickled) #unpickle the list
    The above code works fine. However, then I lengthen the list to 1000 strings:

    Code:
    for i in range(1000): #make the list longer this time       
          strings_list.append("30 chars......................")
    A 1000 item list will not pickle properly. In my particular case, the thread that pickles throws what looks like a stack overflow error, and the unpickling thread throws a KeyError (which makes sense since it wasn't pickled properly). It gets even uglier using cPickle, which seems to overflow and overwrite portions of the in-process code. (can you say exploit?)

    Can anyone help me out here? Is there a size limitation to the pickling process? If so, is there a work around other than chopping my list into smaller pieces? Are there any alternative serialization libraries without this limitation?

    Thanks,
    Dave Mills
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Feb 2004
    Location
    London, England
    Posts
    1,585
    Rep Power
    1373
    the code works fine on my system, which is Python 2.3 on Windows XP. In the code you posted you pass 'obj' to pickles instead of 'strings_list', but apart from that it was fine. It still worked when I increased the list length to 1,000,000.

    What version of Python are you using, and what OS?

    BTW, the example you gave will not store 1000 (or 1,000,000) copies of the string - since the strings are identical it only stores one copy and 1,000,000 references to it. If the code that is actually failing is storing different strings, this may be relevant.

    Regards,

    Dave - The Developers' Coach
    Last edited by DevCoach; March 13th, 2004 at 04:03 AM.
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Feb 2004
    Location
    London, England
    Posts
    1,585
    Rep Power
    1373
    I tried modifying the string each time, and it still works fine, both with pickle and cPickle.

    Code:
    import cPickle as pickle
    
    strings_list = list()  #make an empty list                  
    for i in range(1000000): #then fill it up
        strings_list.append("30 chars......................" + str(i))
    
    list_pickled = pickle.dumps(strings_list, -1) #pickle the list (-1 is for HIGHEST_PROTOCOL)
    
    f = open('test.pickle', 'wb')
    f.write(list_pickled)
    f.close()
    
    list_unpickled = pickle.loads(list_pickled) #unpickle the list
    
    print list_unpickled[0:2]
    Dave - The Developers' Coach
  6. #4
  7. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2004
    Posts
    5
    Rep Power
    0
    Ok, a couple things...

    - You're right about pickle only keeping 1 reference for all the identical strings, but in my case the strings are all SHA-1 hashes in hexadecimal format. Therefore, they are almost certainly all unique. (Also, the hex SHA-1 hashes are 40-chars long, not 30 as in my original post.)

    - I actually tested out my sample code, and you're right, it does work. I just ran back to my program code to see if I was being silly, but it seems I am not. I've been able to condense my problem down to a few lines of code with which you should be able to reproduce the behavior.

    Code:
    hashes = list()
    for i in range(1000):
          chunk = str(i)  #so we have unique strings
          chunk = chunk.zfill(16384)  #pad the string with zeros until <width> is reached...in my program i'm reading 16k chunks and hashing them
          shaobj = sha.new(chunk)  #create a fresh sha object and assign the chunk
          hash = shaobj.hexdigest()  #get the hexadecimal hash
          hashes.append(hash)  #add it to the list
                
    my_custom_print_func('before pickling. hashes=' + repr(hashes))
    list_pickled = pickle.dumps(hashes, -1) 
    
    list_unpickled = pickle.loads(list_pickled) #unpickle the list
    my_custom_print_func('after unpickling. hashes=' + repr(list_unpickled))
    The above code executes successfully for me, but at soon as the number of strings in the list gets too high, it ceases to work properly. There is no error thrown, but my print statements print nothing but empty space. I experimented to see exactly where the cut-off point was. For me, the above code runs properly with 1488 strings, but ceases to function with 1489 strings.

    Do you have any idea why this would happen? Can you reproduce the behavior? Can you suggest an alternate serialization module and/or a work-around?

    Thanks,
    Dave Mills

    -edit-
    I almost forgot: I'm running win2k and using Python 2.3
    Last edited by dmills; March 13th, 2004 at 05:25 AM. Reason: forgot...
  8. #5
  9. Hello World :)
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Mar 2003
    Location
    Hull, UK
    Posts
    2,537
    Rep Power
    69
    I tried running your last example but it didnt work and i'm a little reluctant to rewrite it since i might end up stripping away the problem... heres what i got

    C:\Documents and Settings\Mark\Desktop>python p.py
    Traceback (most recent call last):
    File "p.py", line 5, in ?
    shaobj = sha.new(chunk) #create a fresh sha object and assign the chunk
    NameError: name 'sha' is not defined

    Anyway as for other modules there are a one or two you should look at although not all of these are strictly serialization.

    http://www.python.org/doc/2.3.3/lib/module-shelve.html - shelve module, which is used for object persistance.

    After this you should look at Pythons dbm modules like Dev said but here are a few links.

    http://www.python.org/doc/2.3.3/lib/module-anydbm.html

    http://www.python.org/doc/2.3.3/lib/module-dbhash.html
    http://www.python.org/doc/2.3.3/lib/module-dbm.html
    http://www.python.org/doc/2.3.3/lib/module-dumbdbm.html
    http://www.python.org/doc/2.3.3/lib/module-gdbm.html
    http://www.python.org/doc/2.3.3/lib/module-whichdb.html
    http://www.python.org/doc/2.3.3/lib/module-bsddb.html

    If this isnt what you're looking for you could write your own little system i.e. save the values to a simple text file and read the values back in then split on a new line.

    Hope this helps,

    Mark.
    programming language development: www.netytan.com Hula

  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Feb 2004
    Location
    London, England
    Posts
    1,585
    Rep Power
    1373
    The revised code still works for me - I went up to 1,000,000 with cPickle and 100,000 with Pickle with no problems.

    There are a number of possibilities:

    • pickle is innocent, and the problem is elsewhere in your program
    • there is a bug in (c)pickle that only manifests in Win2k. Unlikely, since I cannot see any bug reports raised for this on sourceforge, but possible
    • You are running out of memory. Also unlikely, given that the problem happens at fairly small list lengths


    You could try saving the strings to a regular text file instead of pickle, and see what happens.

    Dave - The Developers' Coach
  12. #7
  13. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Feb 2004
    Location
    London, England
    Posts
    1,585
    Rep Power
    1373
    another thought: there could be a problem with repr() handling long lists, or with my_custom_print_func handling long strings. Try printing just the first and last few elements of the list.

    Dave - The Developers' Coach
  14. #8
  15. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2004
    Posts
    5
    Rep Power
    0
    DevCoach, You've been correct about everything so far :-)

    My most recent example wasn't printing because the strings were too long for the wxListBox in which they were displayed. After removing the example from the rest of my code, the example works fine.

    The fact remains that all of my code works properly until I try to pickle up a long list. I must investigate this further. I'll post here again when I've figured it out (or I'm stumped ;-)

    -edit-
    In response to Netytan: you must "import sha" in order to use that module. Here's that sample converted to a stand-alone program. (and now it works fine for me, so I doubt you'll be able to reproduce my problem...but why not post code, ya know?)

    Code:
    #!/usr/bin/env python
    import sha
    import pickle
    
    modules ={}
    
    def main():
        hashes = list()
        the_file = file("C:\\somepath\\somefile.ext", 'r')
        while 1:
            chunk = the_file.read(16*1024)
            if len(chunk) == 0:
                break
            shaobj = sha.new(chunk)
            hash = shaobj.hexdigest()
            hashes.append(hash)
            
        print 'before pickling. len=' + str(len(hashes)) + ' hashes=' + hashes[len(hashes)-1] #print the length and the last element
        list_pickled = pickle.dumps(hashes, -1) #pickle the list (-1 is for HIGHEST_PROTOCOL)
    
        list_unpickled = pickle.loads(list_pickled) #unpickle the list
        print 'after unpickling. len=' + str(len(hashes)) + ' hashes=' + hashes[len(hashes)-1]
    
    if __name__ == '__main__':
        main()
    Last edited by dmills; March 13th, 2004 at 04:59 PM. Reason: replying to netytan
  16. #9
  17. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2004
    Posts
    5
    Rep Power
    0
    It took me a little while, but I tracked down the problem with my code. It seems I wasn't properly handling the "leftovers" in my socket recv() code (damn 1-off errors...).

    I still stand by my original claim that I was experiencing buffer and/or stack overflows when feeding the cPickle module bad data. The standard pickle module just generates KeyError(s), but the cPickle module exhibited some truly bizarre behavior and appeared to overwrite in-process code. The behavior was reminiscent of my many hours tracking down similar C/C++ overflows... If I find some spare time I'll attempt to reproduce the behavior and post a code sample. If I manage to do that I'll mail the python-bugs list too.

    Anyway, thanks for all the help :-)

    Peace,
    David Mills

IMN logo majestic logo threadwatch logo seochat tools logo