|
|
|||||||||
|
|||||||||
| |||||||||
|
|
|
| |||||||||
![]() |
|
|
«
Previous Thread
|
Next Thread
»
|
Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
|
|
Be the architects of evolution and help create the mobile internet future. It’s your move---enter to win here! |
|
#1
|
|||
|
|||
|
I am attempting to pickle a long list of strings.
For instance: Code:
strings_list = list() #make an empty list
for i in range(100): #then fill it up
strings_list.append("30 chars......................")
list_pickled = pickle.dumps(obj, -1) #pickle the list (-1 is for HIGHEST_PROTOCOL)
list_unpickled = pickle.loads(list_pickled) #unpickle the list
The above code works fine. However, then I lengthen the list to 1000 strings: Code:
for i in range(1000): #make the list longer this time
strings_list.append("30 chars......................")
A 1000 item list will not pickle properly. In my particular case, the thread that pickles throws what looks like a stack overflow error, and the unpickling thread throws a KeyError (which makes sense since it wasn't pickled properly). It gets even uglier using cPickle, which seems to overflow and overwrite portions of the in-process code. (can you say exploit?) Can anyone help me out here? Is there a size limitation to the pickling process? If so, is there a work around other than chopping my list into smaller pieces? Are there any alternative serialization libraries without this limitation? Thanks, Dave Mills |
|
#2
|
|||
|
|||
|
the code works fine on my system, which is Python 2.3 on Windows XP. In the code you posted you pass 'obj' to pickles instead of 'strings_list', but apart from that it was fine. It still worked when I increased the list length to 1,000,000.
What version of Python are you using, and what OS? BTW, the example you gave will not store 1000 (or 1,000,000) copies of the string - since the strings are identical it only stores one copy and 1,000,000 references to it. If the code that is actually failing is storing different strings, this may be relevant. Regards, Dave - The Developers' Coach Last edited by DevCoach : March 13th, 2004 at 04:03 AM. |
|
#3
|
|||
|
|||
|
I tried modifying the string each time, and it still works fine, both with pickle and cPickle.
Code:
import cPickle as pickle
strings_list = list() #make an empty list
for i in range(1000000): #then fill it up
strings_list.append("30 chars......................" + str(i))
list_pickled = pickle.dumps(strings_list, -1) #pickle the list (-1 is for HIGHEST_PROTOCOL)
f = open('test.pickle', 'wb')
f.write(list_pickled)
f.close()
list_unpickled = pickle.loads(list_pickled) #unpickle the list
print list_unpickled[0:2]
Dave - The Developers' Coach |
|
#4
|
|||
|
|||
|
Ok, a couple things...
- You're right about pickle only keeping 1 reference for all the identical strings, but in my case the strings are all SHA-1 hashes in hexadecimal format. Therefore, they are almost certainly all unique. (Also, the hex SHA-1 hashes are 40-chars long, not 30 as in my original post.) - I actually tested out my sample code, and you're right, it does work. I just ran back to my program code to see if I was being silly, but it seems I am not. I've been able to condense my problem down to a few lines of code with which you should be able to reproduce the behavior. Code:
hashes = list()
for i in range(1000):
chunk = str(i) #so we have unique strings
chunk = chunk.zfill(16384) #pad the string with zeros until <width> is reached...in my program i'm reading 16k chunks and hashing them
shaobj = sha.new(chunk) #create a fresh sha object and assign the chunk
hash = shaobj.hexdigest() #get the hexadecimal hash
hashes.append(hash) #add it to the list
my_custom_print_func('before pickling. hashes=' + repr(hashes))
list_pickled = pickle.dumps(hashes, -1)
list_unpickled = pickle.loads(list_pickled) #unpickle the list
my_custom_print_func('after unpickling. hashes=' + repr(list_unpickled))
The above code executes successfully for me, but at soon as the number of strings in the list gets too high, it ceases to work properly. There is no error thrown, but my print statements print nothing but empty space. I experimented to see exactly where the cut-off point was. For me, the above code runs properly with 1488 strings, but ceases to function with 1489 strings. Do you have any idea why this would happen? Can you reproduce the behavior? Can you suggest an alternate serialization module and/or a work-around? Thanks, Dave Mills -edit- I almost forgot: I'm running win2k and using Python 2.3 Last edited by dmills : March 13th, 2004 at 05:25 AM. Reason: forgot... |
|
#5
|
||||
|
||||
|
I tried running your last example but it didnt work and i'm a little reluctant to rewrite it since i might end up stripping away the problem... heres what i got
C:\Documents and Settings\Mark\Desktop>python p.py Traceback (most recent call last): File "p.py", line 5, in ? shaobj = sha.new(chunk) #create a fresh sha object and assign the chunk NameError: name 'sha' is not defined Anyway as for other modules there are a one or two you should look at although not all of these are strictly serialization. http://www.python.org/doc/2.3.3/lib/module-shelve.html - shelve module, which is used for object persistance. After this you should look at Pythons dbm modules like Dev said but here are a few links. http://www.python.org/doc/2.3.3/lib/module-anydbm.html http://www.python.org/doc/2.3.3/lib/module-dbhash.html http://www.python.org/doc/2.3.3/lib/module-dbm.html http://www.python.org/doc/2.3.3/lib/module-dumbdbm.html http://www.python.org/doc/2.3.3/lib/module-gdbm.html http://www.python.org/doc/2.3.3/lib/module-whichdb.html http://www.python.org/doc/2.3.3/lib/module-bsddb.html If this isnt what you're looking for you could write your own little system i.e. save the values to a simple text file and read the values back in then split on a new line. Hope this helps, Mark. |
|
#6
|
|||
|
|||
|
The revised code still works for me - I went up to 1,000,000 with cPickle and 100,000 with Pickle with no problems.
There are a number of possibilities:
You could try saving the strings to a regular text file instead of pickle, and see what happens. Dave - The Developers' Coach |
|
#7
|
|||
|
|||
|
another thought: there could be a problem with repr() handling long lists, or with my_custom_print_func handling long strings. Try printing just the first and last few elements of the list.
Dave - The Developers' Coach |
|
#8
|
|||
|
|||
|
DevCoach, You've been correct about everything so far :-)
My most recent example wasn't printing because the strings were too long for the wxListBox in which they were displayed. After removing the example from the rest of my code, the example works fine. The fact remains that all of my code works properly until I try to pickle up a long list. I must investigate this further. I'll post here again when I've figured it out (or I'm stumped ;-) -edit- In response to Netytan: you must "import sha" in order to use that module. Here's that sample converted to a stand-alone program. (and now it works fine for me, so I doubt you'll be able to reproduce my problem...but why not post code, ya know?) Code:
#!/usr/bin/env python
import sha
import pickle
modules ={}
def main():
hashes = list()
the_file = file("C:\\somepath\\somefile.ext", 'r')
while 1:
chunk = the_file.read(16*1024)
if len(chunk) == 0:
break
shaobj = sha.new(chunk)
hash = shaobj.hexdigest()
hashes.append(hash)
print 'before pickling. len=' + str(len(hashes)) + ' hashes=' + hashes[len(hashes)-1] #print the length and the last element
list_pickled = pickle.dumps(hashes, -1) #pickle the list (-1 is for HIGHEST_PROTOCOL)
list_unpickled = pickle.loads(list_pickled) #unpickle the list
print 'after unpickling. len=' + str(len(hashes)) + ' hashes=' + hashes[len(hashes)-1]
if __name__ == '__main__':
main()
Last edited by dmills : March 13th, 2004 at 04:59 PM. Reason: replying to netytan |
|
#9
|
|||
|
|||
|
It took me a little while, but I tracked down the problem with my code. It seems I wasn't properly handling the "leftovers" in my socket recv() code (damn 1-off errors...).
I still stand by my original claim that I was experiencing buffer and/or stack overflows when feeding the cPickle module bad data. The standard pickle module just generates KeyError(s), but the cPickle module exhibited some truly bizarre behavior and appeared to overwrite in-process code. The behavior was reminiscent of my many hours tracking down similar C/C++ overflows... If I find some spare time I'll attempt to reproduce the behavior and post a code sample. If I manage to do that I'll mail the python-bugs list too. Anyway, thanks for all the help :-) Peace, David Mills |
![]() |
| Viewing: Dev Shed Forums > Programming Languages > Python Programming > pickling long lists (size limit?) |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|
|
|