Python Programming
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me
Go Back   Dev Shed ForumsProgramming LanguagesPython Programming

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
Be the architects of evolution and help create the mobile internet future. It’s your move---enter to win here!
  #1  
Old March 13th, 2004, 02:14 AM
dmills dmills is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Mar 2004
Posts: 5 dmills User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
Question pickling long lists (size limit?)

I am attempting to pickle a long list of strings.
For instance:

Code:
strings_list = list()  #make an empty list                  
for i in range(100): #then fill it up
     strings_list.append("30 chars......................")

list_pickled = pickle.dumps(obj, -1) #pickle the list (-1 is for HIGHEST_PROTOCOL)

list_unpickled = pickle.loads(list_pickled) #unpickle the list


The above code works fine. However, then I lengthen the list to 1000 strings:

Code:
for i in range(1000): #make the list longer this time       
      strings_list.append("30 chars......................")  


A 1000 item list will not pickle properly. In my particular case, the thread that pickles throws what looks like a stack overflow error, and the unpickling thread throws a KeyError (which makes sense since it wasn't pickled properly). It gets even uglier using cPickle, which seems to overflow and overwrite portions of the in-process code. (can you say exploit?)

Can anyone help me out here? Is there a size limitation to the pickling process? If so, is there a work around other than chopping my list into smaller pieces? Are there any alternative serialization libraries without this limitation?

Thanks,
Dave Mills

Reply With Quote
  #2  
Old March 13th, 2004, 04:00 AM
DevCoach DevCoach is offline
Contributing User
Dev Shed Beginner (1000 - 1499 posts)
 
Join Date: Feb 2004
Location: London, England
Posts: 1,195 DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level) 
Time spent in forums: 1 Week 5 Days 13 h 31 m 54 sec
Reputation Power: 252
the code works fine on my system, which is Python 2.3 on Windows XP. In the code you posted you pass 'obj' to pickles instead of 'strings_list', but apart from that it was fine. It still worked when I increased the list length to 1,000,000.

What version of Python are you using, and what OS?

BTW, the example you gave will not store 1000 (or 1,000,000) copies of the string - since the strings are identical it only stores one copy and 1,000,000 references to it. If the code that is actually failing is storing different strings, this may be relevant.

Regards,

Dave - The Developers' Coach

Last edited by DevCoach : March 13th, 2004 at 04:03 AM.

Reply With Quote
  #3  
Old March 13th, 2004, 04:26 AM
DevCoach DevCoach is offline
Contributing User
Dev Shed Beginner (1000 - 1499 posts)
 
Join Date: Feb 2004
Location: London, England
Posts: 1,195 DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level) 
Time spent in forums: 1 Week 5 Days 13 h 31 m 54 sec
Reputation Power: 252
I tried modifying the string each time, and it still works fine, both with pickle and cPickle.

Code:
import cPickle as pickle

strings_list = list()  #make an empty list                  
for i in range(1000000): #then fill it up
    strings_list.append("30 chars......................" + str(i))

list_pickled = pickle.dumps(strings_list, -1) #pickle the list (-1 is for HIGHEST_PROTOCOL)

f = open('test.pickle', 'wb')
f.write(list_pickled)
f.close()

list_unpickled = pickle.loads(list_pickled) #unpickle the list

print list_unpickled[0:2]


Dave - The Developers' Coach

Reply With Quote
  #4  
Old March 13th, 2004, 05:23 AM
dmills dmills is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Mar 2004
Posts: 5 dmills User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
Ok, a couple things...

- You're right about pickle only keeping 1 reference for all the identical strings, but in my case the strings are all SHA-1 hashes in hexadecimal format. Therefore, they are almost certainly all unique. (Also, the hex SHA-1 hashes are 40-chars long, not 30 as in my original post.)

- I actually tested out my sample code, and you're right, it does work. I just ran back to my program code to see if I was being silly, but it seems I am not. I've been able to condense my problem down to a few lines of code with which you should be able to reproduce the behavior.

Code:
hashes = list()
for i in range(1000):
      chunk = str(i)  #so we have unique strings
      chunk = chunk.zfill(16384)  #pad the string with zeros until <width> is reached...in my program i'm reading 16k chunks and hashing them
      shaobj = sha.new(chunk)  #create a fresh sha object and assign the chunk
      hash = shaobj.hexdigest()  #get the hexadecimal hash
      hashes.append(hash)  #add it to the list
            
my_custom_print_func('before pickling. hashes=' + repr(hashes))
list_pickled = pickle.dumps(hashes, -1) 

list_unpickled = pickle.loads(list_pickled) #unpickle the list
my_custom_print_func('after unpickling. hashes=' + repr(list_unpickled))


The above code executes successfully for me, but at soon as the number of strings in the list gets too high, it ceases to work properly. There is no error thrown, but my print statements print nothing but empty space. I experimented to see exactly where the cut-off point was. For me, the above code runs properly with 1488 strings, but ceases to function with 1489 strings.

Do you have any idea why this would happen? Can you reproduce the behavior? Can you suggest an alternate serialization module and/or a work-around?

Thanks,
Dave Mills

-edit-
I almost forgot: I'm running win2k and using Python 2.3

Last edited by dmills : March 13th, 2004 at 05:25 AM. Reason: forgot...

Reply With Quote
  #5  
Old March 13th, 2004, 08:39 AM
netytan's Avatar
netytan netytan is offline
Hello World :)
Dev Shed Frequenter (2500 - 2999 posts)
 
Join Date: Mar 2003
Location: Hull, UK
Posts: 2,529 netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level) 
Time spent in forums: 1 Week 2 Days 17 h 19 m 5 sec
Reputation Power: 63
Send a message via ICQ to netytan Send a message via AIM to netytan Send a message via MSN to netytan Send a message via Yahoo to netytan
I tried running your last example but it didnt work and i'm a little reluctant to rewrite it since i might end up stripping away the problem... heres what i got

C:\Documents and Settings\Mark\Desktop>python p.py
Traceback (most recent call last):
File "p.py", line 5, in ?
shaobj = sha.new(chunk) #create a fresh sha object and assign the chunk
NameError: name 'sha' is not defined

Anyway as for other modules there are a one or two you should look at although not all of these are strictly serialization.

http://www.python.org/doc/2.3.3/lib/module-shelve.html - shelve module, which is used for object persistance.

After this you should look at Pythons dbm modules like Dev said but here are a few links.

http://www.python.org/doc/2.3.3/lib/module-anydbm.html

http://www.python.org/doc/2.3.3/lib/module-dbhash.html
http://www.python.org/doc/2.3.3/lib/module-dbm.html
http://www.python.org/doc/2.3.3/lib/module-dumbdbm.html
http://www.python.org/doc/2.3.3/lib/module-gdbm.html
http://www.python.org/doc/2.3.3/lib/module-whichdb.html
http://www.python.org/doc/2.3.3/lib/module-bsddb.html

If this isnt what you're looking for you could write your own little system i.e. save the values to a simple text file and read the values back in then split on a new line.

Hope this helps,

Mark.
__________________
programming language development: www.netytan.com Hula


Reply With Quote
  #6  
Old March 13th, 2004, 10:01 AM
DevCoach DevCoach is offline
Contributing User
Dev Shed Beginner (1000 - 1499 posts)
 
Join Date: Feb 2004
Location: London, England
Posts: 1,195 DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level) 
Time spent in forums: 1 Week 5 Days 13 h 31 m 54 sec
Reputation Power: 252
The revised code still works for me - I went up to 1,000,000 with cPickle and 100,000 with Pickle with no problems.

There are a number of possibilities:
  • pickle is innocent, and the problem is elsewhere in your program
  • there is a bug in (c)pickle that only manifests in Win2k. Unlikely, since I cannot see any bug reports raised for this on sourceforge, but possible
  • You are running out of memory. Also unlikely, given that the problem happens at fairly small list lengths

You could try saving the strings to a regular text file instead of pickle, and see what happens.

Dave - The Developers' Coach

Reply With Quote
  #7  
Old March 13th, 2004, 10:07 AM
DevCoach DevCoach is offline
Contributing User
Dev Shed Beginner (1000 - 1499 posts)
 
Join Date: Feb 2004
Location: London, England
Posts: 1,195 DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level) 
Time spent in forums: 1 Week 5 Days 13 h 31 m 54 sec
Reputation Power: 252
another thought: there could be a problem with repr() handling long lists, or with my_custom_print_func handling long strings. Try printing just the first and last few elements of the list.

Dave - The Developers' Coach

Reply With Quote
  #8  
Old March 13th, 2004, 04:51 PM
dmills dmills is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Mar 2004
Posts: 5 dmills User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
DevCoach, You've been correct about everything so far :-)

My most recent example wasn't printing because the strings were too long for the wxListBox in which they were displayed. After removing the example from the rest of my code, the example works fine.

The fact remains that all of my code works properly until I try to pickle up a long list. I must investigate this further. I'll post here again when I've figured it out (or I'm stumped ;-)

-edit-
In response to Netytan: you must "import sha" in order to use that module. Here's that sample converted to a stand-alone program. (and now it works fine for me, so I doubt you'll be able to reproduce my problem...but why not post code, ya know?)

Code:
#!/usr/bin/env python
import sha
import pickle

modules ={}

def main():
    hashes = list()
    the_file = file("C:\\somepath\\somefile.ext", 'r')
    while 1:
        chunk = the_file.read(16*1024)
        if len(chunk) == 0:
            break
        shaobj = sha.new(chunk)
        hash = shaobj.hexdigest()
        hashes.append(hash)
        
    print 'before pickling. len=' + str(len(hashes)) + ' hashes=' + hashes[len(hashes)-1] #print the length and the last element
    list_pickled = pickle.dumps(hashes, -1) #pickle the list (-1 is for HIGHEST_PROTOCOL)

    list_unpickled = pickle.loads(list_pickled) #unpickle the list
    print 'after unpickling. len=' + str(len(hashes)) + ' hashes=' + hashes[len(hashes)-1]

if __name__ == '__main__':
    main()

Last edited by dmills : March 13th, 2004 at 04:59 PM. Reason: replying to netytan

Reply With Quote
  #9  
Old March 19th, 2004, 12:01 AM
dmills dmills is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Mar 2004
Posts: 5 dmills User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
It took me a little while, but I tracked down the problem with my code. It seems I wasn't properly handling the "leftovers" in my socket recv() code (damn 1-off errors...).

I still stand by my original claim that I was experiencing buffer and/or stack overflows when feeding the cPickle module bad data. The standard pickle module just generates KeyError(s), but the cPickle module exhibited some truly bizarre behavior and appeared to overwrite in-process code. The behavior was reminiscent of my many hours tracking down similar C/C++ overflows... If I find some spare time I'll attempt to reproduce the behavior and post a code sample. If I manage to do that I'll mail the python-bugs list too.

Anyway, thanks for all the help :-)

Peace,
David Mills

Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming LanguagesPython Programming > pickling long lists (size limit?)


Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump


Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 





© 2003-2008 by Developer Shed. All rights reserved. DS Cluster 2 hosted by Hostway