Python Programming
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me

The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.

Go Back   Dev Shed ForumsProgramming LanguagesPython Programming

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old September 29th, 2003, 04:36 PM
coremah coremah is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Aug 2003
Posts: 11 coremah User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
Sorting millions of records?

Anyone have any Idea how I might go about sorting millions of records?

I see no reason to bother with list.sort()

I do know a varity of sorting alorithms, but I have never had to sort anythin more then tens of thousands of records.

For this project I need tens of millions.

The info will be a csv file alongs the lines of

545777753,a,b,c,d,e
545777755,d,d,f,a,e

- Thanks

Reply With Quote
  #2  
Old September 29th, 2003, 06:12 PM
netytan's Avatar
netytan netytan is offline
Hello World :)
Dev Shed Frequenter (2500 - 2999 posts)
 
Join Date: Mar 2003
Location: Hull, UK
Posts: 2,537 netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level) 
Time spent in forums: 1 Week 2 Days 18 h 17 m 47 sec
Reputation Power: 68
Send a message via ICQ to netytan Send a message via AIM to netytan Send a message via MSN to netytan Send a message via Yahoo to netytan
Wow thats allot of entries!

Well the first thing your obviously going to need to do is read in the file, then break that data up . Just out of interest, what does it (or is going to) look like? On the subject of sorting that number of entires you should take to look at the bisect module..

http://www.python.org/doc/current/l...ule-bisect.html

Mark.
__________________
programming language development: www.netytan.com Hula


Reply With Quote
  #3  
Old September 29th, 2003, 06:51 PM
coremah coremah is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Aug 2003
Posts: 11 coremah User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
Format

The final format will be as follows

012345678701234567880123456789

And so on for fifty million records.

No line breaks, and nothing to spearate records.
Each record is a set number of digits.
There are 3 records listed above
"""0123456787
0123456788
0123456789"""



I used mmap for searching through a sample database(already sorted). The search function works very well [binary search]; I can bump 26,000 records against 10 million entries in 16 seconds.


Most likley i'll have to break it up into hundreds if not thousands of files and sort each one individually and then stick then back together.

I may be able to avoid that through something simaler to a bubble sort, using mmap. But I think it might take something like O(n^2) to finish. Which may or may not matter.
I don't care if it takes 12 hours to finish (just leave it running overnight), but if it is going to take several days, that would not work.

Reply With Quote
  #4  
Old September 29th, 2003, 09:37 PM
SolarBear's Avatar
SolarBear SolarBear is offline
onCsdfeu
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Jul 2003
Location: Canada
Posts: 100 SolarBear User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 2 h 16 m 21 sec
Reputation Power: 10
Send a message via MSN to SolarBear
First of all, I'm not sure if Python is suited to compute such a huge amount of numbers. There are litterally tons of code about sorting in C/C++/Java, but if like me you're more at ease with Python, this isn't really a choice.

Here's a method that COULD work, but never tried it. Don't blame me it fails I suppose here you're aware of all common sorting algorithms : slection sort, insertion sort, quicksort, heapsort, etc.

Chop your data in blocks of N numbers. Set N according to your tastes, available memory and moon cycles. Use a powerful sorting algorithm on it (quicksort, probably, or heapsort if your data is not random) on your first set, save it to a file, then a second set in a second file, etc. So you'll 50,000,000/N files.

Now you have files of sorted numbers. Now concatenate files 2 by 2 ; you'll have a sequence of sorted numbers followed by another sorted sequence. You can call the entire sequence "semi-sorted" which could be handled by a regular selection or insertion sort. When you're done with all your pairs, saved to half the number of files, group files again by 2 and do the same thing. You'll end up eventually with one gigantic file of semi-sorted data. Which will eventually get sorted, we hope.

I realize that it's far from being a complete solution, even less a succesful implementation ; see this as food for thought.

Good luck.

Reply With Quote
  #5  
Old September 29th, 2003, 11:56 PM
Strike Strike is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Dec 2001
Location: Houston, TX
Posts: 383 Strike User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 h 41 m 27 sec
Reputation Power: 12
Send a message via ICQ to Strike Send a message via AIM to Strike Send a message via Yahoo to Strike
This sounds like a job for the itertools module in 2.3. it'll save you a lot on memory (since you won't have to read the whole file in or anything silly like that), and it should be faster in general than any other iteration method. Check out the itertools module.

Reply With Quote
  #6  
Old September 30th, 2003, 10:52 AM
percivall percivall is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Jul 2003
Posts: 133 percivall User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 10
Also, for the parsing, take a look at the CSV module in the Standard Library.

Reply With Quote
  #7  
Old October 2nd, 2003, 02:50 AM
theperfectsoup theperfectsoup is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Jul 2003
Posts: 35 theperfectsoup User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 10
coremah,

If all you're doing is reading values from a file, sorting them, and them dumping them back out to a file, I'd choose a language like C++ to do the job. Every string in Python has a bit more memory overhead than simply the bytes that are allocated for it, and with millions of elements, it can add up to unreasonable amounts (trust me -- I'm rewriting some Python code in C++ right now that is to store up to millions of 10-byte strings, and I plan on cutting memory usage by at least half). In addition to using less memory, C++ will also run a lot faster, provided you pick your sort algorithm intelligently, like quicksort.

Don't get me wrong, I love Python. But right now interpretive languages have their limits, and to me, this isn't one of the things it was meant to do efficiently.

- theperfectsoup

Reply With Quote
  #8  
Old October 2nd, 2003, 08:22 AM
Strike Strike is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Dec 2001
Location: Houston, TX
Posts: 383 Strike User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 h 41 m 27 sec
Reputation Power: 12
Send a message via ICQ to Strike Send a message via AIM to Strike Send a message via Yahoo to Strike
... which is why you use a method that is memory efficient, such as the stuff in itertools. It'll still be slower, yeah, but if you aren't overly concerned about that, then Python is fine.

Reply With Quote
  #9  
Old October 2nd, 2003, 08:24 AM
netytan's Avatar
netytan netytan is offline
Hello World :)
Dev Shed Frequenter (2500 - 2999 posts)
 
Join Date: Mar 2003
Location: Hull, UK
Posts: 2,537 netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level) 
Time spent in forums: 1 Week 2 Days 18 h 17 m 47 sec
Reputation Power: 68
Send a message via ICQ to netytan Send a message via AIM to netytan Send a message via MSN to netytan Send a message via Yahoo to netytan
Sadly i have to agree with perfect , where i see a point where interprited lang's will be able to handle these situations i dout it wil be for a while yet..

Python's getting faster and more efficent with every release, i've heard that Python 2.4 is estimated to be about 60% faster than Python 2.2.3, and since you could really see the 30% speed increase in Python 2.3, the 60% should blow you away!

There are a number or systems designed to speed up Python, psyco being my favourate simply because it's soooo easy to use. I've yet to embded Python and C but it doesn't look too fun/easy i have to say..

Mark.

Reply With Quote
  #10  
Old October 3rd, 2003, 12:40 AM
RicInACloak RicInACloak is offline
Junior Member
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Sep 2003
Posts: 2 RicInACloak User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
Sorting

For so much data I would probably dump it into a database table (without indexes) and then add a suitable index, it will take a while to re-index, but then you did not expect 50m records to be really fast did you.

Reply With Quote
  #11  
Old October 3rd, 2003, 11:14 AM
coremah coremah is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Aug 2003
Posts: 11 coremah User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
Problem seems to be solved.

I am still waiting for the actual dataset to arrive for the sorting.

However with a sample 5,000,000 record set (built using the random modue), I now have a program built for sorting it.

Here is the method I used.

2 seperate programs.

1st Program
Reads 90,000 records from the file, sorts it using list.sort()
Writes it to a file by the name of #.csv (first file is 0.csv, then 1.csv, until all records are read)
5,000,000 recs yield 56 files.

2nd Program.
I used a variation of the mergesort to recombine all 56 files at once.
1st, load into a list the first record from each file.
Then check to see what record was smallest.
Write that record to the final file.
replace the record in the list with the next one from it's corresponding file.
Repeat process for however many records there are.

for whole process, takes about 20 minutes.

module's used. MMAP and OS.

I still need to double check the file, but testing on smaller sets of data proved perfectly sorted.

Thanks for all the help everyone.

If anyone is interested in actual code I can post it.

Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming LanguagesPython Programming > Sorting millions of records?

Developer Shed Advertisers and Affiliates



Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 


Powered by: vBulletin Version 3.0.5
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.

© 2003-2013 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap