September 29th, 2003, 04:36 PM
Sorting millions of records?
Anyone have any Idea how I might go about sorting millions of records?
I see no reason to bother with list.sort()
I do know a varity of sorting alorithms, but I have never had to sort anythin more then tens of thousands of records.
For this project I need tens of millions.
The info will be a csv file alongs the lines of
September 29th, 2003, 06:12 PM
Wow thats allot of entries!
Well the first thing your obviously going to need to do is read in the file, then break that data up . Just out of interest, what does it (or is going to) look like? On the subject of sorting that number of entires you should take to look at the bisect module..
September 29th, 2003, 06:51 PM
The final format will be as follows
And so on for fifty million records.
No line breaks, and nothing to spearate records.
Each record is a set number of digits.
There are 3 records listed above
I used mmap for searching through a sample database(already sorted). The search function works very well [binary search]; I can bump 26,000 records against 10 million entries in 16 seconds.
Most likley i'll have to break it up into hundreds if not thousands of files and sort each one individually and then stick then back together.
I may be able to avoid that through something simaler to a bubble sort, using mmap. But I think it might take something like O(n^2) to finish. Which may or may not matter.
I don't care if it takes 12 hours to finish (just leave it running overnight), but if it is going to take several days, that would not work.
September 29th, 2003, 09:37 PM
First of all, I'm not sure if Python is suited to compute such a huge amount of numbers. There are litterally tons of code about sorting in C/C++/Java, but if like me you're more at ease with Python, this isn't really a choice.
Here's a method that COULD work, but never tried it. Don't blame me it fails I suppose here you're aware of all common sorting algorithms : slection sort, insertion sort, quicksort, heapsort, etc.
Chop your data in blocks of N numbers. Set N according to your tastes, available memory and moon cycles. Use a powerful sorting algorithm on it (quicksort, probably, or heapsort if your data is not random) on your first set, save it to a file, then a second set in a second file, etc. So you'll 50,000,000/N files.
Now you have files of sorted numbers. Now concatenate files 2 by 2 ; you'll have a sequence of sorted numbers followed by another sorted sequence. You can call the entire sequence "semi-sorted" which could be handled by a regular selection or insertion sort. When you're done with all your pairs, saved to half the number of files, group files again by 2 and do the same thing. You'll end up eventually with one gigantic file of semi-sorted data. Which will eventually get sorted, we hope.
I realize that it's far from being a complete solution, even less a succesful implementation ; see this as food for thought.
September 29th, 2003, 11:56 PM
This sounds like a job for the itertools module in 2.3. it'll save you a lot on memory (since you won't have to read the whole file in or anything silly like that), and it should be faster in general than any other iteration method. Check out the itertools module.
September 30th, 2003, 10:52 AM
Also, for the parsing, take a look at the CSV module in the Standard Library.
October 2nd, 2003, 02:50 AM
If all you're doing is reading values from a file, sorting them, and them dumping them back out to a file, I'd choose a language like C++ to do the job. Every string in Python has a bit more memory overhead than simply the bytes that are allocated for it, and with millions of elements, it can add up to unreasonable amounts (trust me -- I'm rewriting some Python code in C++ right now that is to store up to millions of 10-byte strings, and I plan on cutting memory usage by at least half). In addition to using less memory, C++ will also run a lot faster, provided you pick your sort algorithm intelligently, like quicksort.
Don't get me wrong, I love Python. But right now interpretive languages have their limits, and to me, this isn't one of the things it was meant to do efficiently.
October 2nd, 2003, 08:22 AM
... which is why you use a method that is memory efficient, such as the stuff in itertools. It'll still be slower, yeah, but if you aren't overly concerned about that, then Python is fine.
October 2nd, 2003, 08:24 AM
Sadly i have to agree with perfect , where i see a point where interprited lang's will be able to handle these situations i dout it wil be for a while yet..
Python's getting faster and more efficent with every release, i've heard that Python 2.4 is estimated to be about 60% faster than Python 2.2.3, and since you could really see the 30% speed increase in Python 2.3, the 60% should blow you away!
There are a number or systems designed to speed up Python, psyco being my favourate simply because it's soooo easy to use. I've yet to embded Python and C but it doesn't look too fun/easy i have to say..
October 3rd, 2003, 12:40 AM
For so much data I would probably dump it into a database table (without indexes) and then add a suitable index, it will take a while to re-index, but then you did not expect 50m records to be really fast did you.
October 3rd, 2003, 11:14 AM
Problem seems to be solved.
I am still waiting for the actual dataset to arrive for the sorting.
However with a sample 5,000,000 record set (built using the random modue), I now have a program built for sorting it.
Here is the method I used.
2 seperate programs.
Reads 90,000 records from the file, sorts it using list.sort()
Writes it to a file by the name of #.csv (first file is 0.csv, then 1.csv, until all records are read)
5,000,000 recs yield 56 files.
I used a variation of the mergesort to recombine all 56 files at once.
1st, load into a list the first record from each file.
Then check to see what record was smallest.
Write that record to the final file.
replace the record in the list with the next one from it's corresponding file.
Repeat process for however many records there are.
for whole process, takes about 20 minutes.
module's used. MMAP and OS.
I still need to double check the file, but testing on smaller sets of data proved perfectly sorted.
Thanks for all the help everyone.
If anyone is interested in actual code I can post it.