#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2003
    Posts
    11
    Rep Power
    0

    Sorting millions of records?


    Anyone have any Idea how I might go about sorting millions of records?

    I see no reason to bother with list.sort()

    I do know a varity of sorting alorithms, but I have never had to sort anythin more then tens of thousands of records.

    For this project I need tens of millions.

    The info will be a csv file alongs the lines of

    545777753,a,b,c,d,e
    545777755,d,d,f,a,e

    - Thanks
  2. #2
  3. Hello World :)
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Mar 2003
    Location
    Hull, UK
    Posts
    2,537
    Rep Power
    69
    Wow thats allot of entries!

    Well the first thing your obviously going to need to do is read in the file, then break that data up . Just out of interest, what does it (or is going to) look like? On the subject of sorting that number of entires you should take to look at the bisect module..

    http://www.python.org/doc/current/li...le-bisect.html

    Mark.
    programming language development: www.netytan.com Hula

  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2003
    Posts
    11
    Rep Power
    0

    Format


    The final format will be as follows

    012345678701234567880123456789

    And so on for fifty million records.

    No line breaks, and nothing to spearate records.
    Each record is a set number of digits.
    There are 3 records listed above
    """0123456787
    0123456788
    0123456789"""



    I used mmap for searching through a sample database(already sorted). The search function works very well [binary search]; I can bump 26,000 records against 10 million entries in 16 seconds.


    Most likley i'll have to break it up into hundreds if not thousands of files and sort each one individually and then stick then back together.

    I may be able to avoid that through something simaler to a bubble sort, using mmap. But I think it might take something like O(n^2) to finish. Which may or may not matter.
    I don't care if it takes 12 hours to finish (just leave it running overnight), but if it is going to take several days, that would not work.
  6. #4
  7. onCsdfeu
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2003
    Location
    Canada
    Posts
    100
    Rep Power
    12
    First of all, I'm not sure if Python is suited to compute such a huge amount of numbers. There are litterally tons of code about sorting in C/C++/Java, but if like me you're more at ease with Python, this isn't really a choice.

    Here's a method that COULD work, but never tried it. Don't blame me it fails I suppose here you're aware of all common sorting algorithms : slection sort, insertion sort, quicksort, heapsort, etc.

    Chop your data in blocks of N numbers. Set N according to your tastes, available memory and moon cycles. Use a powerful sorting algorithm on it (quicksort, probably, or heapsort if your data is not random) on your first set, save it to a file, then a second set in a second file, etc. So you'll 50,000,000/N files.

    Now you have files of sorted numbers. Now concatenate files 2 by 2 ; you'll have a sequence of sorted numbers followed by another sorted sequence. You can call the entire sequence "semi-sorted" which could be handled by a regular selection or insertion sort. When you're done with all your pairs, saved to half the number of files, group files again by 2 and do the same thing. You'll end up eventually with one gigantic file of semi-sorted data. Which will eventually get sorted, we hope.

    I realize that it's far from being a complete solution, even less a succesful implementation ; see this as food for thought.

    Good luck.
  8. #5
  9. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2001
    Location
    Houston, TX
    Posts
    383
    Rep Power
    13
    This sounds like a job for the itertools module in 2.3. it'll save you a lot on memory (since you won't have to read the whole file in or anything silly like that), and it should be faster in general than any other iteration method. Check out the itertools module.
  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2003
    Posts
    133
    Rep Power
    12
    Also, for the parsing, take a look at the CSV module in the Standard Library.
  12. #7
  13. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2003
    Posts
    35
    Rep Power
    12
    coremah,

    If all you're doing is reading values from a file, sorting them, and them dumping them back out to a file, I'd choose a language like C++ to do the job. Every string in Python has a bit more memory overhead than simply the bytes that are allocated for it, and with millions of elements, it can add up to unreasonable amounts (trust me -- I'm rewriting some Python code in C++ right now that is to store up to millions of 10-byte strings, and I plan on cutting memory usage by at least half). In addition to using less memory, C++ will also run a lot faster, provided you pick your sort algorithm intelligently, like quicksort.

    Don't get me wrong, I love Python. But right now interpretive languages have their limits, and to me, this isn't one of the things it was meant to do efficiently.

    - theperfectsoup
  14. #8
  15. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2001
    Location
    Houston, TX
    Posts
    383
    Rep Power
    13
    ... which is why you use a method that is memory efficient, such as the stuff in itertools. It'll still be slower, yeah, but if you aren't overly concerned about that, then Python is fine.
  16. #9
  17. Hello World :)
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Mar 2003
    Location
    Hull, UK
    Posts
    2,537
    Rep Power
    69
    Sadly i have to agree with perfect , where i see a point where interprited lang's will be able to handle these situations i dout it wil be for a while yet..

    Python's getting faster and more efficent with every release, i've heard that Python 2.4 is estimated to be about 60% faster than Python 2.2.3, and since you could really see the 30% speed increase in Python 2.3, the 60% should blow you away!

    There are a number or systems designed to speed up Python, psyco being my favourate simply because it's soooo easy to use. I've yet to embded Python and C but it doesn't look too fun/easy i have to say..

    Mark.
    programming language development: www.netytan.com Hula

  18. #10
  19. No Profile Picture
    Junior Member
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2003
    Posts
    2
    Rep Power
    0

    Sorting


    For so much data I would probably dump it into a database table (without indexes) and then add a suitable index, it will take a while to re-index, but then you did not expect 50m records to be really fast did you.
  20. #11
  21. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2003
    Posts
    11
    Rep Power
    0

    Problem seems to be solved.


    I am still waiting for the actual dataset to arrive for the sorting.

    However with a sample 5,000,000 record set (built using the random modue), I now have a program built for sorting it.

    Here is the method I used.

    2 seperate programs.

    1st Program
    Reads 90,000 records from the file, sorts it using list.sort()
    Writes it to a file by the name of #.csv (first file is 0.csv, then 1.csv, until all records are read)
    5,000,000 recs yield 56 files.

    2nd Program.
    I used a variation of the mergesort to recombine all 56 files at once.
    1st, load into a list the first record from each file.
    Then check to see what record was smallest.
    Write that record to the final file.
    replace the record in the list with the next one from it's corresponding file.
    Repeat process for however many records there are.

    for whole process, takes about 20 minutes.

    module's used. MMAP and OS.

    I still need to double check the file, but testing on smaller sets of data proved perfectly sorted.

    Thanks for all the help everyone.

    If anyone is interested in actual code I can post it.

IMN logo majestic logo threadwatch logo seochat tools logo