
April 27th, 2008, 06:41 AM
|
|
Registered User
|
|
Join Date: Jan 2008
Posts: 8
Time spent in forums: 1 h 27 m 41 sec
Reputation Power: 0
|
|
|
Large File estimation and list/memory optimization
Hi all -
I'm trying to process very large csv (300+ MB) files and actually have two questions -
1. Is there a way to accurately estimate (within 5% or so) the number of rows in the file without reading it first? I came up with a primitive way of doing it by estimating the average size (in bytes) of a row, and then dividing the file size by that of the average row. This gives me about 85-90% accuracy, but I'd like to do better. Any thoughts will be great.
2. The file contains repeating items that change over time, and since I'd just like the first and last entry of each item, I have 2 dictionaries - ITEM_FIRST and ITEM_LAST that contain that respective data. The keys in those dictionaries are a string that is built from "<item> - <time>" (with values, of course, like "bannana - 10:30"), and my methodology is to check if an item/time combo is in the ITEM_FIRST/LAST.keys() and then adding it if it's not. The problem is that the list of keys grows to a big number, and the processing really slows down. Is there a way to speed things up or make it more efficient?
Thanks!
|