|
|
|||||||||
|
|||||||||
| |||||||||
|
|
|
| |||||||||
![]() |
|
|
«
Previous Thread
|
Next Thread
»
|
Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
#1
|
||||
|
||||
|
Finding top 10 IDs with the highest occurence
I have a list of random data. these datas can be identified by ID1 and ID2. ID1 is unique but ID2 is not. i need to find the top 10 most frequently occuring ID2. But the max value of ID2 is unknown. number of data is in the order of 10s of millions. any idea how i can filter out the ID2s?
|
|
#2
|
||
|
You want to illustrate the format in some way?
|
|
#3
|
|||
|
|||
|
The basic technique should be independent of the specific format, though.
I assume that the data in ID1 corresponds to the data in ID2, except that ID1 doesn't have repetitions (and the order can be different); otherwise, I don't see the point of ID1. One way to do it would be to have a hash table keyed by ID1 and with values corresponding to how many times each datapoint occurs in ID2. That takes one pass through both datasets to create. (Or you could even ignore ID1, actually.) Then there are various methods to find the top 10 most frequent. Sorting the keys comes to mind, but realize that since you only need the 10 most frequent, sorting could be overkill (but might not make a difference performance-wise). Sorry, I have to run... |
|
#4
|
|||
|
|||
|
The problem of finding the largest N items (in this case, 10 items) in a list is actually a generalization of sorting, but it isn't obvious to me how to efficiently modify any given sorting procedure to do it. One excellent candidate comes to mind, though, and that is heapsort. If you simply limit the size of the heap to N items, then I think you will get exactly what you want. More explicitly:
|
|
#5
|
|||
|
|||
|
Well, I don't understand the relation btn ID1 and ID2, but here's a one liner to identify
the top ten ID2 frequencies (using unix): sort datafile.txt | uniq -c | sort -n -r | head |
|
#6
|
||||
|
||||
|
ID1 n ID2 are not really related in anyway. let's jus say that ID1 is like the ID card number and ID2 is like the name of the person. i mentioned ID1 as i tot it might come in handy in anyway. i am actually trying to sort network traces according to their objectID (the ID of the location the requests are coming from). not sure if this helps. but think u can understand the size of the data i am looking at better.
as a result, i am actually quite concerned about the processing time of running a sort algorithm. cos i may not hv sufficient resources. but likewise i understand that the best and easiest method would b to sort all the data first then list out the top 10. |
|
#7
|
|||
|
|||
|
sketches, see my post #4. What I described isn't a full-blown sort; it's an adaptation of a sorting algorithm to your specific problem, and I think it would be pretty fast. If you're not familiar with heapsort, though, then it probably meant nothing to you. There are other ways: now I'm seeing ways that other good sorting algorithms could be modified similarly. Like in quicksort, for example, you can just "prune" the recursion tree on sections of the array that clearly don't contain any of the top 10. I think that would also lead to a very fast algorithm. I'm not sure whether the heapsort variation or the quicksort variation would be faster on your data (or just in general), but I think you would have a winner with either one.
By the way, just ignore what I said about ID1 in post #3; it doesn't affect anything. |
|
#8
|
||||
|
||||
|
I've done a similar problem to this. To make my final list after I had the list of values, I just made a small list of 10 items, (all set to zero) then for each item in the long list, if it was greater then the smallest item in the short list, I sorted it into the list. That way, it only takes one pass through you large list, and no more then 10 comparisons to items in your short list.
__________________
"Science is constructed of facts as a house is of stones. But a collection of facts is no more a science than a heap of stones is a house." - Henri Poincare |
|
#9
|
||||
|
||||
|
Lux, you might have over looked one point. my IDs are not arranged in any order. for an ID which was not added into the heap of 10, if the same ID was found after it is being thrown out, then the frequency would not be captured. hope you get wat i am trying to say.
|
|
#10
|
|||
|
|||
|
I think that's taken care of by the hash table I mentioned in my first reply: you try to insert each ID into the hash table as a key; either the ID isn't in the hash table, in which case it gets inserted with a value of 1 (since you've just seen it once), or the ID is there already, in which case you increment the value (since you just saw it one more time). The heap thing is done on the keys of the hash table (comparing values), which means the same ID will never come up twice.
There are other different ways to do this, too. For example, you could iterate through ID2 and maintain a sorted list of the distinct ID's, using a hash table to keep track of the positions in the sorted list for each distinct ID. Since the operation of updating the list for a new entry in ID2 can be done in constant time (assuming good hash table performance), this is probably more efficient than my previous suggestions. Anyway, I won't go into more detail, since you're probably finished with this by now. |
![]() |
| Viewing: Dev Shed Forums > Programming Languages - More > Software Design > Finding top 10 IDs with the highest occurence |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|
|
|