Software Design
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me
Go Back   Dev Shed ForumsProgramming Languages - MoreSoftware Design

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old March 10th, 2004, 09:29 AM
sketches's Avatar
sketches sketches is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Mar 2004
Posts: 14 sketches User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
Finding top 10 IDs with the highest occurence

I have a list of random data. these datas can be identified by ID1 and ID2. ID1 is unique but ID2 is not. i need to find the top 10 most frequently occuring ID2. But the max value of ID2 is unknown. number of data is in the order of 10s of millions. any idea how i can filter out the ID2s?

Reply With Quote
  #2  
Old March 10th, 2004, 09:55 AM
DaWei_M's Avatar
DaWei_M DaWei_M is offline
Permanently Banned
Dev Shed God 5th Plane (7000 - 7499 posts)
 
Join Date: Jan 2004
Location: Central New York. Texan via Arizona, out of his element!
Posts: 7,351 DaWei_M User rank is Sergeant (500 - 2000 Reputation Level)DaWei_M User rank is Sergeant (500 - 2000 Reputation Level)DaWei_M User rank is Sergeant (500 - 2000 Reputation Level)DaWei_M User rank is Sergeant (500 - 2000 Reputation Level)DaWei_M User rank is Sergeant (500 - 2000 Reputation Level) 
Time spent in forums: 2 Weeks 1 Day 19 h 39 m 7 sec
Warnings Level: 10
Number of bans: 1
Reputation Power: 0
You want to illustrate the format in some way?

Reply With Quote
  #3  
Old March 10th, 2004, 10:49 AM
Lux Perpetua Lux Perpetua is offline
Contributing User
Dev Shed Beginner (1000 - 1499 posts)
 
Join Date: Feb 2004
Location: San Francisco Bay
Posts: 1,475 Lux Perpetua User rank is Major (30000 - 40000 Reputation Level)Lux Perpetua User rank is Major (30000 - 40000 Reputation Level)Lux Perpetua User rank is Major (30000 - 40000 Reputation Level)Lux Perpetua User rank is Major (30000 - 40000 Reputation Level)Lux Perpetua User rank is Major (30000 - 40000 Reputation Level)Lux Perpetua User rank is Major (30000 - 40000 Reputation Level)Lux Perpetua User rank is Major (30000 - 40000 Reputation Level)Lux Perpetua User rank is Major (30000 - 40000 Reputation Level)Lux Perpetua User rank is Major (30000 - 40000 Reputation Level)Lux Perpetua User rank is Major (30000 - 40000 Reputation Level) 
Time spent in forums: 3 Weeks 6 Days 21 m 14 sec
Reputation Power: 377
The basic technique should be independent of the specific format, though.

I assume that the data in ID1 corresponds to the data in ID2, except that ID1 doesn't have repetitions (and the order can be different); otherwise, I don't see the point of ID1.

One way to do it would be to have a hash table keyed by ID1 and with values corresponding to how many times each datapoint occurs in ID2. That takes one pass through both datasets to create. (Or you could even ignore ID1, actually.) Then there are various methods to find the top 10 most frequent. Sorting the keys comes to mind, but realize that since you only need the 10 most frequent, sorting could be overkill (but might not make a difference performance-wise). Sorry, I have to run...

Reply With Quote
  #4  
Old March 10th, 2004, 01:10 PM
Lux Perpetua Lux Perpetua is offline
Contributing User
Dev Shed Beginner (1000 - 1499 posts)
 
Join Date: Feb 2004
Location: San Francisco Bay
Posts: 1,475 Lux Perpetua User rank is Major (30000 - 40000 Reputation Level)Lux Perpetua User rank is Major (30000 - 40000 Reputation Level)Lux Perpetua User rank is Major (30000 - 40000 Reputation Level)Lux Perpetua User rank is Major (30000 - 40000 Reputation Level)Lux Perpetua User rank is Major (30000 - 40000 Reputation Level)Lux Perpetua User rank is Major (30000 - 40000 Reputation Level)Lux Perpetua User rank is Major (30000 - 40000 Reputation Level)Lux Perpetua User rank is Major (30000 - 40000 Reputation Level)Lux Perpetua User rank is Major (30000 - 40000 Reputation Level)Lux Perpetua User rank is Major (30000 - 40000 Reputation Level) 
Time spent in forums: 3 Weeks 6 Days 21 m 14 sec
Reputation Power: 377
The problem of finding the largest N items (in this case, 10 items) in a list is actually a generalization of sorting, but it isn't obvious to me how to efficiently modify any given sorting procedure to do it. One excellent candidate comes to mind, though, and that is heapsort. If you simply limit the size of the heap to N items, then I think you will get exactly what you want. More explicitly:
  • initialize the heap with the first 10 keys, and assign higher priority to keys with smaller values.
  • Now iterate over the entire set of keys:
    • If a certain key has higher priority than the item at the top of the heap, then don't bother adding it.
    • Otherwise, replace the top element of the heap with that key and heapify the heap.
  • Now the heap contains the ten most frequent items, and you can do what you want with them. If you want to order the top 10, just repeatedly ask the heap for its top element (this is exactly the last stage of heapsort). If you don't care, then you don't need to do anything extra, since the top 10 items are all there in your heap.

Reply With Quote
  #5  
Old March 10th, 2004, 01:12 PM
vanekl vanekl is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Dec 2003
Posts: 229 vanekl User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 5
Well, I don't understand the relation btn ID1 and ID2, but here's a one liner to identify
the top ten ID2 frequencies (using unix):

sort datafile.txt | uniq -c | sort -n -r | head

Reply With Quote
  #6  
Old March 11th, 2004, 01:35 AM
sketches's Avatar
sketches sketches is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Mar 2004
Posts: 14 sketches User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
ID1 n ID2 are not really related in anyway. let's jus say that ID1 is like the ID card number and ID2 is like the name of the person. i mentioned ID1 as i tot it might come in handy in anyway. i am actually trying to sort network traces according to their objectID (the ID of the location the requests are coming from). not sure if this helps. but think u can understand the size of the data i am looking at better.

as a result, i am actually quite concerned about the processing time of running a sort algorithm. cos i may not hv sufficient resources. but likewise i understand that the best and easiest method would b to sort all the data first then list out the top 10.

Reply With Quote
  #7  
Old March 11th, 2004, 02:00 AM
Lux Perpetua Lux Perpetua is offline
Contributing User
Dev Shed Beginner (1000 - 1499 posts)
 
Join Date: Feb 2004
Location: San Francisco Bay
Posts: 1,475 Lux Perpetua User rank is Major (30000 - 40000 Reputation Level)Lux Perpetua User rank is Major (30000 - 40000 Reputation Level)Lux Perpetua User rank is Major (30000 - 40000 Reputation Level)Lux Perpetua User rank is Major (30000 - 40000 Reputation Level)Lux Perpetua User rank is Major (30000 - 40000 Reputation Level)Lux Perpetua User rank is Major (30000 - 40000 Reputation Level)Lux Perpetua User rank is Major (30000 - 40000 Reputation Level)Lux Perpetua User rank is Major (30000 - 40000 Reputation Level)Lux Perpetua User rank is Major (30000 - 40000 Reputation Level)Lux Perpetua User rank is Major (30000 - 40000 Reputation Level) 
Time spent in forums: 3 Weeks 6 Days 21 m 14 sec
Reputation Power: 377
sketches, see my post #4. What I described isn't a full-blown sort; it's an adaptation of a sorting algorithm to your specific problem, and I think it would be pretty fast. If you're not familiar with heapsort, though, then it probably meant nothing to you. There are other ways: now I'm seeing ways that other good sorting algorithms could be modified similarly. Like in quicksort, for example, you can just "prune" the recursion tree on sections of the array that clearly don't contain any of the top 10. I think that would also lead to a very fast algorithm. I'm not sure whether the heapsort variation or the quicksort variation would be faster on your data (or just in general), but I think you would have a winner with either one.

By the way, just ignore what I said about ID1 in post #3; it doesn't affect anything.

Reply With Quote
  #8  
Old March 11th, 2004, 12:43 PM
dog135's Avatar
dog135 dog135 is offline
Doggie
Dev Shed Novice (500 - 999 posts)
 
Join Date: Jul 2003
Location: Seattle, WA
Posts: 751 dog135 User rank is Corporal (100 - 500 Reputation Level)dog135 User rank is Corporal (100 - 500 Reputation Level)dog135 User rank is Corporal (100 - 500 Reputation Level)dog135 User rank is Corporal (100 - 500 Reputation Level) 
Time spent in forums: 10 h 38 m 25 sec
Reputation Power: 7
I've done a similar problem to this. To make my final list after I had the list of values, I just made a small list of 10 items, (all set to zero) then for each item in the long list, if it was greater then the smallest item in the short list, I sorted it into the list. That way, it only takes one pass through you large list, and no more then 10 comparisons to items in your short list.
__________________
"Science is constructed of facts as a house is of stones. But a collection of facts is no more a science than a heap of stones is a house." - Henri Poincare

Reply With Quote
  #9  
Old March 14th, 2004, 04:06 AM
sketches's Avatar
sketches sketches is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Mar 2004
Posts: 14 sketches User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
Lux, you might have over looked one point. my IDs are not arranged in any order. for an ID which was not added into the heap of 10, if the same ID was found after it is being thrown out, then the frequency would not be captured. hope you get wat i am trying to say.

Reply With Quote
  #10  
Old March 22nd, 2004, 06:26 PM
Lux Perpetua Lux Perpetua is offline
Contributing User
Dev Shed Beginner (1000 - 1499 posts)
 
Join Date: Feb 2004
Location: San Francisco Bay
Posts: 1,475 Lux Perpetua User rank is Major (30000 - 40000 Reputation Level)Lux Perpetua User rank is Major (30000 - 40000 Reputation Level)Lux Perpetua User rank is Major (30000 - 40000 Reputation Level)Lux Perpetua User rank is Major (30000 - 40000 Reputation Level)Lux Perpetua User rank is Major (30000 - 40000 Reputation Level)Lux Perpetua User rank is Major (30000 - 40000 Reputation Level)Lux Perpetua User rank is Major (30000 - 40000 Reputation Level)Lux Perpetua User rank is Major (30000 - 40000 Reputation Level)Lux Perpetua User rank is Major (30000 - 40000 Reputation Level)Lux Perpetua User rank is Major (30000 - 40000 Reputation Level) 
Time spent in forums: 3 Weeks 6 Days 21 m 14 sec
Reputation Power: 377
I think that's taken care of by the hash table I mentioned in my first reply: you try to insert each ID into the hash table as a key; either the ID isn't in the hash table, in which case it gets inserted with a value of 1 (since you've just seen it once), or the ID is there already, in which case you increment the value (since you just saw it one more time). The heap thing is done on the keys of the hash table (comparing values), which means the same ID will never come up twice.

There are other different ways to do this, too. For example, you could iterate through ID2 and maintain a sorted list of the distinct ID's, using a hash table to keep track of the positions in the sorted list for each distinct ID. Since the operation of updating the list for a new entry in ID2 can be done in constant time (assuming good hash table performance), this is probably more efficient than my previous suggestions. Anyway, I won't go into more detail, since you're probably finished with this by now.

Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming Languages - MoreSoftware Design > Finding top 10 IDs with the highest occurence


Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump


Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 





© 2003-2008 by Developer Shed. All rights reserved. DS Cluster 1 hosted by Hostway
Stay green...Green IT