Software Design
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me
Go Back   Dev Shed ForumsProgramming Languages - MoreSoftware Design

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old November 24th, 2005, 12:50 PM
aajiz aajiz is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Nov 2005
Posts: 3 aajiz User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 h 41 m 47 sec
Reputation Power: 0
Find Top 10 Algorithm

Hi all

I have this interesting problem. Assume 20 items (A,B,C,D etc) are owned by a group of 10,000 people in various combinations (permutations are not important). Something like

Owner {Items}
1 {A,C}
2 {A,B,C}
3 {B,D}
4 {G}
and so on.

I need to know which top 10 items account for most number of people?

Thanks.

Reply With Quote
  #2  
Old November 24th, 2005, 02:18 PM
bah26 bah26 is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Jul 2004
Posts: 71 bah26 User rank is Corporal (100 - 500 Reputation Level)bah26 User rank is Corporal (100 - 500 Reputation Level)bah26 User rank is Corporal (100 - 500 Reputation Level)bah26 User rank is Corporal (100 - 500 Reputation Level) 
Time spent in forums: 6 h 59 m 35 sec
Reputation Power: 7
How would you do it by hand?

Reply With Quote
  #3  
Old November 24th, 2005, 02:34 PM
aajiz aajiz is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Nov 2005
Posts: 3 aajiz User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 h 41 m 47 sec
Reputation Power: 0
I will of course use a database and scripts to achieve this. There is always a brute-force method of checking items in each and every combinations but I am hoping someone will suggest a better way.

Reply With Quote
  #4  
Old November 24th, 2005, 03:57 PM
DevCoach DevCoach is offline
Contributing User
Dev Shed Beginner (1000 - 1499 posts)
 
Join Date: Feb 2004
Location: London, England
Posts: 1,254 DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level)DevCoach User rank is Captain (20000 - 30000 Reputation Level) 
Time spent in forums: 1 Week 6 Days 8 h 10 m 34 sec
Reputation Power: 265
I dont have a definitive answer, but some ideas that come to mind:

1) as bensmyth suggested, try creating some small sample data sets and working it out by hand. Note down the steps you take to solve it.

2) reverse the problem (a): how would you find the bottom ten objects? since there are 20 objects the ones that are left must be the top ten.

3) reverse the problem (b): change the data from people with a set of objects to objects with a set of owners. e.g. (from your example):

A {1,2...}
B {2,3...}
C {1,2...}
D {3...}
G {4...}

Try solving it again by hand for the data set you used in (1). Does this make the problem any easier?

4) Consider Using a language that has a good Set data type that allows for set algebra to be used.

5) from step (3) it seems likely that there will be pairs of objects where one set of owners is a subset of the other. The object with the smaller subset can be eliminated.

Hope this helps,

Dave

BTW, is this a homework assignment, or do you have a real need for this algorithm?
Comments on this post
mvantuyl agrees!
Lux Perpetua agrees!

Reply With Quote
  #5  
Old November 24th, 2005, 04:38 PM
aajiz aajiz is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Nov 2005
Posts: 3 aajiz User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 h 41 m 47 sec
Reputation Power: 0
Hi Dave

Thanks for your ideas. This is not a homework assignment. I need to solve a business problem at work.

Reply With Quote
  #6  
Old November 25th, 2005, 04:22 PM
LyonHaert's Avatar
LyonHaert LyonHaert is offline
Arcane Scribbler
Dev Shed Intermediate (1500 - 1999 posts)
 
Join Date: Jun 2005
Location: Indianapolis, IN
Posts: 1,849 LyonHaert User rank is Colonel (50000 - 60000 Reputation Level)LyonHaert User rank is Colonel (50000 - 60000 Reputation Level)LyonHaert User rank is Colonel (50000 - 60000 Reputation Level)LyonHaert User rank is Colonel (50000 - 60000 Reputation Level)LyonHaert User rank is Colonel (50000 - 60000 Reputation Level)LyonHaert User rank is Colonel (50000 - 60000 Reputation Level)LyonHaert User rank is Colonel (50000 - 60000 Reputation Level)LyonHaert User rank is Colonel (50000 - 60000 Reputation Level)LyonHaert User rank is Colonel (50000 - 60000 Reputation Level)LyonHaert User rank is Colonel (50000 - 60000 Reputation Level)LyonHaert User rank is Colonel (50000 - 60000 Reputation Level)LyonHaert User rank is Colonel (50000 - 60000 Reputation Level) 
Time spent in forums: 1 Month 1 Week 10 h 54 m 59 sec
Reputation Power: 524
I'm all for figuring out problems, but sometimes it takes a series of improved ideas to come to the best one, and that can take a while. With the way I would approach this, it is more of a database design (and SQL) solution than a programming solution.

However, this will only work without modification under the following circumstances:
1. Each owner is unique and has their own ID (usually an auto incrementing primary key) in a database table.
2. Each item is unique and has its own ID (same as above) in a second database table.
3. Each item can only be associated once to any owner. For example, you can't have (using your notation Owner{Items}) 34{A,A,B,C,C}, but only once each, as in 34{A,B,C}.

You create a third table with at least three columns: an id column (auto incrementing, primary key) to identify each association, a column for the owner ID, and a column for the item ID. (Don't forget, you'll want index on these columns.)

Then for each item under an owner, you will insert a record into this association table. So for 34{A,B,C}, where the 34 is the owner ID, and the item IDs for A, B, and C are 101, 102, 103 (respectively), you would do 3 inserts:

insert into owners_to_items (ownerID, itemID) values (34, 101);
insert into owners_to_items (ownerID, itemID) values (34, 102);
insert into owners_to_items (ownerID, itemID) values (34, 103);

Then, to find out the top 10 items, you just need some SQL like this:

select itemID, count(*) as item_cnt from owners_to_items group by itemID order by item_cnt desc limit 10;

That will give you a result set of two columns:
1. The item ID.
2. How many owners that item is associated to.

Reply With Quote
  #7  
Old January 9th, 2006, 09:15 AM
tragen tragen is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Jan 2006
Posts: 7 tragen User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 44 m 44 sec
Reputation Power: 0
I think the databases are probably the best way to go.. but if you want a really quick and dirty way to do it you could get away with a simple hash table, you are looking at a complexity of about O(n) . Which I would tend to argue would be almost that same as a bunch of DB calls. With a n < 100,000.

so I thought I would just throw that out there.







Quote:
Originally Posted by aajiz
Hi all
I have this interesting problem. Assume 20 items (A,B,C,D etc) are owned by a group of 10,000 people in various combinations (permutations are not important). Something like

Owner {Items}
1 {A,C}
2 {A,B,C}
3 {B,D}
4 {G}
and so on.

I need to know which top 10 items account for most number of people?

Thanks.

Reply With Quote
  #8  
Old January 13th, 2006, 04:40 PM
Adrastea0413's Avatar
Adrastea0413 Adrastea0413 is offline
Contributing User
Dev Shed Intermediate (1500 - 1999 posts)
 
Join Date: Dec 2003
Location: Washington, DC Metro
Posts: 1,742 Adrastea0413 User rank is Major General (70000 - 90000 Reputation Level)Adrastea0413 User rank is Major General (70000 - 90000 Reputation Level)Adrastea0413 User rank is Major General (70000 - 90000 Reputation Level)Adrastea0413 User rank is Major General (70000 - 90000 Reputation Level)Adrastea0413 User rank is Major General (70000 - 90000 Reputation Level)Adrastea0413 User rank is Major General (70000 - 90000 Reputation Level)Adrastea0413 User rank is Major General (70000 - 90000 Reputation Level)Adrastea0413 User rank is Major General (70000 - 90000 Reputation Level)Adrastea0413 User rank is Major General (70000 - 90000 Reputation Level)Adrastea0413 User rank is Major General (70000 - 90000 Reputation Level)Adrastea0413 User rank is Major General (70000 - 90000 Reputation Level)Adrastea0413 User rank is Major General (70000 - 90000 Reputation Level)Adrastea0413 User rank is Major General (70000 - 90000 Reputation Level)Adrastea0413 User rank is Major General (70000 - 90000 Reputation Level) 
Time spent in forums: 1 Month 4 Days 20 h 28 m 9 sec
Reputation Power: 804
Facebook
Quote:
Originally Posted by LyonHaert
You create a third table with at least three columns: an id column (auto incrementing, primary key) to identify each association, a column for the owner ID, and a column for the item ID. (Don't forget, you'll want index on these columns.)


What would be the benefits of creating an auto incrementing PK field? Why not just make two fields, owner_id and item_id, and make them a joint PK? That way the combination 34,101 (for example) could only be entered once.

Reply With Quote
  #9  
Old January 13th, 2006, 05:04 PM
LyonHaert's Avatar
LyonHaert LyonHaert is offline
Arcane Scribbler
Dev Shed Intermediate (1500 - 1999 posts)
 
Join Date: Jun 2005
Location: Indianapolis, IN
Posts: 1,849 LyonHaert User rank is Colonel (50000 - 60000 Reputation Level)LyonHaert User rank is Colonel (50000 - 60000 Reputation Level)LyonHaert User rank is Colonel (50000 - 60000 Reputation Level)LyonHaert User rank is Colonel (50000 - 60000 Reputation Level)LyonHaert User rank is Colonel (50000 - 60000 Reputation Level)LyonHaert User rank is Colonel (50000 - 60000 Reputation Level)LyonHaert User rank is Colonel (50000 - 60000 Reputation Level)LyonHaert User rank is Colonel (50000 - 60000 Reputation Level)LyonHaert User rank is Colonel (50000 - 60000 Reputation Level)LyonHaert User rank is Colonel (50000 - 60000 Reputation Level)LyonHaert User rank is Colonel (50000 - 60000 Reputation Level)LyonHaert User rank is Colonel (50000 - 60000 Reputation Level) 
Time spent in forums: 1 Month 1 Week 10 h 54 m 59 sec
Reputation Power: 524
That would also work. I prefer having a single id to refer to that specific association entry ("WHERE id = <id>") rather than matching by the pairing of the ownerid and the itemid ("WHERE ownerid = <ownerid> AND itemid = <itemid>").

The reason for this is I often have to deal with items displayed in a list box (VB6, C# .NET). With the listbox, there is a built-in way to have an integer associated with that entry, but only one. Even in my occasional PHP work, option tags (inside a select tag, of course) only have one value attribute that I use the same way.

If you dropped the auto_incrementing PK for each association, you wouldn't be able to take advantage of that. You'd have to use the more cumbersome method of having an array (probably of a custom value-type) independent of the listbox.

Any changes in performance with the database between these two is probably marginal at most (but I'm guessing here). I simply find my method easier to code for.

I also do similar assocation tables in some projects, but they often allow multiple 'items' per 'owner'. Having a two-field PK would cause an SQL error, which I would then have to catch so the user doesn't see an ugly, incomprehensible fatal error. If duplicates were not allowed, I'd be checking before the insert to make sure it wasn't a duplicate anyway.

(Kindof when on and on a bit, eh?) Anyway, that's the advantages as I see them.

Your turn. Tell me the advantages of a joint PK, as you see it. I haven't had cause to use one yet, but I always appreciate examples.

Reply With Quote
  #10  
Old January 19th, 2006, 10:14 AM
Adrastea0413's Avatar
Adrastea0413 Adrastea0413 is offline
Contributing User
Dev Shed Intermediate (1500 - 1999 posts)
 
Join Date: Dec 2003
Location: Washington, DC Metro
Posts: 1,742 Adrastea0413 User rank is Major General (70000 - 90000 Reputation Level)Adrastea0413 User rank is Major General (70000 - 90000 Reputation Level)Adrastea0413 User rank is Major General (70000 - 90000 Reputation Level)Adrastea0413 User rank is Major General (70000 - 90000 Reputation Level)Adrastea0413 User rank is Major General (70000 - 90000 Reputation Level)Adrastea0413 User rank is Major General (70000 - 90000 Reputation Level)Adrastea0413 User rank is Major General (70000 - 90000 Reputation Level)Adrastea0413 User rank is Major General (70000 - 90000 Reputation Level)Adrastea0413 User rank is Major General (70000 - 90000 Reputation Level)Adrastea0413 User rank is Major General (70000 - 90000 Reputation Level)Adrastea0413 User rank is Major General (70000 - 90000 Reputation Level)Adrastea0413 User rank is Major General (70000 - 90000 Reputation Level)Adrastea0413 User rank is Major General (70000 - 90000 Reputation Level)Adrastea0413 User rank is Major General (70000 - 90000 Reputation Level) 
Time spent in forums: 1 Month 4 Days 20 h 28 m 9 sec
Reputation Power: 804
Facebook
Quote:
Originally Posted by LyonHaert
That would also work. I prefer having a single id to refer to that specific association entry ("WHERE id = <id>") rather than matching by the pairing of the ownerid and the itemid ("WHERE ownerid = <ownerid> AND itemid = <itemid>").

The reason for this is I often have to deal with items displayed in a list box (VB6, C# .NET). With the listbox, there is a built-in way to have an integer associated with that entry, but only one. Even in my occasional PHP work, option tags (inside a select tag, of course) only have one value attribute that I use the same way.

If you dropped the auto_incrementing PK for each association, you wouldn't be able to take advantage of that. You'd have to use the more cumbersome method of having an array (probably of a custom value-type) independent of the listbox.

Any changes in performance with the database between these two is probably marginal at most (but I'm guessing here). I simply find my method easier to code for.

I also do similar assocation tables in some projects, but they often allow multiple 'items' per 'owner'. Having a two-field PK would cause an SQL error, which I would then have to catch so the user doesn't see an ugly, incomprehensible fatal error. If duplicates were not allowed, I'd be checking before the insert to make sure it wasn't a duplicate anyway.

(Kindof when on and on a bit, eh?) Anyway, that's the advantages as I see them.

Your turn. Tell me the advantages of a joint PK, as you see it. I haven't had cause to use one yet, but I always appreciate examples.


I agree with you regarding your use of dropdown menus and such. It would only make sense to use <option value="1"> rather than <option value="12|3"> and then splitting the value at the | and then saying the "12" is the customer and the "3" is the item or whatever.

However, in this particular problem, I don't see how there will be any instances of a dropdown, it seems to be more of a statistical analysis. And with a joint PK, it seems to just be more logical to me. If you want to count all of user #10's items or all of the users that own item B, you're still going to do a "WHERE item_id='B'" or a "WHERE owner_id=10" statement. There just doesn't appear to be a use for an auto_increment field in this problem. However, I don't know the problem fully, so maybe there is.... I just don't see it.

If the author of this message wanted to display a dropdown to select a row, it still wouldn't be hard to do with a joint PK. The dropdown probably wouldn't contain a list of all possible combinations (e.g. user 10 and item B, user 10 and item C, user 11 and item B)... it would probably contain half of the formula. A dropdown would probably be on a page such as displayitems.php?user=10 and would have a dropdown such as item B, item C, and the other items for user 10. Or a page such as displayusers.php?item=B and would have a dropdown that had user 10, user 11, and the other users that own item B. And at all times, you have 1 value for the option, but both pieces of the ID.

The primary key is meant to define unique records, and the combination of an item_id and an owner_id will define any unique record and I feel they would be a perfectly suitable PK. No need to bloat the table with an additional useless column.

Sorry for the additional rambling :-)

Reply With Quote
  #11  
Old January 20th, 2006, 09:21 AM
jkmyoung jkmyoung is offline
Contributing User
Dev Shed Novice (500 - 999 posts)
 
Join Date: Sep 2004
Posts: 505 jkmyoung User rank is Sergeant Major (2000 - 5000 Reputation Level)jkmyoung User rank is Sergeant Major (2000 - 5000 Reputation Level)jkmyoung User rank is Sergeant Major (2000 - 5000 Reputation Level)jkmyoung User rank is Sergeant Major (2000 - 5000 Reputation Level)jkmyoung User rank is Sergeant Major (2000 - 5000 Reputation Level)jkmyoung User rank is Sergeant Major (2000 - 5000 Reputation Level) 
Time spent in forums: 3 Days 5 h 22 m 31 sec
Reputation Power: 53
First, I hope I'm understanding the question correctly. You want to pick 10 items which maximizes the number of people who have at least one of these items, perhaps more than one. If I misinterpreted, then the rest of what I have to say should be ignored.

hrmm. I doubt with 10000 rows, anyone would want to sort by hand.
I'd have to agree that a 2-key approach looks much better and would definitely save space -> save time with less reading/writing of results.

Quote:
Originally Posted by DevCoach
2) reverse the problem (a): how would you find the bottom ten objects? since there are 20 objects the ones that are left must be the top ten.

This is wrong. Simpler example with 6 items, finding bottom 3
1(A) 2(A) 3(AF) 4(ABF) 5(BE) 6(BDE) 7(CD) 8(C)
Worse is CDE : (5678) 4 people.
Reverse of CDE is ABF (123456) 6 people
Best is ABC (12345678) 8 people

It also looks doubtful that you can eliminate certain subsets from the information you've given us. You could eliminate owners with duplicate information, but then each of the owners would have to be weighted, which is also counter-productive.

Personally, I think a programming approach would be much faster. Anyways, I think the DB approach would be (pseudo)
COUNT(SELECT DISTINCT owner WHERE item=$A OR item=$B OR ....)

where $A, $B, etc are the item id's of the 10 items you want to select.

Quote:
Originally Posted by DevCoach
3) reverse the problem (b): change the data from people with a set of objects to objects with a set of owners. e.g. (from your example):

A {1,2...}
B {2,3...}
C {1,2...}
D {3...}
G {4...}

I like this approach. If you stored each of the 20 queries for each of the items SELECT owner WHERE item=$someitem, then you'd simply merge 10 of those queries, selecting DISTINCT owner each time. Depending on the way you merge, you could save a lot of time, especially if previous results of merges were stored and could be recalled. EG. if A,B,C,D,E,F,G,H,I,J were your 10 items.
First pass, merge A&B, C&D : (AB, CD, E, F, G, H, I, J)
second pass (ABE, CDF, GH, IJ)
third pass (ABEGH, CDDFIJ)
final pass (ABCDEFGHIJ)


Unfortunately, the top 9 items may be totally different from the top 10 items, although this is rarely the case in real life practice. Subsets will not necessarily help. Unfortunately, to get the best answer you'd probably have to use a brute-force algorithm, 20C10 = 184756 I suggest using a 'genetic' algorithm.
1. Pick 10 random items from the 20.
2. Do the query to find out how many owners there are.
3. Out of those pick a fixed sized, (say 3?) items to remove.
4. Pick 3 items from the ones currently not used. (May get some of those you removed)
5. Do query.
6. Compare to previous result. If better, take the new one, otherwise revert back
7. Go to step 3

If you cycle enough times, you should get an answer that approximates the best answer fairly well.

Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming Languages - MoreSoftware Design > Find Top 10 Algorithm


Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump