|
|
|||||||||
|
|||||||||
| |||||||||
|
|
|
| |||||||||
![]() |
|
|
«
Previous Thread
|
Next Thread
»
|
Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
#1
|
|||
|
|||
|
Find Top 10 Algorithm
Hi all
I have this interesting problem. Assume 20 items (A,B,C,D etc) are owned by a group of 10,000 people in various combinations (permutations are not important). Something like Owner {Items} 1 {A,C} 2 {A,B,C} 3 {B,D} 4 {G} and so on. I need to know which top 10 items account for most number of people? Thanks. |
|
#2
|
|||
|
|||
|
How would you do it by hand?
|
|
#3
|
|||
|
|||
|
I will of course use a database and scripts to achieve this. There is always a brute-force method of checking items in each and every combinations but I am hoping someone will suggest a better way.
|
|
#4
|
|||
|
|||
|
I dont have a definitive answer, but some ideas that come to mind:
1) as bensmyth suggested, try creating some small sample data sets and working it out by hand. Note down the steps you take to solve it. 2) reverse the problem (a): how would you find the bottom ten objects? since there are 20 objects the ones that are left must be the top ten. 3) reverse the problem (b): change the data from people with a set of objects to objects with a set of owners. e.g. (from your example): A {1,2...} B {2,3...} C {1,2...} D {3...} G {4...} Try solving it again by hand for the data set you used in (1). Does this make the problem any easier? 4) Consider Using a language that has a good Set data type that allows for set algebra to be used. 5) from step (3) it seems likely that there will be pairs of objects where one set of owners is a subset of the other. The object with the smaller subset can be eliminated. Hope this helps, Dave BTW, is this a homework assignment, or do you have a real need for this algorithm? |
|
#5
|
|||
|
|||
|
Hi Dave
Thanks for your ideas. This is not a homework assignment. I need to solve a business problem at work. |
|
#6
|
||||
|
||||
|
I'm all for figuring out problems, but sometimes it takes a series of improved ideas to come to the best one, and that can take a while. With the way I would approach this, it is more of a database design (and SQL) solution than a programming solution.
However, this will only work without modification under the following circumstances: 1. Each owner is unique and has their own ID (usually an auto incrementing primary key) in a database table. 2. Each item is unique and has its own ID (same as above) in a second database table. 3. Each item can only be associated once to any owner. For example, you can't have (using your notation Owner{Items}) 34{A,A,B,C,C}, but only once each, as in 34{A,B,C}. You create a third table with at least three columns: an id column (auto incrementing, primary key) to identify each association, a column for the owner ID, and a column for the item ID. (Don't forget, you'll want index on these columns.) Then for each item under an owner, you will insert a record into this association table. So for 34{A,B,C}, where the 34 is the owner ID, and the item IDs for A, B, and C are 101, 102, 103 (respectively), you would do 3 inserts: insert into owners_to_items (ownerID, itemID) values (34, 101); insert into owners_to_items (ownerID, itemID) values (34, 102); insert into owners_to_items (ownerID, itemID) values (34, 103); Then, to find out the top 10 items, you just need some SQL like this: select itemID, count(*) as item_cnt from owners_to_items group by itemID order by item_cnt desc limit 10; That will give you a result set of two columns: 1. The item ID. 2. How many owners that item is associated to. |
|
#7
|
|||
|
|||
|
I think the databases are probably the best way to go.. but if you want a really quick and dirty way to do it you could get away with a simple hash table, you are looking at a complexity of about O(n) . Which I would tend to argue would be almost that same as a bunch of DB calls. With a n < 100,000.
so I thought I would just throw that out there. Quote:
|
|
#8
|
||||
|
||||
|
Quote:
What would be the benefits of creating an auto incrementing PK field? Why not just make two fields, owner_id and item_id, and make them a joint PK? That way the combination 34,101 (for example) could only be entered once. |
|
#9
|
||||
|
||||
|
That would also work. I prefer having a single id to refer to that specific association entry ("WHERE id = <id>") rather than matching by the pairing of the ownerid and the itemid ("WHERE ownerid = <ownerid> AND itemid = <itemid>").
The reason for this is I often have to deal with items displayed in a list box (VB6, C# .NET). With the listbox, there is a built-in way to have an integer associated with that entry, but only one. Even in my occasional PHP work, option tags (inside a select tag, of course) only have one value attribute that I use the same way. If you dropped the auto_incrementing PK for each association, you wouldn't be able to take advantage of that. You'd have to use the more cumbersome method of having an array (probably of a custom value-type) independent of the listbox. Any changes in performance with the database between these two is probably marginal at most (but I'm guessing here). I simply find my method easier to code for. I also do similar assocation tables in some projects, but they often allow multiple 'items' per 'owner'. Having a two-field PK would cause an SQL error, which I would then have to catch so the user doesn't see an ugly, incomprehensible fatal error. If duplicates were not allowed, I'd be checking before the insert to make sure it wasn't a duplicate anyway. (Kindof when on and on a bit, eh?) Anyway, that's the advantages as I see them. Your turn. Tell me the advantages of a joint PK, as you see it. I haven't had cause to use one yet, but I always appreciate examples. |
|
#10
|
||||
|
||||
|
Quote:
I agree with you regarding your use of dropdown menus and such. It would only make sense to use <option value="1"> rather than <option value="12|3"> and then splitting the value at the | and then saying the "12" is the customer and the "3" is the item or whatever. However, in this particular problem, I don't see how there will be any instances of a dropdown, it seems to be more of a statistical analysis. And with a joint PK, it seems to just be more logical to me. If you want to count all of user #10's items or all of the users that own item B, you're still going to do a "WHERE item_id='B'" or a "WHERE owner_id=10" statement. There just doesn't appear to be a use for an auto_increment field in this problem. However, I don't know the problem fully, so maybe there is.... I just don't see it. If the author of this message wanted to display a dropdown to select a row, it still wouldn't be hard to do with a joint PK. The dropdown probably wouldn't contain a list of all possible combinations (e.g. user 10 and item B, user 10 and item C, user 11 and item B)... it would probably contain half of the formula. A dropdown would probably be on a page such as displayitems.php?user=10 and would have a dropdown such as item B, item C, and the other items for user 10. Or a page such as displayusers.php?item=B and would have a dropdown that had user 10, user 11, and the other users that own item B. And at all times, you have 1 value for the option, but both pieces of the ID. The primary key is meant to define unique records, and the combination of an item_id and an owner_id will define any unique record and I feel they would be a perfectly suitable PK. No need to bloat the table with an additional useless column. Sorry for the additional rambling :-) |
|
#11
|
||||
|
||||
|
First, I hope I'm understanding the question correctly. You want to pick 10 items which maximizes the number of people who have at least one of these items, perhaps more than one. If I misinterpreted, then the rest of what I have to say should be ignored.
hrmm. I doubt with 10000 rows, anyone would want to sort by hand. I'd have to agree that a 2-key approach looks much better and would definitely save space -> save time with less reading/writing of results. Quote:
This is wrong. Simpler example with 6 items, finding bottom 3 1(A) 2(A) 3(AF) 4(ABF) 5(BE) 6(BDE) 7(CD) 8(C) Worse is CDE : (5678) 4 people. Reverse of CDE is ABF (123456) 6 people Best is ABC (12345678) 8 people It also looks doubtful that you can eliminate certain subsets from the information you've given us. You could eliminate owners with duplicate information, but then each of the owners would have to be weighted, which is also counter-productive. Personally, I think a programming approach would be much faster. Anyways, I think the DB approach would be (pseudo) COUNT(SELECT DISTINCT owner WHERE item=$A OR item=$B OR ....) where $A, $B, etc are the item id's of the 10 items you want to select. Quote:
I like this approach. If you stored each of the 20 queries for each of the items SELECT owner WHERE item=$someitem, then you'd simply merge 10 of those queries, selecting DISTINCT owner each time. Depending on the way you merge, you could save a lot of time, especially if previous results of merges were stored and could be recalled. EG. if A,B,C,D,E,F,G,H,I,J were your 10 items. First pass, merge A&B, C&D : (AB, CD, E, F, G, H, I, J) second pass (ABE, CDF, GH, IJ) third pass (ABEGH, CDDFIJ) final pass (ABCDEFGHIJ) Unfortunately, the top 9 items may be totally different from the top 10 items, although this is rarely the case in real life practice. Subsets will not necessarily help. Unfortunately, to get the best answer you'd probably have to use a brute-force algorithm, 20C10 = 184756 I suggest using a 'genetic' algorithm. 1. Pick 10 random items from the 20. 2. Do the query to find out how many owners there are. 3. Out of those pick a fixed sized, (say 3?) items to remove. 4. Pick 3 items from the ones currently not used. (May get some of those you removed) 5. Do query. 6. Compare to previous result. If better, take the new one, otherwise revert back 7. Go to step 3 If you cycle enough times, you should get an answer that approximates the best answer fairly well. |
![]() |
| Viewing: Dev Shed Forums > Programming Languages - More > Software Design > Find Top 10 Algorithm |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|