#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2007
    Posts
    7
    Rep Power
    0

    Set-Based Search


    Hello -
    I need to efficiently search a large number of sets to try and find a match, or near match, for a given set.
    This is to take place in a web-application. Has anyone got any suggestions on how I might do this while being reasonable fast and efficient? Oh, I'll be doing it in ruby, which is why I posted in here.

    All ideas I've come up with so far seem to be leading nowhere.

    Thanks for any input!
    -tschneit
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2007
    Location
    /usr/bin/ruby
    Posts
    63
    Rep Power
    29
    What do you have so far?
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jan 2004
    Location
    Constant Limbo
    Posts
    989
    Rep Power
    363
    What are these 'sets'? Do you have the option of pushing the search out to a separate thread to offset any delay? Have you investigated what is out there? If so how does your problem differ?
    What you ask is rather generic; there are many existing solutions that I'm sure will fit your problem. The above questions are some items that will help us (and you) to better understand and analyze the problem.
    True happiness is not getting what you want, it's wanting what you've already got.

    My Blog
  6. #4
  7. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2007
    Posts
    7
    Rep Power
    0
    I think this should cover those questions relatively well... I posted in the PHP forum as well, as this is more of a theory question than anything having to do with a particular language. Follow this link:
    http://forums.devshed.com/php-development-5/set-based-search-481432.html


    Thanks!
  8. #5
  9. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jan 2004
    Location
    Constant Limbo
    Posts
    989
    Rep Power
    363
    You speak of a problem closely related to the 'knapsack problem' (google for details) which is known to be NP-Hard.
    Based on your replies in the other thread, I would suggest a Trie. A Trie will allow you (if you sort your sets) to 'tag' node that represent the end of a set. Once you have this, you can do a search from that point to see what comes close.
    For instance, if you have a node in the tree that represents the end of a set (at depth n) you can look at all nodes that are at depth n + 1 and link to the end node mentioned above. That means they all contain only one more item than the set you have. You can also do the same from the parent node of your end point (depth n - 1) and you will have all sets that have an equal number of elements different by only one from your set. You can see that teh search expands exponentially, but with a good sort (*) you can find close matches fast when they exist.
    Realize that this depends on the elements of your set as well. Some things (such as strings and numbers) lend themselves to this rather nicely, but objects in general may not fit the 'ordered' model.

    (*) By sort I mean that the elements will have to be ordered by some standard (meaningful) method in your Trie. If you order elements in an inconvenient way you could end up with the search needing to return to the root of the Trie before following the closest match.
    True happiness is not getting what you want, it's wanting what you've already got.

    My Blog
  10. #6
  11. Banned ;)
    Devshed Supreme Being (6500+ posts)

    Join Date
    Nov 2001
    Location
    Woodland Hills, Los Angeles County, California, USA
    Posts
    9,648
    Rep Power
    4248
    Another way would be to reorganize your data somewhat. Let's say you have the following sets (going by the example on the other thread):
    A = {1, 2, 3}
    B = {4, 3, 7}
    C = {2, 8, 7}
    D = {1, 2, 4}
    E = {6, 2, 3}

    And you want to find a set containing (1, 2, 3) (ideally, this would be A), but will also settle for 2 out of 3 matches (D and E). You could reorganize your data as a hash of arrays, such as:
    1 => [A, D]
    2 => [A, C, D, E],
    3 => [A, B, E],
    4 => [B, D]
    6 => [E]
    7 => [B, C]
    8 => [C]

    Then, when searching, you could find all the sets that 1, 2 and 3 belong to, by quickly looking up the hash. You could also increment counters and figure out which set has the maximum number of hits. After scanning like this, you'd end up with
    Hits for A = 3
    Hits for B = 1
    Hits for C = 1
    Hits for D = 2
    Hits for E = 2

    Hence you can pick A, D or E

    By the way, you aren't trying to write some kind of search engine by any chance, are you?
    Up the Irons
    What Would Jimi Do? Smash amps. Burn guitar. Take the groupies home.
    "Death Before Dishonour, my Friends!!" - Bruce D ickinson, Iron Maiden Aug 20, 2005 @ OzzFest
    Down with Sharon Osbourne

    "I wouldn't hire a butcher to fix my car. I also wouldn't hire a marketing firm to build my website." - Nilpo
  12. #7
  13. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2007
    Posts
    7
    Rep Power
    0
    I like the trie idea. It seems like it could work, once I brush myself up on the operation I'll know a bit better.

    I like the idea of hashing the sets to values. After picking all of the possible sets, an intersection could be performed to find the ones that use the most ingredients.

    Still, I am worried about performance. I am writing in ruby (and rails), and I dont know if there is a way to keep a model loaded up in the background, on which I might run these searches. More pondering must be done

    And yes, I am building some type of search engine, and I know I shouldn't reinvent the wheel. I've looked at what I could find for search implementations and I dont know if I can bend them to my will.

    Thanks for your input folks!
  14. #8
  15. No Profile Picture
    Gödelian monster
    Devshed Regular (2000 - 2499 posts)

    Join Date
    Jul 1999
    Location
    Central Florida, USA
    Posts
    2,307
    Rep Power
    62
    Far be it from me to brutishly intervene, but if performance is a consideration, have you considered using a real set-based search tool, such as an SQL DBMS, perhaps?

    If you want performance, Ruby is almost certainly not the language for such a thing. It excels at making things easy to manipulate, especially for pulling together an informal application, but it was never intended for processing large volumes of data.

    And if you still want to deal with array/set notation, my favorite DBMS happens to have an array meta-type, and a very nice complement of array operators and functions (operators like "overlap" and "is contained by" allow you to treat the arrays like unordered sets). You can also create custom datatypes and operators, although you might just be best off sticking with normal relations and queries. And PostgreSQL is sure to be much faster at searching your data than anything except a custom C application.

    Now, if the whole point of this is an intellectual exercise, or research project or somesuch, then that's another story, and I applaud your effort. But if you need something practical and production-ready, then you might want to consider using tools that are made for the purpose. Ruby happens to have a Postgres module, of course.
    The real n-tier system:

    FreeBSD -> PostgreSQL -> [any_language] -> Apache -> Mozilla/XUL

    Amazon wishlist -- rycamor (at) gmail.com

IMN logo majestic logo threadwatch logo seochat tools logo