|
|
|||||||||
|
|||||||||
| |||||||||
|
|
|
| |||||||||
![]() |
|
|
«
Previous Thread
|
Next Thread
»
|
Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
#1
|
||||
|
||||
|
Optimizing Phrase Searches
I am struggling with a problem in my custom search engine. I need to be able to do phrase searching on a large word set ( ~ 37K words in the lexicon) and a moderately large table of indexed terms (~ 370K records). Here's how I do it now:
The relation mIndex contains three attributes: Code:
-------------------------------------------------- loc | uid | tid | -------------------------------------------------- 9 | 2113368521 | 1065016808 | 10 | 2113368521 | -1174825386 | 11 | 2113368521 | 816805332 | -------------------------------------------------- uid and tid correspond to urls and terms respectively in the table pageInfo and termInfo respectively. They're stored like this because the long ints take up less space and there's additional metadata about the pages and terms that is stored in the other tables (like the count of a term or the page's title, etc.). loc is the short int location of the term tid on page uid. Let's pretend that the three stored records are the terms 'foo', 'bar', and 'blah' stored on the page 'index.htm'. Therefore, to do phrase searching: 1. Look up the tid of the query's first term and use that (via nested SELECT - no VIEWs) to look up all of the unique uids that it appears in and the locations it appears. Code:
SELECT TOP 200 uid FROM mIndex m WHERE m.tid = (SELECT tid FROM termInfo WHERE term='foo') 2. Foreach of the remaining terms, I look through the results to see if that term appears at the next location (loc+1) in the same uid (page). Code:
SELECT loc FROM mIndex m WHERE m.tid = (SELECT tid FROM termInfo WHERE term='bar') AND m.loc = $loc+1 AND m.uid = $uid Effectively, it's saying "ok, go find me as many as 200 places where the first term in the query appears on the site and give me the positions (locations, loc) of each of them and which page we're referring to (e.g. each result will be a unique pairing of a page and a location in that page). Then, for each of the remaining terms, we try to find where the next location (position, loc) on a url is equal to the next term in the query. This continues until the query is exhausted or the length of the phrase is too much and it errs out. All of this looping in the application really hurts. If you understand what I'm doing... I'd love to hear any suggestions on how to improve upon this algorithm (beyond hardware and RDBMS upgrades.. they're already planned). One thing I already plan on doing is putting the lexicon for the system in-memory. This will allow me to query the lexicon for tids instead of doing those godawful compound SELECTs. |
|
#2
|
||||
|
||||
|
1. Why subselect when a simple inner-join will do?
Code:
instead of: SELECT TOP 200 uid FROM mIndex m WHERE m.tid = (SELECT tid FROM termInfo WHERE term='foo') How about using: SELECT TOP 200 m.uid FROM mIndex AS m INNER JOIN termInfo AS t ON t.tid = m.tid WHERE t.term = 'foo' 2. What if the words are out of sequence. i.e. what if "bar" appears before "foo" in the word set, then the page will not show up for your query, since the loc in question is $loc - 1 instead of $loc + 1. I would use something like this and combine all the queries (including the one above) into one query: Code:
SELECT TOP 200 COUNT(*), m.uid
FROM mIndex AS m WITH (NOLOCK)
INNER JOIN termInfo AS t WITH (NOLOCK)
ON t.tid = m.tid
WHERE t.term IN ('foo', 'bar', 'baz')
GROUP BY m.uid
ORDER BY 1 DESC
This will get around the possibility of the words appearing out of sequence. Also, there's only one query to the database instead of 200. Note that I assumed you were using MS SQL, so I tossed in the WITH (NOLOCK) part in to the query -- drop this if you don't have MS SQL. Hope this helps ![]()
__________________
Up the Irons What Would Jimi Do? Smash amps. Burn guitar. Take the groupies home. "Death Before Dishonour, my Friends!!" - Bruce D ickinson, Iron Maiden Aug 20, 2005 @ OzzFest Down with Sharon Osbourne Puzzle of the Month solved by Keath and KevinADC, superior perl programmers of the month |
|
#3
|
||||
|
||||
|
Hmmm... in my attempt to explain what I'm doing, I believe I garbled my actual implemenation.
You are, of course, right - using the INNER JOIN from suggestion 1 is good, and I am doing that. However, I'm actually joining to the results on pageInfo because I'm gathering data from that as well. I'm doing it that way in anticipation of moving the lexicon to an in-memory position where I can grab tids from a fast Perl hash and bypass the database altogether. The second selection, however, looks like a veeeeerrrryyyyyy promising change. There's an extra layer of complexity that I neglected in my attempt to explain what I was doing - restriction by section. I'll have to see about implementing your suggestion along with that. Also, words out of sequence isn't a big issue, it's just a confusing example I posted (again...). The 'loc' is actually just a short int representing at what position on a page a term appears, it's not really a counter in any order (e.g. the three locs listed above could just as easily have been 12, 67, and 22 and all it would mean is that for that tid/uid pairing, the location on the pages was position 12, 67, and 22 respectively - there's no order to them in the database, they're just arbitrary metadata about the tid/uid pairing). Unfortunately, the "database server" is Microsoft Access ATM. I've developed a concerted and hard push to move onto MS-SQL 7 because we're really pushing Access to its (exceptionally low) limit.... Thanks Scorpions! I'll let you know what the result of implementing suggestion 2 is when I get back to work tomorrow. Last edited by Ctb : October 7th, 2003 at 10:16 PM. |
![]() |
| Viewing: Dev Shed Forums > Programming Languages - More > Software Design > Optimizing Phrase Searches |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|
|
|