Software Design
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me
Go Back   Dev Shed ForumsProgramming Languages - MoreSoftware Design

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old October 7th, 2003, 04:27 PM
Ctb's Avatar
Ctb Ctb is offline
An Ominous Coward
Dev Shed Specialist (4000 - 4499 posts)
 
Join Date: Jan 2002
Posts: 4,425 Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level) 
Time spent in forums: 3 Weeks 10 h
Reputation Power: 0
Optimizing Phrase Searches

I am struggling with a problem in my custom search engine. I need to be able to do phrase searching on a large word set ( ~ 37K words in the lexicon) and a moderately large table of indexed terms (~ 370K records). Here's how I do it now:

The relation mIndex contains three attributes:
Code:
--------------------------------------------------
loc    |           uid            |         tid          |
--------------------------------------------------
 9     | 2113368521       | 1065016808  |
10    | 2113368521       | -1174825386 |
11    | 2113368521       | 816805332    |
--------------------------------------------------

uid and tid correspond to urls and terms respectively in the table pageInfo and termInfo respectively. They're stored like this because the long ints take up less space and there's additional metadata about the pages and terms that is stored in the other tables (like the count of a term or the page's title, etc.).

loc is the short int location of the term tid on page uid.

Let's pretend that the three stored records are the terms 'foo', 'bar', and 'blah' stored on the page 'index.htm'.

Therefore, to do phrase searching:

1. Look up the tid of the query's first term and use that (via nested SELECT - no VIEWs) to look up all of the unique uids that it appears in and the locations it appears.
Code:
SELECT TOP 200 uid FROM mIndex m WHERE m.tid = (SELECT tid FROM termInfo WHERE term='foo')

2. Foreach of the remaining terms, I look through the results to see if that term appears at the next location (loc+1) in the same uid (page).
Code:
SELECT loc FROM mIndex m WHERE m.tid = (SELECT tid FROM termInfo WHERE term='bar') AND m.loc = $loc+1 AND m.uid = $uid


Effectively, it's saying "ok, go find me as many as 200 places where the first term in the query appears on the site and give me the positions (locations, loc) of each of them and which page we're referring to (e.g. each result will be a unique pairing of a page and a location in that page). Then, for each of the remaining terms, we try to find where the next location (position, loc) on a url is equal to the next term in the query. This continues until the query is exhausted or the length of the phrase is too much and it errs out.

All of this looping in the application really hurts. If you understand what I'm doing... I'd love to hear any suggestions on how to improve upon this algorithm (beyond hardware and RDBMS upgrades.. they're already planned).

One thing I already plan on doing is putting the lexicon for the system in-memory. This will allow me to query the lexicon for tids instead of doing those godawful compound SELECTs.

Reply With Quote
  #2  
Old October 7th, 2003, 07:25 PM
Scorpions4ever's Avatar
Scorpions4ever Scorpions4ever is offline
Banned ;)
Click here for more information.
 
Join Date: Nov 2001
Location: Glendale, Los Angeles County, California, USA
Posts: 7,713 Scorpions4ever User rank is General 3rd Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 3rd Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 3rd Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 3rd Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 3rd Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 3rd Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 3rd Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 3rd Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 3rd Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 3rd Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 3rd Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 3rd Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 3rd Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 3rd Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 3rd Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 3rd Grade (Above 100000 Reputation Level) 
Time spent in forums: 1 Month 3 Days 10 h 46 m 57 sec
Reputation Power: 1179
1. Why subselect when a simple inner-join will do?
Code:
instead of:
SELECT TOP 200 uid FROM mIndex m WHERE m.tid = (SELECT tid FROM termInfo WHERE term='foo')

How about using:
SELECT TOP 200 m.uid 
FROM mIndex AS m
INNER JOIN termInfo AS t 
ON t.tid = m.tid
WHERE t.term = 'foo'


2. What if the words are out of sequence. i.e. what if "bar" appears before "foo" in the word set, then the page will not show up for your query, since the loc in question is $loc - 1 instead of $loc + 1. I would use something like this and combine all the queries (including the one above) into one query:
Code:
SELECT TOP 200 COUNT(*), m.uid
FROM  mIndex AS m WITH (NOLOCK)
INNER JOIN termInfo AS t WITH (NOLOCK)
ON t.tid = m.tid
WHERE t.term IN ('foo', 'bar', 'baz')
GROUP BY m.uid
ORDER BY 1 DESC


This will get around the possibility of the words appearing out of sequence. Also, there's only one query to the database instead of 200. Note that I assumed you were using MS SQL, so I tossed in the WITH (NOLOCK) part in to the query -- drop this if you don't have MS SQL. Hope this helps
__________________
Up the Irons
What Would Jimi Do? Smash amps. Burn guitar. Take the groupies home.
"Death Before Dishonour, my Friends!!" - Bruce D ickinson, Iron Maiden Aug 20, 2005 @ OzzFest
Down with Sharon Osbourne

Puzzle of the Month solved by Keath and KevinADC, superior perl programmers of the month

Reply With Quote
  #3  
Old October 7th, 2003, 09:57 PM
Ctb's Avatar
Ctb Ctb is offline
An Ominous Coward
Dev Shed Specialist (4000 - 4499 posts)
 
Join Date: Jan 2002
Posts: 4,425 Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level) 
Time spent in forums: 3 Weeks 10 h
Reputation Power: 0
Hmmm... in my attempt to explain what I'm doing, I believe I garbled my actual implemenation.

You are, of course, right - using the INNER JOIN from suggestion 1 is good, and I am doing that. However, I'm actually joining to the results on pageInfo because I'm gathering data from that as well. I'm doing it that way in anticipation of moving the lexicon to an in-memory position where I can grab tids from a fast Perl hash and bypass the database altogether.

The second selection, however, looks like a veeeeerrrryyyyyy promising change. There's an extra layer of complexity that I neglected in my attempt to explain what I was doing - restriction by section. I'll have to see about implementing your suggestion along with that.

Also, words out of sequence isn't a big issue, it's just a confusing example I posted (again...). The 'loc' is actually just a short int representing at what position on a page a term appears, it's not really a counter in any order (e.g. the three locs listed above could just as easily have been 12, 67, and 22 and all it would mean is that for that tid/uid pairing, the location on the pages was position 12, 67, and 22 respectively - there's no order to them in the database, they're just arbitrary metadata about the tid/uid pairing).

Unfortunately, the "database server" is Microsoft Access ATM. I've developed a concerted and hard push to move onto MS-SQL 7 because we're really pushing Access to its (exceptionally low) limit....

Thanks Scorpions! I'll let you know what the result of implementing suggestion 2 is when I get back to work tomorrow.

Last edited by Ctb : October 7th, 2003 at 10:16 PM.

Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming Languages - MoreSoftware Design > Optimizing Phrase Searches


Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump


Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 





© 2003-2008 by Developer Shed. All rights reserved. DS Cluster 5 hosted by Hostway
Stay green...Green IT