Dev Shed Lounge
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me
Go Back   Dev Shed ForumsOtherDev Shed Lounge

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
SlickEdit: Code in over 40 languages across 7 platforms. SlickEdit’s unmatched power, speed, and flexibility allows even the most accomplished developers to write better code faster. Download a free trial today!
  #1  
Old March 11th, 2002, 04:55 PM
Ctb's Avatar
Ctb Ctb is offline
An Ominous Coward
Dev Shed Specialist (4000 - 4499 posts)
 
Join Date: Jan 2002
Posts: 4,425 Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level) 
Time spent in forums: 3 Weeks 10 h
Reputation Power: 0
Lightbulb Search Engine Design

Hi all,

I have just been tasked here at work with creating a search engine for our 2000+ page intranet. We are running IIS 4.0 on a WinNT 4.0 platform. I would like to know if anyone out there knows of anything as far as web sites, books, etc. that deal with the theory behind creating an effective search engine. I, of course, first and foremost need to make it work here, but I would also like to make it something worthwhile that can ported elsewhere to other sites on other OSes, servers, etc.

Reply With Quote
  #2  
Old March 11th, 2002, 05:03 PM
M.Hirsch M.Hirsch is offline
Contributing User
Dev Shed God 1st Plane (5500 - 5999 posts)
 
Join Date: Oct 2000
Location: Back in the real world.
Posts: 5,969 M.Hirsch User rank is First Lieutenant (10000 - 20000 Reputation Level)M.Hirsch User rank is First Lieutenant (10000 - 20000 Reputation Level)M.Hirsch User rank is First Lieutenant (10000 - 20000 Reputation Level)M.Hirsch User rank is First Lieutenant (10000 - 20000 Reputation Level)M.Hirsch User rank is First Lieutenant (10000 - 20000 Reputation Level)M.Hirsch User rank is First Lieutenant (10000 - 20000 Reputation Level)M.Hirsch User rank is First Lieutenant (10000 - 20000 Reputation Level)M.Hirsch User rank is First Lieutenant (10000 - 20000 Reputation Level) 
Time spent in forums: 1 Month 1 Day 22 h 39 m 55 sec
Reputation Power: 184
there is good search engine scripts alredy, why reinvent the wheel? look at ht://dig (this is the name of the program, not an url )

some theory: make indecies of all words on all pages, put them into a binary tree and build a fault-tolerant search on it....
__________________
--
Manuel Hirsch - Linux, FreeBSD, programming, administration articles, tutorials and more.

Reply With Quote
  #3  
Old March 11th, 2002, 05:53 PM
Bob Loblaw Bob Loblaw is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Dec 2001
Posts: 174 Bob Loblaw User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 7
Yeah, htdig is pretty nice and easy fast robust
dont' know about on your platform doh

also swish-e

Reply With Quote
  #4  
Old March 11th, 2002, 06:46 PM
Ctb's Avatar
Ctb Ctb is offline
An Ominous Coward
Dev Shed Specialist (4000 - 4499 posts)
 
Join Date: Jan 2002
Posts: 4,425 Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level) 
Time spent in forums: 3 Weeks 10 h
Reputation Power: 0
)Before I posted ht://dig was something I figured I would look into after I found it in a search of the forums. But, I also found a paper on the the original development of the google search engine and some of the methodology behind it. I think that building an effective, scalable search engine and database would be a (massively) challenging test for myself and anyone else who wants to help on the side (off company time that is).

I note that ht://dig is provided under the GNU.. good. Does anyone out there know if anyone has already ported it to Win32?

Reply With Quote
  #5  
Old March 12th, 2002, 11:29 PM
Derek Petersen Derek Petersen is offline
Do you like PHP like ME?
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Nov 2001
Location: St. George, Utah, of the USA
Posts: 67 Derek Petersen User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 7
search engines aren't all that difficult... it's just using knocking down keywords using eregi(), then pulling then pulling it outa mysql. But then again... you said you were a perl person
__________________
You know your a web programmer when you see a '$' and think of PHP rather than money.

Reply With Quote
  #6  
Old March 13th, 2002, 04:18 AM
binky's Avatar
binky binky is offline
Gerbil
Dev Shed Intermediate (1500 - 1999 posts)
 
Join Date: Oct 2001
Location: In a Rotastak
Posts: 1,763 binky User rank is Sergeant (500 - 2000 Reputation Level)binky User rank is Sergeant (500 - 2000 Reputation Level)binky User rank is Sergeant (500 - 2000 Reputation Level)binky User rank is Sergeant (500 - 2000 Reputation Level)binky User rank is Sergeant (500 - 2000 Reputation Level) 
Time spent in forums: 22 h 12 m 52 sec
Reputation Power: 18
You're using IIS, why not just use the indexing service? You'll get the search engine running in a day and that'll be that.
__________________
- Sorted!

www.ppfuk.com - Free Photo Sharing

Reply With Quote
  #7  
Old March 13th, 2002, 07:58 AM
Hero Zzyzzx's Avatar
Hero Zzyzzx Hero Zzyzzx is offline
11
Dev Shed Demi-God (4500 - 4999 posts)
 
Join Date: Jul 2001
Location: Lynn, MA
Posts: 4,632 Hero Zzyzzx User rank is Second Lieutenant (5000 - 10000 Reputation Level)Hero Zzyzzx User rank is Second Lieutenant (5000 - 10000 Reputation Level)Hero Zzyzzx User rank is Second Lieutenant (5000 - 10000 Reputation Level)Hero Zzyzzx User rank is Second Lieutenant (5000 - 10000 Reputation Level)Hero Zzyzzx User rank is Second Lieutenant (5000 - 10000 Reputation Level)Hero Zzyzzx User rank is Second Lieutenant (5000 - 10000 Reputation Level)Hero Zzyzzx User rank is Second Lieutenant (5000 - 10000 Reputation Level) 
Time spent in forums: 4 Days 23 h 12 m 33 sec
Reputation Power: 76
Send a message via AIM to Hero Zzyzzx
Quote:
Originally posted by Derek Petersen
search engines aren't all that difficult... it's just using knocking down keywords using eregi(), then pulling then pulling it outa mysql. But then again... you said you were a perl person


You have GOT to be kidding here. Search engines are EXTREMELY difficult to do at any level above extremely basic. How about stemming words, searching for phrases, required/excluded terms and phrases, relevancy ranking and searching different types of files in different locations?

Perl has FAR better tools for creating search engines than PHP anyway- do a quick little search of cpan and you'll see a bunch of different modules that give you a load of functionality.

The one I use is DBIx::FullTextSearch, which is an inverted indexer that uses MySQL for the backend. It's VERY fast, the tables are extremely well optimized and it has a number of nice features like indexing files on the file system, web pages using LWP, and plain scalars of course. It's kind of bare-bones, but if you want/need to create your own, it's an EXCELLENT module. BTW, it supports all the stuff I mentioned above, and should would perfectly on win32, given that it's pure perl.

That being said, I think you should look at htdig a little closer. It can extract text from different types of files automatically, and I'd be surprised if someone hasn't ported it to win32. If they haven't, set up a cheapy linux box, figure out samba and you should be good to go.

Reply With Quote
  #8  
Old March 13th, 2002, 11:51 AM
Ctb's Avatar
Ctb Ctb is offline
An Ominous Coward
Dev Shed Specialist (4000 - 4499 posts)
 
Join Date: Jan 2002
Posts: 4,425 Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level) 
Time spent in forums: 3 Weeks 10 h
Reputation Power: 0
Quote:
You're using IIS, why not just use the indexing service?

Actually, IIS and Indexing Service are what prompted this request to begin with. We're so fed up with the problems we've had with the IIS server that I was basically told it was now my job to come up with a replacement for Indexing Service.

I'm going to check out that CPAN module you mentioned Hero. I found this paper about the original Google engine, so it gave me a pretty good idea of the sort of things to take into consideration:
Anatomy of a Search Engine

I'm going to jump into the deep end here and start production on our own engine using perl. I'm sure that I'll be back in perl forum asking plenty of questions once I get started

Thanks to everyone on this thread for pitching in ideas!

Reply With Quote
  #9  
Old March 13th, 2002, 04:17 PM
Derek Petersen Derek Petersen is offline
Do you like PHP like ME?
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Nov 2001
Location: St. George, Utah, of the USA
Posts: 67 Derek Petersen User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 7
Quote:
You have GOT to be kidding here. Search engines are EXTREMELY difficult to do at any level above extremely basic. How about stemming words, searching for phrases, required/excluded terms and phrases, relevancy ranking and searching different types of files in different locations?


It all depends on how your search engine searches. Mine, I've seen a lot of crappy search engines that count the number of times a word is on a page then display them in order that way, and a lot of times it's not what I want at all. So the way I did mine was controlling it by loading a bunch of keywords for a subject into an eregi() and then a variable at the end. The variable matches any one of those keywords. Then I tell the browser which linkes to display. But maybe this would be a harder way for someone using 1000+ pages, cause I didn't have that many. Anyways for my bit smaller site, it gave me more control on what showed up and better results, though I haven't quite finished it (cause I've took a break in making it for a while for personal reasons), I know the thing works through testing. And the great thign is the Eregi() already takes out words like 'of' cause it just sorts the string looking only for the keywords.

Reply With Quote
  #10  
Old March 14th, 2002, 10:09 PM
Chansey Chansey is offline
Webmaster
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Mar 2002
Posts: 10 Chansey User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
OMG, Wew *watches as the information flies over his head*

i didn't get any of that, but don't try to explain, i'll analyze and figure it out...

i hope

Reply With Quote
Reply

Viewing: Dev Shed ForumsOtherDev Shed Lounge > Search Engine Design


Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump


Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 





© 2003-2008 by Developer Shed. All rights reserved. DS Cluster 2 hosted by Hostway