FTP Help
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me

The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.

Go Back   Dev Shed ForumsSystem AdministrationFTP Help

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old February 12th, 2002, 12:00 PM
Pda0 Pda0 is offline
OpenBSD fella
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Jan 2002
Location: Chile, South America
Posts: 11 Pda0 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
Send a message via ICQ to Pda0
Lightbulb Efficient indexing of ftp sites

Hi, this problem have been nagging my mind lately. Im not sure if the topic is ok for this forum though

In my university, there are several ftp servers that share all type of stuff, like documents, shareware packages, linux distributions, etc.

I was writing a PHP script that every once in a while, connects to all of the ftp servers, inserting the directory listing information in a mysql database. Then, with the aid of another PHP script, you could do searches upon the database, which turned out to be really cool and useful.

The DB schema is as follows: It has 4 tables, 'paths','files','site_info' and 'ftp_data'. The first three just consist of a unique ID and the info, and 'ftp_data' makes the relation between those three. That way, I can somehow use DB space more efficiently without repeating path and file names.

Then, the query would do a case-insensitive LIKE lookup on the 'path' and/or 'files' tables, get the ID's and JOIN them with the ftp_data table, and VOILA, it works

The bad part of all this, is that the system is very slow. In a pentium 100mhz (Well, i know it sucks but its what ive got to make experiments), a query would take about 5-10 minutes to get itself done. No indexing on the tables would help much, as the LIKE search doesnt have any use of them.

Maybe someone could come up with a more efficient way of indexing the path and file tables? We could work on the basis that we have a lot of HD but 'slow' machines. The use of Mysql isnt mandatory, maybe a text-based DB would be better?

.pd

Reply With Quote
  #2  
Old February 16th, 2002, 09:25 PM
Hero Zzyzzx's Avatar
Hero Zzyzzx Hero Zzyzzx is offline
11
Dev Shed Demi-God (4500 - 4999 posts)
 
Join Date: Jul 2001
Location: Lynn, MA
Posts: 4,635 Hero Zzyzzx User rank is Second Lieutenant (5000 - 10000 Reputation Level)Hero Zzyzzx User rank is Second Lieutenant (5000 - 10000 Reputation Level)Hero Zzyzzx User rank is Second Lieutenant (5000 - 10000 Reputation Level)Hero Zzyzzx User rank is Second Lieutenant (5000 - 10000 Reputation Level)Hero Zzyzzx User rank is Second Lieutenant (5000 - 10000 Reputation Level)Hero Zzyzzx User rank is Second Lieutenant (5000 - 10000 Reputation Level)Hero Zzyzzx User rank is Second Lieutenant (5000 - 10000 Reputation Level) 
Time spent in forums: 4 Days 23 h 44 m 19 sec
Reputation Power: 81
Send a message via AIM to Hero Zzyzzx
So you don't have indeces on the columns you're using in your "where" sections of your SQL?

MySQL uses indeces on "like" statements fine, as long as you don't start the search term with wildcard.
example, assume that "field2" is indexed:

Code:
this statement would use an index:
select field1 from table1 where field2 like 'test%';

this statement WOULDN'T use an index:
select field1 from table1 where field2 like '%test%';

Remember, MySQL uses indeces from left-to-right, so if you think of those two statements above keeping this in mind, it makes perfect sense why MySQL can't use an index for the second statement- It doesn't know where to start looking!

I'd suggest you stick with MySQL for your data and index storage, you're not going to get anything much faster for free.

You could also do an inverted index search- create another table with each unique word and the document ID that contain it, then search on this indexed keyword table. I wouldn't suggest writing this from scratch, however. If you know a little about perl, take a look at DBIx::FullTextSearch, which is a VERY fast and configurable MySQL based inverted index search engine.

Reply With Quote
  #3  
Old March 3rd, 2002, 11:11 PM
Pda0 Pda0 is offline
OpenBSD fella
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Jan 2002
Location: Chile, South America
Posts: 11 Pda0 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
Send a message via ICQ to Pda0
Hero Zzyzzx

Thanks for the reply.

Ive been wondering about the index problem. Yes, you are right... but about 99% of the searches do use a starting and ending wildcard... say "%combustion%" to find all papers/stuff related to combustion.

The word-index would work though... uhmm... i would have to make a reg exp match to find the words inside the path/file names though (Sometimes words are separated with dots, underscores,spaces,etc) Not a bad idea at all.

Ill check out the DBIx thingie, maybe I'll find further info there.

Thanks again,

Reply With Quote
Reply

Viewing: Dev Shed ForumsSystem AdministrationFTP Help > Efficient indexing of ftp sites

Developer Shed Advertisers and Affiliates



Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 


Powered by: vBulletin Version 3.0.5
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.

© 2003-2013 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap