February 12th, 2002, 12:00 PM
Efficient indexing of ftp sites
Hi, this problem have been nagging my mind lately. Im not sure if the topic is ok for this forum though
In my university, there are several ftp servers that share all type of stuff, like documents, shareware packages, linux distributions, etc.
I was writing a PHP script that every once in a while, connects to all of the ftp servers, inserting the directory listing information in a mysql database. Then, with the aid of another PHP script, you could do searches upon the database, which turned out to be really cool and useful.
The DB schema is as follows: It has 4 tables, 'paths','files','site_info' and 'ftp_data'. The first three just consist of a unique ID and the info, and 'ftp_data' makes the relation between those three. That way, I can somehow use DB space more efficiently without repeating path and file names.
Then, the query would do a case-insensitive LIKE lookup on the 'path' and/or 'files' tables, get the ID's and JOIN them with the ftp_data table, and VOILA, it works
The bad part of all this, is that the system is very slow. In a pentium 100mhz (Well, i know it sucks but its what ive got to make experiments), a query would take about 5-10 minutes to get itself done. No indexing on the tables would help much, as the LIKE search doesnt have any use of them.
Maybe someone could come up with a more efficient way of indexing the path and file tables? We could work on the basis that we have a lot of HD but 'slow' machines. The use of Mysql isnt mandatory, maybe a text-based DB would be better?
February 16th, 2002, 09:25 PM
So you don't have indeces on the columns you're using in your "where" sections of your SQL?
MySQL uses indeces on "like" statements fine, as long as you don't start the search term with wildcard.
example, assume that "field2" is indexed:
Remember, MySQL uses indeces from left-to-right, so if you think of those two statements above keeping this in mind, it makes perfect sense why MySQL can't use an index for the second statement- It doesn't know where to start looking!
this statement would use an index:
select field1 from table1 where field2 like 'test%';
this statement WOULDN'T use an index:
select field1 from table1 where field2 like '%test%';
I'd suggest you stick with MySQL for your data and index storage, you're not going to get anything much faster for free.
You could also do an inverted index search- create another table with each unique word and the document ID that contain it, then search on this indexed keyword table. I wouldn't suggest writing this from scratch, however. If you know a little about perl, take a look at DBIx::FullTextSearch, which is a VERY fast and configurable MySQL based inverted index search engine.
March 3rd, 2002, 11:11 PM
Thanks for the reply.
Ive been wondering about the index problem. Yes, you are right... but about 99% of the searches do use a starting and ending wildcard... say "%combustion%" to find all papers/stuff related to combustion.
The word-index would work though... uhmm... i would have to make a reg exp match to find the words inside the path/file names though (Sometimes words are separated with dots, underscores,spaces,etc) Not a bad idea at all.
Ill check out the DBIx thingie, maybe I'll find further info there.