#1
  1. No Profile Picture
    OpenBSD fella
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2002
    Location
    Chile, South America
    Posts
    11
    Rep Power
    0

    Lightbulb Efficient indexing of ftp sites


    Hi, this problem have been nagging my mind lately. Im not sure if the topic is ok for this forum though

    In my university, there are several ftp servers that share all type of stuff, like documents, shareware packages, linux distributions, etc.

    I was writing a PHP script that every once in a while, connects to all of the ftp servers, inserting the directory listing information in a mysql database. Then, with the aid of another PHP script, you could do searches upon the database, which turned out to be really cool and useful.

    The DB schema is as follows: It has 4 tables, 'paths','files','site_info' and 'ftp_data'. The first three just consist of a unique ID and the info, and 'ftp_data' makes the relation between those three. That way, I can somehow use DB space more efficiently without repeating path and file names.

    Then, the query would do a case-insensitive LIKE lookup on the 'path' and/or 'files' tables, get the ID's and JOIN them with the ftp_data table, and VOILA, it works

    The bad part of all this, is that the system is very slow. In a pentium 100mhz (Well, i know it sucks but its what ive got to make experiments), a query would take about 5-10 minutes to get itself done. No indexing on the tables would help much, as the LIKE search doesnt have any use of them.

    Maybe someone could come up with a more efficient way of indexing the path and file tables? We could work on the basis that we have a lot of HD but 'slow' machines. The use of Mysql isnt mandatory, maybe a text-based DB would be better?

    .pd
  2. #2
  3. 11
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Jul 2001
    Location
    Lynn, MA
    Posts
    4,635
    Rep Power
    82
    So you don't have indeces on the columns you're using in your "where" sections of your SQL?

    MySQL uses indeces on "like" statements fine, as long as you don't start the search term with wildcard.
    example, assume that "field2" is indexed:

    Code:
    this statement would use an index:
    select field1 from table1 where field2 like 'test%';
    
    this statement WOULDN'T use an index:
    select field1 from table1 where field2 like '%test%';
    Remember, MySQL uses indeces from left-to-right, so if you think of those two statements above keeping this in mind, it makes perfect sense why MySQL can't use an index for the second statement- It doesn't know where to start looking!

    I'd suggest you stick with MySQL for your data and index storage, you're not going to get anything much faster for free.

    You could also do an inverted index search- create another table with each unique word and the document ID that contain it, then search on this indexed keyword table. I wouldn't suggest writing this from scratch, however. If you know a little about perl, take a look at DBIx::FullTextSearch, which is a VERY fast and configurable MySQL based inverted index search engine.
  4. #3
  5. No Profile Picture
    OpenBSD fella
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2002
    Location
    Chile, South America
    Posts
    11
    Rep Power
    0

    Hero Zzyzzx


    Thanks for the reply.

    Ive been wondering about the index problem. Yes, you are right... but about 99% of the searches do use a starting and ending wildcard... say "%combustion%" to find all papers/stuff related to combustion.

    The word-index would work though... uhmm... i would have to make a reg exp match to find the words inside the path/file names though (Sometimes words are separated with dots, underscores,spaces,etc) Not a bad idea at all.

    Ill check out the DBIx thingie, maybe I'll find further info there.

    Thanks again,

IMN logo majestic logo threadwatch logo seochat tools logo