1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2009
    Rep Power

    Dictionary based parser


    I am developping an app where I need to parse out search queries and figure out the Geographic location name (if any) that might be included in the query.

    The various research papers I've been reading on the subject suggest I use a dictionary based approach where I should take the query and look for candidate entries in the dictionary that match any substring of one or more words included in the query.

    Then I should select the entry that has the longest number of consecutively matching words (+ other factors that I won't detail in this first post).

    SO basically how do I implement this in an efficient manner?
    I managed to break down a given query to an array of strings containing all possible combination of successive words such as the query

    "pizza restaurant in Geneva Switzerland" gives me an array sorted by string length containing the strings
    pizza restaurant
    restaurant geneva
    geneva switzerland
    pizza restaurant geneva
    restaurant geneva switzerland
    pizza restaurant geneva switzerland

    For each subsequence of words I then query my dictionary (a database containing location names) to find entries that match the above subsequences.

    As you might have guessed, this is not performing very fast ...

    Are there any know dictionary based substring identification mechanisms / algorithms that I should be aware of ?
    How can I do thing faster ??
    Do you guys have any other suggestion on how I should tackle this problem ?

    Many thanks in advance for your help !


  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Loyal (3000 - 3499 posts)

    Join Date
    May 2004
    Rep Power
    What form of place dictionary do you have? Is it just a list of place names or is it a proper database of places?

    It should be broken down into city, county/burrow, state/territory/province, country, postal code, etc. If it is, then you only need to break the sentences down to words and query for partial matches. So in your example, you'd look for cities named pizza and if none of them are in a county, state/province or country named reestaurant, you move on to the next word. When you get to Geneva, you'll get a hit in cities, then check county, state/province for Switzerland and you should find a match.

    You might also think about using some pattern matching. Words like city, county, state, zip, and alphanumerics (postal codes) etc. can also supply clues as to which words to use to search what fields in your database.

    I recall doing some research on this subject some years ago and found that not all countries have postal codes and the countries that do, have varying formats for them. It helps if you can look-up standard forms of place names by country. In other words, you'll have to compile your own database from multiple sources and then apply a set of heuristics to optimize your search.

    Also, I recommend using an actual database engine and running queries against that rather than trying to do in memory string searches on your own. With considerable effort, you might do a better job than say PostgreSQL could do against a properly designed schema, but it's unlikely. And there's also a large SQL code base out there that you can tap in this particular domain.
    I no longer wish to be associated with this site.

IMN logo majestic logo threadwatch logo seochat tools logo