#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2005
    Posts
    16
    Rep Power
    0

    dictionary, help me please.


    I have a dictionary file with format like :

    a
    as
    at
    .
    .
    begin
    but
    .
    .
    .
    window
    .
    .
    .


    it has over 100.000 lines , each line contains 1 word like that. I must use this dictionary file to find in my document and count number of word in that document.

    Can you tell me how can i store that dictionary. Thank so much
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Nov 2004
    Location
    There where the rabbits jump
    Posts
    556
    Rep Power
    11
    em you mean you have a word per line in a dictionary file

    go trough it with a readlines and then you have a list and then with the for loop...
    Those people who think they know everything are a great annoyance to those of us who do.
  4. #3
  5. Commie Mutant Traitor
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Jun 2004
    Location
    Alpharetta, GA
    Posts
    1,806
    Rep Power
    1570
    If you need to search through the data at all - and I gather you would, if you need to track the number of uses of each word - I would use a balance binary tree instead of a flat list. With this much data, it would make a measurable difference in speed.

    Keep in mind that it may take a fair amount of memory, though not a prohibitively large amount if it's only doing a word count - a quick estimate of 128K words of 32 16-bit Unicode characters, plus perhaps an overhead of 64 bytes per list node (worst-case; actual overhead is probably only about 16 bytes), gives a high figure of 16M. The actual usage will almost certainly be less, but keep in mind that that is for just the dictionary - you would still need enough memory for the interpreter, the program, and the data, not to mention the OS itself and any other apps currently running. Modern PCs should have plenty of room, but still, loading that much entirely into data is likely to make things a bit crowded.

    Perhaps more importantly, reading in 4 to 8 meg of data from disk generally takes several seconds, which may be a bottleneck in a program whose main operation takes only a few milliseconds. You do not want the program reading in that much data when most of it isn't getting used.

    Assuming that the data is stored in alphabetical order, my overall recommendation is to leave the dictionary data on disk, and have an in-memory balanced binary tree. When you read a new word from the data, you would first search for it in tree, and if it is found, increment the count of the word. If it isn't in the tree, you would then search for it in the file (matching the first letter first, then the first two, and so on until you either match the word or else determine that it isn't in the dictionary). If you find a match, you would then insert the word into the tree; if not, you would handle that in whatever manner you mean to (asking for a correction or what-have-you). It would be somewhat more work that Monkeyman's approach, but should be much faster.

    Comments on this post

    • netytan agrees
    Rev First Speaker Schol-R-LEA;2 JAM LCF ELF KoR KCO BiWM TGIF
    #define KINSEY (rand() % 7) λ Scheme is the Red Pill
    Scheme in Short Understanding the C/C++ Preprocessor
    Taming Python A Highly Opinionated Review of Programming Languages for the Novice, v1.1

    FOR SALE: One ShapeSystem 2300 CMD, extensively modified for human use. Includes s/w for anthro, transgender, sex-appeal enhance, & Gillian Anderson and Jason D. Poit clone forms. Some wear. $4500 obo. tverres@et.ins.gov
  6. #4
  7. Hello World :)
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Mar 2003
    Location
    Hull, UK
    Posts
    2,537
    Rep Power
    69
    If you know the number of words in the dictionary file (the number of lines in the file - 1) you should in be able to narrow things down by using a good hash function to extract an approximate position.

    Another simpler alternative would be to have a bunch of dictionary files: one for each letter in the dictionary . That would at least cut down the amount of data you have to work with.

    If you wrote a long running program (a daemon in *nix talk) that kept the loaded dictionary in memory for searching then you could cut down on load times by loading it only once.

    There are a number of open source spell checkers, if your interested I would look into how these work since they tend to be pretty fast.

    Hope this helps,

    Mark.
    programming language development: www.netytan.com Hula

  8. #5
  9. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Nov 2004
    Location
    There where the rabbits jump
    Posts
    556
    Rep Power
    11
    hey how would you acually code a deamon like with threads
    Those people who think they know everything are a great annoyance to those of us who do.
  10. #6
  11. Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Feb 2005
    Posts
    610
    Rep Power
    65

    Smile


    Load your dictionary file into a set of words, call it dicSet. Load your text words into a txtList and convert to a set, call it txtSet. Remember sets contain only unique words.

    Now do a superfast (hashed)
    Code:
    commonSet = dicSet.intersection(txtSet)
    commonSet now contains all the unique words common to both sets. What's left to do is to search and count with this much smaller set against the txtList with the old for loop thingy and you are all set (no pun).

    If you want the words in txtSet that are not in dicSet do a
    Code:
    diffSet = txtSet.difference(dicSet)
  12. #7
  13. Banned ;)
    Devshed Supreme Being (6500+ posts)

    Join Date
    Nov 2001
    Location
    Woodland Hills, Los Angeles County, California, USA
    Posts
    9,616
    Rep Power
    4247
    Why do you need the dictionary at all, if all you want is the word frequency in the file. If you need to keep track of which words are used and update the dictionary that way, it might be more memory efficient to reverse the process. To avoid confusing the issue, I will refer to a python dictionary object as a hash.

    1. Open the file
    2. Split the lines into words and process each word.
    3. For each word, check if the hash already has the word or not. If it doesn't exist, then add it to the hash, otherwise increment the hash value.
    4. At this point, you have the word frequency of the words in the file.
    5. Next, open the dictionary file and go through each line.
    6. For each line entry, check the hash and if you find the hash entry, you know how many times it occurred in the file.

    This approach avoids loading the whole dictionary into memory .
    Code:
    #!/usr/bin/env python
    
    # First get the word frequency from the text file
    hash = {}
    lines = open("alice.txt", "r").readlines()
    for line in lines:
        words = line.lower().strip().split()
        # Note: I consider whitespace as a word break.
        # This is not necessarily a good assumption.
        for word in words:
            if hash.has_key(word):
                hash[word] += 1
            else:
                hash[word] = 1
    
    # Now print the word frequency that we have so far
    #for word in hash.keys():
    #    print word, hash[word]
    
    # Now open a dictionary and see what words are matched already
    dictfile = open("/usr/share/dict/words", "r")
    for word in dictfile:
        word = word.lower().strip()
        # If we already have this word in the hash, print out the frequency.
        if hash.has_key(word):
            print word, hash[word]
        else:
            print word, 0
    
    dictfile.close()
    Of course, note that I split the lines into words using whitespace as the delimiter, which is not actually the correct approach. We discussed how to split a line into words a while back, so you can search the forum for a better way to do it.
    Up the Irons
    What Would Jimi Do? Smash amps. Burn guitar. Take the groupies home.
    "Death Before Dishonour, my Friends!!" - Bruce D ickinson, Iron Maiden Aug 20, 2005 @ OzzFest
    Down with Sharon Osbourne

    "I wouldn't hire a butcher to fix my car. I also wouldn't hire a marketing firm to build my website." - Nilpo
  14. #8
  15. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2005
    Posts
    16
    Rep Power
    0
    thank so much.

    Please help me, i have an unicode string , and i try to split it :

    s=unicode string # ( s=file.readline() )
    but s.split([",","."]) don't give me a list of unicode word. Please help me. Thanks
  16. #9
  17. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2005
    Location
    >>> import pygtk
    Posts
    4
    Rep Power
    0
    Originally Posted by NPT
    s=unicode string # ( s=file.readline() )
    but s.split([",","."]) don't give me a list of unicode word. Please help me. Thanks
    first turn the unicode string into ascii with the builtin "str" method:
    Code:
    s=str(s)
  18. #10
  19. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2004
    Posts
    461
    Rep Power
    25
    I also want to point out that if this is just a little thing that you are not planing on using more than once. Like a small tool or something of that sort, you may actually spend more time trying to make it run *faster than the time it is actually used. This sorta defeats the python or other high level languages ideals.

    IMO unless this is going to be a frequantly used tool that you truly need the speed, and your time can't be spent coding something else and be more productive. Then it is worth the extra effort of speeding it up. If like I said it is only a small tool that isn't going to be used more than a few times. Then you are wasting your time and lowering your productivity.

    The best approch to optimizing something is to know when the optimizations are actually going to pay off during run time. Remember the time an optimized app takes to run, plus the amount of time it took you to optimize the app is what truely tells how much optimization payed off. Not just what happens during run time.

IMN logo majestic logo threadwatch logo seochat tools logo