#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2003
    Posts
    74
    Rep Power
    11

    How many words in a txt file?


    Hello,

    Would would the best way to count and display all the words in a text file? Say you have an ebook and you want to show all unique words and how many times each word occurs?

    Thanks
    Random
  2. #2
  3. Banned ;)
    Devshed Supreme Being (6500+ posts)

    Join Date
    Nov 2001
    Location
    Woodland Hills, Los Angeles County, California, USA
    Posts
    9,616
    Rep Power
    4247
    Best way would be to use a dictionary object to maintain word counts.
    Code:
    #!/usr/bin/env python
    import re
    
    # Read the file into a buffer
    file = open("ebook.txt")
    buf = ""
    for line in file:
        buf += line
    file.close()
    
    # Now split the buffer into words
    reg = re.compile(r'\W+')
    words = reg.split(buf)
    
    # Now count the occurences of each word
    unique = {}
    for word in words:
        if word in unique:
            unique[word] += 1
        else:
            unique[word] = 1
    
    for word in unique.keys():
        print word, unique[word]
    Of course, the regexp I used to split the words could be improved upon.
    Last edited by Scorpions4ever; December 7th, 2004 at 01:22 PM.
    Up the Irons
    What Would Jimi Do? Smash amps. Burn guitar. Take the groupies home.
    "Death Before Dishonour, my Friends!!" - Bruce D ickinson, Iron Maiden Aug 20, 2005 @ OzzFest
    Down with Sharon Osbourne

    "I wouldn't hire a butcher to fix my car. I also wouldn't hire a marketing firm to build my website." - Nilpo
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2003
    Posts
    74
    Rep Power
    11
    Thanks Scorpi that exactly what I was looking for!

    Random
  6. #4
  7. Hello World :)
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Mar 2003
    Location
    Hull, UK
    Posts
    2,537
    Rep Power
    69
    Originally Posted by Scorpions4ever
    ...
    #Read the file into a buffer
    file = open("ebook.txt")
    buf = ""
    for line in file:
    buf += line
    file.close()
    ...
    Unless I'm missing something important here you should be using the build-in read() method -- returns the contents of the file as a string -- so the above code-block ends up being to something like this:

    Code:
    fileBuffer = file('ebook.txt').read()
    You should also be able to split() the returned string on spaces instead of using regular expressions. Assuming of course that the words in the ebook are separated by spaces .

    Hope this helps,

    Mark.
    programming language development: www.netytan.com Hula

  8. #5
  9. Banned ;)
    Devshed Supreme Being (6500+ posts)

    Join Date
    Nov 2001
    Location
    Woodland Hills, Los Angeles County, California, USA
    Posts
    9,616
    Rep Power
    4247
    Yep, I was thinking of using read() after I posted the code. Actually, I wrote the original code to work for each line separately and didn't think anyone would notice .

    As for split(), I debated it, but then decided not to use it. The trouble is that a period or a comma could also be a word separator and split won't work in this case.
    Up the Irons
    What Would Jimi Do? Smash amps. Burn guitar. Take the groupies home.
    "Death Before Dishonour, my Friends!!" - Bruce D ickinson, Iron Maiden Aug 20, 2005 @ OzzFest
    Down with Sharon Osbourne

    "I wouldn't hire a butcher to fix my car. I also wouldn't hire a marketing firm to build my website." - Nilpo
  10. #6
  11. Hello World :)
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Mar 2003
    Location
    Hull, UK
    Posts
    2,537
    Rep Power
    69
    Originally Posted by Scorpions4ever
    As for split(), I debated it, but then decided not to use it. The trouble is that a period or a comma could also be a word separator and split won't work in this case.
    That's a really good point; didn't even consider that. I suppose the best place to start with that would be to remove/replace all punctuation [comma's etc] with white space. Still, the basic idea is there hey .
    programming language development: www.netytan.com Hula

  12. #7
  13. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2003
    Posts
    74
    Rep Power
    11
    But what happens on word "don't" it becomes "don" and "t" so I tried using split() and it seems to be working.

    Random
  14. #8
  15. Banned ;)
    Devshed Supreme Being (6500+ posts)

    Join Date
    Nov 2001
    Location
    Woodland Hills, Los Angeles County, California, USA
    Posts
    9,616
    Rep Power
    4247
    The perl cookbook has a really good discussion about exactly what constitutes a word (whitespace vs. English word).
    http://www.oreilly.com/catalog/cookb...pter/ch08.html
    Search for the section "Processing Every Word in a File" and you'll see that they use whitespace as the delimiter in the beginning and then use this regexp for "English words" after the discussion:
    \w[\w'-]*

    I've also seen \b([\w'-]+)\b in use too. That's why I put the caveat "Of course, the regexp I used to split the words could be improved upon" in my original post in this thread.
    Last edited by Scorpions4ever; December 7th, 2004 at 07:21 PM.
    Up the Irons
    What Would Jimi Do? Smash amps. Burn guitar. Take the groupies home.
    "Death Before Dishonour, my Friends!!" - Bruce D ickinson, Iron Maiden Aug 20, 2005 @ OzzFest
    Down with Sharon Osbourne

    "I wouldn't hire a butcher to fix my car. I also wouldn't hire a marketing firm to build my website." - Nilpo
  16. #9
  17. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2005
    Posts
    174
    Rep Power
    11

    don't forget string module


    I have one question at the end of this post. I realize it's an old post but who knows when these ideas hit you?

    First my comment. It struck me that no one mentioned that the string module has a string.punctuation preset that defines all the characters used for ..wait for it.. punctuation.

    You could also just chop off the trailing character with len() telling you the length and then the slice operator:

    Code:
    a[:len(a)-1]
    But this won't work with my 'a' example below. Here's some code for you to play with. BTW, it keeps the inner punctuation intact ie: "that's" .

    Code:
    from python shell...
    import string
    string.punctuation #take a look
    
    a = '()-->hello<--()'
    a.strip(string.punctuation)
    a
    hello #should be this
    
    f = open("foo.txt", "r")
    z = f.readlines()
    z = map(string.strip,z) #finally used map function... and even works!
    or
    z = map(string.strip(string.punctuation),z) #doesn't work
    Question:
    Hmm, that last line doesn't work even though it's also a function call on a preset. However the standard string.strip() in z = map(string.strip,z) works fine!? Any ideas?

    Cheers
    sf2k
  18. #10
  19. Hello World :)
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Mar 2003
    Location
    Hull, UK
    Posts
    2,537
    Rep Power
    69
    Very impressive idea, I probably never would have come up with this . Anyway, the reason for your problem is that map works by calling the function with the current element of the sequence as its argument; so without using lambda theres no way to pass the value in. Here's a working example:

    Code:
    >>> import string
    >>> a = '()-->hello<--()'
    >>> a.strip(string.punctuation)
    'hello'
    >>> sampleLine = 'hello, I\'m a sample line... anything else?'
    >>> sampleLine
    "hello, I'm a sample line... anything else?"
    >>> map(lambda x: str.strip(x, string.punctuation), sampleLine.split())
    ['hello', "I'm", 'a', 'sample', 'line', 'anything', 'else']
    >>> [str.strip(x, string.punctuation) for x in sampleLine.split()]
    ['hello', "I'm", 'a', 'sample', 'line', 'anything', 'else']
    >>>
    As you can see in the map() line lambda us used to pass x to the str.strip() class method (this is the same as calling string.strip()). Personally I prefer the list comprehension to the map() function, which is why I've included it here; it just looks cleaner than using lambda

    Code:
    >>> map(lambda x: string.strip(x, string.punctuation), sampleLine.split())
    ['hello', "I'm", 'a', 'sample', 'line', 'anything', 'else']
    >>> [string.strip(x, string.punctuation) for x in sampleLine.split()]
    ['hello', "I'm", 'a', 'sample', 'line', 'anything', 'else']
    >>>
    Nice one!

    Mark.
    programming language development: www.netytan.com Hula

  20. #11
  21. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2005
    Posts
    174
    Rep Power
    11
    Thank you for your answer.

    Also appreciate your explanation of list comprehension to boot I had in fact not realized that it could run stuff in place like that. Very nice.... pre-emptive answers some questions I had about some other code!

    Cheers
    sf2k

IMN logo majestic logo threadwatch logo seochat tools logo