1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Rep Power

    Return common words in two files

    I've been looking around for an answer to this but have had no luck. I need to take two files and print the top most frequent words they have in common as well as their combined(sum) frequencies. This might be simple but I'm pretty new to programming. Any help?

    def mostFrequent(word,frequency,n):
       my_list = zip(word,frequency) #combine the two lists
       my_list.sort(key=lambda x:x[1],reverse=True) #sort by freq
       words,freqs = zip(*my_list[:n]) #take the top n entries and split back to seperate lists
       return words, freqs #return our most frequent words in order   
    from wordFrequencies import * #gives both the word and its frequency in a file
    L1 = wordFrequencies('file1.txt')
    words1 = L1[0]
    freqs1 = L1[1]
    L2 = wordFrequencies('file2.txt')
    words2 = L2[0]
    freqs2 = L2[1]
    print mostFrequent(words,freqs,20)
    L1 = WordFrequencies('file1.txt')#what I tried
    words1 = set(L1[0])
    freqs1 = set(L1[1])
    L2 = WordFrequencies('file2.txt')
    words2 = set(L2[0])
    freqs2 = set(L2[1])
    words3 = words1.intersection(words2)
    freqs3 = freqs1.intersection(freqs2)
    print mostFrequent(words3,freqs3,20)
    It didn't work. It outputed the wrong words.
  2. #2
  3. Contributing User

    Join Date
    Aug 2011
    Rep Power
    sets are unordered. These statements

    words1 = set(L1[0])
    freqs1 = set(L1[1])

    break the correlation between words and frequencies in

    words1 and freqs1

    Stick with dictionaries. I haven't explored your code beyond that. You could save us time by showing a small example of L1, L2, and the expected output. Otherwise we have to implement wordFrequencies, guessing what it's output should be.

    For me, this involves looking up default dictionary (searching for container since I didn't recall the module is named collections), finding some good sample texts, reading the files, remove punctuation, split words, count them, convert from dictionary to lists. Many simple steps but many, plus test code and debugging. After that there's still the guesswork about the result of wordFrequencies. Oh, and before splitting the text conversion to common case would be useful. (Your problem specification didn't indicate "position within a sentence".) What else will I remember after a few tries?
    [code]Code tags[/code] are essential for python code and Makefiles!

IMN logo majestic logo threadwatch logo seochat tools logo