Thread: MapReduce/MRjob

    #1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2011
    Posts
    8
    Rep Power
    0

    mrjob


    Okay, so this is my first time learning about mapreduce and mrjob in python, and the simpler stuff I think I am able to understand.

    Right now, I am having the hardest time getting regex to work correctly for some reason. My logic seems to be right...

    So the problem is I can't get all of the output I need from this txt file: http://wikisend.com/download/441620/pg1268.txt
    I'm suppose to get the sequence of two adjacent words in a string of words

    This is working code:
    Code:
    from mrjob.job import MRJob
    import re
    
    WORD_RE = out.compile(r"\w+\s\w+")
    
    class BiGramFreqCount(MRJob):
    
      def mapper(self, _, line):
        for word in WORD_RE.findall(line):
          yield(word.lower(), 1)
    
      def reducer(self, key, values):
        yield (key, sum(values))
      
    if __name__ == '__main__':
      BiGramFreqCount.run()
    The code above gives me some of the outputs but not all because it doesn't include the ones that have punctuation in between.

    I think there's a way to do it without complicating regex but since I'm not too familiar with mrjob, I don't really know the approach.

    Any help would be greatly appreciated!
  2. #2
  3. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,837
    Rep Power
    480
    I suggest preprocessing the entire text by converting to lower case and replacing all characters that are not white space, digits, or letters with a space character.
    [code]Code tags[/code] are essential for python code and Makefiles!
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2011
    Posts
    8
    Rep Power
    0
    Hm, I tried stripping and replacing with space, but that does seem to work properly...i heard that it can be done with regex, but I don't get the impression it can. Would it be easier to maybe have like to def reducers? Sorry, I'm not really that good a python yet so I apologize.
  6. #4
  7. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,837
    Rep Power
    480
    Why is reducer an iterator that returns only one value?

    I didn't mention str.strip .
    Code:
    >>> def preprocess(a):
    ...    return ''.join((' ',c)[c.islower() or c.isspace() or c.isdigit()] for c in a.lower())
    ...
    >>> print(preprocess('This "Teribble, horrid!!!!\nmisspelled SENTENCE.'))
    this  teribble  horrid    
    misspelled sentence 
    >>>
    [code]Code tags[/code] are essential for python code and Makefiles!
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2011
    Posts
    8
    Rep Power
    0
    I think it's because the values that have the same key go to the one reducer, so even if there are multiple reducers, each reducer has all the data required for one single key?

    I'm far from an expert so don't quote me on this.

    Do you know by any chance if there is a way to get the correct output without using another def (def preprocess)? Or is it impossible unless I do it that way?
  10. #6
  11. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,837
    Rep Power
    480
    If you're using unix,

    you could preprocess pg1268.txt
    with a few passes through tr

    Code:
    $ ( tr '[:upper:]' '[:lower:]' | tr -c '[:lower:]' ' ' ) pg1628.txt > pg1628.ascii

    Using the definition of preprocess,

    for word in WORD_RE.findall(preprocess(line)):

    other choices get silly.
    [code]Code tags[/code] are essential for python code and Makefiles!
  12. #7
  13. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2011
    Posts
    8
    Rep Power
    0
    Hmm, so I implemented your code and tried it out, but I'm missing a lot of the results...

    Code:
    WORD_RE = re.compile(r"\w+\s\w+")
    
    def preprocess(a):
      return (''.join((' ',c)[c.islower() or c.isspace() or c.isdigit()] for c in a.lower()))
    
    class BiGramFreqCount(MRJob):
     
      def mapper(self, _, line):
        for word in WORD_RE.findall(preprocess(line)):
          yield(word.lower(), 1)
    
      def reducer(self, key, values):
        yield (key, sum(values))
    
    if __name__ == '__main__':
      BiGramFreqCount.run()
    Does the preprocess also consider exclamation/question marks (!?), quotations marks (""), commas, and periods?
  14. #8
  15. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,837
    Rep Power
    480
    By processing the data a line at a time you're missing the word pairs that cross new line frets.

    (I don't truly know the content of the variable named "line")

    If this is the issue you could include conversion of newline to space in the preprocessing step.

    You could test your program with a simple file like


    this
    is
    a
    test
    [code]Code tags[/code] are essential for python code and Makefiles!

IMN logo majestic logo threadwatch logo seochat tools logo