February 21st, 2013, 03:12 AM
-
mrjob
Okay, so this is my first time learning about mapreduce and mrjob in python, and the simpler stuff I think I am able to understand.
Right now, I am having the hardest time getting regex to work correctly for some reason. My logic seems to be right...
So the problem is I can't get all of the output I need from this txt file: http://wikisend.com/download/441620/pg1268.txt
I'm suppose to get the sequence of two adjacent words in a string of words
This is working code:
Code:
from mrjob.job import MRJob
import re
WORD_RE = out.compile(r"\w+\s\w+")
class BiGramFreqCount(MRJob):
def mapper(self, _, line):
for word in WORD_RE.findall(line):
yield(word.lower(), 1)
def reducer(self, key, values):
yield (key, sum(values))
if __name__ == '__main__':
BiGramFreqCount.run()
The code above gives me some of the outputs but not all because it doesn't include the ones that have punctuation in between.
I think there's a way to do it without complicating regex but since I'm not too familiar with mrjob, I don't really know the approach.
Any help would be greatly appreciated!
February 21st, 2013, 10:31 AM
-
I suggest preprocessing the entire text by converting to lower case and replacing all characters that are not white space, digits, or letters with a space character.
[code]
Code tags[/code] are essential for python code and Makefiles!
February 21st, 2013, 11:04 AM
-
Hm, I tried stripping and replacing with space, but that does seem to work properly...i heard that it can be done with regex, but I don't get the impression it can. Would it be easier to maybe have like to def reducers? Sorry, I'm not really that good a python yet so I apologize.
February 21st, 2013, 11:26 AM
-
Why is reducer an iterator that returns only one value?
I didn't mention str.strip .
Code:
>>> def preprocess(a):
... return ''.join((' ',c)[c.islower() or c.isspace() or c.isdigit()] for c in a.lower())
...
>>> print(preprocess('This "Teribble, horrid!!!!\nmisspelled SENTENCE.'))
this teribble horrid
misspelled sentence
>>>
[code]
Code tags[/code] are essential for python code and Makefiles!
February 21st, 2013, 01:32 PM
-
I think it's because the values that have the same key go to the one reducer, so even if there are multiple reducers, each reducer has all the data required for one single key?
I'm far from an expert so don't quote me on this.
Do you know by any chance if there is a way to get the correct output without using another def (def preprocess)? Or is it impossible unless I do it that way?
February 21st, 2013, 02:51 PM
-
If you're using unix,
you could preprocess pg1268.txt
with a few passes through tr
Code:
$ ( tr '[:upper:]' '[:lower:]' | tr -c '[:lower:]' ' ' ) pg1628.txt > pg1628.ascii
Using the definition of preprocess,
for word in WORD_RE.findall(preprocess(line)):
other choices get silly.
[code]
Code tags[/code] are essential for python code and Makefiles!
February 22nd, 2013, 03:21 AM
-
Hmm, so I implemented your code and tried it out, but I'm missing a lot of the results...
Code:
WORD_RE = re.compile(r"\w+\s\w+")
def preprocess(a):
return (''.join((' ',c)[c.islower() or c.isspace() or c.isdigit()] for c in a.lower()))
class BiGramFreqCount(MRJob):
def mapper(self, _, line):
for word in WORD_RE.findall(preprocess(line)):
yield(word.lower(), 1)
def reducer(self, key, values):
yield (key, sum(values))
if __name__ == '__main__':
BiGramFreqCount.run()
Does the preprocess also consider exclamation/question marks (!?), quotations marks (""), commas, and periods?
February 22nd, 2013, 10:02 AM
-
By processing the data a line at a time you're missing the word pairs that cross new line frets.
(I don't truly know the content of the variable named "line")
If this is the issue you could include conversion of newline to space in the preprocessing step.
You could test your program with a simple file like
this
is
a
test
[code]
Code tags[/code] are essential for python code and Makefiles!