The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.
|
 |
|
Dev Shed Forums
> Programming Languages
> Python Programming
|
MapReduce/MRjob
Discuss MapReduce/MRjob in the Python Programming forum on Dev Shed. MapReduce/MRjob Python Programming forum discussing coding techniques, tips and tricks, and Zope related information. Python was designed from the ground up to be a completely object-oriented programming language.
|
|
 |
|
|
|
|

Dev Shed Forums Sponsor:
|
|
|

February 21st, 2013, 03:12 AM
|
|
Registered User
|
|
Join Date: Apr 2011
Posts: 8
Time spent in forums: 2 h 25 m 18 sec
Reputation Power: 0
|
|
|
mrjob
Okay, so this is my first time learning about mapreduce and mrjob in python, and the simpler stuff I think I am able to understand.
Right now, I am having the hardest time getting regex to work correctly for some reason. My logic seems to be right...
So the problem is I can't get all of the output I need from this txt file: http://wikisend.com/download/441620/pg1268.txt
I'm suppose to get the sequence of two adjacent words in a string of words
This is working code:
Code:
from mrjob.job import MRJob
import re
WORD_RE = out.compile(r"\w+\s\w+")
class BiGramFreqCount(MRJob):
def mapper(self, _, line):
for word in WORD_RE.findall(line):
yield(word.lower(), 1)
def reducer(self, key, values):
yield (key, sum(values))
if __name__ == '__main__':
BiGramFreqCount.run()
The code above gives me some of the outputs but not all because it doesn't include the ones that have punctuation in between.
I think there's a way to do it without complicating regex but since I'm not too familiar with mrjob, I don't really know the approach.
Any help would be greatly appreciated!
|

February 21st, 2013, 10:31 AM
|
 |
Contributing User
|
|
|
|
|
I suggest preprocessing the entire text by converting to lower case and replacing all characters that are not white space, digits, or letters with a space character.
__________________
[code] Code tags[/code] are essential for python code!
|

February 21st, 2013, 11:04 AM
|
|
Registered User
|
|
Join Date: Apr 2011
Posts: 8
Time spent in forums: 2 h 25 m 18 sec
Reputation Power: 0
|
|
|
Hm, I tried stripping and replacing with space, but that does seem to work properly...i heard that it can be done with regex, but I don't get the impression it can. Would it be easier to maybe have like to def reducers? Sorry, I'm not really that good a python yet so I apologize.
|

February 21st, 2013, 11:26 AM
|
 |
Contributing User
|
|
|
|
Why is reducer an iterator that returns only one value?
I didn't mention str.strip .
Code:
>>> def preprocess(a):
... return ''.join((' ',c)[c.islower() or c.isspace() or c.isdigit()] for c in a.lower())
...
>>> print(preprocess('This "Teribble, horrid!!!!\nmisspelled SENTENCE.'))
this teribble horrid
misspelled sentence
>>>
|

February 21st, 2013, 01:32 PM
|
|
Registered User
|
|
Join Date: Apr 2011
Posts: 8
Time spent in forums: 2 h 25 m 18 sec
Reputation Power: 0
|
|
|
I think it's because the values that have the same key go to the one reducer, so even if there are multiple reducers, each reducer has all the data required for one single key?
I'm far from an expert so don't quote me on this.
Do you know by any chance if there is a way to get the correct output without using another def (def preprocess)? Or is it impossible unless I do it that way?
|

February 21st, 2013, 02:51 PM
|
 |
Contributing User
|
|
|
|
If you're using unix,
you could preprocess pg1268.txt
with a few passes through tr
Code:
$ ( tr '[:upper:]' '[:lower:]' | tr -c '[:lower:]' ' ' ) pg1628.txt > pg1628.ascii
Using the definition of preprocess,
for word in WORD_RE.findall(preprocess(line)):
other choices get silly.
|

February 22nd, 2013, 03:21 AM
|
|
Registered User
|
|
Join Date: Apr 2011
Posts: 8
Time spent in forums: 2 h 25 m 18 sec
Reputation Power: 0
|
|
Hmm, so I implemented your code and tried it out, but I'm missing a lot of the results...
Code:
WORD_RE = re.compile(r"\w+\s\w+")
def preprocess(a):
return (''.join((' ',c)[c.islower() or c.isspace() or c.isdigit()] for c in a.lower()))
class BiGramFreqCount(MRJob):
def mapper(self, _, line):
for word in WORD_RE.findall(preprocess(line)):
yield(word.lower(), 1)
def reducer(self, key, values):
yield (key, sum(values))
if __name__ == '__main__':
BiGramFreqCount.run()
Does the preprocess also consider exclamation/question marks (!?), quotations marks (""), commas, and periods?
|

February 22nd, 2013, 10:02 AM
|
 |
Contributing User
|
|
|
|
|
By processing the data a line at a time you're missing the word pairs that cross new line frets.
(I don't truly know the content of the variable named "line")
If this is the issue you could include conversion of newline to space in the preprocessing step.
You could test your program with a simple file like
this
is
a
test
|
Developer Shed Advertisers and Affiliates
| Thread Tools |
Search this Thread |
|
|
|
| Display Modes |
Rate This Thread |
Linear Mode
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
|
|