Python Programming
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me

The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.

Go Back   Dev Shed ForumsProgramming LanguagesPython Programming

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old February 21st, 2013, 03:12 AM
psk102 psk102 is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Apr 2011
Posts: 8 psk102 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 2 h 25 m 18 sec
Reputation Power: 0
mrjob

Okay, so this is my first time learning about mapreduce and mrjob in python, and the simpler stuff I think I am able to understand.

Right now, I am having the hardest time getting regex to work correctly for some reason. My logic seems to be right...

So the problem is I can't get all of the output I need from this txt file: http://wikisend.com/download/441620/pg1268.txt
I'm suppose to get the sequence of two adjacent words in a string of words

This is working code:
Code:
from mrjob.job import MRJob
import re

WORD_RE = out.compile(r"\w+\s\w+")

class BiGramFreqCount(MRJob):

  def mapper(self, _, line):
    for word in WORD_RE.findall(line):
      yield(word.lower(), 1)

  def reducer(self, key, values):
    yield (key, sum(values))
  
if __name__ == '__main__':
  BiGramFreqCount.run()


The code above gives me some of the outputs but not all because it doesn't include the ones that have punctuation in between.

I think there's a way to do it without complicating regex but since I'm not too familiar with mrjob, I don't really know the approach.

Any help would be greatly appreciated!

Reply With Quote
  #2  
Old February 21st, 2013, 10:31 AM
b49P23TIvg's Avatar
b49P23TIvg b49P23TIvg is offline
Contributing User
Dev Shed Loyal (3000 - 3499 posts)
 
Join Date: Aug 2011
Posts: 3,383 b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level) 
Time spent in forums: 1 Month 2 Weeks 3 Days 13 h 44 m 29 sec
Reputation Power: 383
I suggest preprocessing the entire text by converting to lower case and replacing all characters that are not white space, digits, or letters with a space character.
__________________
[code]Code tags[/code] are essential for python code!

Reply With Quote
  #3  
Old February 21st, 2013, 11:04 AM
psk102 psk102 is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Apr 2011
Posts: 8 psk102 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 2 h 25 m 18 sec
Reputation Power: 0
Hm, I tried stripping and replacing with space, but that does seem to work properly...i heard that it can be done with regex, but I don't get the impression it can. Would it be easier to maybe have like to def reducers? Sorry, I'm not really that good a python yet so I apologize.

Reply With Quote
  #4  
Old February 21st, 2013, 11:26 AM
b49P23TIvg's Avatar
b49P23TIvg b49P23TIvg is offline
Contributing User
Dev Shed Loyal (3000 - 3499 posts)
 
Join Date: Aug 2011
Posts: 3,383 b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level) 
Time spent in forums: 1 Month 2 Weeks 3 Days 13 h 44 m 29 sec
Reputation Power: 383
Why is reducer an iterator that returns only one value?

I didn't mention str.strip .
Code:
>>> def preprocess(a):
...    return ''.join((' ',c)[c.islower() or c.isspace() or c.isdigit()] for c in a.lower())
...
>>> print(preprocess('This "Teribble, horrid!!!!\nmisspelled SENTENCE.'))
this  teribble  horrid    
misspelled sentence 
>>> 

Reply With Quote
  #5  
Old February 21st, 2013, 01:32 PM
psk102 psk102 is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Apr 2011
Posts: 8 psk102 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 2 h 25 m 18 sec
Reputation Power: 0
I think it's because the values that have the same key go to the one reducer, so even if there are multiple reducers, each reducer has all the data required for one single key?

I'm far from an expert so don't quote me on this.

Do you know by any chance if there is a way to get the correct output without using another def (def preprocess)? Or is it impossible unless I do it that way?

Reply With Quote
  #6  
Old February 21st, 2013, 02:51 PM
b49P23TIvg's Avatar
b49P23TIvg b49P23TIvg is offline
Contributing User
Dev Shed Loyal (3000 - 3499 posts)
 
Join Date: Aug 2011
Posts: 3,383 b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level) 
Time spent in forums: 1 Month 2 Weeks 3 Days 13 h 44 m 29 sec
Reputation Power: 383
If you're using unix,

you could preprocess pg1268.txt
with a few passes through tr

Code:
$ ( tr '[:upper:]' '[:lower:]' | tr -c '[:lower:]' ' ' ) pg1628.txt > pg1628.ascii



Using the definition of preprocess,

for word in WORD_RE.findall(preprocess(line)):

other choices get silly.

Reply With Quote
  #7  
Old February 22nd, 2013, 03:21 AM
psk102 psk102 is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Apr 2011
Posts: 8 psk102 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 2 h 25 m 18 sec
Reputation Power: 0
Hmm, so I implemented your code and tried it out, but I'm missing a lot of the results...

Code:
WORD_RE = re.compile(r"\w+\s\w+")

def preprocess(a):
  return (''.join((' ',c)[c.islower() or c.isspace() or c.isdigit()] for c in a.lower()))

class BiGramFreqCount(MRJob):
 
  def mapper(self, _, line):
    for word in WORD_RE.findall(preprocess(line)):
      yield(word.lower(), 1)

  def reducer(self, key, values):
    yield (key, sum(values))

if __name__ == '__main__':
  BiGramFreqCount.run()

Does the preprocess also consider exclamation/question marks (!?), quotations marks (""), commas, and periods?

Reply With Quote
  #8  
Old February 22nd, 2013, 10:02 AM
b49P23TIvg's Avatar
b49P23TIvg b49P23TIvg is offline
Contributing User
Dev Shed Loyal (3000 - 3499 posts)
 
Join Date: Aug 2011
Posts: 3,383 b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level) 
Time spent in forums: 1 Month 2 Weeks 3 Days 13 h 44 m 29 sec
Reputation Power: 383
By processing the data a line at a time you're missing the word pairs that cross new line frets.

(I don't truly know the content of the variable named "line")

If this is the issue you could include conversion of newline to space in the preprocessing step.

You could test your program with a simple file like


this
is
a
test

Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming LanguagesPython Programming > MapReduce/MRjob

Developer Shed Advertisers and Affiliates



Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 


Powered by: vBulletin Version 3.0.5
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.

© 2003-2013 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap