#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2013
    Posts
    27
    Rep Power
    0

    Sentimental Analysis Yelp Academic Database


    Hi so I am totally stuck on how to go about this.

    So what I'm trying to do is to perform an automated customer sentiment analysis of the Yelp's Academic Dataset (specifically analyzing the 'review text' to compute a sentiment score) and then using the NLTK python package to stem the word.

    Any help or push in the right direction would be much appreciated.
  2. #2
  3. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,966
    Rep Power
    481
    Dude or dudette, this is easy. EASY!
    Originally Posted by Instruction
    A high sentiment score
    would mean the review is positive, and a low (negative) sentiment score would mean the review is
    positive.
    Either way, positive. Just choose a random number (or dispense even with that) and report "positive".
    [code]Code tags[/code] are essential for python code and Makefiles!
  4. #3
  5. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,966
    Rep Power
    481
    But seriously, the instructions give the method in pretty good detail. Have you installed nltk? Start there.

    I don't have your fine instructions for using the Porter stemming algorithm. It looks a little aggressive with regard to trailing "e"s. Perhaps after you find the stem you should pass the words through a spelling improvement routine. Maybe the nltk has one.

    http://norvig.com/spell-correct.html


    Code:
    >>> import nltk
    >>> S = nltk.PorterStemmer()
    >>> S.stem('matting')
    'mat'
    >>> S.stem('reliance')
    'relianc'
    >>> S.stem('reliability')
    'reliabl'
    >>>
    [code]Code tags[/code] are essential for python code and Makefiles!
  6. #4
  7. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2013
    Posts
    27
    Rep Power
    0
    Okay, so I installed the nltk (pretty sure I did it correctly)...

    So the part I'm really confused on is how I'm suppose to get the 'review text' part. Am I suppose to get the other dataset information like the business_id and star rating? Or is there like a specific format to follow so that it will be stemmed correctly later using nltk?
  8. #5
  9. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,966
    Rep Power
    481
    Perhaps you can skip the spelling phase. Your word list contains the "roots"

    excruciat,-5

    Expert? I had already installed the nltk. That's about it.

    OK, sentiment_word_list.txt is good but we can ignore it? I'll need to reread the instructions.
    The stemmed .json file includes smilies, how nice.
    I don't see any surprises in the "how to use it file".
    Then we have data in the other file.

    (written 12 hours ago)
    .........
    I'm not getting to this project quickly.
    ........
    (written now)
    [code]Code tags[/code] are essential for python code and Makefiles!
  10. #6
  11. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,966
    Rep Power
    481
    How goes? We scan through that 1/3 gigabyte data file and finally find an entry with a text field.

    So I'd make a sample data set to test my program. The test data should have about 5 entries, 2 unique business id's, a few of the entries have no text, a few have no business_id so you can detect and reject bad data.

    Read the data using json. I'd reject bad data as I loaded it, then I'd sort by business_id. A more efficient way is to process the data as comes storing results as a dictionary keyed by business id, then combine the keys later. Sorting first helps me think about the problem. Well, good luck. This isn't a trivial small problem.

    {"votes": {"funny": 0, "useful": 2, "cool": 2}, "user_id": "Jho-WZ05VCyPyNOIQMkKkw", "review_id": "uQFqwVDXxCycjJBHzr5gLw", "stars": 5, "date": "2008-07-07", "text": "I went here for lunch with promises from my friend that I would be impressed and really like the place.\nHe was right.\nTheir Turkey Hoagie was so freakin good!\nThe bread was fresh, ingredients also fresh and tasty and the folks here are so friendly and polite. (A great GREAT find nowadays!)\nI will be back for more.", "type": "review", "business_id": "3r9WCjQYKDZeDv1oxLGVwA"}
    [code]Code tags[/code] are essential for python code and Makefiles!

IMN logo majestic logo threadwatch logo seochat tools logo