#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    12
    Rep Power
    0

    Merging data into tsv file help


    Hi, I just started learning python a few days ago...and I'm already stuck on something easy TT.TT

    I know I need to use BeautifulSoup4 and urllib2 in my code and have done so with reading through the data on the webpage, but I'm not sure how to add the regions into my tsv file and merge it. (hopefully that makes sense)

    here's a snippet of my current tsv file
    Code:
    country	area	population 
    MACAU	28.2	578025 
    MONACO	2	30510 
    SINGAPORE	697	5353494 
    ..........
    my code:
    Code:
    import urllib2, re
    from bs4 import BeautifulSoup
    
    
    response = urllib2.urlopen('http://www.indexmundi.com/factbook/regions').read()
    soup = BeautifulSoup(response)
    row = soup.findAll('li')
    for link in row:
        href = link.find('a')['href']
        url = "http://www.indexmundi.com"
        countryurl = url + href
        response = urllib2.urlopen(countryurl).read()
        soup = BeautifulSoup(response)
        data_table = soup.findAll('td')
        for data in data_table:
            region = data.find('a')['href']
            print region
    and what I want my final tsv file to look like:
    Code:
    country	region	area	population
    AFGHANISTAN	Asia	652230	30419928
    ALBANIA	Europe	28748	3002859
    ALGERIA	Africa	2381741	37367226
    AMERICAN SAMOA	Oceania	199	54947
    ANDORRA	Europe	468	85082
    ANGOLA	Africa	1246700	18056072
    .............
    I don't think I need to keep reading into the links from where I'm at right? But then I'm not sure how to merge the regions into the file with the correct country and order it like the above.

    I'd appreciate any help!
  2. #2
  3. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,710
    Rep Power
    480
    Read your input with

    import csv # click here for documentation
    csv.DictReader


    Insert another key: value pair into each dictionary,
    then use csv.DictWriter

    At least I think that's how it works.
    [code]Code tags[/code] are essential for python code and Makefiles!
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    12
    Rep Power
    0
    Hmm, I looked over it but I'm not exactly sure how to implement it correctly with my current code.

    Also, I was told that what I'm trying to achieve can be done with just beautifulsoup4 and urllib2 and doesn't need to get complicated with other modules?

    Sorry, still trying to understand python -__-;
  6. #4
  7. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,710
    Rep Power
    480
    bs4 isn't part of the standard distribution, I consider it exotic.
    [code]Code tags[/code] are essential for python code and Makefiles!
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    12
    Rep Power
    0
    Hmm....
  10. #6
  11. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    12
    Rep Power
    0
    So I played around with the code and got this output right now where it lists all the region, country like the format below:
    Code:
    [(u, 'Africa', u, 'Algeria'), (u, 'Africa', u, 'Angola')...etc)]
    I've stored it and now am trying to figure out how to merge it to the tsv file.

    I think I'm heading in the right direction with making a dictionary but I'm getting either an attribute error or a values error.

    I hope someone can help me out with this part
  12. #7
  13. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,710
    Rep Power
    480
    As we say, what error do you experience?

    My car doesn't work. Please diagnose it remotely. Thanks. Have a good day.
    [code]Code tags[/code] are essential for python code and Makefiles!
  14. #8
  15. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    12
    Rep Power
    0
    Oh sorry, I meant the lookup line is giving me the error so the program stops when it runs to that line. The region, country still prints like so
    Code:
    [(u, 'Africa', u, 'Algeria'), (u, 'Africa', u, 'Angola')...etc)]
    But when it gets to make a dict, at the lookup line with lookup = dict((c.upper(), r) for r, c in data.data), I get attribute error: list has no attribute data.

    However, if I change the line to lookup = dict((c.upper(), r) for r, c in data), then the program runs and hits country, area, population = line.split('\t') and gives me the ValueError: too many values to unpack

    Sorry again, I'm still new so I'm trying to understand what's going on as well. Do you know if I'm going in the right direction? I don't think I need to extract anything else from the site right?

    Thanks again for your help.
  16. #9
  17. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,710
    Rep Power
    480
    Python did not print

    [(u, 'Africa', u, 'Algeria'), (u, 'Africa', u, 'Angola')...etc)]

    It's hard to get python to print that. Anyway, you have to go out of your way.

    Python may have printed this:

    [(u'Africa', u'Algeria'), (u'Africa', u'Angola')...etc]


    This is a python tuple:

    (2, 5)

    It's immutable. It is iterable. This one has length 2.


    This is a tuple assignment

    (a,b,) = ('String to a', None)

    b gets None.
    a is assigned 'String to a'



    (a,b,c,) = (1,2)

    These tuples have different lengths. What is python to do? Ah! Tell you of the error.


    Here is some code similar to the part you're complaining about.
    Code:
    for (r, c) in data:
        pass
    This will fail when the next of iterable data returns an iterable that hasn't got length 2.

    If you happen to supply more information please cut and paste it. Please include the error message. No paraphrasing! No retyping.
    Last edited by b49P23TIvg; March 16th, 2013 at 05:11 PM.
    [code]Code tags[/code] are essential for python code and Makefiles!
  18. #10
  19. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    12
    Rep Power
    0
    Hi again, sorry about the output, you're totally right, it was a typo on my part. The print out I get is this:
    Code:
    [(u'Africa', u'Algeria'), (u'Africa', u'Angola'), (u'Africa', u'Benin'), (u'Africa', u'Botswana'), (u'Africa', u'Burkina Faso'), (u'Africa', u'Burundi'), (u'Africa', u'Cameroon'), (u'Africa', u'Cape Verde'), (u'Africa', u'Central African Republic'), (u'Africa', u'Chad'), (u'Africa', u'Comoros'), (u'Africa', u'Congo, Democratic Republic of the'), (u'Africa', u'Congo, Republic of the'), (u'Africa', u"Cote d'Ivoire"), (u'Africa', u'Djibouti'), (u'Africa', u'Egypt'), (u'Africa', u'Equatorial Guinea'), (u'Africa', u'Eritrea'), (u'Africa', u'Ethiopia'), (u'Africa', u'Gabon'), (u'Africa', u'Gambia, The'), (u'Africa', u'Ghana'), (u'Africa', u'Guinea'), (u'Africa', u'Guinea-Bissau'), (u'Africa', u'Kenya'), (u'Africa', u'Lesotho'), (u'Africa', u'Liberia')
    With this line of code:
    Code:
    lookup = dict((c.upper(), r) for r, c in data.data)
    I get:
    Code:
    Traceback (most recent call last):
      File "<pyshell#13>", line 1, in <module>
        lookup = dict((c.upper(), r) for r, c in data.data)
    AttributeError: 'list' object has no attribute 'data'
    So I change it to:
    Code:
    lookup = dict((c.upper(), r) for r, c in data)
    and that seems to pass through python to the next part which is this:
    Code:
    for line in open("data.tsv", "r"):
    	country, area, population = line.split('\t')
    	if country in lookup:
    		print "\t".join([country, lookup[country], area, population])
    	else:
    		print "\t".join([country, 'UNKNOWN', area, population])
    But this gives me this error:
    Code:
    Traceback (most recent call last): 
    File "<pyshell#20>", line 2, in <module> 
    country, area, population = line.split('\t') 
    ValueError: too many values to unpack
  20. #11
  21. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,710
    Rep Power
    480
    Great, now read the rest of my previous post which explains a little bit about tuples.

    (as far as data not being an attribute of list, try this experiment for the totally unpredictable surprise result:

    >>> hasattr([], 'data')
    [code]Code tags[/code] are essential for python code and Makefiles!

IMN logo majestic logo threadwatch logo seochat tools logo