#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2013
    Posts
    27
    Rep Power
    0

    Findall, occurrences, sorted


    Hi, I'm having trouble getting re.findall to produce all my occurrences.

    Right now, it's only producing the very last row of the table.

    This is part of the html code of the table:
    Code:
    <div id="content"><table border='0' style='padding-left:20'><tr><th>Rank</th><th align='left'>Country</th><th align='left' colspan='2'>Exports (Billion $)</th></tr><tr><td align='right'>1</td><td><a href='../china/exports.html'>China</a></td><td align='right'>1,904</td><td><img src='/img/g.gif' height='10' width='350'></td></tr><tr><td align='right'>2</td><td><a href='../united_states/exports.html'>United States</a></td><td align='right'>1,497</td><td><img src='/img/g.gif' height='10' width='275'></td></tr><tr><td align='right'>3</td><td><a href='../germany/exports.html'>Germany</a></td><td align='right'>1,408</td><td><img src='/img/g.gif' height='10' width='259'></td></tr><tr><td align='right'>4</td><td><a href='../japan/exports.html'>Japan</a></td><td align='right'>788</td><td><img src='/img/g.gif' height='10' width='145'></td></tr><tr><td align='right'>5</td><td><a href='../france/exports.html'>France</a></td><td align='right'>587.1</td><td><img src='/img/g.gif' height='10' width='108'></td></tr>
    I know the problem is somewhere in my re.findall() line of code, but I can't figure out what needs to be added to make it print out all the occurrences instead of just one.

    this is a part of my code:
    Code:
    def extract_data(filename):
      country_export = []
      
      # Open and read file
      f = open(filename, 'rU')
      text = f.read()
    
      tuples = re.findall(r'<td .*>(.*)</td><td><a .*>(.*)</a></td><td .*>(.*)</td>', text)
      print tuples
    So this code is only giving me
    Code:
    [('221', 'Tokelau', '')]
    The other data don't show up and also, I'm not sure about why the zero is not showing up either...when I tried it in IDLE, it showed up, but when I tried in the command line (as seen from the output above) it doesn't show.
  2. #2
  3. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,996
    Rep Power
    481
    Instead of .* in your patterns
    [^>/n]*
    might be more appropriate.
    Any character except > or newline.
    I suppose either of us could read the manual about how newlines are treated. Productive for another project.

    How about you use the html parser? Recently people have complained about the unreadable, impossible to understand python documentation. This bit would be a counterexample.
    [code]Code tags[/code] are essential for python code and Makefiles!
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2013
    Posts
    27
    Rep Power
    0
    Oh my gosh! Thank you so much! It's working correctly now!!
    I changed it a slight bit but I would never have gotten there without your help!!
    Thank you!
  6. #4
  7. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,996
    Rep Power
    481
    So I didn't convince you to use the html parser. Your loss.
    [code]Code tags[/code] are essential for python code and Makefiles!
  8. #5
  9. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    43
    Rep Power
    3
    Yes I would suggest using a module built specifically for HTML parsing. I know BeautifulSoup is another module that people seem to enjoy.
  10. #6
  11. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2013
    Posts
    27
    Rep Power
    0
    Originally Posted by b49P23TIvg
    So I didn't convince you to use the html parser. Your loss.
    haha, you actually did in fact. I will aim to implement it in the future.

IMN logo majestic logo threadwatch logo seochat tools logo