Thread: Parsing

    #1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2013
    Posts
    27
    Rep Power
    0

    Parsing


    So I'm 2/3 there with my code. I'm just struggling on a little part of it.

    saved file should have: ID, rank, rating, title, year, votes.

    The ID is the part that sits between the last 2 slashes in the movie url in the table. So for ex. the url is http://www.imdb.com/title/tt0111161/, the ID is tt0111161

    My final output should look something like this:

    Code:
    tt0111161 1. 9.2 The Shawshank Redemption (1994) 900,907
    tt0068646 2. 9.2 The Godfather (1972) 650,418
    tt0071562 3. 9.0 The Godfather: Part II (1974) 418,204
    I have the output file and everything, the only part I'm stuck on is how to get that part of the ID from the table

    my code:
    Code:
    data_table = soup.find_all('table')[1]
    myData = []
    
    for row in data_table.find_all('tr'):
      l = []
      for s in row.strings:
        l.append(unicode(s))    
      
      myData.append(l)
          
    outfile = open('myfile.txt', 'w')
    for l in lol:
      line = '\t'.join(l) + '\n'
      outfile.write(unicode(line).encode('utf8'))
    outfile.close()
    Here's part of the table:
    Code:
                  <table border="1" cellpadding="4" cellspacing="0">
                   <tr bgcolor="#FFFFDB">
                    <td align="center">
                     <font face="Arial, Helvetica, sans-serif" size="-1">
                      <b>
                       Rank
                      </b>
                     </font>
                    </td>
                    <td align="center">
                     <font face="Arial, Helvetica, sans-serif" size="-1">
                      <b>
                       Rating
                      </b>
                     </font>
                    </td>
                    <td>
                     <font face="Arial, Helvetica, sans-serif" size="-1">
                      <b>
                       Title
                      </b>
                     </font>
                    </td>
                    <td align="right">
                     <font face="Arial, Helvetica, sans-serif" size="-1">
                      <b>
                       Votes
                      </b>
                     </font>
                    </td>
                   </tr>
                   <tr bgcolor="#e5e5e5" valign="top">
                    <td align="right">
                     <font face="Arial, Helvetica, sans-serif" size="-1">
                      <b>
                       1.
                      </b>
                     </font>
                    </td>
                    <td align="center">
                     <font face="Arial, Helvetica, sans-serif" size="-1">
                      9.2
                     </font>
                    </td>
                    <td>
                     <font face="Arial, Helvetica, sans-serif" size="-1">
                      <a href="/title/tt0111161/">
                       The Shawshank Redemption
                      </a>
                      (1994)
                     </font>
                    </td>
    Any help would be greatly appreciated!
  2. #2
  3. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,843
    Rep Power
    480
    >>> '/title/tt0111161/'.split('/')[-2]
    'tt0111161'
    [code]Code tags[/code] are essential for python code and Makefiles!
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2013
    Posts
    27
    Rep Power
    0
    I'm getting this error...
    TypeError: 'NoneType' object has no attribute '__getitem__'

    Do you know why?

    Code:
    data_table = soup.find_all('table')[1]
    myData = []
    
    for row in data_table.find_all('tr'):
      l = []
      cells = row.find('a')['href'].split('/')[-2]
      
    outfile = open('step2.txt', 'w')
    for l in myData:
      line = '\t'.join(l) + '\n'
      outfile.write(unicode(line).encode('utf8'))
    outfile.close()
  6. #4
  7. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,843
    Rep Power
    480
    I installed bs4. Please post a complete code. If possible, let's start with
    Code:
    data=('''              <table border="1" cellpadding="4" cellspacing="0">
                   <tr bgcolor="#FFFFDB">
                    <td align="center">
                     <font face="Arial, Helvetica, sans-serif" size="-1">
                      <b>
                       Rank
                      </b>
                     </font>
                    </td>
                    <td align="center">
                     <font face="Arial, Helvetica, sans-serif" size="-1">
                      <b>
                       Rating
                      </b>
                     </font>
                    </td>
                    <td>
                     <font face="Arial, Helvetica, sans-serif" size="-1">
                      <b>
                       Title
                      </b>
                     </font>
                    </td>
                    <td align="right">
                     <font face="Arial, Helvetica, sans-serif" size="-1">
                      <b>
                       Votes
                      </b>
                     </font>
                    </td>
                   </tr>
                   <tr bgcolor="#e5e5e5" valign="top">
                    <td align="right">
                     <font face="Arial, Helvetica, sans-serif" size="-1">
                      <b>
                       1.
                      </b>
                     </font>
                    </td>
                    <td align="center">
                     <font face="Arial, Helvetica, sans-serif" size="-1">
                      9.2
                     </font>
                    </td>
                    <td>
                     <font face="Arial, Helvetica, sans-serif" size="-1">
                      <a href="/title/tt0111161/">
                       The Shawshank Redemption
                      </a>
                      (1994)
                     </font>
                    </td>''')
    [code]Code tags[/code] are essential for python code and Makefiles!
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2013
    Posts
    27
    Rep Power
    0
    Ah, I fixed it! Thanks!

IMN logo majestic logo threadwatch logo seochat tools logo