#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    18
    Rep Power
    0

    When returning results from a page, it crashes


    I've got a problem where the program I'm creating crashes after looping through a certain amount of times.

    Code:
    import urllib.request
    
    data = urllib.request.urlopen('http://www.football-league.co.uk/page/DivisionalScorers/0,,10794~20127,00.html')
    
    e = data.read()
    
    m = e.decode('utf8')
    
    a = m.count('<tr class="rowDark">')
    b = m.count('<tr class="rowLight">')
    
    splitted_page = m.split('<div class="statistics">')
    splitted_page = splitted_page[1].split('</div>')
    
    for x in range(int(a+b)):
        splitted_page2 = splitted_page[0].split('<tr class="rowDark">')
        splitted_page2 = splitted_page2[1].split('</tr>')
        splitted_page3 = splitted_page2[0].split('<td style="text-align:center;">')
        splitted_page3 = splitted_page3[1].split('</td>')
        splitted_page4 = splitted_page2[0].split('<td>')
        splitted_page4 = splitted_page4[1].split('</td>')
        splitted_page5 = splitted_page4[0].split('>')
        splitted_page5 = splitted_page5[1].split('<')
    
        splitted_page6 = splitted_page[0].split('<tr class="rowLight">')
        splitted_page6 = splitted_page6[1].split('</tr>')
        splitted_page7 = splitted_page6[0].split('<td style="text-align:center;">')
        splitted_page7 = splitted_page7[1].split('</td>')
        splitted_page8 = splitted_page6[0].split('<td>')
        splitted_page8 = splitted_page8[1].split('</td>')
        splitted_page9 = splitted_page8[0].split('>')
        splitted_page9 = splitted_page9[1].split('<')
    
        print(splitted_page5[0].upper())
        print(splitted_page3[0])
        print(splitted_page9[0].upper())
        print(splitted_page7[0])
    
        splitted_page[0] = splitted_page[0].replace('<tr class="rowDark">','',1)
        splitted_page[0] = splitted_page[0].replace('<tr class="rowLight">','',1)
    What it is doing is returning player's names and the amount of goals they've scored, but the error I get after about 12 or so players is:

    Traceback (most recent call last):
    File "E:\Matthew\Sixth Form\Computing\Coursework\program\link-to-web.py", line 32, in <module>
    splitted_page9 = splitted_page9[1].split('<')
    IndexError: list index out of range
    Any ideas?

    I've checked the webpage's source and there seems to be no change in pattern to what is being written.
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2009
    Posts
    487
    Rep Power
    33
    splitted_page9 = splitted_page9[1].split('<')
    IndexError: list index out of range
    The message should be self explanatory. Test for length greater than one before this statement as it appears to have one item only.
  4. #3
  5. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,851
    Rep Power
    481
    Study the web page source carefully. I was planning to advocate a finite state machine using the html.parser module, but frankly, I've done a few web page parsing jobs recently and I've found splitting a better approach. I use j from www.jsoftware.com , not python. The algorithm is independent of programming language. Follow the comments when confused.

    I removed CR and LF . The data contains three groups starting with <tbody>
    The second of these contains the data you want. I split this portion of the data into groups starting with <tr . Let's examine the first 160 columns of the first 14 rows.
    Code:
        _ 160{.>14{.D
    <tr class="rowDark"><td><a target="_blank" href="http://www.player.cpfc.co.uk/page/ProfilesDetail/0,,10323~32552,00.html">Glenn Murray</a></td><td><a target="_b
    <tr class="rowLight"><td><a target="_blank" href="http://www.player.burnleyfootballclub.com/page/ProfilesDetail/0,,10413~49603,00.html">Charlie Austin</a></td><
    <tr class="rowDark"><td><a target="_blank" href="http://www.player.rovers.co.uk/page/ProfilesDetail/0,,10303~39589,00.html">Jordan Rhodes</a></td><td><a target=
    <tr class="rowLight"><td><a target="_blank" href="http://www.player.watfordfc.com/page/ProfilesDetail/0,,10400~53331,00.html">Matej Vydra</a></td><td><a target=
    <tr class="rowDark"><td><a target="_blank" href="http://www.player.blackpoolfc.co.uk/page/ProfilesDetail/0,,10432~52218,00.html">Tom Ince</a></td><td><a target=
    <tr class="rowLight"><td><a target="_blank" href="http://www.player.lcfc.com/page/ProfilesDetail/0,,10274~46979,00.html">Chris Wood</a></td><td><a target="_blan
    <tr class="rowDark"><td><a target="_blank" href="http://www.leedsunited.com/page/PlayerProfiles/0,,10273~44021,00.html">Luciano Becchio</a></td><td><a target="_
    <tr class="rowLight"><td><a target="_blank" href="http://www.player.lcfc.com/page/ProfilesDetail/0,,10274~22242,00.html">David Nugent</a></td><td><a target="_bl
    <tr class="rowDark"><td><a target="_blank" href="http://www.player.watfordfc.com/page/ProfilesDetail/0,,10400~39532,00.html">Troy Deeney</a></td><td><a target="
    <tr class="rowLight"><td><a target="_blank" href="http://www.player.bcfc.com/page/ProfilesDetail/0,,10412~8699,00.html">Marlon King</a></td><td><a target="_blan
    <tr class="rowDark"><td><a target="_blank" href="http://www.player.bcfc.co.uk/page/ProfilesDetail/0,,10327~34050,00.html">Steven Davies</a></td><td><a target="_
    <tr class="rowLight"><td><a target="_blank" href="http://www.player.seagulls.co.uk/page/ProfilesDetail/0,,10433~23201,00.html">Craig Mackail-Smith</a></td><td><
    <tr class="rowDark"><td><a target="_blank" href="http://www.player.dcfc.co.uk/page/FirstTeamProfilesDetail/0,,10270~36305,00.html">Jamie Ward</a></td><td><a tar
    <tr class="rowLight"><td>Scott McDonald</td><td><a target="_blank" href="http://www.mfc.co.uk">Middlesbrough</a></td><td style="text-align:center;">11</td></tr>
    Scott McDonald, the last of these, is different! I think that's where your code fails, as did my html.parser attempt.

    Next cut on '<td'
    Code:
       E =: (<;.1~ '<td'&E.)@> D  NB. cut on '<td'
       $E         NB. 340 scorers by name, club, goals.
    340 3
       F =: ({.~ '</'&({.@:I.@:E.))&.>E   NB. In each field find the first occurrence of '</'  and retain everything before it.
       [G=: (}.~ >:@:i:&'>')&.> F     NB. In each field keep everything after the rightmost '>'
    ...
    │Sean Murray            │Watford       │1 │
    ├───────────────────────┼──────────────┼──┤
    │Lloyd Doyley           │Watford       │1 │
    ├───────────────────────┼──────────────┼──┤
    │Danny Batth            │Wolves        │1 │
    ├───────────────────────┼──────────────┼──┤
    │Stephen Ward           │Wolves        │1 │
    ├───────────────────────┼──────────────┼──┤
    │Matt Doherty           │Wolves        │1 │
    ├───────────────────────┼──────────────┼──┤
    │Richard Stearman       │Wolves        │1 │
    └───────────────────────┴──────────────┴──┘
    Summary: now that I understand this web page I could succeed with the html parser approach. However, I had to solve the problem in a way that made the data easy to visualize. So why should I rewrite the solution in python? Many web pages are flawed, missing </end> tags are common. Maybe the strict=False key helps.

    parser = MyHTMLParser(strict=False)
    Last edited by b49P23TIvg; March 14th, 2013 at 12:45 PM.
    [code]Code tags[/code] are essential for python code and Makefiles!
  6. #4
  7. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    18
    Rep Power
    0
    Originally Posted by b49P23TIvg
    Scott McDonald, the last of these, is different! I think that's where your code fails, as did my html.parser attempt
    Unfortunately, this is where my code doesn't fail. It fails just after Craig Mackail-Smith. I know that the fact it is saying that my list index is out of range, but I'm not sure why it would be after just assigning it.

    EDIT: Upon further inspection, I believe that this is the case, perhaps a try except statement may be useful?

IMN logo majestic logo threadwatch logo seochat tools logo