#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2012
    Posts
    1
    Rep Power
    0

    Extracting a table from the Web


    Hey!

    I have recently begun learning computer programming using the Python language (using Python 2.5.4). I use Mac OS X. Amidst my learning, I have been assigned the task to extract a web table, which I will specify in a second, using Python programming and convert it into a format that is readable by (can be placed neatly and directed into) Microsoft Excel.

    The link below has one table with statistics about the National Hockey League (NHL).

    nhl.com/ice/gamestats.htm?season=20112012&gameType=2&team=&viewName=summary

    I have been reading about ways to complete the task, but I realize that people with much more experience using Python may be able to help me more than the books.

    If anyone has a code that is designed to do just this and can be adjusted to the particular website that I need to work with, provide any helpful and guiding knowledge in generating the code, or even texts which I can read that will help me write the code, that would be greatly appreciated! Thanks in advance!
  2. #2
  3. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,709
    Rep Power
    480
    Looks like there are several tables on that page. Five tables? (search the page source for "<table")
    The most obvious one says that it shows rows 1-30 of 1230 results. Do you need all 1230?

    You might try the python csv library module to write a file that excel can read.

    The python libraries also are packed with html functionality. There could be a reader that, as one of its features, identifies tables. I myself would do something stupid like write my own code to parse the page source, find the table rows <tr>blah blah blah </tr>
    where the stuff in between is table data <td>information</td>
    but hey, that may account for my being unemployed.
    [code]Code tags[/code] are essential for python code and Makefiles!

IMN logo majestic logo threadwatch logo seochat tools logo