November 23rd, 2013, 05:34 PM
Basic web scraping of DOM elements with no ID or Name?
I am looking to parse and pull a few elements from many instances of this website (for the sake of record keeping and data tracking/analysis).
I have already created a fully-working script that does what I need, but the parsing component is inefficient and not very robust. (Essentially I open each page manually using IEC.py, dump the entire page HTML to a long string, and then parse out the data using a series of clumsy string searches).
I've looked into a few other libraries and options for headless browsing, but I've run into a few complications that I don't know how to work around:
1. The DOM elements I'm looking to scrape have no IDs or names. They are almost all just table cells identified as <td class="View">. (But I'm not an expert on navigating the DOM so maybe there's another easy way to identify these?)
I don't need an elaborate solution -- I'm just pulling a few short strings from a simple page -- but I'd at least like a DOM-aware solution that can scrape my target data based on tags instead of by making clumsy string searches.
Any specific approaches or library / technique recommendations would be appreciated.