#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2012
    Posts
    80
    Rep Power
    3

    Web-Page Data Mining With Python 3.2?


    So, I've been interested for a LONG, LONG time in copying data off of websites and storing it externally en masse. My code framework of choice, of course, is Python.

    I don't know if there is a good library out there to help me do this, or even if the basic library could do it.

    One particular example, out of about five databases I've been oogling for quite some time, is BibleGateway.

    I wanted to write a program to pull information
    off their database, effectively creating a little Python bible applet.

    Dictionary.com, maybe google... I've wanted to find some way to make a searchable hardcopy backup of some facebook data (of my own, of course) for quite sometime now. I just don't even know what point A would be.

    Does anyone know how I might do this with Python?
  2. #2
  3. Recovering Intellectual
    Devshed Beginner (1000 - 1499 posts)

    Join Date
    Jun 2006
    Location
    Orange County, CA
    Posts
    1,306
    Rep Power
    785
    How do you propose to connect to their database? I assume you mean screen scraping the data out of the HTML right?
    Bugs that go away by themselves come back by themselves
    Beware - your loyalty will not be rewarded
  4. #3
  5. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,996
    Rep Power
    481
    You will need a computer for your database.

    Comments on this post

    • Matt1776 agrees : lol - holy CRAP!! Google is the matrix
    [code]Code tags[/code] are essential for python code and Makefiles!
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2012
    Posts
    114
    Rep Power
    4
    I haven't used it for anything non-trivial myself, but BeautifulSoup is a well-regarded library for scraping HTML.
  8. #5
  9. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    43
    Rep Power
    3
    Code:
    import urllib
    
            itemPage = "http://www.gaiaonline.com/marketplace/itemdetail/"+str(itemNo)
            f = urllib.urlopen(itemPage)
            s = f.read()
    This was the first step to a code I had designed to study marketplace listings in GaiaOnline a long time ago.

    After I opened the webpage and got its data, I just parsed it using some string manipulations. Once I got the data I wanted cleaned up, I saved it into text files that could be further studied by my matplotlib graph plotter scripts.

    This would be a good place to start.
    Last edited by eliskan; April 19th, 2013 at 05:35 PM.

IMN logo majestic logo threadwatch logo seochat tools logo