Thread: Beautifulsoup

    #1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2011
    Posts
    8
    Rep Power
    0

    Beautifulsoup


    I just started learning beautifulsoup and kind of got stuck on a little problem.

    So I have this html code:
    Code:
    <ol>
      <li><h2>Puppy One</h2>
      </li>
      <li><h2>Puppy Two</h2>
      </li>
      <li><h2>Puppy Three</h2>
      </li>
    </ol>
    I need to grab the data inside the <h2> tags, so Puppy One, Puppy Two, Puppy Three. But it needs to return as a string instead of a list, so
    Code:
    ["Puppy One", "Puppy Two", "Puppy Three"]
    Right now, I have this code:
    Code:
    def get_puppies():
      page = BeautifulSoup(html)
      for tags in page.findAll('li'):
        puppy = tags.findAll('h2')
      return puppy
    It returns [<h2>Puppy Three</h2>] and has the tags surrounding it - which I don't want. I played around with text=True, but since I don't think I'm coding it right it's not working?
  2. #2
  3. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,897
    Rep Power
    481
    Had you considered using plain old html parser?
    http://docs.python.org/3/library/html.parser.html
    [code]Code tags[/code] are essential for python code and Makefiles!
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Location
    Ranchos de Taos, NM
    Posts
    3
    Rep Power
    0
    Try the 'string' method for the tag. E.g.,

    Code:
    >>> from BeautifulSoup import BeautifulSoup
    >>> buff = "<h2>Hello World!</h2>"
    >>> tree = BeautifulSoup(buff)
    >>> tree
    <h2>Hello World!</h2>
    >>> tree.h2.string
    u'Hello World!'
    Last edited by Joseph8th; February 2nd, 2013 at 02:36 PM. Reason: typo
  6. #4
  7. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2011
    Posts
    8
    Rep Power
    0
    @b49P23TIvg - I'm reading into it now.

    But for this, I've managed to get it to show
    Code:
    [<h2>Puppy One</h2>, <h2>Puppy Two</h2>, <h2>Puppy Three</h2>]
    However, I haven't been able to get ride of the <h2> tags.
    I know I'm suppose to return it in a JSON string to get the "" but I'm still trying to figure that part out.
    Here's my code now.
    Code:
    def get_puppies():
      page = BeautifulSoup(html)
      puppy = page.findAll('h2')[:3]
      return puppy
    I know this is probably a bad way to do it b/c if it was something longer with a lot of other tags, I would have problems, but for this, it works so I'll take it for now. I just need to figure out how to return it like this:
    Code:
    ["Puppy One", "Puppy Two", "Puppy Three"]
    @Joseph8th - I tried the .string as well as .text and extract() but I got errors for all of them which is really weird.

    Do any of you know what I'm doing wrong?

IMN logo majestic logo threadwatch logo seochat tools logo