Thread: regexp help

    #1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2005
    Posts
    34
    Rep Power
    0

    regexp help


    i want to extract the text between the title elements in a html file, the thing is, both of the methods below are returning nothing:

    Code:
    import re
    
    prices = open('index.html').read()
    re.findall('<title>(.+?)</title>', prices)
    
    raw_input("the end")
    Code:
    import re
    
    prices = open('index.html').read()
    re.findall('<.*>(.+?)<.*>', prices)
    
    raw_input("the end")
    I am pretty sure it's a regexp thing, but i really do not know why they are not working..
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2004
    Posts
    40
    Rep Power
    10
    pretty new to Python myself so not sure about the regexp thing but this is how i would do it

    Code:
    prices = open('index.html').read()
    pos = prices.find("<TITLE>")  ##FIND TITLE
    pos1 = prices.find("</TITLE>")  ##FIND TITLE CLOSE TAG
    title = prices[pos:pos1]
    print title
    title = title[7:]  ##REMOVE TITLE OPENING TAG FROM STRING 
    print title
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2005
    Posts
    34
    Rep Power
    0
    the only problem is, they could be in lower case or caps

    i.e. <TITLE> or <title>

    how could i get round this?
  6. #4
  7. Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Feb 2005
    Posts
    610
    Rep Power
    65

    Smile


    Change your string, I think it is named prices, to all lower case before you use find.
    Code:
    # change the string to all lower case characters
    str1 = str1.lower()
  8. #5
  9. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2005
    Posts
    34
    Rep Power
    0
    i also found another problem

    sometimes the string has a " in it, which seems to make it not work, i.e.

    pos = f.find('"keywords" content="')
    pos1 = f.find('"<meta name="description"')
  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2004
    Posts
    40
    Rep Power
    10
    if you surround your search phrase with triple quotes this should resolve it
    ''' "keywords" content=" '''

    Code:
      pos = f.find('''"keywords" content="  ''')
    pos1 = f.find('"<meta name="description"')
  12. #7
  13. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2005
    Posts
    34
    Rep Power
    0
    ah i am getting somwhere now, cheers guys. one more question, how do i remove two charaters from the end of a string?

    also, the strings in the find function have to be unique right?

    i.e. when the string is like </title> it works, but if i try something like "> it will not work..
  14. #8
  15. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2004
    Posts
    40
    Rep Power
    10
    once you have your data in the string you can remove characters from the end by either counting characters from the beginning or from the end

    string = string[:9]
    includes all characters from beginning to 9th character

    string = string[:-1]
    includes all characters except the last one in the string

    hope this helps
  16. #9
  17. Hello World :)
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Mar 2003
    Location
    Hull, UK
    Posts
    2,537
    Rep Power
    69
    Originally Posted by sitepoint
    i want to extract the text between the title elements in a html file, the thing is, both of the methods below are returning nothing:

    Code:
    import re
    
    prices = open('index.html').read()
    re.findall('<title>(.+?)</title>', prices)
    
    raw_input("the end")
    Code:
    import re
    
    prices = open('index.html').read()
    re.findall('<.*>(.+?)<.*>', prices)
    
    raw_input("the end")
    I am pretty sure it's a regexp thing, but i really do not know why they are not working..
    The problem could be to do with a number of things, firstly case. Second there may be a newline character somewhere in your title which causes the match to fail. Try this:

    Code:
    >>> titleRegex = re.compile('<title>(.*?)</title>', re.I | re.S)
    >>> titleRegex.findall(aString)
    ['Page']
    >>> re.findall('<title>(.*?)</title>', aString, re.I | re.S)
    ['Page']
    >>>
    Mark.
    programming language development: www.netytan.com Hula

  18. #10
  19. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2005
    Posts
    78
    Rep Power
    10
    Originally Posted by sitepoint
    i want to extract the text between the title elements in a html file, the thing is, both of the methods below are returning nothing:
    There's a better way to get a title:
    Code:
    from htmllib import HTMLParser
    import formatter
    import urllib
    
    PROXY = "http://MYPROXYADDRESS:MYPORT"
    SITE = "http://MYWEBADDRESS"
    
    opener = urllib.FancyURLopener({'http': PROXY})
    page = opener.open(SITE).read()
    parser = HTMLParser(formatter.NullFormatter())
    parser.feed(page)
    print parser.title
    --OH.
  20. #11
  21. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2005
    Posts
    34
    Rep Power
    0
    with the above code ^ it may print the title, but can it print the meta description & keywords?
  22. #12
  23. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2005
    Posts
    78
    Rep Power
    10
    Originally Posted by sitepoint
    with the above code ^ it may print the title, but can it print the meta description & keywords?
    An easyish addition:
    Code:
    from htmllib import HTMLParser
    import formatter
    import urllib
    
    PROXY = "http://MYPROXYADDRESS:MYPORT"
    SITE = "http://MYWEBADDRESS"
    
    class MyParser(HTMLParser):
        def do_meta(self, attrs):
            d = dict(attrs)
            try:
                name = d["name"]
                content = d["content"]
                self.__dict__["meta_"+name] = content
            except:
                pass
    
    opener = urllib.FancyURLopener({'http': PROXY})
    page = opener.open(SITE).read()
    parser = MyParser(formatter.NullFormatter())
    parser.feed(page)
    print "Title is", parser.title
    print "Keywords are", parser.meta_keywords
    --OH.

IMN logo majestic logo threadwatch logo seochat tools logo