#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2013
    Posts
    2
    Rep Power
    0

    Regex help required


    Hi,

    I am stucked in a situtation and trying my luck with Regex myself, I thought to give it up and ask in the forum.

    I have HTML pages from which I need to take out particular header text.

    =============================================
    E.g.

    Following are sample of HTML pages that I have
    Application - First Page
    -------------------------
    <div class="upperForm">
    <span class="upperFormLeft"></span>
    <span class="upperFormRight"></span>
    <div id="divConsentHeader" class="headerLabel" name="divConsentHeader">
    <h2>First Page</h2>
    <br>
    </div>
    </div>

    Application - Second Page
    -------------------------
    <div class="upperForm">
    <span class="upperFormLeft"></span>
    <span class="upperFormRight"></span>
    <h2 class="headerLabel">
    <span id="detailsLabel">Details Label</span>
    </h2>

    <div class="upperForm">
    <span class="upperFormLeft"></span>
    <span class="upperFormRight"></span>
    <h2 class="headerLabel">
    <span id="infoLabel">Information</span>
    </h2>
    </div>

    Application - Third Page
    -------------------------
    <div class="upperForm">
    <span class="upperFormLeft"></span>
    <span class="upperFormRight"></span>
    <div>
    <h2 class="headerLabel">
    <span id="summaryLabel">Summary</span>
    </h2>


    Application - Fourth Page
    -------------------------
    <div class="upperForm">
    <span class="upperFormLeft"></span>
    <span class="upperFormRight"></span>
    <h2 id="heading" class="headerLabel">Start</h2>
    </div>


    Application - Fifth Page
    -------------------------
    <div class="upperForm">
    <span class="upperFormLeft"></span>
    <span class="upperFormRight"></span>
    <h2 class="headerLabel" name="detailsLabel">Details</h2>
    </div>

    =============================================

    I need to take fetch the header text out of this HTML pages. The common thing above all the headers is "<span class="upperFormRight"></span>"

    However, after that each page has different tags, some has H2, and some has SPAN and in different format.

    Is there any way in Regex, I can remove just the text out of these pages, that is, Details, Start, Summary, etc.

    Please help..
  2. #2
  3. --
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2012
    Posts
    3,957
    Rep Power
    1046
    You think a regex solves your problem? Think twice!

    Too long, didn't read: What you want is an HTML parser. Because that's what you wanna do, parse HTML.

    Since all elements have an ID or a name attribute (or a sibling with one), this is a piece of cake for an HTML parser.

    Yeah, I know, everybody loves regexes. But sometimes they're just not the right tool.
    The 6 worst sins of security ē How to (properly) access a MySQL database with PHP

    Why canít I use certain words like "drop" as part of my Security Question answers?
    There are certain words used by hackers to try to gain access to systems and manipulate data; therefore, the following words are restricted: "select," "delete," "update," "insert," "drop" and "null".
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2013
    Posts
    2
    Rep Power
    0
    Originally Posted by Jacques1

    Too long, didn't read: What you want is an HTML parser. Because that's what you wanna do, parse HTML.

    Since all elements have an ID or a name attribute (or a sibling with one), this is a piece of cake for an HTML parser.

    Yeah, I know, everybody loves regexes. But sometimes they're just not the right tool.
    Thanks Jacques1 for quick response.

    The problem is I have a tool which would read those pages and that tool has only "Find/Replace" and "Regex" method.. so can't use HTML Parser.

    The tool that we are using, we want it to get those text out of the HTML pages.

    I know it would be hard to use Regex, but I still thought, if any masters here can help me to figure out a solution to it.

IMN logo majestic logo threadwatch logo seochat tools logo