#1
  1. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jan 2017
    Posts
    845
    Rep Power
    0

    Regex To Extract Content Visible To Webpage Visitor's Browser


    Php Folks,

    What is the regex to extract the content of a webpage that is visible to the user in his browser no matter what is visible to the web crawler (searchengine spider or bot)?
    That means, it should ignore the following and not extract these tags or any data inbetween these tags excluding the <body> & </body> tags as including that in the filter would render the data extraction useless:

    title
    title tags
    meta keywords
    meta keywords tags
    meta descriptions
    meta descriptions tags
    html tags
    dhtml tags
    xml tags
    javascript tags,
    etc. tags.

    If you know of php functions, other than regex that do what I want, then say so by writing: OFF TOPIC.

    Thanks for your help!
  2. #2
  3. Wiser? Not exactly.
    Devshed God 2nd Plane (6000 - 6499 posts)

    Join Date
    May 2001
    Location
    Bonita Springs, FL
    Posts
    6,274
    Rep Power
    4193
    Originally Posted by UniqueIdeaMan
    What is the regex to extract the content of a webpage that is visible to the user in his browser no matter what is visible to the web crawler (searchengine spider or bot)?
    There isn't one, regex is the wrong tool for this task. You need a DOM parser, such as DOMDocument.

    If you only want your spider to see what a user sees then excluding certain HTML tags is only part of the issue anyway. You also need to account for various css tricks like:
    Code:
    <div style="display: none;">Spam</div>
    
    <div style="visibility: hidden;">Spam</div>
    
    <div style="color: white; background: whtie;">Spam</div>
    
    <div style="width: 1px; height: 1px; overflow: hidden;">Spam</div>
    ...

    Comments on this post

    • Catacaustic agrees
    Recycle your old CD's



    If I helped you out, show some love with some reputation, or tip with Bitcoins to 1N645HfYf63UbcvxajLKiSKpYHAq2Zxud
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jan 2017
    Posts
    845
    Rep Power
    0
    Originally Posted by kicken
    There isn't one, regex is the wrong tool for this task. You need a DOM parser, such as DOMDocument.

    If you only want your spider to see what a user sees then excluding certain HTML tags is only part of the issue anyway. You also need to account for various css tricks like:
    Code:
    <div style="display: none;">Spam</div>
    
    <div style="visibility: hidden;">Spam</div>
    
    <div style="color: white; background: whtie;">Spam</div>
    
    <div style="width: 1px; height: 1px; overflow: hidden;">Spam</div>
    ...
    Yeah, I forgot to add the css on my list.
    In short, I need to only extract content that the user would see. Text content only. Someone else also suggested DOM stuff or PHP library that I like for handling and parsing HTML pages. I told him I'm gonna look into DOM. You guys really know your stuff. As for me. I'll never really be a real programmer. Just a basic one that managaes to learn a little to make use of his what little talent to build his dream websites to earn money and website traffic virally and help others achieve the same.
    I've quit the idea to ever be in the position to be as good as php freelancers.
    But saying all this, I'll see how much I fair with Python. Django is like cURL, I gathered, or deals with web development. Speaking out of memory. Heard about Django probably over a year ago.
    But, I ain't quitting php until I finished my web proxy, membership site, social network, searchengine and forum. First two finished.

    Just downloaded a lot of php youtube tutorials on paypal ipn and bitcoin stuffs so I can integrate their gateways to accept online payments. I wonder how hard that will be. Will be checking out the vids tomorrow.
    In the meanwhile, Kicken and Catacaustic: Good Night!
    Last edited by UniqueIdeaMan; January 17th, 2018 at 04:03 PM.

IMN logo majestic logo threadwatch logo seochat tools logo