#1
  1. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jan 2017
    Posts
    845
    Rep Power
    0

    What Is The Regex To Deal With Html Tags ?


    Php Folks,


    What is the regex to extract the title, meta keywords, meta descriptions and the content text (without all the tags such as html tags, dhtml tags, xml tags, javascript tags, etc.) ?

    I actually, prefer one regex to extract title, another to extract meta keywords, another to extract meta descriptions and finally another to extract the content text.

    That way, I can make use of each separately when I don't want to extract all (title, description, etc.).



    If you know of php functions, other than regex that do what I want, then say so by writing: OFF TOPIC.

    Thanks for your help!
    Last edited by UniqueIdeaMan; January 16th, 2018 at 08:49 AM.
  2. #2
  3. Wiser? Not exactly.
    Devshed God 2nd Plane (6000 - 6499 posts)

    Join Date
    May 2001
    Location
    Bonita Springs, FL
    Posts
    6,276
    Rep Power
    4193
    Recycle your old CD's



    If I helped you out, show some love with some reputation, or tip with Bitcoins to 1N645HfYf63UbcvxajLKiSKpYHAq2Zxud
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jan 2017
    Posts
    845
    Rep Power
    0
    Originally Posted by kicken
    The regex or the php function ?
  6. #4
  7. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2017
    Posts
    10
    Rep Power
    0
    you could use something like this,

    <meta.*?>(*SKIP)(*F)|.

    (<meta.*?>) to $1CarriageReturn+LineFeed

  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2017
    Posts
    10
    Rep Power
    0
    Code:
    .*?(<meta.*?>)|.*           to          $1CarriageReturn+LineFeed
    Regex "s" option must be used in order to allow "." (dot) to match new lines characters (CarriageReturn\LineFeed)

  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jan 2017
    Posts
    845
    Rep Power
    0
    Originally Posted by user4589
    Code:
    .*?(<meta.*?>)|.*           to          $1CarriageReturn+LineFeed
    Regex "s" option must be used in order to allow "." (dot) to match new lines characters (CarriageReturn\LineFeed)

    No. I can't use a .exe. It has to be .php as I need to rid html tags on the fly after pulling pages by my web crawler.
    Otherwise, I'm quite capable of weeding-out html building my own .exe bot.
  12. #7
  13. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2017
    Posts
    10
    Rep Power
    0
    Originally Posted by UniqueIdeaMan
    No. I can't use a .exe. It has to be .php as I need to rid html tags on the fly after pulling pages by my web crawler.
    Otherwise, I'm quite capable of weeding-out html building my own .exe bot.
    You can use the regex code from the tool above in php! (I just used the tool to showcase an example!)
  14. #8
  15. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jan 2017
    Posts
    845
    Rep Power
    0
    Originally Posted by user4589
    You can use the regex code from the tool above in php! (I just used the tool to showcase an example!)
    Thanks. But, how-about you show me a code sample ?
    For example, write a cURL code that fetches a page and then the regex extracts the following with the aid of php. Must be a .php file:

    * links
    * meta keywords
    * meta descriptions
    * email address

    And no. I'm not trying to build an email extractor. It's just for learning purpose. I already know how to build an email extractor (.exe).
    Thanks for your past samples and great thanks to your future code samples (cURL and/or php).

    Cheers!
  16. #9
  17. Lord of the Dance
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    Oct 2003
    Posts
    4,190
    Rep Power
    2012
    You didn't look at Kickens link, which had the link to PHP: DOMDocument - Manual ?
    Read the comments there and you will find several code samples on how to use that.
  18. #10
  19. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jan 2017
    Posts
    845
    Rep Power
    0
    Originally Posted by MrFujin
    You didn't look at Kickens link, which had the link to PHP: DOMDocument - Manual ?
    Read the comments there and you will find several code samples on how to use that.
    Two looks at this link on 2 different days made my head spin and was looking for a simpler way:
    PHP: DOMDocument - Manual
    Seems like too much reading. But, if you guys say that, there is no other way. Then do I have any choice but to look at DOM stuffs ? Dear me!
    Anyway, thanks Mr Fujin.
  20. #11
  21. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2017
    Posts
    10
    Rep Power
    0
    Use the same regex code from the tool in php "preg_replace()" function!

    - Put the regex pattern between //s

    - Instead #R#N use \r\n or <br> !

    Here is an example:

    PHP Code:
    <?php

    $string 
    '
    dwdwdwdw

    A duplicate code was entered: KZKUWRSRPRBU211

    text looks like this

    A duplicate code was entered: 9fgvkwkdprkds11

    (lower case)

    A duplicate code was entered: 4&**()$##$%%%^@@#&

    hfhfhfhhffhfjfjfjf
    '
    ;


    $Serials preg_replace('/.*?A duplicate code was entered: (.*?)(\R|$)|.*/s''$1<br><br>'$string);

    echo 
    $Serials

    ?>

IMN logo majestic logo threadwatch logo seochat tools logo