#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2017
    Posts
    457
    Rep Power
    0

    What Is The Regex To Deal With Html Tags ?


    Php Folks,


    What is the regex to extract the title, meta keywords, meta descriptions and the content text (without all the tags such as html tags, dhtml tags, xml tags, javascript tags, etc.) ?

    I actually, prefer one regex to extract title, another to extract meta keywords, another to extract meta descriptions and finally another to extract the content text.

    That way, I can make use of each separately when I don't want to extract all (title, description, etc.).



    If you know of php functions, other than regex that do what I want, then say so by writing: OFF TOPIC.

    Thanks for your help!
    Last edited by UniqueIdeaMan; January 16th, 2018 at 08:49 AM.
  2. #2
  3. Wiser? Not exactly.
    Devshed God 2nd Plane (6000 - 6499 posts)

    Join Date
    May 2001
    Location
    Bonita Springs, FL
    Posts
    6,149
    Rep Power
    4103
    Recycle your old CD's



    If I helped you out, show some love with some reputation, or tip with Bitcoins to 1N645HfYf63UbcvxajLKiSKpYHAq2Zxud
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2017
    Posts
    457
    Rep Power
    0
    Originally Posted by kicken
    The regex or the php function ?
  6. #4
  7. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2017
    Posts
    6
    Rep Power
    0
    you could use something like this,

    <meta.*?>(*SKIP)(*F)|.

    (<meta.*?>) to $1CarriageReturn+LineFeed

  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2017
    Posts
    6
    Rep Power
    0
    Code:
    .*?(<meta.*?>)|.*           to          $1CarriageReturn+LineFeed
    Regex "s" option must be used in order to allow "." (dot) to match new lines characters (CarriageReturn\LineFeed)

  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2017
    Posts
    457
    Rep Power
    0
    Originally Posted by user4589
    Code:
    .*?(<meta.*?>)|.*           to          $1CarriageReturn+LineFeed
    Regex "s" option must be used in order to allow "." (dot) to match new lines characters (CarriageReturn\LineFeed)

    No. I can't use a .exe. It has to be .php as I need to rid html tags on the fly after pulling pages by my web crawler.
    Otherwise, I'm quite capable of weeding-out html building my own .exe bot.
  12. #7
  13. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2017
    Posts
    6
    Rep Power
    0
    Originally Posted by UniqueIdeaMan
    No. I can't use a .exe. It has to be .php as I need to rid html tags on the fly after pulling pages by my web crawler.
    Otherwise, I'm quite capable of weeding-out html building my own .exe bot.
    You can use the regex code from the tool above in php! (I just used the tool to showcase an example!)

IMN logo majestic logo threadwatch logo seochat tools logo