Thread: Xmlns Regex C#

    #1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2009
    Posts
    33
    Rep Power
    5

    Xmlns Regex C#


    Hi all,

    I am trying to remove all namespace declarations within HTML documents, but I am having trouble with the regular expression.

    I wrote my own regular expression: "xmlns=\"[\\d\\w/\\]*\"" but this didn't work, and I also tried a regular expression from the web: @"(xmlns:?[^=]*=[""][^""]*[""])" and this worked, but it gets confused by anything that has xmlns written in it. Does anyone know a good regular expression?

    It should detect xmlns="" and remove the xmlns="" along with anything inside of the speechmarks.

    Thanks for any help!
  2. #2
  3. A94528C464D168DC82FE4933E9DF37
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2010
    Location
    California
    Posts
    119
    Rep Power
    73
    try something like this:
    Code:
    <[^>]+(xmlns:?[^=]*=[""][^""]*[""])
    that'll grab any instance of the namespace declaration made within an HTML tag..
    this will end up grabbing two groups in each match:
    1) the part before the namespace dec (not including the tag's "<")
    2) the actual declaration attribute itself

    simply replace each match with "<" followed by the result of the first group in each match and that should work for you.

    as a side-note (unrelated to regex's), you might also want to look into parsing the HTML as an XML doc (if your HTML is actually XHTML, this is a cinch, otherwise it may be tough) and simply using XML/HTML DOM to remove the xmlns attribute from every element and then spit out the altered tree.
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2009
    Posts
    33
    Rep Power
    5
    Originally Posted by jalucas
    try something like this:
    Code:
    <[^>]+(xmlns:?[^=]*=[""][^""]*[""])
    that'll grab any instance of the namespace declaration made within an HTML tag..
    this will end up grabbing two groups in each match:
    1) the part before the namespace dec (not including the tag's "<")
    2) the actual declaration attribute itself

    simply replace each match with "<" followed by the result of the first group in each match and that should work for you.

    as a side-note (unrelated to regex's), you might also want to look into parsing the HTML as an XML doc (if your HTML is actually XHTML, this is a cinch, otherwise it may be tough) and simply using XML/HTML DOM to remove the xmlns attribute from every element and then spit out the altered tree.
    I haven't tried the above regex yet but I will, thank you . I am infact parsing the HTML as an XML document, and have thought about iterating over each element and removing the xmlns attribute, but it would be an exhaustive method and it would be far quicker to use the string replace method.

IMN logo majestic logo threadwatch logo seochat tools logo