August 6th, 2010, 08:15 AM
Xmlns Regex C#
I am trying to remove all namespace declarations within HTML documents, but I am having trouble with the regular expression.
I wrote my own regular expression: "xmlns=\"[\\d\\w/\\]*\"" but this didn't work, and I also tried a regular expression from the web: @"(xmlns:?[^=]*=[""][^""]*[""])" and this worked, but it gets confused by anything that has xmlns written in it. Does anyone know a good regular expression?
It should detect xmlns="" and remove the xmlns="" along with anything inside of the speechmarks.
Thanks for any help!
August 7th, 2010, 12:12 AM
try something like this:
that'll grab any instance of the namespace declaration made within an HTML tag..
this will end up grabbing two groups in each match:
1) the part before the namespace dec (not including the tag's "<")
2) the actual declaration attribute itself
simply replace each match with "<" followed by the result of the first group in each match and that should work for you.
as a side-note (unrelated to regex's), you might also want to look into parsing the HTML as an XML doc (if your HTML is actually XHTML, this is a cinch, otherwise it may be tough) and simply using XML/HTML DOM to remove the xmlns attribute from every element and then spit out the altered tree.
August 9th, 2010, 03:52 AM
I haven't tried the above regex yet but I will, thank you . I am infact parsing the HTML as an XML document, and have thought about iterating over each element and removing the xmlns attribute, but it would be an exhaustive method and it would be far quicker to use the string replace method.
Originally Posted by jalucas