August 1st, 2013, 08:35 AM
XML Formatting with invalid characters
I have a problem which I am uncertain of how to handle.
I have XML Data in a client application which is being sent over HTTP Post / Webservice to a Server application.
Inside my XML Data I have the following problem node:
it is a node containing a value that has the ">" character as part of the text.
Now this becomes a problem when trying to parse the XML at the server side (incorrect close tag found etc. etc.)
In an attempt to solve the problem I have escaped the < > / and other characters for transmission purposes. This works find as it is converted to
where "ampersand" is the "&" sign. Sorry couldnt type it here as the browser seems to convert it into the correctly displayed > or <
However in hindsight I realized it makes no difference as the escaped characters are unescaped prior to parsing at the server side - leaving you with the same problem originally described.
a) What is the standard for handling such a situation / XML.
b) If escaping is the key to the solutions, what am I missing and how should I go about escaping the xml.
Last edited by dve83; August 1st, 2013 at 08:37 AM.
August 1st, 2013, 08:49 AM
it should be obvious that any payload you put into an XML document must first be escaped (or wrapped into a CDATA section) so that it doesn't collide with the markup. So, yeah, you have to replace all special characters with their corresponding entity.
If the entities somehow get replaced (making the document invalid) before they reach your script, there's obviously a bug. That's what you need to fix. The server must receive the proper XML document with all entities intact.
August 1st, 2013, 08:55 AM
Hi and thanks for the fast reply. Forgive my ignorance as I ask the following:
Lets us assume I escape the XML text to the following (which I am currently doing)
hence: doing as per your reply :
when it reaches the server, the script on the server will then "un"escape the data back to
and as a result the parsing will fail due to the reserved markup character >.
Am I incorrectly escaping the whole XML string data? should I be escaping this differently? Or is my error on the server side?
I am trying to avoid using CDATA nodes as the moment (if possible)
Your reply greatly appreciated
August 1st, 2013, 09:37 AM
OK, I think there's a general misunderstanding of how XML works.
You must not escape "<" and ">" if they belong to the XML markup. This breaks the whole document, because now you've got syntax spaghetti instead of XML. Tags like "<ID>" must stay like they are, you need to write them down literally.
What you do have to escape is the data of the XML document. If you wanna put some piece of text into an XML element, then you need to escape it first.
See the difference? On one hand, there's the XML markup, which describes the structure of the document. On the other hand, there's the data. It's any custom text you put into your document. Since the XML parser must be able to distinguish between, say, "<ID>" in the sense of an XML tag and "<ID>" in the sense of text, you need to escape the latter. You must not escape the former.
The transmission process works like this:
You transmit a valid XML document to the server. It must adhere to the syntax rules of XML, and all data has to be escaped.
Your example above would look like this:
On the server, you parse this very document with the markup and all entities intact. During parsing, the markup is transformed into an abstract tree structure, and the entities get replaced with the corresponding characters.
August 1st, 2013, 09:52 AM
thank you very much for the fast and precise response.