#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2004
    Location
    South Africa
    Posts
    59
    Rep Power
    11

    XML Formatting with invalid characters


    Dear all,

    I have a problem which I am uncertain of how to handle.

    I have XML Data in a client application which is being sent over HTTP Post / Webservice to a Server application.

    Inside my XML Data I have the following problem node:
    <ID>valuecontaining></ID>

    it is a node containing a value that has the ">" character as part of the text.

    Now this becomes a problem when trying to parse the XML at the server side (incorrect close tag found etc. etc.)

    In an attempt to solve the problem I have escaped the < > / and other characters for transmission purposes. This works find as it is converted to

    "ampersand"lt;ID"ampersand"gt;valuecontaining"ampersand"gt;"ampersand"lt;/ID"ampersand"gt;

    where "ampersand" is the "&" sign. Sorry couldnt type it here as the browser seems to convert it into the correctly displayed > or <

    However in hindsight I realized it makes no difference as the escaped characters are unescaped prior to parsing at the server side - leaving you with the same problem originally described.

    Questions:
    a) What is the standard for handling such a situation / XML.
    b) If escaping is the key to the solutions, what am I missing and how should I go about escaping the xml.

    Kind Regards
    dve83
    Last edited by dve83; August 1st, 2013 at 08:37 AM.
  2. #2
  3. --
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2012
    Posts
    3,959
    Rep Power
    1014
    Hi,

    it should be obvious that any payload you put into an XML document must first be escaped (or wrapped into a CDATA section) so that it doesn't collide with the markup. So, yeah, you have to replace all special characters with their corresponding entity.

    If the entities somehow get replaced (making the document invalid) before they reach your script, there's obviously a bug. That's what you need to fix. The server must receive the proper XML document with all entities intact.
    The 6 worst sins of security ē How to (properly) access a MySQL database with PHP

    Why canít I use certain words like "drop" as part of my Security Question answers?
    There are certain words used by hackers to try to gain access to systems and manipulate data; therefore, the following words are restricted: "select," "delete," "update," "insert," "drop" and "null".
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2004
    Location
    South Africa
    Posts
    59
    Rep Power
    11
    Hi and thanks for the fast reply. Forgive my ignorance as I ask the following:

    Lets us assume I escape the XML text to the following (which I am currently doing)

    Code:
    "ampersand"lt;ID"ampersand"gt;valuecontaining"ampersand"gt;"ampersand"lt;/ID"ampersand"gt;
    hence: doing as per your reply :
    you have to replace all special characters with their corresponding entity.
    when it reaches the server, the script on the server will then "un"escape the data back to

    Code:
    <ID>valuecontaining></ID>
    and as a result the parsing will fail due to the reserved markup character >.

    Am I incorrectly escaping the whole XML string data? should I be escaping this differently? Or is my error on the server side?
    I am trying to avoid using CDATA nodes as the moment (if possible)

    Your reply greatly appreciated
    Danie
  6. #4
  7. --
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2012
    Posts
    3,959
    Rep Power
    1014
    OK, I think there's a general misunderstanding of how XML works.

    You must not escape "<" and ">" if they belong to the XML markup. This breaks the whole document, because now you've got syntax spaghetti instead of XML. Tags like "<ID>" must stay like they are, you need to write them down literally.

    What you do have to escape is the data of the XML document. If you wanna put some piece of text into an XML element, then you need to escape it first.

    See the difference? On one hand, there's the XML markup, which describes the structure of the document. On the other hand, there's the data. It's any custom text you put into your document. Since the XML parser must be able to distinguish between, say, "<ID>" in the sense of an XML tag and "<ID>" in the sense of text, you need to escape the latter. You must not escape the former.

    The transmission process works like this:

    You transmit a valid XML document to the server. It must adhere to the syntax rules of XML, and all data has to be escaped.

    Your example above would look like this:
    Code:
    <ID>valuecontaining lt;</ID>
    On the server, you parse this very document with the markup and all entities intact. During parsing, the markup is transformed into an abstract tree structure, and the entities get replaced with the corresponding characters.
    The 6 worst sins of security ē How to (properly) access a MySQL database with PHP

    Why canít I use certain words like "drop" as part of my Security Question answers?
    There are certain words used by hackers to try to gain access to systems and manipulate data; therefore, the following words are restricted: "select," "delete," "update," "insert," "drop" and "null".
  8. #5
  9. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2004
    Location
    South Africa
    Posts
    59
    Rep Power
    11
    Hi Jacques,

    thank you very much for the fast and precise response.

    Danie

IMN logo majestic logo threadwatch logo seochat tools logo