#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2015
    Posts
    9
    Rep Power
    0

    Create an XML file given a predefined XSD and raw data?


    I know this is a broad question with a number of different solutions but this is a relatively new domain for me so I thought it couldn't hurt to ask.

    I recently started a job that requires me to pull data / reports of differing formats - a lot of Excel reports along with raw data dumps (.csv or similar) from various HR/timekeeping systems - perform an analysis of the data and then extract those results to a pre-defined XSD format that I will then submit to an external database (manually via web).

    I have experience programming in C/C++ (mostly for embedded applications) and the MS Office productivity suite along with SQL Server 2012 here at work.

    What is the best path for me to accomplish this?

    Much of the work I am doing is with the aim of automating future data submissions. Prior to my arrival the vast majority of this work was being done via manual data entry. My goal is to create a process (be it a program, an existing tool, or some combination thereof) that can be used to import data/reports (of varying formats) and, given an specified XSD, structure an XML batch upload file accordingly.

    A VB script in conjunction with MS Access has been recommended although there seem to be conflicting opinions about this online. I'm considering this, however, as a way to create a functional prototype (I have about 3 weeks to figure this out) that can be revised later if necessary.

    My thanks in advance for any advice!!
  2. #2
  3. Lazy Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    16,382
    Rep Power
    9645
    What kind of programming languages/tools do you have available? With a low-level language like C/C++ you have to do a lot of work, and more to properly read things like Excel files. VBA would make the Excel job easier, but you'd "have" to use it to deal with the other data formats too (or if you don't then you need multiple tools for multiple files).

    I'm a PHP developer so it's easy for me to see how PHP or Python would be good for this. They're highish-level languages with lots of framework support, they work cross-system (which doesn't really matter to you), and they're pretty easy to learn (Python is easy overall, PHP is easier if you know C/C++ stuff).

    For example, you can make something that understands different types of files: .xls/x, .csv, whatever. It loads a particular "module" (code) to handle that file type which reads the file as input and produces the right XML as output.

    After doing some learning, I could see developing a prototype to handle one type of file taking a day or two to understand how to read the input file (CSV being the easiest) and another day or two to write something to produce the XML for it.
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2015
    Posts
    9
    Rep Power
    0
    Thanks for your response!

    I've played around with Python before and can download the interpreter easily enough (or perhaps just use the Cygwin distribution) but I'd like something that I can use to manage both the input data (Excel, .csv, whatnot) as well as the XSD definition files (various sites I will be uploading to have varying data requirements and thus different XML schema definitions) that I need to validate the output XML.

    In truth, while I don't mind hacking something together in the interim, I'd ultimately like to have a tool and a process that is easy to modify and update when the data or XSD requirements change.

    There is some work here that has been done in VB/MSAccess but I've read some pretty harsh criticism of this approach online...

    Does VB handle XML sufficiently that I can import an external XSD instead of having to define the format of the XML in my code?
  6. #4
  7. Lazy Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    16,382
    Rep Power
    9645
    Getting really meta and making something aware of the input types and output schemas is fun but you will not get it done soon. It's exponentially complex, and I promise you there will come a time when it fails on something. The best (and quickest) way would be to do that part of it manually. Meaning you manually write the code which translates whatever input into the specific output you need. Unless the schemas change drastically (unlikely), any changes to them will likely require a minimal amount of work to change the code.

    Right now you should get the system working. Later you can work to try and make it more intelligent.

    The problem with VBA is that it rarely ever evolves into a real tool. It stays in some Excel/Access file somewhere, and as needs grow over time the VBA grows more and more complex to a point where it does things that have nothing to do with Excel anymore. Your CSV inputs are already starting you along that path. What's worse, this VBA would have to reside in a file somewhere just as code because the actual source data is located somewhere else.
    It's faster in the short term for working with one particular input (Excel or Access) but will hurt you in the medium and long term.

    If the data changes, your tool should support that so long as the input and output formats do not change.
    If the input and/or output formats do change, do some development work to address it.
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2015
    Posts
    9
    Rep Power
    0
    Originally Posted by requinix
    Getting really meta and making something aware of the input types and output schemas is fun but you will not get it done soon. It's exponentially complex, and I promise you there will come a time when it fails on something. The best (and quickest) way would be to do that part of it manually. Meaning you manually write the code which translates whatever input into the specific output you need. Unless the schemas change drastically (unlikely), any changes to them will likely require a minimal amount of work to change the code.

    Right now you should get the system working. Later you can work to try and make it more intelligent.

    The problem with VBA is that it rarely ever evolves into a real tool. It stays in some Excel/Access file somewhere, and as needs grow over time the VBA grows more and more complex to a point where it does things that have nothing to do with Excel anymore. Your CSV inputs are already starting you along that path. What's worse, this VBA would have to reside in a file somewhere just as code because the actual source data is located somewhere else.
    It's faster in the short term for working with one particular input (Excel or Access) but will hurt you in the medium and long term.

    If the data changes, your tool should support that so long as the input and output formats do not change.
    If the input and/or output formats do change, do some development work to address it.
    I think my intention will be to prototype an initial implementation in MSAccess 2010. I've reading up on Lync and it seems that in conjunction with VB would be ideal... at least for the short term (as you alluded to in your last response). Additionally, MSAccess provides me with a means to create a quick-and-dirty GUI. I'm a bit concerned about the version of VB available with 2010 (2012 included updates pertaining to lambda expressions and such), however, these are not as important for a standalone prototype so I think I can get by.

    I'm expecting the majority of my raw data to be in some table form, be it Excel or in a format that can be easily converted to a spreadsheet (such as .csv). I've experimented importing a chunk of the data into MSAccess and for the most part have been successful. I'm a bit concerned with the actual process of forming the XML since I'm not quite clear on how I will reference the XSD from within the program (import it into MSAccess along with the rest of my input data or reference it as an external set of files?) however hopefully these will not be intractable problems.
  10. #6
  11. Lazy Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    16,382
    Rep Power
    9645
    Unless you don't know the format of the XML (ie, the XSD) during development, you don't really have to care about it in code. Just make sure that whatever XML the program generates conforms to the schema.
  12. #7
  13. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2015
    Posts
    9
    Rep Power
    0
    Originally Posted by requinix
    Unless you don't know the format of the XML (ie, the XSD) during development, you don't really have to care about it in code. Just make sure that whatever XML the program generates conforms to the schema.
    Brace yourself for an extra dumb question:

    How will the program know how to translate the required data from each table into the proper elements in XML if I'm not specifying it programmatically...

    I know I will be querying the imported tables for the proper data but won't I need to specify what goes where once I am ready to export it to XML?
  14. #8
  15. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2015
    Posts
    9
    Rep Power
    0
    Also, I do have the XSD - one of my concerns also is that it is specified across a couple files - a Report.xsd that contains a reference to separate <xs: schemaLocation = "Enumerations.xsd"> file.
  16. #9
  17. Lazy Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    16,382
    Rep Power
    9645
    The XSD specifies the format of the XML but it doesn't say anything about where that data is coming from, right? It doesn't say "this quantity value is a xs:int and comes from cell J20". That's why you have to do all this work in the first place.

    You, the developer, know the format of the XML. You know because it's there in the schema and you can read the schema? Like, you know that the "quantity" value is supposed to be an integer, and you know that the value is supposed to come from cell J20.
    So what you do is something along the lines of (because I don't remember much of Excel VBA):
    Code:
    XMLString = XMLString & "<quantity>" & ToInt(Sheet.Range("J20").Value) & "</quantity>"
    You're building the XML as a string. Or you can do something more sophisticated, maybe, if VBA has the tools for it, but whatever method you use you're still building the "<quantity>" element using the value from J20.

    And you do that for everything. Eventually you have a complete XML string that you can save to a file. And if you did everything right then that XML will conform to the schema (and if doesn't then you didn't).

IMN logo majestic logo threadwatch logo seochat tools logo