Putting large amounts of XML data into a database
I was given an quite an interesting project to work on and I've got a deadline of 2 weeks for it... Anyways I need to convert the whole intranet of a large telecom company (about 20000 employees) from an old XML based system (actually it uses a really weird hierarchial database and we can dump XML out of it, so I have to work on the XML dump) to a oracle database in order to use a new content management system that has been developed.
It's an XML file that defines the whole directory hierarchy and all articles, images, cross-references, authors etc. etc. This XML file is about 2.1GB, so there's a huge amount of stuff in it. Pictures and other binary data was previously just saved on the harddrive in files, but that's something I have to import to the oracle too.
Now my problem is. I've never worked with XML before, but noone here cares. This needs to be done in Java and I was looking at the APIs and noticed that there are two different APIs to work with i.e. XML DOM and SAX. Which one do you recommend that I should use?
The straight-forward implementation I thought about was to do multiple passes on the XML file i.e. first create the directory structure, then import images etc. that doesn't have any dependencies and finally start adding articles, news, updates etc. Can SAX handle this or do I have to build the recursive node traverals myself? Not that that would be a problem, it just takes more time and that's exactly what I haven't. Well I'm just a summer working student from uni, so they can dump these things on me...
Anyone done anything similar? Hints? Things to watch out for? Another coder here told me that someting similar had been done (on a much smaller site) by doing XSLT stylesheets that converted the XML into SQL inserts. Anyways then I would need to learn XSLT too and I don't think converting a 2.1GB XML file into a probably equally big SQL script would be a clean solution
July 17th, 2002, 10:09 AM
I think you will have no choice but to use SAX.
A DOM parser will attempt to load the entire 2GB document into memory, the system would just crap out.
I would just register call backs with a SAX parsre and run your sql queries from the callbacks once you have a full 'record'.
You are fourtunate that you can get an XML dump of everything, normally I have variable length binary data structures to deal with lol.