#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2013
    Posts
    4
    Rep Power
    0

    Xml Parsing Help


    Hi i am going through a tough time parsing large XML files say about 100 MB. I used minidom and it worked on smaller ones' but now i am stuck with element tree and lxml but bot able to get the desired output for the larger XML files.
    XML File:
    Code:
    <File>
    		<TITLE>Empire</TITLE>
    		<ARTIST>Maria</ARTIST>
    		<COUNTRY>USA</COUNTRY>
    		<COMPANY>Microsoft</COMPANY>
    		<YEAR>1985</YEAR>
    	</File>
    	<File>
    		<TITLE>Heart</TITLE>
    		<ARTIST>Tyler</ARTIST>
    		<COUNTRY>UK</COUNTRY>
    		<COMPANY>CBS</COMPANY>
    		<YEAR>1988</YEAR>
    	</File>
    	<File>
    		<TITLE>Greatest Hits</TITLE>
    		<ARTIST>MLTR</ARTIST>
    		<COUNTRY>USA</COUNTRY>
    		<COMPANY>SONY</COMPANY>
    		<YEAR>1982</YEAR>
    	</File>
    Desired Output::
    Empire,Maria,USA,Microsoft,1985
    Heart,Tyler,UK,CBS,1988
    Greatest Hits,MLTR,USA,SONY,1982


    Its 100MB files... thanks in advance
  2. #2
  3. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,996
    Rep Power
    481
    You must have used my black hole compression algorithm. Anything goes in but no information comes out.

    What is your question? What program fails? More specifically than "It doesn't work for big files", what, please, is the trouble?

    You provided a small xml file. I expect your minidom program to work with the small file.

    minidom is in a part of the python documentation I've never looked at before now.
    [code]Code tags[/code] are essential for python code and Makefiles!
  4. #3
  5. Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Feb 2005
    Posts
    620
    Rep Power
    65
    So, where is your code that you have tried?
    Real Programmers always confuse Christmas and Halloween because Oct31 == Dec25
  6. #4
  7. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2013
    Posts
    4
    Rep Power
    0
    Code:
    import os,sys
    import xml.dom.minidom
    from xml.dom.minidom import parseString
    from xml.dom.minidom import Node
    xmlTag = dom.getElementsByTagName('File')
    for node in xmlTag:
    	TITLE = node.getAttribute('TITLE ')
    	ARTIST= node.getAttribute('ARTIST')
    	COUNTRY= node.getAttribute('COUNTRY')
    	COMPANY= node.getAttribute('COMPANY')
    	YEAR= node.getAttribute('YEAR')
    	list1 = node.getElementsByTagName('TITLE ')
    	list2 = node.getElementsByTagName('ARTIST')
    	list3 = node.getElementsByTagName('COUNTRY')
    	list4 = node.getElementsByTagName('COMPANY')
    	list5 = node.getElementsByTagName('YEAR')
            for a in list1:	
    		TITLE_value = a.childNodes[0].nodeValue
    		print TITLE_value
            ............................................
            ............................................
            ............................................
    When i run this code for 100MB files it throws memory error.
    Regarding the XML eg. that i have given is just a model of my 100MB xml file.
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2013
    Posts
    4
    Rep Power
    0

    87 views and no reply


    87 views but still no reply regarding a solution
  10. #6
  11. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,996
    Rep Power
    481
    Buy more RAM.
    Install it.
    Use it.
    [code]Code tags[/code] are essential for python code and Makefiles!
  12. #7
  13. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2013
    Posts
    4
    Rep Power
    0

    Solved


    I have finally written my code for parsing huge files and its working great.here it is


    Code:
    import lxml import etree    
        for event, element in etree.iterparse(the_xml_file):
            if 'TITLE' in element.tag:
                print element.text

IMN logo majestic logo threadwatch logo seochat tools logo