#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2013
    Posts
    19
    Rep Power
    0

    Help trying to parse my file


    I have an xml file that I'm trying to parse and pull out some important data. I'm trying to do this in python, but I'm not having much luck. This is an example of my file:

    ------------------------------------------------------
    <record version="2" event="open(2) - read" modifier="fe" host="computer_one" iso8601="2013-05-21 11:44:06.315 -05:00"><path>/var/ld/ld.config</path><subject audit-uid="root" uid="root" gid="root" ruid="root" rgid="root" pid="976" sid="975" tid="0 0 computer_one"/><return errval="failure: No such file or directory" retval="-1"/></record>

    <record version="2" event="login - local" host="computer_one" iso8601="2013-05-21 11:43:58.239 -05:00"><subject audit-uid="root" uid="root" gid="root" ruid="root" rgid="root" pid="861" sid="861" tid="0 0 computer_one"/><text>invalid password</text><return errval="failure: Interrupted system call" retval="-1"/></record>
    -----------------------------------------------------

    I need to skip lines that contain the word 'open' in it. I don't know how to do that. This is my code below...

    Code:
    #!/usr/bin/python
    
    import re
    
    with open('small_log.xml', 'r') as handle:
      for line in handle:
        if line == 'open':
          continue
        else:
          print line
    #    content = line.strip().split('record')
    
    #print content
    I think I'm checking for the word 'open' only, but I only want to check the line for an occurence of the word 'open' to skip that line. I'm thinking I may need to use regex so I imported re. I'm more used to using perl regex. Can anyone help with this?
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2007
    Location
    Joensuu, Finland
    Posts
    436
    Rep Power
    67
    Originally Posted by cspctec
    I have an xml file that I'm trying to parse and pull out some important data.
    Regex’s are not really suited to parsing XML files. Try some XML parser, say Expat which is simple but probably enough for your purposes.
    My armada: openSUSE 13.1 (home desktop, home laptop), Crunchbang Linux 11 (work laptop), Trisquel GNU/Linux 6.0.1 (mini laptop), Ubuntu 14.04 LTS (server), Android 4.2.1 (tablet), Windows 7 Ultimate (testbed)
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2009
    Posts
    492
    Rep Power
    33
    You can use the "in" keyword.
    Code:
    if 'open' not in line:
        print line
  6. #4
  7. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2013
    Posts
    19
    Rep Power
    0
    I looked at several of the XML parsers available in Python and I decided to use xml.etree.ElementTree

    The problem is, almost all of my XML lines have the same tag - "record". Only the first XML line has a different tag of "file". It is difficult to find the information I need because they all have the same tag. My output looks like this:

    file {'iso8601': '2013-05-21 11:43:41.199 -05:00'}
    record {'iso8601': '2013-05-21 11:43:12.929 -05:00', 'modifier': 'na', 'version': '2', 'event': 'system booted'}
    record {'host': 'server_one', 'version': '2', 'event': 'init(1m)', 'iso8601': '2013-05-21 11:43:45.249 -05:00'}
    record {'host': 'server_one', 'version': '2', 'event': 'profile command', 'iso8601': '2013-05-21 11:43:50.117 -05:00'}
    record {'host': 'server_one', 'version': '2', 'event': 'login - local', 'iso8601': '2013-05-21 11:43:58.239 -05:00'}
    record {'host': 'server_one', 'version': '2', 'event': 'login - local', 'iso8601': '2013-05-21 11:44:06.209 -05:00'}
    record {'host': 'server_one', 'version': '2', 'event': 'open(2) - read,write', 'iso8601': '2013-05-21 11:44:06.209 -05:00'}
    record {'host': 'server_one', 'version': '2', 'event': 'open(2) - read,write', 'iso8601': '2013-05-21 11:44:06.210 -05:00'}
    record {'host': 'server_one', 'version': '2', 'event': 'open(2) - read,write', 'iso8601': '2013-05-21 11:44:06.210 -05:00'}

    This is the code I'm using:

    Code:
    from xml.etree import ElementTree as ET
    
    count = 0
    
    tree = ET.parse('large_file.xml')
    root = tree.getroot()
    
    for child in root:
        print (child.tag, child.attrib)
    First of all, are these dictionaries all with the same key of "record" (except the first one)?

    I also need to get a count of all of the elements that are the same. The last two elements are the same, so I would need a count of 2 for them. How would I do this?

    Also, is there a way I could dig into the attribute of the "records" and search for a keyword, like open, and print only those lines? I have tried something like:

    Code:
    for child in root:
      if 'open' in child.attrib:
        print child
    But I get no output with that. I appreciate any help.
  8. #5
  9. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,855
    Rep Power
    481
    Try dwblas's solution. It's far and away the simplest thing that could possibly work. And it will work until the code writing the xml changes or on occasion you'll omit a record with "open" in an unexpected place, because of the information. You could filter by index position of "open".


    Please explain the output you need.

    dwblas gave you a literal solution:

    if 'open' not in line:
    print line


    SuperOscar told you that regular expressions don't parse xml well. Stack based parsers do a good job.
    [code]Code tags[/code] are essential for python code and Makefiles!
  10. #6
  11. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2013
    Posts
    19
    Rep Power
    0
    Originally Posted by b49P23TIvg
    Try dwblas's solution. It's far and away the simplest thing that could possibly work.

    SuperOscar told you that regular expressions don't parse xml well. Stack based parsers do a good job.
    I thought I did try dwblas's solution. My code:

    Code:
    for child in root:
      if 'open' in child.attrib:
        print child
    is essentially the same as dwblas's code, except I'm printing when the word 'open' is found. I'm using the "in" keyword. I think the problem is, I'm trying to search through a dictionary. I don't know enough about dictionaries to know if you can look into the key's value and look for a keyword. I'm sure there is a way.

    Basically, I need this output:

    record {'host': 'server_one', 'version': '2', 'event': 'login - local', 'iso8601': '2013-05-21 11:43:58.239 -05:00'}
    record {'host': 'server_one', 'version': '2', 'event': 'login - local', 'iso8601': '2013-05-21 11:44:06.209 -05:00'}
    record {'host': 'server_one', 'version': '2', 'event': 'open(2) - read,write', 'iso8601': '2013-05-21 11:44:06.209 -05:00'}
    record {'host': 'server_one', 'version': '2', 'event': 'open(2) - read,write', 'iso8601': '2013-05-21 11:44:06.210 -05:00'}
    record {'host': 'server_one', 'version': '2', 'event': 'open(2) - read,write', 'iso8601': '2013-05-21 11:44:06.210 -05:00'}

    To turn into this:

    record {'host': 'server_one', 'version': '2', 'event': 'open(2) - read,write', 'iso8601': '2013-05-21 11:44:06.209 -05:00'}
    record {'host': 'server_one', 'version': '2', 'event': 'open(2) - read,write', 'iso8601': '2013-05-21 11:44:06.210 -05:00'}
    record {'host': 'server_one', 'version': '2', 'event': 'open(2) - read,write', 'iso8601': '2013-05-21 11:44:06.210 -05:00'}

    I only want it to print dictionary entries with 'open' in the value.

    I also gave up on using regex to parse the XML file as others suggested earlier. I'm using etree.ElementTree because I read that it is less low-level than Expat for parsing XML.
  12. #7
  13. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,855
    Rep Power
    481
    I'm as certain as I can be that dwblas said
    Code:
    #!/usr/bin/python
    
    import re
    
    with open('small_log.xml', 'r') as handle:
      for line in handle:
        if  'open' in line:
          continue
        else:
          print line
    [code]Code tags[/code] are essential for python code and Makefiles!
  14. #8
  15. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2013
    Posts
    19
    Rep Power
    0
    Originally Posted by b49P23TIvg
    I'm as certain as I can be that dwblas said
    Code:
    #!/usr/bin/python
    
    import re
    
    with open('small_log.xml', 'r') as handle:
      for line in handle:
        if  'open' in line:
          continue
        else:
          print line
    Okay, maybe I should just start over to make everything clear...

    I have an XML file (Solaris BSM XML file spit out of praudit if you know what that is) that looks like the following:

    <?xml version='1.0' encoding='UTF-8' ?>
    <?xml-stylesheet type='text/xsl' href='file:///usr/share/lib/xml/style/adt_record.xsl.1' ?>

    <!DOCTYPE audit PUBLIC '-//Sun Microsystems, Inc.//DTD Audit V1//EN' 'file:///usr/share/lib/xml/dtd/adt_record.dtd.1'>

    <audit>
    <file iso8601="2013-05-21 11:43:41.199 -05:00"></file>
    <record version="2" event="system booted" modifier="na" iso8601="2013-05-21 11:43:12.929 -05:00"><text>booting kernel</text></record>
    <record version="2" event="init(1m)" host="sv01" iso8601="2013-05-21 11:43:45.249 -05:00"><text>booted</text><return errval="success" retval="0"/></record>
    <record version="2" event="profile command" host="sv01" iso8601="2013-05-21 11:43:50.117 -05:00"><path>/</path><path>/usr/share/webconsole/private/bin/smcwebstart</path><cmd><argv>/var/webconsole/domains/console/conf/start_tomcat</argv></cmd><process audit-uid="-2" uid="noaccess" gid="noaccess" ruid="noaccess" rgid="noaccess" pid="929" sid="0" tid="0 0 0.0.0.0"/><privilege set-type="Inheritable">proc_audit</privilege><return errval="success" retval="0"/></record>
    <record version="2" event="login - local" host="sv01" iso8601="2013-05-21 11:43:58.239 -05:00"><subject audit-uid="root" uid="root" gid="root" ruid="root" rgid="root" pid="861" sid="861" tid="0 0 sv01"/><text>invalid password</text><return errval="failure: Interrupted system call" retval="-1"/></record>
    <record version="2" event="login - local" host="sv01" iso8601="2013-05-21 11:44:06.209 -05:00"><subject audit-uid="root" uid="root" gid="root" ruid="root" rgid="root" pid="975" sid="975" tid="0 0 sv01"/><text>successful login</text><return errval="success" retval="0"/></record>
    <record version="2" event="open(2) - read,write" host="sv01" iso8601="2013-05-21 11:44:06.209 -05:00"><path>/var/adm/utmpx</path><subject audit-uid="root" uid="root" gid="root" ruid="root" rgid="root" pid="976" sid="975" tid="0 0 sv01"/><return errval="success" retval="4"/></record>
    <record version="2" event="open(2) - read,write" host="sv01" iso8601="2013-05-21 11:44:06.210 -05:00"><path>/var/adm/utmpx</path><attribute mode="100644" uid="root" gid="bin" fsid="65538" nodeid="2294" device="18446744073709551615"/><subject audit-uid="root" uid="root" gid="root" ruid="root" rgid="root" pid="976" sid="975" tid="0 0 sv01"/><return errval="success" retval="5"/></record>
    <record version="2" event="open(2) - read,write" host="sv01" iso8601="2013-05-21 11:44:06.210 -05:00"><path>/etc/utmppipe</path><attribute mode="10600" uid="root" gid="root" fsid="65538" nodeid="368715" device="18446744073709551615"/><subject audit-uid="root" uid="root" gid="root" ruid="root" rgid="root" pid="976" sid="975" tid="0 0 sv01"/><return errval="success" retval="6"/></record>
    <record version="2" event="open(2) - write" modifier="sp" host="sv01" iso8601="2013-05-21 11:44:06.210 -05:00"><path>/var/adm/wtmpx</path><attribute mode="100644" uid="adm" gid="adm" fsid="65538" nodeid="2295" device="18446744073709551615"/><subject audit-uid="root" uid="root" gid="root" ruid="root" rgid="root" pid="976" sid="975" tid="0 0 sv01"/><use_of_privilege result="successful use of priv">file_dac_write</use_of_privilege><return errval="success" retval="4"/></record>
    <record version="2" event="auditon(2) - get audit state" modifier="sp" host="sv01" iso8601="2013-05-21 11:44:06.211 -05:00"><subject audit-uid="root" uid="root" gid="root" ruid="root" rgid="root" pid="976" sid="975" tid="0 0 sv01"/><use_of_privilege result="successful use of priv">sys_audit</use_of_privilege><return errval="success" retval="0"/></record></audit>

    and I have used the following code:

    Code:
    #!/usr/bin/python
    
    from xml.etree import ElementTree as ET
    
    tree = ET.parse('small_bsm.xml')
    root = tree.getroot()
    
    for child in root:
        print (child.tag, child.attrib)
    to get the XML file to the following format:

    file {'iso8601': '2013-05-21 11:43:41.199 -05:00'}
    record {'iso8601': '2013-05-21 11:43:12.929 -05:00', 'modifier': 'na', 'version': '2', 'event': 'system booted'}
    record {'host': 'sv01', 'version': '2', 'event': 'init(1m)', 'iso8601': '2013-05-21 11:43:45.249 -05:00'}
    record {'host': 'sv01', 'version': '2', 'event': 'profile command', 'iso8601': '2013-05-21 11:43:50.117 -05:00'}
    record {'host': 'sv01', 'version': '2', 'event': 'login - local', 'iso8601': '2013-05-21 11:43:58.239 -05:00'}
    record {'host': 'sv01', 'version': '2', 'event': 'login - local', 'iso8601': '2013-05-21 11:44:06.209 -05:00'}
    record {'host': 'sv01', 'version': '2', 'event': 'open(2) - read,write', 'iso8601': '2013-05-21 11:44:06.209 -05:00'}
    record {'host': 'sv01', 'version': '2', 'event': 'open(2) - read,write', 'iso8601': '2013-05-21 11:44:06.210 -05:00'}
    record {'host': 'sv01', 'version': '2', 'event': 'open(2) - read,write', 'iso8601': '2013-05-21 11:44:06.210 -05:00'}
    record {'iso8601': '2013-05-21 11:44:06.210 -05:00', 'host': 'sv01', 'modifier': 'sp', 'version': '2', 'event': 'open(2) - write'}
    record {'iso8601': '2013-05-21 11:44:06.211 -05:00', 'host': 'sv01', 'modifier': 'sp', 'version': '2', 'event': 'auditon(2) - get audit state'}

    Now, I'm pretty sure this is a dictionary. The keys are both "file" and "record". If I use the code posted earlier (in a way that may be wrong, this is a dictionary so I don't know):

    Code:
    from xml.etree import ElementTree as ET
    
    tree = ET.parse('small_bsm.xml')
    root = tree.getroot()
    
    for child in root:
        if 'open' in child:
            continue
        else:
            print (child.attrib)
    I get the exact same results. Values in the dictionary that contain the word 'open' are not skipped as they should be. That is the problem. I am not trying to straight-up parse through the raw XML document, so the code that was provided is not working.

    How do I look for a keyword in a value in a dictionary?
  16. #9
  17. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,855
    Rep Power
    481
    Summary: Try the code you quoted (my interpretation of dwblas's thoughts)!

    If you must,
    Only part of the line can be interpreted as a python dictionary. Remove the bad part, make a dictionary, use the values() method.

    Let's see if python can interpret this as a dictionary.
    Code:
    >>> file {'iso8601': '2013-05-21 11:43:41.199 -05:00'}
      File "<stdin>", line 1
        file {'iso8601': '2013-05-21 11:43:41.199 -05:00'}
             ^
    SyntaxError: invalid syntax
    Nope. (You might have converted part of the xml file to json. I don't know for sure.)

    Part of these lines look like a dictionary.
    Code:
    >>> eval(LINE[LINE.index('{'):])
    {'iso8601': '2013-05-21 11:43:41.199 -05:00'}
    >>> any('open' in v for v in eval(LINE[LINE.index('{'):]).values())
    False
    >>> 
    >>> 
    >>> 
    >>> LINE = '''record {'host': 'sv01', 'version': '2', 'event': 'open(2) - read,write', 'iso8601': '2013-05-21 11:44:06.210 -05:00'}\n'''
    >>> any('open' in v for v in  eval(LINE[LINE.index('{'):]).values())
    True
    >>>
    [code]Code tags[/code] are essential for python code and Makefiles!

IMN logo majestic logo threadwatch logo seochat tools logo