#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Posts
    11
    Rep Power
    0

    Python regex feedback


    Hey;

    I had a task @ a client to ID the IPs used by cups printers. Being much more familiar w/perl than python, I got the job done that way. This being a python shop, though, I wanted to go back and make sure i could figure out how to do it in python. The python script works so that's all to the good. I'm hoping for some feedback on the script, though, to see how badly I mucked up my first python regex experiment.

    Standard printers.conf file w/multiple printers defined in individual stanzas. Both scripts accept input from stdin and print the output as:

    Code:
    $ cat printers.conf | ./glom_printers.py
    chib120    10.217.65.216   lpd://10.217.65.216/
    crs1ptr    10.217.26.20    lpd://10.217.26.20/
    crs2ptr    10.217.26.16    lpd://10.217.26.16/
    ful3ptr    10.135.240.242  lpd://10.135.240.242/
    [[snip]]
    $ cat printers.conf | ./glom_printers   
    chib120    10.217.65.216   lpd://10.217.65.216/
    crs1ptr    10.217.26.20    lpd://10.217.26.20/
    crs2ptr    10.217.26.16    lpd://10.217.26.16/
    ful3ptr    10.135.240.242  lpd://10.135.240.242/
    [[nuther snip]]
    The perl code is pretty compact and, at least for me, easily understood:

    Code:
    #!/usr/bin/perl
    
    use strict;
    
    my @lines = <>;
    my $lines = join('', @lines);
    
    while ($lines =~ m{(<printer.*?</printer>)}mgsi)
    {  my $chunk = $1;
       my ($ptr) = $chunk =~ m{<printer (\w+)}i;
       my ($uri) = $chunk =~ m{deviceuri (.*?)$}ims;
       my ($ip) = $uri =~ m{.*://(.*?)$};
       $ip =~ s/\/.*//g;
       printf("%-10s %-15s %s\n", $ptr, $ip, $uri);
       # print "\n$chunk\n";
    }
    Take out all the test code that I have in there and the python code is at most a line or two longer:

    Code:
    #!/usr/bin/python
    
    import sys
    import re
    
    data = sys.stdin.readlines()
    lines = ''.join(data)
    ptn = re.compile(r"(<printer.*?</printer>)",re.I|re.S|re.M)
    iterator = ptn.finditer(lines)
    for chunk in iterator:
       # print '=' * 70
       # print chunk.group()
       # stanza = chunk.group()
       ptn1 = re.compile(r"<printer\s+([^>]*)>",re.I)
       printer = ptn1.search(chunk.group()).group(1)
       ptn2 = re.compile(r"deviceuri\s+(.*?)$",re.I|re.M)
       uri = ptn2.search(chunk.group()).group(1)
       ip = re.search(r"\w+://([^/]*)/*", uri).group(1)
       print "%-10s %-15s %s" % (printer, ip, uri)
       # print stanza
    It works which is obviously the important thing. Other than that, though, any feedback on the regex usage? Anything could have been done more efficiently?

    Thanks for any hints/tips/suggestions.

    Doug O'Leary
  2. #2
  3. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Location
    Iran
    Posts
    149
    Rep Power
    139
    Code:
    $ cat printers.conf | ./glom_printers.py
    chib120    10.217.65.216   lpd://10.217.65.216/
    crs1ptr    10.217.26.20    lpd://10.217.26.20/
    crs2ptr    10.217.26.16    lpd://10.217.26.16/
    ful3ptr    10.135.240.242  lpd://10.135.240.242/
    [[snip]]
    $ cat printers.conf | ./glom_printers   
    chib120    10.217.65.216   lpd://10.217.65.216/
    crs1ptr    10.217.26.20    lpd://10.217.26.20/
    crs2ptr    10.217.26.16    lpd://10.217.26.16/
    ful3ptr    10.135.240.242  lpd://10.135.240.242/
    [[nuther snip]]
    As I understand, the above data sample is the output of your script (correct me if I'm wrong), is it possible to have a sample of the printer configuration file? just a few lines to see the format of the file and to see whether there is really a need of using regular expressions for this problem.


    Regards,
    Dariyoosh
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Posts
    11
    Rep Power
    0
    Hey

    Yes, that's the script output. The printers.conf file is standard, but two complete stanzas follow:

    Code:
    <Printer chib120>
    Info HP4250
    Location XXX building - 12th floor
    DeviceURI lpd://10.217.65.216/
    State Idle
    StateTime 1295638458
    Accepting Yes
    Shared Yes
    JobSheets none none
    QuotaPeriod 0
    PageLimit 0
    KLimit 0
    OpPolicy default
    ErrorPolicy stop-printer
    </Printer>
    <Printer crs1ptr>
    Info HP LaserJet 4250
    Location YYY building, 4th floor
    DeviceURI lpd://10.217.26.20/
    State Idle
    StateTime 1328218009
    Accepting Yes
    Shared Yes
    JobSheets none none
    QuotaPeriod 0
    PageLimit 0
    KLimit 0
    OpPolicy default
    ErrorPolicy stop-printer
    </Printer>
    Thanks for the reply.

    Doug O'Leary
  6. #4
  7. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Location
    Iran
    Posts
    149
    Rep Power
    139
    Ok, obviously regular expressions here are useful.

    What I understand from the sample you provided (correct me if I'm wrong) is that the format of each line of the script output (if we suppose that it is written for example into a file) is

    <Printer Name> <tabulation> <IP Address> <tabulation> <lpd://IP Address/>

    It seems to me that, right at the beginning of your script you read the whole file at once by using readlines(). As long as the file size is not very huge, it is ok. Otherwise it is more efficient to read the file line by line within a loop, but then this may also imply that you change the way you search tokens based on your regular expressions.

    So, what I'm writing down here, is just another way (and I don't claim that would be the best and most efficient way of doing this)

    Code:
    import re
    
    def extrPrintersInfo(param_filePath):
        with open(param_filePath, "r") as printersFileCfg, open("output.txt", "w") as outputFile:
            
            # The pattern for a printer name
            progPrinterName = re.compile("^<Printer (?P<printerName>.+)>$")
            
            # The pattern for a printer IP address
            progPrinterIP = re.compile(
                r"^DeviceURI lpd://(?P<IPAddr>(\d{1,3}\.){3}\d{1,3})/$")
                
            printerName = None
            printerIPAddr = None
            
            for line in printersFileCfg:
                token = progPrinterName.match(line)
                
                if token:
                    printerName =  token.group("printerName")
                    continue
                
                token = progPrinterIP.match(line)
                if token:
                    printerIPAddr = token.group("IPAddr")
                    # So now you have both name and IP address
                    # So you can write them in the file
                    outputFile.write("".join([
                                                printerName,
                                                "\t",
                                                printerIPAddr,
                                                "\t",
                                                "lpd://",
                                                printerIPAddr,
                                                "/\n"
                                             ]
                                            )
                                    )
                    continue
    
    
    extrPrintersInfo("data.txt")
    So if we run the script

    Code:
    $ python -tt myscript.py
    $ cat output.txt
    chib120 10.217.65.216   lpd://10.217.65.216/
    crs1ptr 10.217.26.20    lpd://10.217.26.20/
    $
    Regards,
    Dariyoosh
  8. #5
  9. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,892
    Rep Power
    481
    Building the regular expression state tables costs a lot of resources. And that explains the re module permits this extra re.compile step. You don't need to use it. In fact, the re implementation I reviewed cached the hundred or so most recently used expression.

    Sounds like I oppose compile? Anything but! Your regular expressions don't change, therefor factor them out of the loop.

    Where you had ''.join(readlines()) I used read() .
    I compiled the regular expressions used in the loop and didn't bother compiling the expression used just once.

    Code:
    import sys
    import re
    
    data = sys.stdin.read()
    
    ptn1 = re.compile(r"<printer\s+([^>]*)>",re.I)
    ptn2 = re.compile(r"deviceuri\s+(.*?)$",re.I|re.M)
    ptn3 = re.compile(r"\w+://([^/]*)/*")
    
    for chunk in re.finditer(r"(<printer.*?</printer>)",data,re.I|re.S|re.M):
       printer = ptn1.search(chunk.group()).group(1)
       uri = ptn2.search(chunk.group()).group(1)
       ip = ptn3.search(uri).group(1)
       print "%-10s %-15s %s" % (printer, ip, uri)
    [code]Code tags[/code] are essential for python code and Makefiles!
  10. #6
  11. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Posts
    11
    Rep Power
    0
    Originally Posted by b49P23TIvg
    Your regular expressions don't change, therefor factor them out of the loop.
    Excellent point! Thanks for the feedback and the responses.

    I appreciate it.

    Doug O'Leary

IMN logo majestic logo threadwatch logo seochat tools logo