Thread: file processing

    #1
  1. No Profile Picture
    Junior Member
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2003
    Posts
    6
    Rep Power
    0

    file processing


    Hello,

    I am a student in linguistics who needs to do some file and string processing. I've been using Perl for some time, but I'm looking for a more proper language.

    I've been playing with Python, and it might be the pearl I'm looking for... (ok, easy joke)

    I wrote a basic script in Python which reads a 2MB file and loads it in a string. I must confess that I am disappointed, because of the script being amazingly slow (about 5 minutes with an Athlon XP 1500+, against less than 1 second with Perl). Performance is not my main concern, but I want my scripts to perform decently.

    Could someone please telle me wether the way I do it in Python is the best way to go ? Is it me or is Python definitively slow with large files ?
    Thank you very much,

    Julien.

    #!/usr/bin/python

    import string

    my_string = ''
    my_file = open('large_file', 'r')
    for line in my_file.readlines():
    line.strip()
    my_string = my_string + line
    my_file.close()

    ### end ###

    And the equivalent file in Perl :

    #!/usr/bin/perl

    open(TXT, "my_file") or die "pb TXT : $!\n";
    while(<TXT>) {
    chomp;
    $my_string .= $_;
    }
    close(TXT);
  2. #2
  3. Hello World :)
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Mar 2003
    Location
    Hull, UK
    Posts
    2,537
    Rep Power
    69
    Mmmm ok dude, you're just a little scary you managed to write a piece of Python with nearly exactly the same format as perl

    Python isn't as fast at string prosession as perl - but then that's what perl was made for - so I don't expect it to be.

    Anyway asuming your using Python 2.+ this should work better, I knowtices you imported the module (string) and didn't use it which is a bad practice in itself.

    Code:
    #!/usr/bin/env python
    
    my_string = open('large_file', 'r').read().split('\n')
    This will give you a list containt all the lines in a file. assuming now that you want it as a single string you can call join on it as below

    Code:
    #!/usr/bin/env python
    
    my_string = ''.join(open('large_file', 'r').read().split('\n'))
    Readlines reads in a line but keeps the '\n'. Other than looping through and removing each '\n' you can split the string from read on each new line and miss out this whole step and in a single line (Although you could easily pull this out into several lines)

    Hope this helps,

    Mark.
  4. #3
  5. No Profile Picture
    Junior Member
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2003
    Posts
    6
    Rep Power
    0
    Originally posted by netytan
    [B]Mmmm ok dude, you're just a little scary you managed to write a piece of Python with nearly exactly the same format as perl
    You're right. I felt I was doing it the perlish way...

    Python isn't as fast at string prosession as perl - but then that's what perl was made for - so I don't expect it to be.
    OK, but as I mentionned performance is not my main concern.

    snip...

    Code:
    #!/usr/bin/env python
    
    my_string = ''.join(open('large_file', 'r').read().split('\n'))
    Readlines reads in a line but keeps the '\n'. Other than looping through and removing each '\n' you can split the string from read on each new line and miss out this whole step and in a single line (Although you could easily pull this out into several lines)
    This is exactly what I needed , and it performs MUCH better.

    Thank you very much, I think I'm going to learn Python seriously now .

    Julien.
  6. #4
  7. Hello World :)
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Mar 2003
    Location
    Hull, UK
    Posts
    2,537
    Rep Power
    69
    Glad to help, I'm sure you'll have allot of fun learning Python! and once you've caught the Py'bug it'll be a long time untill you touch perl again

    Take care,
    Mark.
  8. #5
  9. No Profile Picture
    Junior Member
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2003
    Location
    Alexandria, VA
    Posts
    5
    Rep Power
    0

    Re: file processing


    Code:
    for line in my_file.readlines():
            line.strip()
           my_string = my_string + line
    It's that third line above that was probably killing your performance. Strings in Python are immutable, so each time you hit that line you were creating a new string and copying the old one into it along with the new line. When your strings get really long and you do this a lot it can really crush performance as these 2MB strings get copied back and forth in memory.

    (The other ramification of strings being immutable is that the 2nd line above wasn't doing what you probably intended. Calling line.strip() doesn't alter line, it returns a new string that is stripped leaving line itself unchanged.)

    Your original script would probably work just fine if you save all the concatenation until the very end:
    Code:
    import string
    lineList = []
    for line in my_file.readlines():
            lineList.append(line.strip())
    my_string = string.join(lineList, '')
    Execution time for this is probably in the same ballpark as the solutions in the other replies, I think it's mostly a stylistic choice. Some folks like to get lots done with one line of code, some people prefer to more explicity list out each step. Go with whichever you think is the most readable.
  10. #6
  11. Hello World :)
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Mar 2003
    Location
    Hull, UK
    Posts
    2,537
    Rep Power
    69
    Hi five ,

    I like writing pieced of code which use as little space as possible.. from my point of view, the less you have to type the faster your dev time and the smaller your app's will be.

    On occasion I do use longer versions simply because they tend to be slightly easier to read than some of the one liners but even these are quite readable in Python . I like to use the built-in functions i.e 'string object'.split() instead of importing the string module and using string.split('string object').

    It's just a matter off oppinion, whatever suits your needs and or tastes I guess.

    Oh I just thought of another way to do with tould be to use replace on the file conent, missing out the whole loop, strip, append, join..

    Code:
    #!/usr/bin/env python
    
    file = open('large_file', 'r').read()
    file = file.replace('\n', '')
    Have fun guys,
    Mark.
  12. #7
  13. No Profile Picture
    Junior Member
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2003
    Posts
    6
    Rep Power
    0
    Hello,

    FiveGrainJa > Thank you for this very instructive explanation . I now understand why it was so slow.

    netytan > as far as I can see, there's more than one way to do it . Your last example (with the replace method) is the fastest, and performs as fast as the perl equivalent .

    Thank you both for your help, you convinced me away from Perl

    Julien.
  14. #8
  15. Hello World :)
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Mar 2003
    Location
    Hull, UK
    Posts
    2,537
    Rep Power
    69
    very welcome Julien,

    There's usually more than one way to do somthing with most languages (perl is famous for this), but one way is usually better than another, and if it isn't it comes down to personal choice of which one you want to use

    Usually you should take into account the preformance, size, and readability of you code block.

    Any other factors five? There the main onces I can see.

    Mark.
  16. #9
  17. Wacky hack
    Devshed Novice (500 - 999 posts)

    Join Date
    Apr 2001
    Location
    London, England
    Posts
    513
    Rep Power
    14
    I have to admit I'm confused as to why you say you want to move from Perl to a more proper language. For processing strings, Perl is the proper language, it's designed exactly for that. Python just isn't as good when it comes to doing more complicated string manipulation. It's only worth moving to Python if the script does a lot more than string manipulation, and the string manipulation is fairly simple, because then you might find it easier to extend/embed your code in Python.
  18. #10
  19. No Profile Picture
    Junior Member
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2003
    Posts
    6
    Rep Power
    0
    Hi,

    Originally posted by telex4
    [B]I have to admit I'm confused as to why you say you want to move from Perl to a more proper language. For processing strings, Perl is the proper language, it's designed exactly for that.
    Oops, I didn't want to open a Perl vs Python debate .

    Perl IS nice at string manipulations, for sure . But the point is that I'm looking for a *clear* (and not *proper*, I was biaised by my mother tongue ) general purpose language, which Perl is certainly not, in my opinion .

    There are things I'll continue doing with Perl, but I'll try do as much as possible ("use once" scripts, or personnal and non critical stuff) without .

    Mark > I'll keep in mind your remarks concerning performance/readability/compactness

    Cheers,

    Julien.
  20. #11
  21. Hello World :)
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Mar 2003
    Location
    Hull, UK
    Posts
    2,537
    Rep Power
    69
    Oops, I didn't want to open a Perl vs Python debate
    Oh I don't think you did Julien. Telex is a big Python user, you seen QuickRip? All he's saying is that perl is better for string manipulation than Python, but then you already know that is what perl was made to do. Where Python is more general purpose (imo).

    Mark.
  22. #12
  23. Wacky hack
    Devshed Novice (500 - 999 posts)

    Join Date
    Apr 2001
    Location
    London, England
    Posts
    513
    Rep Power
    14
    Heh, yes, I definitely prefer Python as a language. I'm in the middle of coding some extra features into a large Perl CGI solution I did for a client last summer, and blimey, is it a pain having to use Perl again
  24. #13
  25. No Profile Picture
    Junior Member
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2003
    Posts
    6
    Rep Power
    0
    OK, thank you for the link, I must confess I didn't know it (but I'm about to try it ).

    Julien.

IMN logo majestic logo threadwatch logo seochat tools logo