1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Rep Power

    Huge text file reader. Need some help!

    I have a problem with a program where I have to process a "huge" text file. The file contains the letters that represent a proteome attached to a description.

    Basically I have to split the letters (protein) and the description of the proteome it self, and insert only the protoeme representation into a list which will be eventually returned when ready. I've done this and it works pretty well when processing a relatively smaller text file.

    When it comes to a huge file, the program runs to a certain extent and then it crashes with the nice message that Windows program has stopped working and etc. What I understand is that the time complexity and RAM usage isn't very convenient...

    What alternative methods are there for doing a light and fast procedure to achieve the described task above? - In case I wasn't clear enough:

    this is what the text contains:


    (>g:123124[121] - this is the description which I want to discard)
    the function should return a list like this :


    Any ideas? :]
  2. #2
  3. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2012
    Rep Power
    If you execute the program within the terminal/command prompt you will see the error output

    define a large file?
    Stop reading the file in one go; you're consuming all the memory on the system. Read in 16MB or so chunks instead.

    data = File.read(16 * 1024 * 1024)

    If each line has the same format as your example where first part is always after the first ] and it is always before >, and the second part is always after the second ], then this would work.

    However if the format were to change or the characters themselves contained either ] or >, thenit would break

    s = '>g:1212ladassda[1212]ASSGDSGDJFGJFTDFGHNDF>g:12124[121]SAFSDSGSGDF'
    a = s.split(']')
    two = a[-1]
    one = a[1].split('>')[0]
  4. #3
  5. Contributing User
    Devshed God 1st Plane (5500 - 5999 posts)

    Join Date
    Aug 2011
    Rep Power
    How many lines are there in the input file?

    1 line.
    10 million lines.

    Why do you want to use python?

    Is this description of a record separator?
    '>' followed by any number of characters that are not ']' followed by ']'

    If the file is too large to fit in memory then processes it 1 character at a time, writing the result to another file as it goes, storing almost nothing in memory.

    flex would create the fastest program with sufficiently little programmer time (if I were the programmer).
    [code]Code tags[/code] are essential for python code and Makefiles!

IMN logo majestic logo threadwatch logo seochat tools logo