#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    32
    Rep Power
    2

    Huge text file reader. Need some help!


    I have a problem with a program where I have to process a "huge" text file. The file contains the letters that represent a proteome attached to a description.

    Basically I have to split the letters (protein) and the description of the proteome it self, and insert only the protoeme representation into a list which will be eventually returned when ready. I've done this and it works pretty well when processing a relatively smaller text file.

    When it comes to a huge file, the program runs to a certain extent and then it crashes with the nice message that Windows program has stopped working and etc. What I understand is that the time complexity and RAM usage isn't very convenient...

    What alternative methods are there for doing a light and fast procedure to achieve the described task above? - In case I wasn't clear enough:


    this is what the text contains:

    >g:1212ladassda[1212]ASSGDSGDJFGJFTDFGHNDF>g:12124[121]SAFSDSGSGDF

    (>g:123124[121] - this is the description which I want to discard)
    the function should return a list like this :

    [ 'ASSGDSGDJFGJFTDFGHNDF' , 'SAFSDSGSGDF' ]

    Any ideas? :]
  2. #2
  3. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2012
    Posts
    10
    Rep Power
    0
    If you execute the program within the terminal/command prompt you will see the error output

    define a large file?
    Stop reading the file in one go; you're consuming all the memory on the system. Read in 16MB or so chunks instead.

    Code:
    data = File.read(16 * 1024 * 1024)


    If each line has the same format as your example where first part is always after the first ] and it is always before >, and the second part is always after the second ], then this would work.

    However if the format were to change or the characters themselves contained either ] or >, thenit would break

    Code:
    s = '>g:1212ladassda[1212]ASSGDSGDJFGJFTDFGHNDF>g:12124[121]SAFSDSGSGDF'
    a = s.split(']')
    two = a[-1]
    one = a[1].split('>')[0]
    print([one,two])
  4. #3
  5. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,839
    Rep Power
    480
    How many lines are there in the input file?

    1 line.
    10 million lines.

    Why do you want to use python?

    Is this description of a record separator?
    '>' followed by any number of characters that are not ']' followed by ']'


    If the file is too large to fit in memory then processes it 1 character at a time, writing the result to another file as it goes, storing almost nothing in memory.

    flex would create the fastest program with sufficiently little programmer time (if I were the programmer).
    [code]Code tags[/code] are essential for python code and Makefiles!

IMN logo majestic logo threadwatch logo seochat tools logo