December 24th, 2012, 04:34 PM
Huge text file reader. Need some help!
I have a problem with a program where I have to process a "huge" text file. The file contains the letters that represent a proteome attached to a description.
Basically I have to split the letters (protein) and the description of the proteome it self, and insert only the protoeme representation into a list which will be eventually returned when ready. I've done this and it works pretty well when processing a relatively smaller text file.
When it comes to a huge file, the program runs to a certain extent and then it crashes with the nice message that Windows program has stopped working and etc. What I understand is that the time complexity and RAM usage isn't very convenient...
What alternative methods are there for doing a light and fast procedure to achieve the described task above? - In case I wasn't clear enough:
this is what the text contains:
(>g:123124 - this is the description which I want to discard)
the function should return a list like this :
[ 'ASSGDSGDJFGJFTDFGHNDF' , 'SAFSDSGSGDF' ]
Any ideas? :]
December 24th, 2012, 09:00 PM
If you execute the program within the terminal/command prompt you will see the error output
define a large file?
Stop reading the file in one go; you're consuming all the memory on the system. Read in 16MB or so chunks instead.
data = File.read(16 * 1024 * 1024)
If each line has the same format as your example where first part is always after the first ] and it is always before >, and the second part is always after the second ], then this would work.
However if the format were to change or the characters themselves contained either ] or >, thenit would break
s = '>g:1212ladassdaASSGDSGDJFGJFTDFGHNDF>g:12124SAFSDSGSGDF'
a = s.split(']')
two = a[-1]
one = a.split('>')
December 24th, 2012, 09:15 PM
How many lines are there in the input file?
10 million lines.
Why do you want to use python?
Is this description of a record separator?
'>' followed by any number of characters that are not ']' followed by ']'
If the file is too large to fit in memory then processes it 1 character at a time, writing the result to another file as it goes, storing almost nothing in memory.
flex would create the fastest program with sufficiently little programmer time (if I were the programmer).
[/code] are essential for python code and Makefiles!