Python Programming
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me

The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.

Go Back   Dev Shed ForumsProgramming LanguagesPython Programming

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old December 24th, 2012, 04:34 PM
Nightmareix35 Nightmareix35 is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Nov 2012
Posts: 32 Nightmareix35 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 8 h 42 m 41 sec
Reputation Power: 1
Huge text file reader. Need some help!

I have a problem with a program where I have to process a "huge" text file. The file contains the letters that represent a proteome attached to a description.

Basically I have to split the letters (protein) and the description of the proteome it self, and insert only the protoeme representation into a list which will be eventually returned when ready. I've done this and it works pretty well when processing a relatively smaller text file.

When it comes to a huge file, the program runs to a certain extent and then it crashes with the nice message that Windows program has stopped working and etc. What I understand is that the time complexity and RAM usage isn't very convenient...

What alternative methods are there for doing a light and fast procedure to achieve the described task above? - In case I wasn't clear enough:


this is what the text contains:

>g:1212ladassda[1212]ASSGDSGDJFGJFTDFGHNDF>g:12124[121]SAFSDSGSGDF

(>g:123124[121] - this is the description which I want to discard)
the function should return a list like this :

[ 'ASSGDSGDJFGJFTDFGHNDF' , 'SAFSDSGSGDF' ]

Any ideas? :]

Reply With Quote
  #2  
Old December 24th, 2012, 09:00 PM
metulburr metulburr is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Dec 2012
Posts: 10 metulburr User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 5 h 43 m 57 sec
Reputation Power: 0
If you execute the program within the terminal/command prompt you will see the error output

define a large file?
Stop reading the file in one go; you're consuming all the memory on the system. Read in 16MB or so chunks instead.

Code:
data = File.read(16 * 1024 * 1024)




If each line has the same format as your example where first part is always after the first ] and it is always before >, and the second part is always after the second ], then this would work.

However if the format were to change or the characters themselves contained either ] or >, thenit would break

Code:
s = '>g:1212ladassda[1212]ASSGDSGDJFGJFTDFGHNDF>g:12124[121]SAFSDSGSGDF'
a = s.split(']')
two = a[-1]
one = a[1].split('>')[0]
print([one,two])

Reply With Quote
  #3  
Old December 24th, 2012, 09:15 PM
b49P23TIvg's Avatar
b49P23TIvg b49P23TIvg is offline
Contributing User
Dev Shed Loyal (3000 - 3499 posts)
 
Join Date: Aug 2011
Posts: 3,353 b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level) 
Time spent in forums: 1 Month 2 Weeks 3 Days 8 h 3 m 36 sec
Reputation Power: 383
How many lines are there in the input file?

1 line.
10 million lines.

Why do you want to use python?

Is this description of a record separator?
'>' followed by any number of characters that are not ']' followed by ']'


If the file is too large to fit in memory then processes it 1 character at a time, writing the result to another file as it goes, storing almost nothing in memory.

flex would create the fastest program with sufficiently little programmer time (if I were the programmer).
__________________
[code]Code tags[/code] are essential for python code!

Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming LanguagesPython Programming > Huge text file reader. Need some help!

Developer Shed Advertisers and Affiliates



Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 


Powered by: vBulletin Version 3.0.5
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.

© 2003-2013 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap