Python Programming
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me

The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.

Go Back   Dev Shed ForumsProgramming LanguagesPython Programming

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old December 6th, 2012, 04:11 PM
thalakos thalakos is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Nov 2012
Posts: 11 thalakos User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 h 56 m 52 sec
Reputation Power: 0
Merging 2 text files under few very difficult conditions

Hi all,

I have this two file:

A)

Code:
K00001	32
K00001	177
K00001	189
K00001	212
K00001	232
K00001	233
K00001	234
K00001	346
K00002	182
K00002	189
K00002	273
K00003	146
K00003	193
K00003	240
K00004	176
K00005	273
K00006	192
K00007	51
K00007	184
K00008	51
K00009	51
........


B)

Code:
0	BR:ko01002	Metabolism	Enzyme Families	Peptidases
1	PATH:ko04142	Cellular Processes	Transport and Catabolism	Lysosome
2	PATH:ko04612	Organismal Systems	Immune System	Antigen processing and presentation
3	BR:ko03110	Genetic Information Processing	Folding, Sorting and Degradation	Chaperones and folding catalysts
4	PATH:ko04145	Cellular Processes	Transport and Catabolism	Phagosome
5	PATH:ko05152	Human Diseases	Infectious Diseases	Tuberculosis
6	PATH:ko05323	Human Diseases	Immune System Diseases	Rheumatoid arthritis
7	PATH:ko04141	Genetic Information Processing	Folding, Sorting and Degradation	Protein processing in endoplasmic reticulum
8	PATH:ko04210	Cellular Processes	Cell Growth and Death	Apoptosis
9	PATH:ko05010	Human Diseases	Neurodegenerative Diseases	Alzheimer's disease
10	None	Unclassified	Metabolism	Amino acid metabolism
11	BR:ko03000	Genetic Information Processing	Transcription	Transcription factors
12	PATH:ko04111	Cellular Processes	Cell Growth and Death	Cell cycle - yeast
13	PATH:ko00600	Metabolism	Lipid Metabolism	Sphingolipid metabolism
14	PATH:ko04020	Environmental Information Processing	Signal Transduction	Calcium signaling pathway
15	PATH:ko04974	Organismal Systems	Digestive System	Protein digestion and absorption
16	BR:ko04030	Environmental Information Processing	Signaling Molecules and Interaction	G protein-coupled receptors
17	PATH:ko04080	Environmental Information Processing	Signaling Molecules and Interaction	Neuroactive ligand-receptor interaction
18	BR:ko01003	Metabolism	Glycan Biosynthesis and Metabolism	Glycosyltransferases
19	PATH:ko04620	Organismal Systems	Immune System	Toll-like receptor signaling pathway
20	PATH:ko05133	Human Diseases	Infectious Diseases	Pertussis
21	PATH:ko05164	Human Diseases	Infectious Diseases	Influenza A
22	PATH:ko05160	Human Diseases	Infectious Diseases	Hepatitis C
23	PATH:ko05142	Human Diseases	Infectious Diseases	Chagas disease (American trypanosomiasis)
24	BR:ko00535	Metabolism	Glycan Biosynthesis and Metabolism	Proteoglycans
25	BR:ko03009	Genetic Information Processing	Translation	Ribosome Biogenesis
26	BR:ko02000	Environmental Information Processing	Membrane Transport	Transporters
27	PATH:ko02010	Environmental Information Processing	Membrane Transport	ABC transporters
28	PATH:ko00591	Metabolism	Lipid Metabolism	Linoleic acid metabolism
29	BR:ko01004	Metabolism	Lipid Metabolism	Lipid biosynthesis proteins
30	PATH:ko00590	Metabolism	Lipid Metabolism	Arachidonic acid metabolism
31	PATH:ko00380	Metabolism	Amino Acid Metabolism	Tryptophan metabolism


The files are linked by the numbers: every KOOOX in file A) is associated with a function in file B). THe association is supported by the same number in the raw of the KOOOX and the function. I need to have a new file like this:

Code:
KOOOX       BR:...the rest of the line
KOOOx1     PATH:...the rest of the line
....
....


So all the KOOOX ID's and their associated function (BR: or PATH: depending on the related number). What makes it difficult is that in the file B) there all only 300 defined and unique lines (so different function) but in the file A) There are thousand of that KOOOX ID's in multiple entry. In the new file I need to generate I must have only single entries and if there are more numbers associated with KOOO ID's and functions in that case I should have something like that:
Code:
K00001	funcition (releted to the 1st number)  function (reletade to the second number) function (releted to the 3rd number) ...


Every KOOOS should be in its own line with its associated functions and TAB separeted.

I know it's very very tricky but I hope some of you expert here could have an idea on how to do it. Some dictionary for example.

Thanks in advance for any input

Reply With Quote
  #2  
Old December 6th, 2012, 05:34 PM
MrFujin's Avatar
MrFujin MrFujin is online now
Lord of the Dance
Dev Shed Loyal (3000 - 3499 posts)
 
Join Date: Oct 2003
Posts: 3,161 MrFujin User rank is General 11st Grade (Above 100000 Reputation Level)MrFujin User rank is General 11st Grade (Above 100000 Reputation Level)MrFujin User rank is General 11st Grade (Above 100000 Reputation Level)MrFujin User rank is General 11st Grade (Above 100000 Reputation Level)MrFujin User rank is General 11st Grade (Above 100000 Reputation Level)MrFujin User rank is General 11st Grade (Above 100000 Reputation Level)MrFujin User rank is General 11st Grade (Above 100000 Reputation Level)MrFujin User rank is General 11st Grade (Above 100000 Reputation Level)MrFujin User rank is General 11st Grade (Above 100000 Reputation Level)MrFujin User rank is General 11st Grade (Above 100000 Reputation Level)MrFujin User rank is General 11st Grade (Above 100000 Reputation Level)MrFujin User rank is General 11st Grade (Above 100000 Reputation Level)MrFujin User rank is General 11st Grade (Above 100000 Reputation Level)MrFujin User rank is General 11st Grade (Above 100000 Reputation Level)MrFujin User rank is General 11st Grade (Above 100000 Reputation Level)MrFujin User rank is General 11st Grade (Above 100000 Reputation Level) 
Time spent in forums: 2 Months 2 Weeks 1 Day 14 h 5 m 5 sec
Reputation Power: 1736
Which number is related to each other?
Is it the number in second column of A) that should be matched to the numbers in the first column of B)?

Can you post a complete example of a row for one of the K000X?

Reply With Quote
  #3  
Old December 6th, 2012, 07:03 PM
thalakos thalakos is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Nov 2012
Posts: 11 thalakos User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 h 56 m 52 sec
Reputation Power: 0
Quote:
Originally Posted by MrFujin
Which number is related to each other?
Is it the number in second column of A) that should be matched to the numbers in the first column of B)?

Can you post a complete example of a row for one of the K000X?


Thanks for your reply;

Exactly the number in column A) should be matched to those in column B), with linked I mean that the releted KOOOX needa to be reported in the new file with the function (and they had both the same number). For example for KOOOO1 (you do not see the function because the column B) stops at 31, but let's suppose that the file was entire with all the 300 number-functions (one per line) an so should be something like this:
Code:
K00001     function32  function177  fun..189  fun..212 fun..232  func..233 func...234 func..346
...
...
K00005   function273


The example is for a multiple entry (K00001) and a single entry K0005. So the multiple entry should be reported only one time but being that has multiple number associated, it will have also multiple functions associated that shoul be reported in the same line and tab separated. While the one single entry (like K0005) will be reported with its unique function (being only one number). At the end the new file will have all the KOOOOX ordered and in a multiple entry with their function/functions in the same line and tab separeted.

That does make sense?

Reply With Quote
  #4  
Old December 6th, 2012, 09:33 PM
b49P23TIvg's Avatar
b49P23TIvg b49P23TIvg is offline
Contributing User
Dev Shed Loyal (3000 - 3499 posts)
 
Join Date: Aug 2011
Posts: 3,458 b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level) 
Time spent in forums: 1 Month 2 Weeks 4 Days 6 h 33 m 33 sec
Reputation Power: 403
It troubles me that thalakos does not seem to know the difference between capital OH and the digit zero.
0O0O0O0O0O0O
Code:
0O0O0O0O0O
Maybe I haven't read the post with sufficient care.
__________________
[code]Code tags[/code] are essential for python code!

Reply With Quote
  #5  
Old December 6th, 2012, 10:23 PM
thalakos thalakos is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Nov 2012
Posts: 11 thalakos User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 h 56 m 52 sec
Reputation Power: 0
I'm sorry, I've been very sufficent about that; but I thought that in this contex it was not so important. Anyway I was precise in my first post file A) and file B) and by the way the K ID's are all 000... (numbers), while in file B) BR:ko0.. or
PATH:ko04142 are ko follow by 0 (number).

I hope that with this clarification would be much easire for someone to answer to this seemingly insoluble problem.

Thanks

Reply With Quote
  #6  
Old December 13th, 2012, 12:37 AM
thalakos thalakos is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Nov 2012
Posts: 11 thalakos User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 h 56 m 52 sec
Reputation Power: 0
[SOLVED]

I know it's not the correct place but finally the problem it's solved in perl:
Code:
use strict; 
use warnings; 
open my $B, '<', 'path to file_B.txt' or die "Cannot open file_B: $!"; 
my @functions; 
my %toc; 
my $i =0; 
while (my $line = <$B>) { 
    chomp $line; 
    $line =~ s/(\d+)\s//; 
    $functions[$i] = $line; 
    $toc{$1} = $i++; 
} 
close $B; 
open my $A, '<', 'path to file_A.txt' or die "Cannot open file_A: $!"; 
open my $C, '>', 'file_C.txt' or die "Cannot open file_C; $!"; 
my $prev_id = 'foo'; 
while (my $line = <$A>){ 
    my ($id,$code) = split /\s+/, $line; 
    if ($id ne $prev_id){ 
        print {$C} "\n" if $prev_id ne 'foo'; 
        print {$C} "$id "; 
        $prev_id = $id; 
    } 
    else { 
        print {$C} "\t"; 
    } 
    print {$C} $functions[$toc{$code}]; 
} 
print {$C} "\n"; 
close $C; 
close $A;


It works great, so if someone has the same stuff to do it could be use or someone could "convert" it in python.

Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming LanguagesPython Programming > Merging 2 text files under few very difficult conditions

Developer Shed Advertisers and Affiliates



Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 


Powered by: vBulletin Version 3.0.5
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.

© 2003-2013 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap