December 6th, 2012, 04:11 PM
-
Merging 2 text files under few very difficult conditions
Hi all,
I have this two file:
A)
Code:
K00001 32
K00001 177
K00001 189
K00001 212
K00001 232
K00001 233
K00001 234
K00001 346
K00002 182
K00002 189
K00002 273
K00003 146
K00003 193
K00003 240
K00004 176
K00005 273
K00006 192
K00007 51
K00007 184
K00008 51
K00009 51
........
B)
Code:
0 BR:ko01002 Metabolism Enzyme Families Peptidases
1 PATH:ko04142 Cellular Processes Transport and Catabolism Lysosome
2 PATH:ko04612 Organismal Systems Immune System Antigen processing and presentation
3 BR:ko03110 Genetic Information Processing Folding, Sorting and Degradation Chaperones and folding catalysts
4 PATH:ko04145 Cellular Processes Transport and Catabolism Phagosome
5 PATH:ko05152 Human Diseases Infectious Diseases Tuberculosis
6 PATH:ko05323 Human Diseases Immune System Diseases Rheumatoid arthritis
7 PATH:ko04141 Genetic Information Processing Folding, Sorting and Degradation Protein processing in endoplasmic reticulum
8 PATH:ko04210 Cellular Processes Cell Growth and Death Apoptosis
9 PATH:ko05010 Human Diseases Neurodegenerative Diseases Alzheimer's disease
10 None Unclassified Metabolism Amino acid metabolism
11 BR:ko03000 Genetic Information Processing Transcription Transcription factors
12 PATH:ko04111 Cellular Processes Cell Growth and Death Cell cycle - yeast
13 PATH:ko00600 Metabolism Lipid Metabolism Sphingolipid metabolism
14 PATH:ko04020 Environmental Information Processing Signal Transduction Calcium signaling pathway
15 PATH:ko04974 Organismal Systems Digestive System Protein digestion and absorption
16 BR:ko04030 Environmental Information Processing Signaling Molecules and Interaction G protein-coupled receptors
17 PATH:ko04080 Environmental Information Processing Signaling Molecules and Interaction Neuroactive ligand-receptor interaction
18 BR:ko01003 Metabolism Glycan Biosynthesis and Metabolism Glycosyltransferases
19 PATH:ko04620 Organismal Systems Immune System Toll-like receptor signaling pathway
20 PATH:ko05133 Human Diseases Infectious Diseases Pertussis
21 PATH:ko05164 Human Diseases Infectious Diseases Influenza A
22 PATH:ko05160 Human Diseases Infectious Diseases Hepatitis C
23 PATH:ko05142 Human Diseases Infectious Diseases Chagas disease (American trypanosomiasis)
24 BR:ko00535 Metabolism Glycan Biosynthesis and Metabolism Proteoglycans
25 BR:ko03009 Genetic Information Processing Translation Ribosome Biogenesis
26 BR:ko02000 Environmental Information Processing Membrane Transport Transporters
27 PATH:ko02010 Environmental Information Processing Membrane Transport ABC transporters
28 PATH:ko00591 Metabolism Lipid Metabolism Linoleic acid metabolism
29 BR:ko01004 Metabolism Lipid Metabolism Lipid biosynthesis proteins
30 PATH:ko00590 Metabolism Lipid Metabolism Arachidonic acid metabolism
31 PATH:ko00380 Metabolism Amino Acid Metabolism Tryptophan metabolism
The files are linked by the numbers: every KOOOX in file A) is associated with a function in file B). THe association is supported by the same number in the raw of the KOOOX and the function. I need to have a new file like this:
Code:
KOOOX BR:...the rest of the line
KOOOx1 PATH:...the rest of the line
....
....
So all the KOOOX ID's and their associated function (BR: or PATH: depending on the related number). What makes it difficult is that in the file B) there all only 300 defined and unique lines (so different function) but in the file A) There are thousand of that KOOOX ID's in multiple entry. In the new file I need to generate I must have only single entries and if there are more numbers associated with KOOO ID's and functions in that case I should have something like that:
Code:
K00001 funcition (releted to the 1st number) function (reletade to the second number) function (releted to the 3rd number) ...
Every KOOOS should be in its own line with its associated functions and TAB separeted.
I know it's very very tricky but I hope some of you expert here could have an idea on how to do it. Some dictionary for example.
Thanks in advance for any input
December 6th, 2012, 05:34 PM
-
Which number is related to each other?
Is it the number in second column of A) that should be matched to the numbers in the first column of B)?
Can you post a complete example of a row for one of the K000X?
December 6th, 2012, 07:03 PM
-
Originally Posted by MrFujin
Which number is related to each other?
Is it the number in second column of A) that should be matched to the numbers in the first column of B)?
Can you post a complete example of a row for one of the K000X?
Thanks for your reply;
Exactly the number in column A) should be matched to those in column B), with linked I mean that the releted KOOOX needa to be reported in the new file with the function (and they had both the same number). For example for KOOOO1 (you do not see the function because the column B) stops at 31, but let's suppose that the file was entire with all the 300 number-functions (one per line) an so should be something like this:
Code:
K00001 function32 function177 fun..189 fun..212 fun..232 func..233 func...234 func..346
...
...
K00005 function273
The example is for a multiple entry (K00001) and a single entry K0005. So the multiple entry should be reported only one time but being that has multiple number associated, it will have also multiple functions associated that shoul be reported in the same line and tab separated. While the one single entry (like K0005) will be reported with its unique function (being only one number). At the end the new file will have all the KOOOOX ordered and in a multiple entry with their function/functions in the same line and tab separeted.
That does make sense?
December 6th, 2012, 09:33 PM
-
It troubles me that thalakos does not seem to know the difference between capital OH and the digit zero.
0O0O0O0O0O0OMaybe I haven't read the post with sufficient care.
[code]
Code tags[/code] are essential for python code and Makefiles!
December 6th, 2012, 10:23 PM
-
I'm sorry, I've been very sufficent about that; but I thought that in this contex it was not so important. Anyway I was precise in my first post file A) and file B) and by the way the K ID's are all 000... (numbers), while in file B) BR:ko0.. or
PATH:ko04142 are ko follow by 0 (number).
I hope that with this clarification would be much easire for someone to answer to this seemingly insoluble problem.
Thanks
December 13th, 2012, 12:37 AM
-
[SOLVED]
I know it's not the correct place but finally the problem it's solved in perl:
Code:
use strict;
use warnings;
open my $B, '<', 'path to file_B.txt' or die "Cannot open file_B: $!";
my @functions;
my %toc;
my $i =0;
while (my $line = <$B>) {
chomp $line;
$line =~ s/(\d+)\s//;
$functions[$i] = $line;
$toc{$1} = $i++;
}
close $B;
open my $A, '<', 'path to file_A.txt' or die "Cannot open file_A: $!";
open my $C, '>', 'file_C.txt' or die "Cannot open file_C; $!";
my $prev_id = 'foo';
while (my $line = <$A>){
my ($id,$code) = split /\s+/, $line;
if ($id ne $prev_id){
print {$C} "\n" if $prev_id ne 'foo';
print {$C} "$id ";
$prev_id = $id;
}
else {
print {$C} "\t";
}
print {$C} $functions[$toc{$code}];
}
print {$C} "\n";
close $C;
close $A;
It works great, so if someone has the same stuff to do it could be use or someone could "convert" it in python.