#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    11
    Rep Power
    0

    Merging 2 text files under few very difficult conditions


    Hi all,

    I have this two file:

    A)

    Code:
    K00001	32
    K00001	177
    K00001	189
    K00001	212
    K00001	232
    K00001	233
    K00001	234
    K00001	346
    K00002	182
    K00002	189
    K00002	273
    K00003	146
    K00003	193
    K00003	240
    K00004	176
    K00005	273
    K00006	192
    K00007	51
    K00007	184
    K00008	51
    K00009	51
    ........
    B)

    Code:
    0	BR:ko01002	Metabolism	Enzyme Families	Peptidases
    1	PATH:ko04142	Cellular Processes	Transport and Catabolism	Lysosome
    2	PATH:ko04612	Organismal Systems	Immune System	Antigen processing and presentation
    3	BR:ko03110	Genetic Information Processing	Folding, Sorting and Degradation	Chaperones and folding catalysts
    4	PATH:ko04145	Cellular Processes	Transport and Catabolism	Phagosome
    5	PATH:ko05152	Human Diseases	Infectious Diseases	Tuberculosis
    6	PATH:ko05323	Human Diseases	Immune System Diseases	Rheumatoid arthritis
    7	PATH:ko04141	Genetic Information Processing	Folding, Sorting and Degradation	Protein processing in endoplasmic reticulum
    8	PATH:ko04210	Cellular Processes	Cell Growth and Death	Apoptosis
    9	PATH:ko05010	Human Diseases	Neurodegenerative Diseases	Alzheimer's disease
    10	None	Unclassified	Metabolism	Amino acid metabolism
    11	BR:ko03000	Genetic Information Processing	Transcription	Transcription factors
    12	PATH:ko04111	Cellular Processes	Cell Growth and Death	Cell cycle - yeast
    13	PATH:ko00600	Metabolism	Lipid Metabolism	Sphingolipid metabolism
    14	PATH:ko04020	Environmental Information Processing	Signal Transduction	Calcium signaling pathway
    15	PATH:ko04974	Organismal Systems	Digestive System	Protein digestion and absorption
    16	BR:ko04030	Environmental Information Processing	Signaling Molecules and Interaction	G protein-coupled receptors
    17	PATH:ko04080	Environmental Information Processing	Signaling Molecules and Interaction	Neuroactive ligand-receptor interaction
    18	BR:ko01003	Metabolism	Glycan Biosynthesis and Metabolism	Glycosyltransferases
    19	PATH:ko04620	Organismal Systems	Immune System	Toll-like receptor signaling pathway
    20	PATH:ko05133	Human Diseases	Infectious Diseases	Pertussis
    21	PATH:ko05164	Human Diseases	Infectious Diseases	Influenza A
    22	PATH:ko05160	Human Diseases	Infectious Diseases	Hepatitis C
    23	PATH:ko05142	Human Diseases	Infectious Diseases	Chagas disease (American trypanosomiasis)
    24	BR:ko00535	Metabolism	Glycan Biosynthesis and Metabolism	Proteoglycans
    25	BR:ko03009	Genetic Information Processing	Translation	Ribosome Biogenesis
    26	BR:ko02000	Environmental Information Processing	Membrane Transport	Transporters
    27	PATH:ko02010	Environmental Information Processing	Membrane Transport	ABC transporters
    28	PATH:ko00591	Metabolism	Lipid Metabolism	Linoleic acid metabolism
    29	BR:ko01004	Metabolism	Lipid Metabolism	Lipid biosynthesis proteins
    30	PATH:ko00590	Metabolism	Lipid Metabolism	Arachidonic acid metabolism
    31	PATH:ko00380	Metabolism	Amino Acid Metabolism	Tryptophan metabolism
    The files are linked by the numbers: every KOOOX in file A) is associated with a function in file B). THe association is supported by the same number in the raw of the KOOOX and the function. I need to have a new file like this:

    Code:
    KOOOX       BR:...the rest of the line
    KOOOx1     PATH:...the rest of the line
    ....
    ....
    So all the KOOOX ID's and their associated function (BR: or PATH: depending on the related number). What makes it difficult is that in the file B) there all only 300 defined and unique lines (so different function) but in the file A) There are thousand of that KOOOX ID's in multiple entry. In the new file I need to generate I must have only single entries and if there are more numbers associated with KOOO ID's and functions in that case I should have something like that:
    Code:
    K00001	funcition (releted to the 1st number)  function (reletade to the second number) function (releted to the 3rd number) ...
    Every KOOOS should be in its own line with its associated functions and TAB separeted.

    I know it's very very tricky but I hope some of you expert here could have an idea on how to do it. Some dictionary for example.

    Thanks in advance for any input
  2. #2
  3. Lord of the Dance
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Oct 2003
    Posts
    3,614
    Rep Power
    1945
    Which number is related to each other?
    Is it the number in second column of A) that should be matched to the numbers in the first column of B)?

    Can you post a complete example of a row for one of the K000X?
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    11
    Rep Power
    0
    Originally Posted by MrFujin
    Which number is related to each other?
    Is it the number in second column of A) that should be matched to the numbers in the first column of B)?

    Can you post a complete example of a row for one of the K000X?
    Thanks for your reply;

    Exactly the number in column A) should be matched to those in column B), with linked I mean that the releted KOOOX needa to be reported in the new file with the function (and they had both the same number). For example for KOOOO1 (you do not see the function because the column B) stops at 31, but let's suppose that the file was entire with all the 300 number-functions (one per line) an so should be something like this:
    Code:
    K00001     function32  function177  fun..189  fun..212 fun..232  func..233 func...234 func..346
    ...
    ...
    K00005   function273
    The example is for a multiple entry (K00001) and a single entry K0005. So the multiple entry should be reported only one time but being that has multiple number associated, it will have also multiple functions associated that shoul be reported in the same line and tab separated. While the one single entry (like K0005) will be reported with its unique function (being only one number). At the end the new file will have all the KOOOOX ordered and in a multiple entry with their function/functions in the same line and tab separeted.

    That does make sense?
  6. #4
  7. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,843
    Rep Power
    480
    It troubles me that thalakos does not seem to know the difference between capital OH and the digit zero.
    0O0O0O0O0O0O
    Code:
    0O0O0O0O0O
    Maybe I haven't read the post with sufficient care.
    [code]Code tags[/code] are essential for python code and Makefiles!
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    11
    Rep Power
    0
    I'm sorry, I've been very sufficent about that; but I thought that in this contex it was not so important. Anyway I was precise in my first post file A) and file B) and by the way the K ID's are all 000... (numbers), while in file B) BR:ko0.. or
    PATH:ko04142 are ko follow by 0 (number).

    I hope that with this clarification would be much easire for someone to answer to this seemingly insoluble problem.

    Thanks
  10. #6
  11. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    11
    Rep Power
    0
    [SOLVED]

    I know it's not the correct place but finally the problem it's solved in perl:
    Code:
    use strict; 
    use warnings; 
    open my $B, '<', 'path to file_B.txt' or die "Cannot open file_B: $!"; 
    my @functions; 
    my %toc; 
    my $i =0; 
    while (my $line = <$B>) { 
        chomp $line; 
        $line =~ s/(\d+)\s//; 
        $functions[$i] = $line; 
        $toc{$1} = $i++; 
    } 
    close $B; 
    open my $A, '<', 'path to file_A.txt' or die "Cannot open file_A: $!"; 
    open my $C, '>', 'file_C.txt' or die "Cannot open file_C; $!"; 
    my $prev_id = 'foo'; 
    while (my $line = <$A>){ 
        my ($id,$code) = split /\s+/, $line; 
        if ($id ne $prev_id){ 
            print {$C} "\n" if $prev_id ne 'foo'; 
            print {$C} "$id "; 
            $prev_id = $id; 
        } 
        else { 
            print {$C} "\t"; 
        } 
        print {$C} $functions[$toc{$code}]; 
    } 
    print {$C} "\n"; 
    close $C; 
    close $A;
    It works great, so if someone has the same stuff to do it could be use or someone could "convert" it in python.

IMN logo majestic logo threadwatch logo seochat tools logo