#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    11
    Rep Power
    0

    Joint data in columns with more the one condition


    Hi all,

    I guess someone here could help me. I have this type of file:

    PHP Code:
    K00001    32        0    BR:ko01002    Metabolism    Enzyme Families    Peptidases
    K00001    177        1    PATH
    :ko04142    Cellular Processes    Transport and Catabolism    Lysosome
    K00001    189        2    PATH
    :ko04612    Organismal Systems    Immune System    Antigen processing and presentation
    K00001    212        3    BR
    :ko03110    Genetic Information Processing    FoldingSorting and Degradation    Chaperones and folding catalysts
    K00001    232        4    PATH
    :ko04145    Cellular Processes    Transport and Catabolism    Phagosome
    K00001    233        5    PATH
    :ko05152    Human Diseases    Infectious Diseases    Tuberculosis
    K00001    234        6    PATH
    :ko05323    Human Diseases    Immune System Diseases    Rheumatoid arthritis
    K00001    346        7    PATH
    :ko04141    Genetic Information Processing    FoldingSorting and Degradation    Protein processing in endoplasmic reticulum
    K00002    182        8    PATH
    :ko04210    Cellular Processes    Cell Growth and Death    Apoptosis
    K00002    189        9    PATH
    :ko05010    Human Diseases    Neurodegenerative Diseases    Alzheimer's disease
    K00002    273        10    None    Unclassified    Metabolism    Amino acid metabolism
    K00003    146        11    BR:ko03000    Genetic Information Processing    Transcription    Transcription factors
    K00003    193        12    PATH:ko04111    Cellular Processes    Cell Growth and Death    Cell cycle - yeast
    K00003    240        13    PATH:ko00600    Metabolism    Lipid Metabolism    Sphingolipid metabolism
    K00004    176        14    PATH:ko04020    Environmental Information Processing    Signal Transduction    Calcium signaling pathway
    K00005    273        15    PATH:ko04974    Organismal Systems    Digestive System    Protein digestion and absorption
    K00006    192        16    BR:ko04030    Environmental Information Processing    Signaling Molecules and Interaction    G protein-coupled receptors
    K00007    51        17    PATH:ko04080    Environmental Information Processing    Signaling Molecules and Interaction    Neuroactive ligand-receptor interaction
    K00007    184        18    BR:ko01003    Metabolism    Glycan Biosynthesis and Metabolism    Glycosyltransferases
    K00008    51        19    PATH:ko04620    Organismal Systems    Immune System    Toll-like receptor signaling pathway
    K00009    51        20    PATH:ko05133    Human Diseases    Infectious Diseases    Pertussis
    K00010    43        21    PATH:ko05164    Human Diseases    Infectious Diseases    Influenza A
    K00010    257        22    PATH:ko05160    Human Diseases    Infectious Diseases    Hepatitis C 

    In the Column A there is a list of ID's (often present in multiple time), every ID is associated with a number in column B.
    The same numbers in column B correspond to the number of column D that is associeted with a functions (colum E,F,...).
    So in other words the ID's in column A should be associated with their function (in colum E,F,G..). The ID column and the functions columns are linked by the numbers in column B and D; in the example it not seems to be any realtionship but this is due to the fact that the file is huge and has hundreds of lines, not shown here.
    Acutally the column A should become in single entry ID only but without loosing the associated functions. As you can see some time there are single ID entry so it will have only 1 function associated, other time the ID entry are multiple ( like K0001) and so the function also will be multiple and it need to be reported in the same raws tab separeted.

    At the end I need a new file like this:

    PHP Code:
    K00001 Metabolism Enzyme Families Peptidases <tabOrganismal Systems Immune System Antigen processing and presentation <tab
    K0009 Metabolism Enzyme Families Peptidasesss 

    Indeed of the original file I have to retain only column A (the ID's) and the associted function (column E,F,G..)
    When the ID (like K0001) has more than one function associated, all its functions should be reported in the same row separeted by tab.

    I wonder if anyone could suggest me some script to do that;
    I know is a little bit tricky and complicated, I really hope to have been clear and that someone could help me.

    Thanks a lot in advance,
    Giorgio
  2. #2
  3. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,894
    Rep Power
    481
    I don't begin to understand what to do with the second, third and fourth columns. Ignoring them, and using your input as a file named d , this gawk program merges data with same first column onto a single line.
    Code:
    $ gawk '$1!=A{A=$1;printf"\n%s",A;c=" "}$1==A{$1=$2=$3=$4="";printf"%c%s",c,$0;c="\t"}' d
    
    K00001     Metabolism Enzyme Families Peptidases	    Cellular Processes Transport and Catabolism Lysosome	    Organismal Systems Immune System Antigen processing and presentation	    Genetic Information Processing Folding, Sorting and Degradation Chaperones and folding catalysts	    Cellular Processes Transport and Catabolism Phagosome	    Human Diseases Infectious Diseases Tuberculosis	    Human Diseases Immune System Diseases Rheumatoid arthritis	    Genetic Information Processing Folding, Sorting and Degradation Protein processing in endoplasmic reticulum
    K00002     Cellular Processes Cell Growth and Death Apoptosis	    Human Diseases Neurodegenerative Diseases Alzheimer's disease	    Unclassified Metabolism Amino acid metabolism
    K00003     Genetic Information Processing Transcription Transcription factors	    Cellular Processes Cell Growth and Death Cell cycle - yeast	    Metabolism Lipid Metabolism Sphingolipid metabolism
    K00004     Environmental Information Processing Signal Transduction Calcium signaling pathway
    K00005     Organismal Systems Digestive System Protein digestion and absorption
    K00006     Environmental Information Processing Signaling Molecules and Interaction G protein-coupled receptors
    K00007     Environmental Information Processing Signaling Molecules and Interaction Neuroactive ligand-receptor interaction	    Metabolism Glycan Biosynthesis and Metabolism Glycosyltransferases
    K00008     Organismal Systems Immune System Toll-like receptor signaling pathway
    K00009     Human Diseases Infectious Diseases Pertussis
    K00010     Human Diseases Infectious Diseases Influenza A	    Human Diseases Infectious Diseases Hepatitis C
    $
    [code]Code tags[/code] are essential for python code and Makefiles!
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    11
    Rep Power
    0
    Hi tnx for your answer,

    the second, third and fourth column should be delated from the new file. Anyway I've tried your command in the entire file and written the output to e new file and is:

    PHP Code:
      K00001     Metabolism Enzyme Families Peptidases
    K00001 177 1 PATH
    :ko04142 Cellular Processes Transport and Catabolism Lysosome
    K00001 189 2 PATH
    :ko04612 Organismal Systems Immune System Antigen processing and presentation
    K00001 212 3 BR
    :ko03110 Genetic Information Processing FoldingSorting and Degradation Chaperones and folding catalysts
    K00001 232 4 PATH
    :ko04145 Cellular Processes Transport and Catabolism Phagosome
    K00001 233 5 PATH
    :ko05152 Human Diseases Infectious Diseases Tuberculosis
    K00001 234 6 PATH
    :ko05323 Human Diseases Immune System Diseases Rheumatoid arthritis
    K00001 346 7 PATH
    :ko04141 Genetic Information Processing FoldingSorting and Degradation Protein processing in endoplasmic reticulum
    K00002 182 8 PATH
    :ko04210 Cellular Processes Cell Growth and Death Apoptosis
    K00002 189 9 PATH
    :ko05010 Human Diseases Neurodegenerative Diseases Alzheimer's disease
    K00002 273 10 None Unclassified Metabolism Amino acid metabolism
    K00003 146 11 BR:ko03000 Genetic Information Processing Transcription Transcription factors
    K00003 193 12 PATH:ko04111 Cellular Processes Cell Growth and Death Cell cycle - yeast
    K00003 240 13 PATH:ko00600 Metabolism Lipid Metabolism Sphingolipid metabolism
    K00004 176 14 PATH:ko04020 Environmental Information Processing Signal Transduction Calcium signaling pathway
    K00005 273 15 PATH:ko04974 Organismal Systems Digestive System Protein digestion and absorption
    K00006 192 16 BR:ko04030 Environmental Information Processing Signaling Molecules and Interaction G protein-coupled receptors
    K00007 51 17 PATH:ko04080 Environmental Information Processing Signaling Molecules and Interaction Neuroactive ligand-receptor interaction
    K00007 184 18 BR:ko01003 Metabolism Glycan Biosynthesis and Metabolism Glycosyltransferases
    K00008 51 19 PATH:ko04620 Organismal Systems Immune System Toll-like receptor signaling pathway
    K00009 51 20 PATH:ko05133 Human Diseases Infectious Diseases Pertussis
    K00010 43 21 PATH:ko05164 Human Diseases Infectious Diseases Influenza A
    K00010 257 22 PATH:ko05160 Human Diseases Infectious Diseases Hepatitis C
    K00011 51 23 PATH:ko05142 Human Diseases Infectious Diseases Chagas disease (American trypanosomiasis)
    K00011 184 24 BR:ko00535 Metabolism Glycan Biosynthesis and Metabolism Proteoglycans
    K00011 208 25 BR:ko03009 Genetic Information Processing Translation Ribosome Biogenesis
    K00011 273 26 BR:ko02000 Environmental Information Processing Membrane Transport Transporters
    K00011 326 27 PATH:ko02010 Environmental Information Processing Membrane Transport ABC transporters 
    there are still multiple entry ID and the numbers and the PATS:XXXX.. or BR:XXXXX. (2nd,3d, 4th coulumn) should be delated.

    I don't know if I explained correctly.
  6. #4
  7. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2012
    Location
    39N 104.28W
    Posts
    158
    Rep Power
    3
    It seems like the best method, perhaps even as an intermediate step if it's not the way you want to keep the data, would be to have a dictionary where the ID is the key and the data you want is the value. So I'd intitialize the dictionary:
    Code:
    dctBio={}
    Then I'd read each line and split it up on spaces:
    Code:
    lstBio=<fileobject>.readline().split()
    Now, you know that the first (index 0) element is the ID. Then the data is in elements 5 and beyond (indices, 4 and beyond). So if the key already exists you append the new data, if not you set it.
    Code:
    try: dctBio[lstBio[0]]+='\t'+" ".join(lstBio[4:])
    except: dctBio[lstBio[0]]=" ".join(lstBio[4:])
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    11
    Rep Power
    0

    Unhappy


    Thanks rrashkin!

    Unfortunately I'm pretty much newby to scripting and I'm afraid of not having grabbed your advice and I'm unbale to apply it by the way
  10. #6
  11. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2012
    Location
    39N 104.28W
    Posts
    158
    Rep Power
    3
    What have you written so far?
  12. #7
  13. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,894
    Rep Power
    481
    I didn't concoct a phony answer. If your result differs you did not copy the program correctly, your version of gawk is completely wrong, or your line endings are different from mine. You have a nasty DOS file on a unix system? Try again with and without the BEGIN{RS="\n\r"}
    Code:
    gawk 'BEGIN{RS="\n\r"}$1!=A{A=$1;printf"\n%s",A;c=" "}$1==A{$1=$2=$3=$4="";printf"%c%s",c,$0;c="\t"}' d
    [code]Code tags[/code] are essential for python code and Makefiles!
  14. #8
  15. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    11
    Rep Power
    0
    I never allowed myself to say that, I know you are right, I'm trying only to figureout why to me dosn not work.

    I using GNU Awk 3.1.8 on a MAC OS X and that could be the reason. I would love to attach the file here, but I'm not sure how and if I can..
  16. #9
  17. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    11
    Rep Power
    0
    Originally Posted by rrashkin
    What have you written so far?
    Very few things, like some loop cycles, very beginning tutorial stuff.

IMN logo majestic logo threadwatch logo seochat tools logo