The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.
|
 |
|
Dev Shed Forums
> Programming Languages
> Python Programming
|
Joint data in columns with more the one condition
Discuss Joint data in columns with more the one condition in the Python Programming forum on Dev Shed. Joint data in columns with more the one condition Python Programming forum discussing coding techniques, tips and tricks, and Zope related information. Python was designed from the ground up to be a completely object-oriented programming language.
|
|
 |
|
|
|
|

Dev Shed Forums Sponsor:
|
|
|

November 26th, 2012, 02:53 PM
|
|
Registered User
|
|
Join Date: Nov 2012
Posts: 11
Time spent in forums: 1 h 56 m 52 sec
Reputation Power: 0
|
|
|
Joint data in columns with more the one condition
Hi all,
I guess someone here could help me. I have this type of file:
PHP Code:
K00001 32 0 BR:ko01002 Metabolism Enzyme Families Peptidases
K00001 177 1 PATH:ko04142 Cellular Processes Transport and Catabolism Lysosome
K00001 189 2 PATH:ko04612 Organismal Systems Immune System Antigen processing and presentation
K00001 212 3 BR:ko03110 Genetic Information Processing Folding, Sorting and Degradation Chaperones and folding catalysts
K00001 232 4 PATH:ko04145 Cellular Processes Transport and Catabolism Phagosome
K00001 233 5 PATH:ko05152 Human Diseases Infectious Diseases Tuberculosis
K00001 234 6 PATH:ko05323 Human Diseases Immune System Diseases Rheumatoid arthritis
K00001 346 7 PATH:ko04141 Genetic Information Processing Folding, Sorting and Degradation Protein processing in endoplasmic reticulum
K00002 182 8 PATH:ko04210 Cellular Processes Cell Growth and Death Apoptosis
K00002 189 9 PATH:ko05010 Human Diseases Neurodegenerative Diseases Alzheimer's disease
K00002 273 10 None Unclassified Metabolism Amino acid metabolism
K00003 146 11 BR:ko03000 Genetic Information Processing Transcription Transcription factors
K00003 193 12 PATH:ko04111 Cellular Processes Cell Growth and Death Cell cycle - yeast
K00003 240 13 PATH:ko00600 Metabolism Lipid Metabolism Sphingolipid metabolism
K00004 176 14 PATH:ko04020 Environmental Information Processing Signal Transduction Calcium signaling pathway
K00005 273 15 PATH:ko04974 Organismal Systems Digestive System Protein digestion and absorption
K00006 192 16 BR:ko04030 Environmental Information Processing Signaling Molecules and Interaction G protein-coupled receptors
K00007 51 17 PATH:ko04080 Environmental Information Processing Signaling Molecules and Interaction Neuroactive ligand-receptor interaction
K00007 184 18 BR:ko01003 Metabolism Glycan Biosynthesis and Metabolism Glycosyltransferases
K00008 51 19 PATH:ko04620 Organismal Systems Immune System Toll-like receptor signaling pathway
K00009 51 20 PATH:ko05133 Human Diseases Infectious Diseases Pertussis
K00010 43 21 PATH:ko05164 Human Diseases Infectious Diseases Influenza A
K00010 257 22 PATH:ko05160 Human Diseases Infectious Diseases Hepatitis C
In the Column A there is a list of ID's (often present in multiple time), every ID is associated with a number in column B.
The same numbers in column B correspond to the number of column D that is associeted with a functions (colum E,F,...).
So in other words the ID's in column A should be associated with their function (in colum E,F,G..). The ID column and the functions columns are linked by the numbers in column B and D; in the example it not seems to be any realtionship but this is due to the fact that the file is huge and has hundreds of lines, not shown here.
Acutally the column A should become in single entry ID only but without loosing the associated functions. As you can see some time there are single ID entry so it will have only 1 function associated, other time the ID entry are multiple ( like K0001) and so the function also will be multiple and it need to be reported in the same raws tab separeted.
At the end I need a new file like this:
PHP Code:
K00001 Metabolism Enzyme Families Peptidases <tab> Organismal Systems Immune System Antigen processing and presentation <tab>
K0009 Metabolism Enzyme Families Peptidasesss
Indeed of the original file I have to retain only column A (the ID's) and the associted function (column E,F,G..)
When the ID (like K0001) has more than one function associated, all its functions should be reported in the same row separeted by tab.
I wonder if anyone could suggest me some script to do that;
I know is a little bit tricky and complicated, I really hope to have been clear and that someone could help me.
Thanks a lot in advance,
Giorgio
|

November 26th, 2012, 03:35 PM
|
 |
Contributing User
|
|
|
|
I don't begin to understand what to do with the second, third and fourth columns. Ignoring them, and using your input as a file named d , this gawk program merges data with same first column onto a single line.
Code:
$ gawk '$1!=A{A=$1;printf"\n%s",A;c=" "}$1==A{$1=$2=$3=$4="";printf"%c%s",c,$0;c="\t"}' d
K00001 Metabolism Enzyme Families Peptidases Cellular Processes Transport and Catabolism Lysosome Organismal Systems Immune System Antigen processing and presentation Genetic Information Processing Folding, Sorting and Degradation Chaperones and folding catalysts Cellular Processes Transport and Catabolism Phagosome Human Diseases Infectious Diseases Tuberculosis Human Diseases Immune System Diseases Rheumatoid arthritis Genetic Information Processing Folding, Sorting and Degradation Protein processing in endoplasmic reticulum
K00002 Cellular Processes Cell Growth and Death Apoptosis Human Diseases Neurodegenerative Diseases Alzheimer's disease Unclassified Metabolism Amino acid metabolism
K00003 Genetic Information Processing Transcription Transcription factors Cellular Processes Cell Growth and Death Cell cycle - yeast Metabolism Lipid Metabolism Sphingolipid metabolism
K00004 Environmental Information Processing Signal Transduction Calcium signaling pathway
K00005 Organismal Systems Digestive System Protein digestion and absorption
K00006 Environmental Information Processing Signaling Molecules and Interaction G protein-coupled receptors
K00007 Environmental Information Processing Signaling Molecules and Interaction Neuroactive ligand-receptor interaction Metabolism Glycan Biosynthesis and Metabolism Glycosyltransferases
K00008 Organismal Systems Immune System Toll-like receptor signaling pathway
K00009 Human Diseases Infectious Diseases Pertussis
K00010 Human Diseases Infectious Diseases Influenza A Human Diseases Infectious Diseases Hepatitis C
$
__________________
[code] Code tags[/code] are essential for python code!
|

November 26th, 2012, 03:54 PM
|
|
Registered User
|
|
Join Date: Nov 2012
Posts: 11
Time spent in forums: 1 h 56 m 52 sec
Reputation Power: 0
|
|
Hi tnx for your answer,
the second, third and fourth column should be delated from the new file. Anyway I've tried your command in the entire file and written the output to e new file and is:
PHP Code:
K00001 Metabolism Enzyme Families Peptidases
K00001 177 1 PATH:ko04142 Cellular Processes Transport and Catabolism Lysosome
K00001 189 2 PATH:ko04612 Organismal Systems Immune System Antigen processing and presentation
K00001 212 3 BR:ko03110 Genetic Information Processing Folding, Sorting and Degradation Chaperones and folding catalysts
K00001 232 4 PATH:ko04145 Cellular Processes Transport and Catabolism Phagosome
K00001 233 5 PATH:ko05152 Human Diseases Infectious Diseases Tuberculosis
K00001 234 6 PATH:ko05323 Human Diseases Immune System Diseases Rheumatoid arthritis
K00001 346 7 PATH:ko04141 Genetic Information Processing Folding, Sorting and Degradation Protein processing in endoplasmic reticulum
K00002 182 8 PATH:ko04210 Cellular Processes Cell Growth and Death Apoptosis
K00002 189 9 PATH:ko05010 Human Diseases Neurodegenerative Diseases Alzheimer's disease
K00002 273 10 None Unclassified Metabolism Amino acid metabolism
K00003 146 11 BR:ko03000 Genetic Information Processing Transcription Transcription factors
K00003 193 12 PATH:ko04111 Cellular Processes Cell Growth and Death Cell cycle - yeast
K00003 240 13 PATH:ko00600 Metabolism Lipid Metabolism Sphingolipid metabolism
K00004 176 14 PATH:ko04020 Environmental Information Processing Signal Transduction Calcium signaling pathway
K00005 273 15 PATH:ko04974 Organismal Systems Digestive System Protein digestion and absorption
K00006 192 16 BR:ko04030 Environmental Information Processing Signaling Molecules and Interaction G protein-coupled receptors
K00007 51 17 PATH:ko04080 Environmental Information Processing Signaling Molecules and Interaction Neuroactive ligand-receptor interaction
K00007 184 18 BR:ko01003 Metabolism Glycan Biosynthesis and Metabolism Glycosyltransferases
K00008 51 19 PATH:ko04620 Organismal Systems Immune System Toll-like receptor signaling pathway
K00009 51 20 PATH:ko05133 Human Diseases Infectious Diseases Pertussis
K00010 43 21 PATH:ko05164 Human Diseases Infectious Diseases Influenza A
K00010 257 22 PATH:ko05160 Human Diseases Infectious Diseases Hepatitis C
K00011 51 23 PATH:ko05142 Human Diseases Infectious Diseases Chagas disease (American trypanosomiasis)
K00011 184 24 BR:ko00535 Metabolism Glycan Biosynthesis and Metabolism Proteoglycans
K00011 208 25 BR:ko03009 Genetic Information Processing Translation Ribosome Biogenesis
K00011 273 26 BR:ko02000 Environmental Information Processing Membrane Transport Transporters
K00011 326 27 PATH:ko02010 Environmental Information Processing Membrane Transport ABC transporters
there are still multiple entry ID and the numbers and the PATS:XXXX.. or BR:XXXXX. (2nd,3d, 4th coulumn) should be delated.
I don't know if I explained correctly.
|

November 26th, 2012, 03:58 PM
|
 |
Contributing User
|
|
Join Date: May 2012
Location: 39N 104.28W
Posts: 90
Time spent in forums: 1 Day 13 h 39 m 14 sec
Reputation Power: 2
|
|
It seems like the best method, perhaps even as an intermediate step if it's not the way you want to keep the data, would be to have a dictionary where the ID is the key and the data you want is the value. So I'd intitialize the dictionary:
Then I'd read each line and split it up on spaces:
Code:
lstBio=<fileobject>.readline().split()
Now, you know that the first (index 0) element is the ID. Then the data is in elements 5 and beyond (indices, 4 and beyond). So if the key already exists you append the new data, if not you set it.
Code:
try: dctBio[lstBio[0]]+='\t'+" ".join(lstBio[4:])
except: dctBio[lstBio[0]]=" ".join(lstBio[4:])
|

November 26th, 2012, 04:03 PM
|
|
Registered User
|
|
Join Date: Nov 2012
Posts: 11
Time spent in forums: 1 h 56 m 52 sec
Reputation Power: 0
|
|
Thanks rrashkin!
Unfortunately I'm pretty much newby to scripting and I'm afraid of not having grabbed your advice and I'm unbale to apply it by the way
|

November 26th, 2012, 04:08 PM
|
 |
Contributing User
|
|
Join Date: May 2012
Location: 39N 104.28W
Posts: 90
Time spent in forums: 1 Day 13 h 39 m 14 sec
Reputation Power: 2
|
|
|
What have you written so far?
|

November 26th, 2012, 04:44 PM
|
 |
Contributing User
|
|
|
|
I didn't concoct a phony answer. If your result differs you did not copy the program correctly, your version of gawk is completely wrong, or your line endings are different from mine. You have a nasty DOS file on a unix system? Try again with and without the BEGIN{RS="\n\r"}
Code:
gawk 'BEGIN{RS="\n\r"}$1!=A{A=$1;printf"\n%s",A;c=" "}$1==A{$1=$2=$3=$4="";printf"%c%s",c,$0;c="\t"}' d
|

November 26th, 2012, 05:17 PM
|
|
Registered User
|
|
Join Date: Nov 2012
Posts: 11
Time spent in forums: 1 h 56 m 52 sec
Reputation Power: 0
|
|
|
I never allowed myself to say that, I know you are right, I'm trying only to figureout why to me dosn not work.
I using GNU Awk 3.1.8 on a MAC OS X and that could be the reason. I would love to attach the file here, but I'm not sure how and if I can..
|

November 26th, 2012, 05:19 PM
|
|
Registered User
|
|
Join Date: Nov 2012
Posts: 11
Time spent in forums: 1 h 56 m 52 sec
Reputation Power: 0
|
|
Quote: | Originally Posted by rrashkin What have you written so far? |
Very few things, like some loop cycles, very beginning tutorial stuff.
|
Developer Shed Advertisers and Affiliates
| Thread Tools |
Search this Thread |
|
|
|
| Display Modes |
Rate This Thread |
Linear Mode
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
|
|