#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    11
    Rep Power
    0

    Merge data in columns B that have same entry in column A


    Hi all,

    I have this text file with two columns A (ID) and B (Gene):

    Code:
     
    ID             Gene 
     
    hsa-let-7a	KRAS 
    hsa-let-7a	HMGA2 
    hsa-let-7a	integrin beta(3) 
    hsa-let-7a	caspase-3 
    hsa-let-7a	PRDM1/Blimp-1 
    hsa-let-7a	HMGA2 
    hsa-let-7a	IGF-II 
    hsa-let-7a	HMGA2 
    hsa-let-7a	HMGA2 
    hsa-let-7a	RAS 
    hsa-let-7a	BCL2 
    hsa-let-7a	RAS 
    hsa-let-7a	MYC 
    hsa-let-7a	CDC25A 
    hsa-let-7a	CDK6 
    hsa-let-7a	NF2 
    hsa-let-7a	c-myc 
    hsa-let-7a	RAS 
    hsa-let-7a	RAS 
    hsa-let-7a	NIRF 
    hsa-let-7b	Cdc34 
    hsa-let-7b	Dicer 
    hsa-let-7b	KRAS 
    hsa-let-7b	CCND1 
    hsa-let-7b	CDC25A 
    hsa-let-7b	CDK6 
    hsa-let-7b	HMGA2 
    hsa-let-7c	HMGA2 
    hsa-let-7c	HMGA2 
    hsa-let-7c	HMGA2 
    hsa-let-7c	BCL2 
    hsa-let-7c	RAS 
    hsa-let-7c	CDC25A 
    hsa-let-7c	CDK6 
    hsa-let-7c	RAS 
    hsa-let-7d	KRAS 
    hsa-let-7d	HMGA2 
    hsa-let-7d	BCL2 
    hsa-let-7d	RAS 
    hsa-let-7d	CDC25A 
    hsa-let-7d	CDK6 
    hsa-let-7d 	BDNF 
    hsa-let-7d 	D3R 
    hsa-let-7e	HMGA2 
    hsa-let-7g	KRAS 
    hsa-let-7g	HMGA2 
    hsa-let-7g	Ras 
    hsa-let-7g	HMGA2 
    hsa-let-7g	CDC25A 
    hsa-let-7g	CDK6 
    hsa-miR-1	c-Met 
    hsa-miR-1	calmodulin 
    hsa-miR-1	Gata4 
    hsa-miR-1	Mef2a 
    hsa-miR-1	BCL2 
    hsa-miR-1	Gata4 
    hsa-miR-1	calmodulin 
    hsa-miR-1	Mef2a 
    hsa-miR-1	C/EBPa 
    hsa-miR-1	FoxP1 
    hsa-miR-1	HDAC4 
    hsa-miR-1	MET 
    hsa-miR-1	HCN4 
    hsa-miR-1	FoxP1 
    hsa-miR-1	HDAC4 
    hsa-miR-1	MET 
    hsa-miR-1	Cdk9 
    hsa-miR-1	fibronectin 
    hsa-miR-1	RasGAP 
    hsa-miR-1	Rheb 
    hsa-miR-1	MEF-2 
    hsa-miR-1	nAChR 
    hsa-miR-1	GAJ1 
    hsa-miR-1	KCNJ2 
    hsa-miR-1	HSP60 
    hsa-miR-1	HSP70 
    hsa-miR-1	Hand2 
    hsa-miR-1	Kir2.1 
    hsa-miR-100	Plk1 
    ...... 
    (line cut)
    I would like to have for column A a single entry and in column B the respective associated name comma separated, like that:
    Code:
     	
    ID                     Gene 
       
    hsa-let-7a    KRAS,HMGA2,integrin beta(3),caspase-3,PRDM1/Blimp-1,HMGA2,IGF-II,HMGA2,HMGA2,RAS,BCL2,RAS,MYC,CDC25A,CDK6,NF2,c-myc,RAS,RAS,NIRF 
    hsa-let-7b    Cdc34,Dicer,KRAS,CCND1,CDC25A,CDK6,HMGA2 
    hsa-let-7c    HMGA2,HMGA2,HMGA2,BCL2,RAS,CDC25A,CDK6,RAS 
    .........
    Do you know any way to do that automatically?

    Thanks in advance,
    Giorgio
  2. #2
  3. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,905
    Rep Power
    481
    Use a defaultdict from collections module. Use a default value factory of list.

    import collections
    d = collections.defaultdict(list)

    The key is first field of a line
    The value is the remainder of the line

    d[key].append(value)

    You'll have a dictionary with a list for each key which is then easy to display as you want. If you know the data is sorted by ID then a more memory-efficient algorithm is possible.

    gawk has associative arrays and the problem would be easier to solve in gawk.
    [code]Code tags[/code] are essential for python code and Makefiles!
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    11
    Rep Power
    0
    Thanks a lot for your asnwer!

IMN logo majestic logo threadwatch logo seochat tools logo