#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2012
    Posts
    63
    Rep Power
    3

    Find common entries in first column and fetch whatever in fornt of it


    Hi all

    I have 2 files

    one is like this

    PTGS2
    IL2RB
    IGF1R
    CALR
    ABCC1
    RET
    ABCB4
    MMP2
    ERBB4
    TP53
    IL7R
    PIK3CG
    SYK
    IL9
    CNTFR
    SLC6A2
    PDGFRA
    PRLR
    Second is like this
    CALR Antigen processing and presentation CPSab CALR Antigen processing and presentation CPSab tttt n 19p13.13c
    KIR2DL5A Antigen processing and presentation CPSab KIR2DL5A tttt n 19p13.13
    KIR2DS1 Antigen processing and presentation CPSab KIR2DS1 tttt n 19q13.4
    KIR2DS2 Antigen processing and presentation CPSab KIR2DS2 tttt n 19q13.4
    KIR2DS3 Antigen processing and presentation CPSab KIR2DS3 tttt n 19q13.4
    KIR2DS5 Antigen processing and presentation CPSab KIR2DS5 tttt n 19q13.4
    PSME1 Antigen processing and presentation CPSab PSME1 tttt n 14q12a
    PSME2 Antigen processing and presentation CPSab PSME2 tttt n 14q12a
    PTK2 Aspirin Blocks Signaling Pathway Involved in Platelet Activation CPSab PTK2 Aspirin Blocks Signaling Pathway Involved in Platelet Activation CPSab tt n 8q24.3c
    SYK Aspirin Blocks Signaling Pathway Involved in Platelet Activation CPSab SYK tt n 9q22.2b
    PIK3C2G CCR3 signaling in Eosinophils CPS CPSab PIK3C2G CCR3 signaling in Eosinophils CPSab t n 12p12.3b
    PTK2 CCR3 signaling in Eosinophils CPS CPSab PTK2 t n 8q24.3c
    CHUK CD40L Signaling Pathway CPSab CHUK CD40L Signaling Pathway CPSab tttt n 10q24.2c
    DUSP1 CD40L Signaling Pathway CPSab DUSP1 tttt n 5q35.1e
    IKBKAP CD40L Signaling Pathway CPSab IKBKAP tttt n 9q31.3a
    MAP3K1 CD40L Signaling Pathway CPSab MAP3K1 tttt n 5q11.2f
    TRAF6 CD40L Signaling Pathway CPSab TRAF6 tttt n 11p12d
    CCNE1 CDK Regulation of DNA Replication C CPSab CCNE1 CDK Regulation of DNA Replication CPSab tttt n 19q12c
    KITLG CDK Regulation of DNA Replication C CPSab KITLG tttt n 12q21.32a
    MCM5 CDK Regulation of DNA Replication C CPSab MCM5 tttt n 22q12.3c
    ORC4L CDK Regulation of DNA Replication C CPSab ORC4L tttt n 2q23.1a
    PIK3C2G CXCR4 Signaling Pathway CPS CPSab PIK3C2G CXCR4 Signaling Pathway CPSab
    I have to check if there if there is any entry common between first file and first column of second file then I have to fetch whatever is present in front of it from second file

    so if CALR is common then output is

    CALR Antigen processing and presentation CPSab CALR Antigen processing and presentation CPSab tttt n 19p13.13c
    Please let me know perl scripting regarding to help one of my friend.
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Location
    Paris area, France
    Posts
    843
    Rep Power
    496
    Hi,

    The way to do it really depends of the size (in terms of number of lines)) of each file. Depending on whether one has many more lines than the other, the algorithm may differ.

    In almost all cases, though, I think that the first thing to do with this type of problem is to read the first file line by line, chomp each line and store each line as a key in a hash (the associated values don't really matter, could be 1 for each hash entry.

    Then, you read the second file line by line and for each line test it against each hash entry. The way to do it may differ depending on various factors pertaining to the data: relative size of the files, size of each line in the second file, volume of data (i.e. do you want to optimize for code simplicity or for speed and performance), etc. and also the Perl version you are using. You could use:
    - Regular expressions to find a match and capture whatever is before the match in the line
    - Index and substr function
    - Possibly the smart match (if your Perl version allows it)

    Another possible approach may be to use the List::Utils (and/or possibly List::More::Utils)) modules to compare the list of words in the first file and the list of words in each line of the second file.
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2012
    Posts
    63
    Rep Power
    3
    Hi

    Thanks for reply.

    Yes, the second file is larger but follow the same pattern as sample presented here.

    But first file is small and this much only which I presented

    Initially I tried one code in unix which worked for very small sample to certain extent but not for large original data so

    here is the code in shell which I tried:

    awk 'NR==FNR{X[$1]=$0;next}{n=split($1,P," ");sub($1,"",$0);for(i=1;i<=n;i++){if(X[P[i]]){print P[i],$0}}}' file1 FS="\t" file2
    Now taking help in from perl!
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Apr 2009
    Posts
    1,970
    Rep Power
    1225
    What have you tried?

    How big are the files?

    Is there a possibility of one or both of the files having duplicate entries in the first column?

IMN logo majestic logo threadwatch logo seochat tools logo