November 12th, 2012, 11:25 PM
-
Find common entries in first column and fetch whatever in fornt of it
Hi all
I have 2 files
one is like this
PTGS2
IL2RB
IGF1R
CALR
ABCC1
RET
ABCB4
MMP2
ERBB4
TP53
IL7R
PIK3CG
SYK
IL9
CNTFR
SLC6A2
PDGFRA
PRLR
Second is like this
CALR Antigen processing and presentation CPSab CALR Antigen processing and presentation CPSab ü tttt n 19p13.13c
KIR2DL5A Antigen processing and presentation CPSab KIR2DL5A ü tttt n 19p13.13
KIR2DS1 Antigen processing and presentation CPSab KIR2DS1 ü tttt n 19q13.4
KIR2DS2 Antigen processing and presentation CPSab KIR2DS2 ü tttt n 19q13.4
KIR2DS3 Antigen processing and presentation CPSab KIR2DS3 ü tttt n 19q13.4
KIR2DS5 Antigen processing and presentation CPSab KIR2DS5 ü tttt n 19q13.4
PSME1 Antigen processing and presentation CPSab PSME1 ü tttt n 14q12a
PSME2 Antigen processing and presentation CPSab PSME2 ü tttt n 14q12a
PTK2 Aspirin Blocks Signaling Pathway Involved in Platelet Activation CPSab PTK2 Aspirin Blocks Signaling Pathway Involved in Platelet Activation CPSab ü tt n 8q24.3c
SYK Aspirin Blocks Signaling Pathway Involved in Platelet Activation CPSab SYK ü tt n 9q22.2b
PIK3C2G CCR3 signaling in Eosinophils CPS CPSab PIK3C2G CCR3 signaling in Eosinophils CPSab ü t n 12p12.3b
PTK2 CCR3 signaling in Eosinophils CPS CPSab PTK2 ü t n 8q24.3c
CHUK CD40L Signaling Pathway CPSab CHUK CD40L Signaling Pathway CPSab ü ü tttt n 10q24.2c
DUSP1 CD40L Signaling Pathway CPSab DUSP1 ü ü ü tttt n 5q35.1e
IKBKAP CD40L Signaling Pathway CPSab IKBKAP ü ü ü ü tttt n 9q31.3a
MAP3K1 CD40L Signaling Pathway CPSab MAP3K1 ü ü ü ü ü tttt n 5q11.2f
TRAF6 CD40L Signaling Pathway CPSab TRAF6 ü ü ü ü ü tttt n 11p12d
CCNE1 CDK Regulation of DNA Replication C CPSab CCNE1 CDK Regulation of DNA Replication CPSab ü ü ü tttt n 19q12c
KITLG CDK Regulation of DNA Replication C CPSab KITLG ü ü ü tttt n 12q21.32a
MCM5 CDK Regulation of DNA Replication C CPSab MCM5 ü ü tttt n 22q12.3c
ORC4L CDK Regulation of DNA Replication C CPSab ORC4L ü ü ü tttt n 2q23.1a
PIK3C2G CXCR4 Signaling Pathway CPS CPSab PIK3C2G CXCR4 Signaling Pathway CPSab
I have to check if there if there is any entry common between first file and first column of second file then I have to fetch whatever is present in front of it from second file
so if CALR is common then output is
CALR Antigen processing and presentation CPSab CALR Antigen processing and presentation CPSab ü tttt n 19p13.13c
Please let me know perl scripting regarding to help one of my friend.
November 13th, 2012, 02:56 AM
-
Hi,
The way to do it really depends of the size (in terms of number of lines)) of each file. Depending on whether one has many more lines than the other, the algorithm may differ.
In almost all cases, though, I think that the first thing to do with this type of problem is to read the first file line by line, chomp each line and store each line as a key in a hash (the associated values don't really matter, could be 1 for each hash entry.
Then, you read the second file line by line and for each line test it against each hash entry. The way to do it may differ depending on various factors pertaining to the data: relative size of the files, size of each line in the second file, volume of data (i.e. do you want to optimize for code simplicity or for speed and performance), etc. and also the Perl version you are using. You could use:
- Regular expressions to find a match and capture whatever is before the match in the line
- Index and substr function
- Possibly the smart match (if your Perl version allows it)
Another possible approach may be to use the List::Utils (and/or possibly List::More::Utils)) modules to compare the list of words in the first file and the list of words in each line of the second file.
November 13th, 2012, 03:07 AM
-
Hi
Thanks for reply.
Yes, the second file is larger but follow the same pattern as sample presented here.
But first file is small and this much only which I presented
Initially I tried one code in unix which worked for very small sample to certain extent but not for large original data so
here is the code in shell which I tried:
awk 'NR==FNR{X[$1]=$0;next}{n=split($1,P," ");sub($1,"",$0);for(i=1;i<=n;i++){if(X[P[i]]){print P[i],$0}}}' file1 FS="\t" file2
Now taking help in from perl!
November 13th, 2012, 08:37 AM
-
What have you tried?
How big are the files?
Is there a possibility of one or both of the files having duplicate entries in the first column?