#1
  1. strongbad dance now
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2004
    Location
    Bermuda
    Posts
    324
    Rep Power
    62

    AWK: rm duplicates based on muli fields


    Ok, long winded title

    I'm trying to use awk to remove rows that are duplicates based on 3 fields, and I want to keep the on that has the higher value in another field. I'm working in C-Shell. For example the below is greped out of a larger data set to use in here as example:

    Input (Field separator is a comma:
    Code:
    4180,-6999,MA,BARNSTABLE,BOURNE,1,1.7,1700,PM,1/26
    4180,-6999,MA,BARNSTABLE,BOURNE,1,3.5,2025,PM,1/26
    4180,-6999,MA,BARNSTABLE,BOURNE,1,1.0,1511,PM,1/26
    4180,-6999,MA,BARNSTABLE,BOURNE,1,5.7,0540,AM,1/27
    I want to identify duplicates based on fields 1, 2, and 10, and the one with the highest value in field 7. I want the output to be:
    Output
    Code:
    4180,-6999,MA,BARNSTABLE,BOURNE,1,3.5,2025,PM,1/26
    4180,-6999,MA,BARNSTABLE,BOURNE,1,5.7,0540,AM,1/27
    I can sort the data, it won't cause issues. So I tried
    Code:
    sort -r -t, master.txt | awk -F, '{ keys=$1$2$10 ; if ( data[keys]++ == 0 ) lines[++count] = $0 } END {for ( i = 1 ; i <= count ; i++ ) print lines[i] }'
    When I try the above command piping "grep BOURNE" onto the end of it, it can't find any BOURNE. Howeverm, if instead of reverse sorting, I just use:
    Code:
    sort -t, master.txt | awk -F, '{ keys=$1$2$10 ; if ( data[keys]++ == 0 ) lines[++count] = $0 } END {for ( i = 1 ; i <= count ; i++ ) print lines[i] }'
    and grep BOURNE on the above, output, I get the lowest values of field seven to return:
    Code:
    4180,-6999,MA,BARNSTABLE,BOURNE,1,1.0,1511,PM,1/26
    4180,-6999,MA,BARNSTABLE,BOURNE,1,5.7,0540,AM,1/27
    I'd really appreciate any help!
  2. #2
  3. strongbad dance now
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2004
    Location
    Bermuda
    Posts
    324
    Rep Power
    62

    Solved


    Following up with the solution I came up with:

    Code:
    sort -t, master.txt | awk -F, '{ s = $1 $2 $10; if(s != prevs) { if (FNR > 1 ) print prevline; preval = $7; prevline = $0;} else if ($7 > preval) { preval = $7; prevline = $0;} prevs =s;} END {print prevline }'
    Hope someone finds it helpful.

IMN logo majestic logo threadwatch logo seochat tools logo