#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2012
    Posts
    6
    Rep Power
    0

    Identify duplicates and update the last 2 digits to 0 for both the Orig and Dup


    Hi,

    I have a requirement where I have to identify duplicates from a file based on the first 6 chars (It is fixed width file of 12 chars length) and whenever a duplicate row is found, its original and duplicate row's last 2 chars should be updated to all 0's if they are not same. (I mean last 2 digits of original and duplicate row should be same, if not then default to 00 else keep them as is)


    I thought of using uniq command and redirect non dups to one file and dups to another and loop the dups but considering the data volumes, I would want to do it in AWK/SED


    here is the sample input and output


    Code:
    input:
    1251233Y1234
    1221249N8821
    1231116Y9945
    1231113Y2123
    1231109Y3212
    1231123N1214
    1231126N1214
    output should be:
    Code:
    1251233Y1234
    1221249N8821
    1231116Y9900
    1231113Y2100
    1231109N3212
    1231123N1214
    1231126N1214 (Since last 2 digits are same nothing changed)
    Any help in achieving the above result using either awk/sed will be greatly appreciated.

    Thanks,
    Faraway
  2. #2
  3. Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Sep 2006
    Posts
    834
    Rep Power
    387

    Cool


    Try this:
    Code:
    awk '{k=substr($1,1,6);a[k]=a[k]","$1;}
    END {
      for (k in a){
        n=split(a[k],o,",");
        if(n>2){
          for(i=2;i<=n;i++) {d=substr(o[i],11,2);  m[k,d]+=1;}
          for(i=2;i<=n;i++) {d=substr(o[i],11,2);
            if (m[k,d]>1) print o[i];
            else print substr(o[i],1,10)"00";
          }
        }
        else print o[2];
      }
    }' < input.txt

IMN logo majestic logo threadwatch logo seochat tools logo