#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2011
    Posts
    7
    Rep Power
    0

    Script to edit a column in CSV file


    Hi,

    Iím very new to batch files and scripting (though I do genuinely want to learn, and would be happy to take some suggestions on where to learn more.)

    What I need to do is open a .csv, and update a column of numbers. Right now, that column has the same number for the entire file but instead, I need to take the 1st number and add 1 to it for the following row, all the way to the end. Eventually Iíll need to do this for all .csv files in a directory, but Iíll be happy to just get the 1st part working for now.

    I promise, Iíve spent many hours researching this but it seems most of what Iíve discovered is way too complicated and I have little idea how to decipher what Iíve seen so far. I admit, I hoped scripting would be similar to .net or .cfm or at least close enough that I could pick it up, but embarrassed to say Iím pretty lost.

    Any help would be greatly appreciated.

    Thanks in advance!
  2. #2
  3. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2011
    Posts
    7
    Rep Power
    0
    Ok, I've broken it down to tasks so I wouldn't be so overwhelmed.

    As far as I can tell, I'll need to do the following:

    1. Open a .csv
    2. locate a batesnumber (it's in the 5th column)
    3. set the 1st 6 characters as var_first
    4. set the last 3 as var_last
    5. the remaining 8 in the middle as var_middle
    6. Go to the next row
    7. set var_new = var_middle +1
    8. replace batesnumber in this row with "var_first & var_new & var_last"
    9. set var_middle = var_new
    10. repeat 6-9 to the end of file.

    Again, any help at all is appreciated. Even how to get started would be great as I honestly want to learn.

    Thanks.
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2011
    Posts
    7
    Rep Power
    0
    Ok, I got it to open. Now I'm stuck trying to figure out how to get to the 5th column of text. I'm assuming I have to use the commas to somehow distinguish it, but I'm at a loss for how to proceed. Most of the things online show how to do a find and replace, especially for symbols, but I haven't seen how to locate a set of text and make it a var.

    Any help please?
  6. #4
  7. No Profile Picture
    Lost in code
    Devshed Supreme Being (6500+ posts)

    Join Date
    Dec 2004
    Posts
    8,316
    Rep Power
    7170
    I have no idea how to do this from a batch script, but it would be pretty simple to do from something like PHP.
    PHP FAQ

    Originally Posted by Spad
    Ah USB, the only rectangular connector where you have to make 3 attempts before you get it the right way around
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2011
    Posts
    7
    Rep Power
    0

    Unhappy


    Originally Posted by E-Oreo
    I have no idea how to do this from a batch script, but it would be pretty simple to do from something like PHP.
    Unfortunately, I don't think PHP is an option for me, unless it was written on a site where I could point it to the csv files and it just did all the edits for me. I normally use vb.net or CF for most of my work, but this assignment has me pulling my hair out! I'm feeling pretty frustrated at this point.
  10. #6
  11. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,928
    Rep Power
    481

    Use gawk


    I think you're asking to replace column 5 with its cumulative sum, all other columns remain the same. Note that the output values of column 5 are 4, 17, 39, ...

    Code:
    $ gawk -F, 'BEGIN{OFS=","}{$5=a+=$5;print}'<<EOF
    > 0,1,2,3,4,5,6,7,8
    > 9,10,11,12,13,14,15,16,17
    > 18,19,20,21,22,23,24,25,26
    > 27,28,29,30,31,32,33,34,35
    > 36,37,38,39,40,41,42,43,44
    > EOF
    0,1,2,3,4,5,6,7,8
    9,10,11,12,17,14,15,16,17
    18,19,20,21,39,23,24,25,26
    27,28,29,30,70,32,33,34,35
    36,37,38,39,110,41,42,43,44
    Last edited by b49P23TIvg; September 20th, 2011 at 09:02 PM. Reason: Shorten the code by a statement. I should really change the data outside column 5 as well to make it more obvious. Oh well.
  12. #7
  13. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,928
    Rep Power
    481

    for all files in the directory


    I've always used this strategy. Maybe there's a more direct way but um, this works. I haven't bothered to think of a more direct method since I've never encountered more than thousands of files with need.

    Code:
    $ for f in * ; do gawk 'one line gawk program' $f >unused_file_name; mv unused_file_name $f ; done
    Note that the * means some "glob" pattern. Usually one doesn't want to process every file. I mean, xargs is a clever program. As are other codes. I suspect this is more difficult with a gui interface. So you can do this directly in many unix shells, or you can install cygwin onto MS-whatever and use the bash shell there to use the same command. That's what I'd do.
    Last edited by b49P23TIvg; September 20th, 2011 at 08:52 PM.
  14. #8
  15. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,928
    Rep Power
    481

    sample input/output


    If you want a specific gawk program, please show sample input and sample output. I can say that I don't know what is a "Bate's" number. gawk has functions built in to split fields based on length, on patterns, on Turing machine like instructions, and can lines or whatnot. Easily. Tersely.

    You gave instructions in with unknown syntax, I can't guess the meaning of

    "var_first & var_new & var_last"
  16. #9
  17. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2011
    Posts
    7
    Rep Power
    0

    Smile


    First I want to thank everyone for your responses! Most of them helped get me on the right track and Iím now pretty close to getting it solved, though I do have a couple of questions.

    In response to b49P23TIvg, a Bates number is whatís used in the legal system in the U.S. to number a document and no 2 documents can have the same Bates number when those documents are submitted to the courts. In this case, the format for these Bates numbers are like this:
    ALAB-M00000001-IPC

    The 00000001 is the number Iím trying to increment by 1.

    Since I was having a hard time editing the number within the csv, I just created a new csv with the updated number. The problem Iím now having is that it keeps replacing the very 1st record with the plus one value and Iím trying to get it to start on the 2nd row. Does anyone have any advice on how I can use the first one as a starting point but then only update from the 2nd one on?

    Also, after the 1st directory I point it to, all subsequent csvís are incremented by 1000, instead of by 1, though I donít see why thatís happening.

    As for my code, here is what I have so far:


    'script
    On Error Resume Next
    Set fileObject = CreateObject("Scripting.FileSystemObject")
    Set folder = fileObject.GetFolder("\\nas002\USPS_Nas2\output_dannytest") ' directory to point to
    Set subfolders = folder.SubFolders
    Set fso = CreateObject("Scripting.FileSystemObject")

    For Each folders In subfolders
    counter = 0
    Set files = folders.Files
    For Each file In files
    'if we got the text file then do the rest
    If lcase(fileObject.getExtensionName(file.path))="txt" Then
    Set objTextFile = fileObject.OpenTextFile(file.path, 1)
    'create our file
    Set MyFile = fso.CreateTextFile(file.Path & "-new.csv", True)
    'now we have the data and where to save it

    Do While Not objTextFile.AtEndOfStream 'spilt it into an array of strings
    arrStr = split(objTextFile.ReadLine,""",""")
    strResults = arrStr(5)
    If counter = 1 Then
    theMid = Mid(strResults, 7, 8)
    End If
    newBatesNum = Left(strResults, 6) & Right("00000000" & theMid + counter, 8) & Right(strResults, 6)
    MyFile.WriteLine(arrStr(0) & """,""" & arrStr(1) & """,""" & arrStr(2) & """,""" & arrStr(3) & """,""" & arrStr(4) & """,""" & newBatesNum & """,""" & newBatesNum & """,""" & arrStr(7) & """,""" & arrStr(8) & """,""" & arrStr(9) & """,""" & arrStr(10) & """,""" & arrStr(11) & """,""" & arrStr(12) & """,""" & arrStr(13))
    counter = counter + 1
    Loop
    End if

    Next
    Next
    Wscript.Echo("Done")

    Thanks again for everyoneís contributions and if anyone could help me figure out why this is happening, Iíd appreciate it.
  18. #10
  19. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,928
    Rep Power
    481
    Assuming the txt files contain commas only as field separators (otherwise you need sophisticated parsing):
    <apparently the column numbers are incorrect, and perhaps the alignment within the sixth field is off, but the original poster did not mention its index origin.>

    Code:
    $ find . -name '*.[tT][xX][tT]' -exec script.sh {} \;
    script.sh is executable and contains

    Code:
    gawk -F, 'BEGIN{OFS=","}1==NR{first=substr($5,1,6);bates=substr($5,1+6,8);last=substr($5,1+6+8)}1<NR{$5=first sprintf("%08d",++bates) last}{print}' $1 > $1.modified
    Next verify the *.txt.modified files are correct, then rename them with a command like this, which you can execute in each subdirectory. Again, I'd stick this command into a script and use find as in find . -type d -exec renaming.script.sh {} \;

    Code:
    cd $1
    for f in *.modified;do echo mv $f ${f#.modified};done
    cd - # superfluous, but I have not tested
    Last edited by b49P23TIvg; September 21st, 2011 at 03:34 PM. Reason: missing % in format string.
  20. #11
  21. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2011
    Posts
    7
    Rep Power
    0
    Ok, I got some of the issues resolved but it's still chopping the 1st row of data and also screwing up the header by adding the 0000's to that row. This is what the data looks like before I run it:


    "Claim Number","Authenticity","Box Number","File Type","Case Number","Beginning Bates","Ending Bates","FolderNumber","DocumentNote","DocumentPageCount","BatchCreateDate","Parent ID","District","Document Path"
    "M-901000","Original","MCCON06971","NWRKS","ALBERTA","ALAB-M00001133-NWRKS","ALAB-M00001136-NWRKS","1","","4","1/26/2011","06971_01","ALAB","ALAB_MCCON06971_NWRKS\00000001\000001.tif"
    "M-901000","Original","MCCON06971","NWRKS","ALBERTA","ALAB-M00001133-NWRKS","ALAB-M00001136-NWRKS","1","","4","1/26/2011","06971_01","ALAB","ALAB_MCCON06971_NWRKS\00000001\000002.tif"
    "M-901000","Original","MCCON06971","NWRKS","ALBERTA","ALAB-M00001133-NWRKS","ALAB-M00001136-NWRKS","1","","4","1/26/2011","06971_01","ALAB","ALAB_MCCON06971_NWRKS\00000001\000003.tif"
    "M-901000","Original","MCCON06971","NWRKS","ALBERTA","ALAB-M00001133-NWRKS","ALAB-M00001136-NWRKS","1","","4","1/26/2011","06971_01","ALAB","ALAB_MCCON06971_NWRKS\00000001\000004.tif"

    And this is what it's doing after I run it:

    "Claim Number","Authenticity","Box Number","File Type","Case Number","Beginn00001133g Bates","Beginn00001133g Bates","FolderNumber","DocumentNote","DocumentPageCount","BatchCreateDate","Parent ID","District","Document Path"
    "M-048326","Original","MCCON06983","IC","062105507","ALAB-M00000002-IC","ALAB-M00000002-IC","1","","283","2/14/2011","06983_01","ALAB","ALAB_MCCON06983_IC\00000001\000001.tif"
    "M-048326","Original","MCCON06983","IC","062105507","ALAB-M00000003-IC","ALAB-M00000003-IC","1","","283","2/14/2011","06983_01","ALAB","ALAB_MCCON06983_IC\00000001\000002.tif"
    "M-048326","Original","MCCON06983","IC","062105507","ALAB-M00000004-IC","ALAB-M00000004-IC","1","","283","2/14/2011","06983_01","ALAB","ALAB_MCCON06983_IC\00000001\000003.tif"
    "M-048326","Original","MCCON06983","IC","062105507","ALAB-M00000005-IC","ALAB-M00000005-IC","1","","283","2/14/2011","06983_01","ALAB","ALAB_MCCON06983_IC\00000001\000004.tif"
    "M-048326","Original","MCCON06983","IC","062105507","ALAB-M00000006-IC","ALAB-M00000006-IC","1","","283","2/14/2011","06983_01","ALAB","ALAB_MCCON06983_IC\00000001\000005.tif"
    "M-048326","Original","MCCON06983","IC","062105507","ALAB-M00000007-IC","ALAB-M00000007-IC","1","","283","2/14/2011","06983_01","ALAB","ALAB_MCCON06983_IC\00000001\000006.tif"


    I know it's hard to see, but only the 1st 2 lines matter, specifically at the Bates Number column. The heading is messed up and it starts at 00002.

    I was wondering if there's a simple way to just copy the 1st 2 rows into the new .csv file and then run the rest of it? I thought that could kill 2 birds with one stone, but I'm just not sure how to implement that. Ideas?
  22. #12
  23. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,928
    Rep Power
    481

    blah


    Code:
    $ head -2 x.csv > mod && tail -2 x.csv | perl_or_whatnot >> mod
  24. #13
  25. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,928
    Rep Power
    481

    test cases help!


    For the input you showed, stored in file a, this one line gawk program produces as shown. My bad, I didn't understand that your csv file had header lines.

    Code:
    $ gawk -F, 'BEGIN{OFS=","}2==NR{first=substr($6,1,7);bates=0+substr($6,1+7,8);last=substr($6,1+7+8)}2<NR{$6=first sprintf("%08d",++bates) last}{print}' a
    "Claim Number","Authenticity","Box Number","File Type","Case Number","Beginning Bates","Ending Bates","FolderNumber","DocumentNote","DocumentPageCount","BatchCreateDate","Parent ID","District","Document Path"
    "M-901000","Original","MCCON06971","NWRKS","ALBERTA","ALAB-M00001133-NWRKS","ALAB-M00001136-NWRKS","1","","4","1/26/2011","06971_01","ALAB","ALAB_MCCON06971_NWRKS\00000001\000001.tif"
    "M-901000","Original","MCCON06971","NWRKS","ALBERTA","ALAB-M00001134-NWRKS","ALAB-M00001136-NWRKS","1","","4","1/26/2011","06971_01","ALAB","ALAB_MCCON06971_NWRKS\00000001\000002.tif"
    "M-901000","Original","MCCON06971","NWRKS","ALBERTA","ALAB-M00001135-NWRKS","ALAB-M00001136-NWRKS","1","","4","1/26/2011","06971_01","ALAB","ALAB_MCCON06971_NWRKS\00000001\000003.tif"
    "M-901000","Original","MCCON06971","NWRKS","ALBERTA","ALAB-M00001136-NWRKS","ALAB-M00001136-NWRKS","1","","4","1/26/2011","06971_01","ALAB","ALAB_MCCON06971_NWRKS\00000001\000004.tif"
    Last edited by b49P23TIvg; September 21st, 2011 at 03:41 PM. Reason: code tags
  26. #14
  27. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2011
    Posts
    7
    Rep Power
    0

    Smile


    Update:

    Thanks for all the help! Had some issues with the counters but I was finally able to get it using some VBScript. Here's the code I used:

    'script
    On Error Resume Next
    Set fileObject = CreateObject("Scripting.FileSystemObject")

    'change this to the directory
    Set folder = fileObject.GetFolder("H:\output")

    Set subfolders = folder.SubFolders
    Set fso = CreateObject("Scripting.FileSystemObject")
    For Each folders In subfolders
    Set files = folders.Files
    counter = -1
    For Each file In files
    'if we got the text file then do the rest
    If lcase(fileObject.getExtensionName(file.path))="txt" Then
    Set objTextFile = fileObject.OpenTextFile(file.path, 1)
    'create our file
    Set MyFile = fso.CreateTextFile(file.Path & "-new.csv", True)
    'now we have the data and where to save it
    Do While Not objTextFile.AtEndOfStream 'spilt it into an array of strings
    arrStr = split(objTextFile.ReadLine,""",""")
    strResults = arrStr(5)

    If counter = 1 Then
    theMid = Mid(strResults, 7, 8)
    End If

    If counter < 1 Then
    MyFile.WriteLine(arrStr(0) & """,""" & arrStr(1) & """,""" & arrStr(2) & """,""" & arrStr(3) & """,""" & arrStr(4) & """,""" & arrStr(5) & """,""" & arrStr(6) & """,""" & arrStr(7) & """,""" & arrStr(8) & """,""" & arrStr(9) & """,""" & arrStr(10) & """,""" & arrStr(11) & """,""" & arrStr(12) & """,""" & arrStr(13))
    counter = counter + 1


    Else
    newBatesNum = Left(strResults, 6) & Right("00000000" & theMid + counter, 8) & Right(strResults, 7 - InStr(Right(strResults, 6), "-"))
    'Wscript.Echo(newBatesNum)
    MyFile.WriteLine(arrStr(0) & """,""" & arrStr(1) & """,""" & arrStr(2) & """,""" & arrStr(3) & """,""" & arrStr(4) & """,""" & newBatesNum & """,""" & newBatesNum & """,""" & arrStr(7) & """,""" & arrStr(8) & """,""" & arrStr(9) & """,""" & arrStr(10) & """,""" & arrStr(11) & """,""" & arrStr(12) & """,""" & arrStr(13))
    counter = counter + 1
    End If

    Loop
    End if

    Next
    Next
    Wscript.Echo("Task Complete")


    Thanks again!

IMN logo majestic logo threadwatch logo seochat tools logo