The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.
|
 |
|
Dev Shed Forums
> Programming Languages - More
> Other Programming Languages
|
Other Language - Script to edit a column in CSV file
Discuss Script to edit a column in CSV file in the Other Programming Languages forum on Dev Shed. Script to edit a column in CSV file A place for discussing programming languages not covered in specific forums such as Assembler, COBOL, etc. - you get the idea.
|
|
 |
|
|
|
|

Dev Shed Forums Sponsor:
|
|
|

September 20th, 2011, 02:36 PM
|
|
Registered User
|
|
Join Date: Sep 2011
Posts: 7
Time spent in forums: 1 h 22 m 27 sec
Reputation Power: 0
|
|
|
Other Language - Script to edit a column in CSV file
Hi,
I’m very new to batch files and scripting (though I do genuinely want to learn, and would be happy to take some suggestions on where to learn more.)
What I need to do is open a .csv, and update a column of numbers. Right now, that column has the same number for the entire file but instead, I need to take the 1st number and add 1 to it for the following row, all the way to the end. Eventually I’ll need to do this for all .csv files in a directory, but I’ll be happy to just get the 1st part working for now.
I promise, I’ve spent many hours researching this but it seems most of what I’ve discovered is way too complicated and I have little idea how to decipher what I’ve seen so far. I admit, I hoped scripting would be similar to .net or .cfm or at least close enough that I could pick it up, but embarrassed to say I’m pretty lost.
Any help would be greatly appreciated.
Thanks in advance!
|

September 20th, 2011, 02:45 PM
|
|
Registered User
|
|
Join Date: Sep 2011
Posts: 7
Time spent in forums: 1 h 22 m 27 sec
Reputation Power: 0
|
|
|
Ok, I've broken it down to tasks so I wouldn't be so overwhelmed.
As far as I can tell, I'll need to do the following:
1. Open a .csv
2. locate a batesnumber (it's in the 5th column)
3. set the 1st 6 characters as var_first
4. set the last 3 as var_last
5. the remaining 8 in the middle as var_middle
6. Go to the next row
7. set var_new = var_middle +1
8. replace batesnumber in this row with "var_first & var_new & var_last"
9. set var_middle = var_new
10. repeat 6-9 to the end of file.
Again, any help at all is appreciated. Even how to get started would be great as I honestly want to learn.
Thanks.
|

September 20th, 2011, 04:16 PM
|
|
Registered User
|
|
Join Date: Sep 2011
Posts: 7
Time spent in forums: 1 h 22 m 27 sec
Reputation Power: 0
|
|
|
Ok, I got it to open. Now I'm stuck trying to figure out how to get to the 5th column of text. I'm assuming I have to use the commas to somehow distinguish it, but I'm at a loss for how to proceed. Most of the things online show how to do a find and replace, especially for symbols, but I haven't seen how to locate a set of text and make it a var.
Any help please?
|

September 20th, 2011, 05:51 PM
|
 |
Lost in code
|
|
|
|
|
I have no idea how to do this from a batch script, but it would be pretty simple to do from something like PHP.
|

September 20th, 2011, 06:08 PM
|
|
Registered User
|
|
Join Date: Sep 2011
Posts: 7
Time spent in forums: 1 h 22 m 27 sec
Reputation Power: 0
|
|
Quote: | Originally Posted by E-Oreo I have no idea how to do this from a batch script, but it would be pretty simple to do from something like PHP. |
Unfortunately, I don't think PHP is an option for me, unless it was written on a site where I could point it to the csv files and it just did all the edits for me. I normally use vb.net or CF for most of my work, but this assignment has me pulling my hair out! I'm feeling pretty frustrated at this point.
|

September 20th, 2011, 08:38 PM
|
 |
Contributing User
|
|
|
|
|
Use gawk
I think you're asking to replace column 5 with its cumulative sum, all other columns remain the same. Note that the output values of column 5 are 4, 17, 39, ...
Code:
$ gawk -F, 'BEGIN{OFS=","}{$5=a+=$5;print}'<<EOF
> 0,1,2,3,4,5,6,7,8
> 9,10,11,12,13,14,15,16,17
> 18,19,20,21,22,23,24,25,26
> 27,28,29,30,31,32,33,34,35
> 36,37,38,39,40,41,42,43,44
> EOF
0,1,2,3,4,5,6,7,8
9,10,11,12,17,14,15,16,17
18,19,20,21,39,23,24,25,26
27,28,29,30,70,32,33,34,35
36,37,38,39,110,41,42,43,44
Last edited by b49P23TIvg : September 20th, 2011 at 09:02 PM.
Reason: Shorten the code by a statement. I should really change the data outside column 5 as well to make it more obvious. Oh well.
|

September 20th, 2011, 08:45 PM
|
 |
Contributing User
|
|
|
|
|
for all files in the directory
I've always used this strategy. Maybe there's a more direct way but um, this works. I haven't bothered to think of a more direct method since I've never encountered more than thousands of files with need.
Code:
$ for f in * ; do gawk 'one line gawk program' $f >unused_file_name; mv unused_file_name $f ; done
Note that the * means some "glob" pattern. Usually one doesn't want to process every file. I mean, xargs is a clever program. As are other codes. I suspect this is more difficult with a gui interface. So you can do this directly in many unix shells, or you can install cygwin onto MS-whatever and use the bash shell there to use the same command. That's what I'd do.
Last edited by b49P23TIvg : September 20th, 2011 at 08:52 PM.
|

September 20th, 2011, 09:15 PM
|
 |
Contributing User
|
|
|
|
|
sample input/output
If you want a specific gawk program, please show sample input and sample output. I can say that I don't know what is a "Bate's" number. gawk has functions built in to split fields based on length, on patterns, on Turing machine like instructions, and can lines or whatnot. Easily. Tersely.
You gave instructions in with unknown syntax, I can't guess the meaning of
"var_first & var_new & var_last"
|

September 21st, 2011, 02:56 AM
|
|
Registered User
|
|
Join Date: Sep 2011
Posts: 7
Time spent in forums: 1 h 22 m 27 sec
Reputation Power: 0
|
|
First I want to thank everyone for your responses! Most of them helped get me on the right track and I’m now pretty close to getting it solved, though I do have a couple of questions.
In response to b49P23TIvg, a Bates number is what’s used in the legal system in the U.S. to number a document and no 2 documents can have the same Bates number when those documents are submitted to the courts. In this case, the format for these Bates numbers are like this:
ALAB-M00000001-IPC
The 00000001 is the number I’m trying to increment by 1.
Since I was having a hard time editing the number within the csv, I just created a new csv with the updated number. The problem I’m now having is that it keeps replacing the very 1st record with the plus one value and I’m trying to get it to start on the 2nd row. Does anyone have any advice on how I can use the first one as a starting point but then only update from the 2nd one on?
Also, after the 1st directory I point it to, all subsequent csv’s are incremented by 1000, instead of by 1, though I don’t see why that’s happening.
As for my code, here is what I have so far:
'script
On Error Resume Next
Set fileObject = CreateObject("Scripting.FileSystemObject")
Set folder = fileObject.GetFolder("\\nas002\USPS_Nas2\output_dannytest") ' directory to point to
Set subfolders = folder.SubFolders
Set fso = CreateObject("Scripting.FileSystemObject")
For Each folders In subfolders
counter = 0
Set files = folders.Files
For Each file In files
'if we got the text file then do the rest
If lcase(fileObject.getExtensionName(file.path))="txt" Then
Set objTextFile = fileObject.OpenTextFile(file.path, 1)
'create our file
Set MyFile = fso.CreateTextFile(file.Path & "-new.csv", True)
'now we have the data and where to save it
Do While Not objTextFile.AtEndOfStream 'spilt it into an array of strings
arrStr = split(objTextFile.ReadLine,""",""")
strResults = arrStr(5)
If counter = 1 Then
theMid = Mid(strResults, 7, 8)
End If
newBatesNum = Left(strResults, 6) & Right("00000000" & theMid + counter, 8) & Right(strResults, 6)
MyFile.WriteLine(arrStr(0) & """,""" & arrStr(1) & """,""" & arrStr(2) & """,""" & arrStr(3) & """,""" & arrStr(4) & """,""" & newBatesNum & """,""" & newBatesNum & """,""" & arrStr(7) & """,""" & arrStr(8) & """,""" & arrStr(9) & """,""" & arrStr(10) & """,""" & arrStr(11) & """,""" & arrStr(12) & """,""" & arrStr(13))
counter = counter + 1
Loop
End if
Next
Next
Wscript.Echo("Done")
Thanks again for everyone’s contributions and if anyone could help me figure out why this is happening, I’d appreciate it.
|

September 21st, 2011, 09:13 AM
|
 |
Contributing User
|
|
|
|
Assuming the txt files contain commas only as field separators (otherwise you need sophisticated parsing):
<apparently the column numbers are incorrect, and perhaps the alignment within the sixth field is off, but the original poster did not mention its index origin.>
Code:
$ find . -name '*.[tT][xX][tT]' -exec script.sh {} \;
script.sh is executable and contains
Code:
gawk -F, 'BEGIN{OFS=","}1==NR{first=substr($5,1,6);bates=substr($5,1+6,8);last=substr($5,1+6+8)}1<NR{$5=first sprintf("%08d",++bates) last}{print}' $1 > $1.modified
Next verify the *.txt.modified files are correct, then rename them with a command like this, which you can execute in each subdirectory. Again, I'd stick this command into a script and use find as in find . -type d -exec renaming.script.sh {} \;
Code:
cd $1
for f in *.modified;do echo mv $f ${f#.modified};done
cd - # superfluous, but I have not tested
Last edited by b49P23TIvg : September 21st, 2011 at 03:34 PM.
Reason: missing % in format string.
|

September 21st, 2011, 02:51 PM
|
|
Registered User
|
|
Join Date: Sep 2011
Posts: 7
Time spent in forums: 1 h 22 m 27 sec
Reputation Power: 0
|
|
|
Ok, I got some of the issues resolved but it's still chopping the 1st row of data and also screwing up the header by adding the 0000's to that row. This is what the data looks like before I run it:
"Claim Number","Authenticity","Box Number","File Type","Case Number","Beginning Bates","Ending Bates","FolderNumber","DocumentNote","DocumentPageCount","BatchCreateDate","Parent ID","District","Document Path"
"M-901000","Original","MCCON06971","NWRKS","ALBERTA","ALAB-M00001133-NWRKS","ALAB-M00001136-NWRKS","1","","4","1/26/2011","06971_01","ALAB","ALAB_MCCON06971_NWRKS\00000001\000001.tif"
"M-901000","Original","MCCON06971","NWRKS","ALBERTA","ALAB-M00001133-NWRKS","ALAB-M00001136-NWRKS","1","","4","1/26/2011","06971_01","ALAB","ALAB_MCCON06971_NWRKS\00000001\000002.tif"
"M-901000","Original","MCCON06971","NWRKS","ALBERTA","ALAB-M00001133-NWRKS","ALAB-M00001136-NWRKS","1","","4","1/26/2011","06971_01","ALAB","ALAB_MCCON06971_NWRKS\00000001\000003.tif"
"M-901000","Original","MCCON06971","NWRKS","ALBERTA","ALAB-M00001133-NWRKS","ALAB-M00001136-NWRKS","1","","4","1/26/2011","06971_01","ALAB","ALAB_MCCON06971_NWRKS\00000001\000004.tif"
And this is what it's doing after I run it:
"Claim Number","Authenticity","Box Number","File Type","Case Number","Beginn00001133g Bates","Beginn00001133g Bates","FolderNumber","DocumentNote","DocumentPageCount","BatchCreateDate","Parent ID","District","Document Path"
"M-048326","Original","MCCON06983","IC","062105507","ALAB-M00000002-IC","ALAB-M00000002-IC","1","","283","2/14/2011","06983_01","ALAB","ALAB_MCCON06983_IC\00000001\000001.tif"
"M-048326","Original","MCCON06983","IC","062105507","ALAB-M00000003-IC","ALAB-M00000003-IC","1","","283","2/14/2011","06983_01","ALAB","ALAB_MCCON06983_IC\00000001\000002.tif"
"M-048326","Original","MCCON06983","IC","062105507","ALAB-M00000004-IC","ALAB-M00000004-IC","1","","283","2/14/2011","06983_01","ALAB","ALAB_MCCON06983_IC\00000001\000003.tif"
"M-048326","Original","MCCON06983","IC","062105507","ALAB-M00000005-IC","ALAB-M00000005-IC","1","","283","2/14/2011","06983_01","ALAB","ALAB_MCCON06983_IC\00000001\000004.tif"
"M-048326","Original","MCCON06983","IC","062105507","ALAB-M00000006-IC","ALAB-M00000006-IC","1","","283","2/14/2011","06983_01","ALAB","ALAB_MCCON06983_IC\00000001\000005.tif"
"M-048326","Original","MCCON06983","IC","062105507","ALAB-M00000007-IC","ALAB-M00000007-IC","1","","283","2/14/2011","06983_01","ALAB","ALAB_MCCON06983_IC\00000001\000006.tif"
I know it's hard to see, but only the 1st 2 lines matter, specifically at the Bates Number column. The heading is messed up and it starts at 00002.
I was wondering if there's a simple way to just copy the 1st 2 rows into the new .csv file and then run the rest of it? I thought that could kill 2 birds with one stone, but I'm just not sure how to implement that. Ideas?
|

September 21st, 2011, 03:23 PM
|
 |
Contributing User
|
|
|
|
|
blah
Code:
$ head -2 x.csv > mod && tail -2 x.csv | perl_or_whatnot >> mod
|

September 21st, 2011, 03:40 PM
|
 |
Contributing User
|
|
|
|
|
test cases help!
For the input you showed, stored in file a, this one line gawk program produces as shown. My bad, I didn't understand that your csv file had header lines.
Code:
$ gawk -F, 'BEGIN{OFS=","}2==NR{first=substr($6,1,7);bates=0+substr($6,1+7,8);last=substr($6,1+7+8)}2<NR{$6=first sprintf("%08d",++bates) last}{print}' a
"Claim Number","Authenticity","Box Number","File Type","Case Number","Beginning Bates","Ending Bates","FolderNumber","DocumentNote","DocumentPageCount","BatchCreateDate","Parent ID","District","Document Path"
"M-901000","Original","MCCON06971","NWRKS","ALBERTA","ALAB-M00001133-NWRKS","ALAB-M00001136-NWRKS","1","","4","1/26/2011","06971_01","ALAB","ALAB_MCCON06971_NWRKS\00000001\000001.tif"
"M-901000","Original","MCCON06971","NWRKS","ALBERTA","ALAB-M00001134-NWRKS","ALAB-M00001136-NWRKS","1","","4","1/26/2011","06971_01","ALAB","ALAB_MCCON06971_NWRKS\00000001\000002.tif"
"M-901000","Original","MCCON06971","NWRKS","ALBERTA","ALAB-M00001135-NWRKS","ALAB-M00001136-NWRKS","1","","4","1/26/2011","06971_01","ALAB","ALAB_MCCON06971_NWRKS\00000001\000003.tif"
"M-901000","Original","MCCON06971","NWRKS","ALBERTA","ALAB-M00001136-NWRKS","ALAB-M00001136-NWRKS","1","","4","1/26/2011","06971_01","ALAB","ALAB_MCCON06971_NWRKS\00000001\000004.tif"
Last edited by b49P23TIvg : September 21st, 2011 at 03:41 PM.
Reason: code tags
|

September 26th, 2011, 06:47 PM
|
|
Registered User
|
|
Join Date: Sep 2011
Posts: 7
Time spent in forums: 1 h 22 m 27 sec
Reputation Power: 0
|
|
Update:
Thanks for all the help! Had some issues with the counters but I was finally able to get it using some VBScript. Here's the code I used:
'script
On Error Resume Next
Set fileObject = CreateObject("Scripting.FileSystemObject")
'change this to the directory
Set folder = fileObject.GetFolder("H:\output")
Set subfolders = folder.SubFolders
Set fso = CreateObject("Scripting.FileSystemObject")
For Each folders In subfolders
Set files = folders.Files
counter = -1
For Each file In files
'if we got the text file then do the rest
If lcase(fileObject.getExtensionName(file.path))="txt" Then
Set objTextFile = fileObject.OpenTextFile(file.path, 1)
'create our file
Set MyFile = fso.CreateTextFile(file.Path & "-new.csv", True)
'now we have the data and where to save it
Do While Not objTextFile.AtEndOfStream 'spilt it into an array of strings
arrStr = split(objTextFile.ReadLine,""",""")
strResults = arrStr(5)
If counter = 1 Then
theMid = Mid(strResults, 7, 8)
End If
If counter < 1 Then
MyFile.WriteLine(arrStr(0) & """,""" & arrStr(1) & """,""" & arrStr(2) & """,""" & arrStr(3) & """,""" & arrStr(4) & """,""" & arrStr(5) & """,""" & arrStr(6) & """,""" & arrStr(7) & """,""" & arrStr(8) & """,""" & arrStr(9) & """,""" & arrStr(10) & """,""" & arrStr(11) & """,""" & arrStr(12) & """,""" & arrStr(13))
counter = counter + 1
Else
newBatesNum = Left(strResults, 6) & Right("00000000" & theMid + counter, 8) & Right(strResults, 7 - InStr(Right(strResults, 6), "-"))
'Wscript.Echo(newBatesNum)
MyFile.WriteLine(arrStr(0) & """,""" & arrStr(1) & """,""" & arrStr(2) & """,""" & arrStr(3) & """,""" & arrStr(4) & """,""" & newBatesNum & """,""" & newBatesNum & """,""" & arrStr(7) & """,""" & arrStr(8) & """,""" & arrStr(9) & """,""" & arrStr(10) & """,""" & arrStr(11) & """,""" & arrStr(12) & """,""" & arrStr(13))
counter = counter + 1
End If
Loop
End if
Next
Next
Wscript.Echo("Task Complete")
Thanks again!
|
Developer Shed Advertisers and Affiliates
| Thread Tools |
Search this Thread |
|
|
|
| Display Modes |
Rate This Thread |
Linear Mode
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
|
|