#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2004
    Posts
    64
    Rep Power
    17

    Regexp handling in CSV


    My script currently is processing various csv files. The top row/header resembles this format:

    Device ID,1) S31 Which best describes how you answered the online reading comprehension quiz?,2) S32 Which best describes how you answered the online timed retrieval quiz?,3) B19. If you want your product to be easy to find in the supermarket then you should make its container,"4) C19. So that he can shift attention between the radio and his incessantly talking girl friend when she is in the car, Joe adjusts his radio",5) B20. Early selection is most likely to occur for,6) C20. Early selection for a red target is most likely to occur when there is,"7) B21. In a lexical decision task, when the target is a bird name, e.g. robin, it is usually preceded by the prime BODY but is sometimes preceded by the prime BIRD."
    Most of the headers begin '1)', '5)', etc. I need to remove this from the csv files. Another problem I've encountered while doing this is that some of the headers are encased in double quotes like, '"4)4) C19. So that he can shift attention between the radio and his incessantly talking girl friend when she is in the car, Joe adjusts his radio", 5) B20'

    I have tried connveting the top row from an array to a string and then
    Code:
    gsub(/[\d]+\)/,'').
    This kinda works. It is unable to deal with the double quote problem. It also replaces with whitespace, which I don't want. Also, I can't figure out how to put it back in the array as it was then write it back to the csv.

    The devshed community has been very helpful thus far with helping me write this script. So, thanks.
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jan 2004
    Location
    Constant Limbo
    Posts
    989
    Rep Power
    363
    If you know that the field (as a string) will have the quotes, then you can require that the match begin at the first character.
    Code:
    gsub(/[\d]+\)/,'')
    Would need to become
    Code:
    gsub(/^[\d]+\)/,'')
    True happiness is not getting what you want, it's wanting what you've already got.

    My Blog
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2004
    Posts
    64
    Rep Power
    17
    I would not know if the field has quotes.
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jan 2004
    Location
    Constant Limbo
    Posts
    989
    Rep Power
    363
    What I'm saying is that if you want to distinguish between two fields that look like this:
    Code:
    42) This is one field
    "42) This is another field"
    And the one you want wont ever have the quotes then you can require that the match not include quotes explicitly. My previous post shows how to do that.
    If you want to support both fields, then you can do something like:
    Code:
    gsub(/"?\[d+]\)/,'')
    True happiness is not getting what you want, it's wanting what you've already got.

    My Blog
  8. #5
  9. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2004
    Posts
    64
    Rep Power
    17
    well that gives the error:
    warning: regexp has `]' without escape
    and spits it out with the quotes. I tryed escaping the double-quote, but it just spits out still
  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jan 2004
    Location
    Constant Limbo
    Posts
    989
    Rep Power
    363
    typo on my part.
    Code:
    gsub(/"?[\d+]\)/,'')
    True happiness is not getting what you want, it's wanting what you've already got.

    My Blog
  12. #7
  13. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2004
    Posts
    64
    Rep Power
    17
    still doesn't seem to work.
  14. #8
  15. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jan 2004
    Location
    Constant Limbo
    Posts
    989
    Rep Power
    363
    you are going to have to be less vague about what works and what doesn't. For instance
    Code:
    irb(main):001:0> a = "24) Example 1"
    => "24) Example 1"
    irb(main):002:0> b = "\"25) Example 2\""
    => "\"25) Example 2\""
    irb(main):003:0> r = /"?[\d+]\)/
    => /"?[\d+]\)/
    irb(main):004:0> r.match a
    => #<MatchData "4)">
    irb(main):005:0> r.match b
    => #<MatchData "5)">
    irb(main):006:0> r.match "foo"
    => nil
    So you can see it actually finds what you are looking for. By 'not works' do you mean that you are not getting the correct strings after gsub? Do you mean that there is a trailing " character in the quoted string case? I can not see you terminal; it would be nice if you can provide some context to this problem.
    You should provide a minimal set of input you wish to process, the command you are using, the output you expect to get, and the output you are getting. These will greatly increase your chance of getting a helpfu answer.
    True happiness is not getting what you want, it's wanting what you've already got.

    My Blog
  16. #9
  17. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2004
    Posts
    64
    Rep Power
    17
    Don't worry. I got it.

    Code:
    #!/usr/bin/env ruby
    
    require 'rubygems'
    require 'roo'
    require 'csv'
    require 'fileutils'
    require 'rio'
    require 'fastercsv'
    
    FileUtils.mkdir_p "/Users/pshapiro/Desktop/Excel/xls"
    FileUtils.mkdir_p "/Users/pshapiro/Desktop/Excel/tmp"
    FileUtils.mkdir_p "/Users/pshapiro/Desktop/Excel/csv"
    
    @filesxls = Dir["/Users/pshapiro/Desktop/Excel/*.xls"]
    for file in @filesxls
    	FileUtils.move(file,"/Users/pshapiro/Desktop/Excel/xls")
    end
    
    @filesxls = Dir["/Users/pshapiro/Desktop/Excel/xls/*.xls"]
    @filetmp = Dir["/Users/pshapiro/Desktop/Excel/xls/*.xls_tmp"]
    
    for file in @filesxls
    	convert = Excel.new(file)
    	convert.default_sheet = convert.sheets[0]
    	convert.to_csv(file+"_tmp")
    end
    
    @filestmp = Dir["/Users/pshapiro/Desktop/Excel/xls/*.xls_tmp"]
    
    for file in @filestmp
    	FileUtils.move(file,"/Users/pshapiro/Desktop/Excel/tmp")
    end
    
    dir = "/Users/pshapiro/Desktop/Excel/tmp/"  
    files = Dir.entries(dir)  
    files.each do |f|  
    next if f == "." or f == ".."  
    oldFile = dir + "/" + f 
    newFile = dir + "/" + File.basename(f, '.*')  
    File.rename(oldFile, newFile)  
    end 
    
    files = Dir.entries(dir)
    files.each do |f|  
    next if f == "." or f == ".."  
    oldFile = dir + "/" + f 
    newFile = dir + "/" + f + ".csv"
    File.rename(oldFile, newFile)  
    end 
    
    @filescsv = Dir["/Users/pshapiro/Desktop/Excel/tmp/*.csv"]
    
    for file in @filescsv
    	FileUtils.move(file,"/Users/pshapiro/Desktop/Excel/csv")
    end
    
    FileUtils.rm_rf("/Users/pshapiro/Desktop/Excel/tmp")
    
    @filescsv = Dir["/Users/pshapiro/Desktop/Excel/csv/*.csv"]
    
    for file in @filescsv
    	5.times {
    	text=""
    	File.open(file,"r"){|f|f.gets;text=f.read}
    	File.open(file,"w+"){|f| f.write(text)}
    	}
    end
    
    dir = "/Users/pshapiro/Desktop/Excel/csv/"  
    files = Dir.entries(dir)  
    files.each do |f|  
    next if f == "." or f == ".."  
    oldFile = dir + "/" + f 
    newFile = dir + "/" + File.basename(f, '.*') + ".tmp" 
    File.rename(oldFile, newFile)  
    end 
    
    @filescsv = Dir["/Users/pshapiro/Desktop/Excel/csv/*.tmp"]
    
    for file in @filescsv
    	csv = FasterCSV.read(file, :headers => true)
    	lastc = csv.headers.length-1
    #	puts lastc
    	rio(file).csv.skipcolumns(1..2,lastc) > rio(file+".csv").csv(',')
    end
    
    @filescsv = Dir["/Users/pshapiro/Desktop/Excel/csv/*.tmp"]
    
    for file in @filescsv
    	FileUtils.remove(file)
    end
    
    dir = "/Users/pshapiro/Desktop/Excel/csv"  
    files = Dir.entries(dir)  
    files.each do |f|  
    next if f == "." or f == ".."  
    oldFile = dir + "/" + f
    newFile = dir + "/" + File.basename(f, '.*') 
    File.rename(oldFile, newFile)  
    end 
    
    2.times {
    files = Dir.entries(dir)  
    files.each do |f|  
    next if f == "." or f == ".."  
    oldFile = dir + "/" + f
    newFile = dir + "/" + File.basename(f, '.*') 
    File.rename(oldFile, newFile)  
    end 
    }
    
    files = Dir.entries(dir)
    files.each do |f|  
    next if f == "." or f == ".."  
    oldFile = dir + "/" + f 
    newFile = dir + "/" + f + ".csv"
    File.rename(oldFile, newFile)  
    end
    
    #####################################
    
    @filescsv = Dir["/Users/pshapiro/Desktop/Excel/csv/*.csv"]
    
    for file in @filescsv
    	csv = FasterCSV.read(file, :headers => true)
    	csv = csv.to_s
    	fields = FCSV.parse_line(csv)
    
    	fields.each do |f|
    		f.sub!(/[\d]+\)+[\s]/,'')
    	end
    
    #	puts fields
    
    	wline = FCSV.generate_line(fields)
    	astring = rio(file).contents
    	rio(file).csv.print(astring).close
    
    	text=""
    	File.open(file,"r"){|f|f.gets;text=f.read}
    	File.open(file,"w+"){|f| f.write(text)}
    
    	astring = rio(file).contents
    	rio(file).csv.print(wline+astring).close
    end
    
    puts "Successfully fixed Microsoft Excel documents!"
    #puts " "
    #teststr = 'Device ID,S31 Which best describes how you answered the online reading comprehension quiz?,S32 Which best describes how you answered the online timed retrieval quiz?,B19. If you want your product to be easy to find in the supermarket then you should make its container,"C19.  So that he can shift attention between the radio and his incessantly talking girl friend when she is in the car, Joe adjusts his radio",B20.  Early selection is most likely to occur for,C20.  Early selection for a red target is most likely to occur when there is,"B21. In a lexical decision task, when the target is a bird name, e.g. robin, it is usually preceded by the prime BODY but is sometimes preceded by the prime BIRD.","C21.  In a lexical decision task, when the target is a dog name, e.g. collie, it is usually preceded by the prime CAR but is sometimes preceded by the prime DOG.","B23. Suppose that that you see a brief display with 12 colored letters: 4 red, 4 white, and 4 blue.  At the offset of the display you hear tone.  A tone instructs you to report only the letters of a particular color: high for red, medium for white, and low for blue.  About how many letters do you report?",B22.  Sperling (1960) found that partial report produced the highest estimate of the number of available letters when the tone occurred ,"C23.  According to the logic of Sperling’s (1960) partial report method, an observer who reports three letters from a row in a 4 x 4 display that was cued at the display’s offset must have seen at least",C22.  Sperling (1960) found that the greatest difference between full and partial report in the number available of letters was when the tone occurred ____ milliseconds after the offset of the visual display'
    #puts teststr.gsub(/!(Device ID)|([BC\B0-9]+\.)\s/,'')
    
    #puts "(Device ID)|[BC\B][0-9]+\."
    #puts "(Device ID)|([BC\B0-9]+\.)"
    #puts "!(Device ID)|([BC]\B[0-9]+\.\s)"
    
    #####################################
    
    @filescsv = Dir["/Users/pshapiro/Desktop/Excel/csv/*.csv"]
    
    for file in @filescsv
    	csv = FasterCSV.read(file, :headers => true)
    	csv = csv.to_s
    	fields = FCSV.parse_line(csv)
    
    	fields.each do |f|
    		f.sub!(/!(Device ID)|([BC\B0-9]+\.)[\s]*/,'')
    	end
    
    #	puts fields
    
    	wline = FCSV.generate_line(fields)
    	astring = rio(file).contents
    	rio(file).csv.print(astring).close
    
    	text=""
    	File.open(file,"r"){|f|f.gets;text=f.read}
    	File.open(file,"w+"){|f| f.write(text)}
    
    	astring = rio(file).contents
    	rio(file).csv.print(wline+astring).close
    end

IMN logo majestic logo threadwatch logo seochat tools logo