#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2013
    Posts
    4
    Rep Power
    0

    Problem with multiline regex in Ruby


    I've set up a Ruby script to do some grep find/replaces on a text file. Two of them seem to work fine, and the third works as expected when I test it on Rubular, but not when I run the script. I'm just learning Ruby and programming in general, so I'm guessing there's something wrong with the script itself, not just the regex.

    See the script and sample input text below. The purpose of the script is to examine a large quantity of text that will be copied from a web page, remove unnecessary parts, and add and remove tabs in certain places to set it up for use as a spreadsheet. The clear_stuff regex represents a unique set of table headers that immediately precedes the desired content. The script should match this regex with all content on the page up to and including this set of headers and remove it. At the moment, it only deletes the headers themselves, so it appears that multiline mode isn't working. I tried using the Regexp.new syntax instead of the one shown below as well, but that didn't work either.

    Any ideas what's happening here?

    Code:
    #regexes for find/replace
    add_tabs = /^(term1)/
    remove_tab = /\t(\tterm2)/
    clear_stuff = /.*hat\.goat\sthis thing\sthat thing\sstuff\scheese/m
    
    
    #read file and replace
    fileObj = File.new("input.txt", "r")
    while (line = fileObj.gets)
      substitute_line = line.gsub(clear_stuff, "")	
      substitute_line1 = substitute_line.gsub(add_tabs, "\t\t\\1")
      substitute_line2 = substitute_line1.gsub(remove_tab, "\\1")
      print(substitute_line2)
    end
    fileObj.close
    Input text:

    Code:
    [Stuff that should get deleted]
    hat.goat	this thing	that thing	stuff	cheese
    [header that needs to stay]			
    term1 [random text - this line should be preceded by 2 tabs after output]
    		term2 [random text - this line should be preceded by a single tab after output]
    [random text - this line should remain as is after output]
  2. #2
  3. --
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2012
    Posts
    3,934
    Rep Power
    1045
    Hi,

    you forget that you read the file linewise. The multiline regex won't help if the input is always a single line.

    You should get rid of this File.new stuff completely. That's pretty much the most cumbersome way of dealing with files with Ruby, so you should avoid it if possible. There are much cleverer methods. If you wanna loop through the lines of a file, use File.foreach. If you wanna get the whole content of a file, use File.read. If you wanna write to a file, use File.open with a block, because this will automatically close the file at the end.

    The regex is also extremely inefficient due to the "." pattern, which should be avoided at all cost. The regex will first consume the whole file (because "." in multiline mode matches anything), then it will go back character by character, and each time it will check if your "clear stuff" follows. This leads to a gigantic number of steps and will melt your CPU when dealing with large text files.

    I'm not even sure if a regex is the right tool for this. How exactly will your input look like? Is it HTML?
    The 6 worst sins of security ē How to (properly) access a MySQL database with PHP

    Why canít I use certain words like "drop" as part of my Security Question answers?
    There are certain words used by hackers to try to gain access to systems and manipulate data; therefore, the following words are restricted: "select," "delete," "update," "insert," "drop" and "null".
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2013
    Posts
    4
    Rep Power
    0
    Thanks for the reply; I got the regex working using File.read. I see what you're saying about the inefficiency of this regex though, what would you recommend instead? I'll give a little more background info.

    I'm using a web-based tool that generates some blocks of text based on a variety of settings. The generated text appears in a table on the web page. I'll be copying the content of the page directly from the browser, not the source code, so no HTML markup will be included. Many elements on the page are dynamic, so there aren't many items I can rely on having present every time I run this script. The set of headers I used for the regex, however, will always immediately precede the content I need, which is why I set it up the way I did. Given this info, is there another way I can select everything preceding the content I need (the length of which is indeterminate) for removal?
  6. #4
  7. --
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2012
    Posts
    3,934
    Rep Power
    1045
    Don't remove the part before the headers, simply start after the headers:

    Code:
    input = <<INP
    some
    garbage
    [start after this]
    actual
    content
    INP
    
    start_headers = "[start after this]\n"
    # calculate index after the headers
    start_index = input.index(start_headers) + start_headers.length
    # get substring
    puts input[start_index .. -1]
    Since the headers are a fixed string (as far as I can tell), you don't even need a regex.

    Note that for very large files, you might have to go back to the line-wise reading to avoid loading the whole content into memory.
    The 6 worst sins of security ē How to (properly) access a MySQL database with PHP

    Why canít I use certain words like "drop" as part of my Security Question answers?
    There are certain words used by hackers to try to gain access to systems and manipulate data; therefore, the following words are restricted: "select," "delete," "update," "insert," "drop" and "null".
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2013
    Posts
    4
    Rep Power
    0
    Hi again,

    I messed around with File.open and came up with the code below. It looks through each line of the source file and does nothing until it finds the unique header (which is what the original regex was looking for), and after finding it, prints the subsequent lines. Do you think this is a better way to handle this situation?

    Thanks.

    Code:
    header = false
    File.open("input.txt", "r").each_line do |line|
      if header == false
        if line.include?("string") == false 
        	next
        else 
        	header = true
        end
      else
      	sub1 = line.gsub(add_tabs, "\t\t\\1")
    	sub2 = sub1.gsub(remove_tab, "\\1")	
    	print sub2
      end
    end
    
    =begin
    "string" in the include statement above is part of the 
    unique set of headers I previously used for the regex
    =end
  10. #6
  11. --
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2012
    Posts
    3,934
    Rep Power
    1045
    Originally Posted by mightypants
    Do you think this is a better way to handle this situation?
    Yes. But use File.foreach for looping through the lines (as explained above).

    By the way, if you're using the brand new Ruby 2.0, you can do this very elegantly and efficient at the same time with lazy Enumerables:

    Code:
    input = <<INP
    some
    garbage
    [start after this]
    actual
    content
    INP
    
    start_headers = "[start after this]\n"
    
    content_lines =
      input.each_line.lazy.drop_while{|line| line != start_headers}.drop 1
    content_lines.each do |line|
      puts line
    end
    That's pretty much the literal translation of your approach: You drop all lines until you reach the headers, which you also drop. The rest is your content.
    The 6 worst sins of security ē How to (properly) access a MySQL database with PHP

    Why canít I use certain words like "drop" as part of my Security Question answers?
    There are certain words used by hackers to try to gain access to systems and manipulate data; therefore, the following words are restricted: "select," "delete," "update," "insert," "drop" and "null".
  12. #7
  13. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2013
    Posts
    4
    Rep Power
    0
    Thanks again for your help. I'm currently running Ruby 1.8.7, but I'll look into the newer technique you mentioned.
  14. #8
  15. --
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2012
    Posts
    3,934
    Rep Power
    1045
    You should at least switch to 1.9. Ruby 1.8 is dead. It lacks many important features, it has a bad interpreter making it very slow, and it will reach its official end of life next month (which means it will be abandoned and won't get any security or bug patches).
    The 6 worst sins of security ē How to (properly) access a MySQL database with PHP

    Why canít I use certain words like "drop" as part of my Security Question answers?
    There are certain words used by hackers to try to gain access to systems and manipulate data; therefore, the following words are restricted: "select," "delete," "update," "insert," "drop" and "null".

IMN logo majestic logo threadwatch logo seochat tools logo