#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2012
    Posts
    1
    Rep Power
    0

    Controlled string substitution with sed


    I'm trying to write a bash script that converts html file into latex file by processing each line with sed. I'm stuck with following issue:
    I need to replace
    Code:
    <a name="something">
    with
    Code:
    \index{something_else}
    . I have an associative array with key-value pairs like this:
    Code:
    "something" => "something_else"
    . However, there are 2 catches:

    • Not all strings from <a> tags are present. If I don't find an array element with "something" key I need to skip creation of \index{}.
    • Some lines contain more than one occurrence of '<a name="...">'.


    Any suggestions?
  2. #2
  3. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2011
    Posts
    29
    Rep Power
    0
    You should use a scripting language (AWK, Perl, Python, etc.) instead of sed. As Matt Might said in his recent article (http://matt.might.net/articles/sculpting-text/): "If you find yourself tempted to use these more advanced [sed] constructs, it's a sign that you want to use a tool like awk or Perl instead."

    For example, in Python:

    Code:
    import re
    
    links = {'something' : 'something_else', 'something2' : 'something_else2'}
    
    def replace_links(m):
    	if m.group(1) not in links:
    		return '' # skip creation of index
    	return r'\index{' + links[m.group(1)] + '}'
    
    for line in open('file.htm'):
    	print( re.sub(r'<a name="([^"]+)">', replace_links, line) )
  4. #3
  5. kill 9, $$;
    Devshed Supreme Being (6500+ posts)

    Join Date
    Sep 2001
    Location
    Shanghai, An tSín
    Posts
    6,897
    Rep Power
    3887
    Perl (and others) will also offer proper HTML parsing libraries, which will be much better than using home-made regexps.

IMN logo majestic logo threadwatch logo seochat tools logo