#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2013
    Posts
    1
    Rep Power
    0

    Exclamation Please help me about linguistic typology with Flex


    Hello everybody,

    With the advice of a friend, I start to learn to use LaTex instead of MS Word for my redaction. He said that it will take me little time at the beginning, but it will help me to save a lot of time after. I began to find that he's right :)

    Today, i try to convert some of my work-in-progress doucments from Word to Latex; but, i have alot of useless whitespace in the converted documents :cadrunk: . On a francophone forum of Ubuntu users, one member suggest me to use Lex, a much more powerful text processing tool, to deal with it; but nobody know exactly how to do these following tasks :( :

    1) Remove all useless whitespace (spaces and tabs) at the end of the line.
    2) Replace all tabs with a fixed number of spaces, 4 for example.
    3) Put a single white after a punctuation marks [,;:!?].
    4) Restore the forgotten upper-cases (transform the letters which found after punctuation marks [.!?] to capital letters.

    So, I wonder if you'd be kind enough to give me some lines of code which could do that. Thank you in advance for any help you can provide :googly:
  2. #2
  3. Contributed User
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    Jun 2005
    Posts
    4,417
    Rep Power
    1871
    You can do it all with one sed command :)
    Code:
    $ cat baz.txt
    This line has trailing spaces.          
    			This line has leading tabs.
    This line!has broken punctuation.
    
    $ odx baz.txt
    000000 54 68 69 73 20 6c 69 6e 65 20 68 61 73 20 74 72  >This line has tr<
    000010 61 69 6c 69 6e 67 20 73 70 61 63 65 73 2e 20 20  >ailing spaces.  <
    000020 20 20 20 20 20 20 20 20 0a 09 09 09 54 68 69 73  >        ....This<
    000030 20 6c 69 6e 65 20 68 61 73 20 6c 65 61 64 69 6e  > line has leadin<
    000040 67 20 74 61 62 73 2e 0a 54 68 69 73 20 6c 69 6e  >g tabs..This lin<
    000050 65 21 68 61 73 20 62 72 6f 6b 65 6e 20 70 75 6e  >e!has broken pun<
    000060 63 74 75 61 74 69 6f 6e 2e 0a 0a                 >ctuation...<
    00006b
    $ sed -e 's/ \+$//' -e 's/\t/    /g' -e 's/\([.!?]\)\(.\)/\1\u\2/g' -e 's/\([,;:!?]\)/\1 /g' baz.txt
    This line has trailing spaces.
                This line has leading tabs.
    This line! Has broken punctuation.
    
    $ sed -e 's/ \+$//' -e 's/\t/    /g' -e 's/\([.!?]\)\(.\)/\1\u\2/g' -e 's/\([,;:!?]\)/\1 /g' baz.txt | odx
    000000 54 68 69 73 20 6c 69 6e 65 20 68 61 73 20 74 72  >This line has tr<
    000010 61 69 6c 69 6e 67 20 73 70 61 63 65 73 2e 0a 20  >ailing spaces.. <
    000020 20 20 20 20 20 20 20 20 20 20 20 54 68 69 73 20  >           This <
    000030 6c 69 6e 65 20 68 61 73 20 6c 65 61 64 69 6e 67  >line has leading<
    000040 20 74 61 62 73 2e 0a 54 68 69 73 20 6c 69 6e 65  > tabs..This line<
    000050 21 20 48 61 73 20 62 72 6f 6b 65 6e 20 70 75 6e  >! Has broken pun<
    000060 63 74 75 61 74 69 6f 6e 2e 0a 0a                 >ctuation...<
    00006b
    The hex dumps just show you where all the invisible characters are - 20 is space, 0a is newline, 09 is tab.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper

IMN logo majestic logo threadwatch logo seochat tools logo