#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2010
    Posts
    3
    Rep Power
    0

    Wink Many forms of a word


    Hello, everyone. I'm a beginner or even not yet a proper beginner with regular expressions (as the title of the thread probably indicates). I am not very familiar with the differences that apply depending on the language, but my language is supposed to be AWK. Now the specific task in question--

    I need to search a text for the following words: on, ona, ono, nxega, nxemu, nxim, nxe, nxoj, nxu, nxom, ga, mu, nx, je, joj, ju.
    (For the curious these are the forms of a third person singular personal prounoun declined through three genders and six cases.)

    Would the following regex find all the right words and nothing else:

    /on(a|o)?|(nx((e(ga|mu))|(im|om|oj))?)|j(e|u|oj)/

    I hope this isn't outrageously wrong and laughable, and that someone is willing to help me out. Many thanks!
  2. #2
  3. Did you steal it?
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    13,962
    Rep Power
    9397
    Tip: [abc] is a character set and means the same as (a|b|c).
    Code:
    ga|j(e|oj|u)|mu|nx(e(ga|mu)?|im|o[jm]|u)?|on[ao]?
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2010
    Posts
    3
    Rep Power
    0
    Originally Posted by requinix
    Tip: [abc] is a character set and means the same as (a|b|c).
    Code:
    ga|j(e|oj|u)|mu|nx(e(ga|mu)?|im|o[jm]|u)?|on[ao]?
    Thanks so much, requinix; that's really elegant. And thanks for the tip; I was aware of the possibility of using a character set in this context, but I wasn't confident enough about that. I have two questions: what purpose might the tabulator serve in tasks such as this one? And is my use of the slash wrong or unnecessary?

    Many thanks once again!
  6. #4
  7. Did you steal it?
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    13,962
    Rep Power
    9397
    What "tabulator"?

    The slash is a per-language thing (notably used in Perl and PHP's PPCRE functions) that delimits the regular expression, much like how quotes delimit strings in programming.
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2010
    Posts
    3
    Rep Power
    0
    Originally Posted by requinix
    What "tabulator"?
    I meant the tab character, sorry. All the examples I'm dealing with start with [ \t], but I'm not sure what the purpose of that is if the result would omit the words at the beginning of the line, unless there is some default space at the beginning, such as a margin.

IMN logo majestic logo threadwatch logo seochat tools logo