#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2012
    Posts
    2
    Rep Power
    0

    Need to use REGEX to catch all conformations of 4 bad words


    I have the ability to parse chat in realtime in a public space with REGEX, and I have some issues with a handful of folks using racial and homophobic slurs, which need to be blocked.

    I have made the following REGEX strings, but I need a better trained eye to review them, and optimize them for efficiency and thoroughness.

    I have 4 chat patterns specifically that I am trying to catch and block:

    Code:
    "chatpattern"		"[nN]+[!1iI]+[Gg]+[3ueaUEA]+[rR]+([sS])?+([:punct:])?+"
    "chatpattern"		"[fF]+[@a4A]+[gG]+[aAoOuU0]+[tT]+([sS])?+([:punct:])?+"
    "chatpattern"		"[fF]+[@a4A]+[gG]+([sS])?+([:punct:])?+"
    "chatpattern"		"[Nn]+[!1iI]+[gG]+([sS])?+([:punct:])?+"
  2. #2
  3. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2012
    Posts
    2
    Rep Power
    0
    I have revised my filter as follows:

    Code:
    		"chatpattern"		"[nN]+[!1iI]+[Gg]+[3ueaUEA]+[rRsS]+"
    		"chatpattern"		"[fF]+[@a4A]+[gG]+[aAoOuU0]+[tTsS]+"
    		"chatpattern"		"[fF]+[@a4A]+[gG]+([sS])?+"
    		"chatpattern"		"[Nn]+[!1iI]+[gG]+([sS])?+"
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2011
    Posts
    29
    Rep Power
    0
    You should check for word boundaries with \b to prevent false positives. The last pattern matches "nig" in "night".
  6. #4
  7. Turn left at the third duck
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2011
    Location
    Nelson, NZ
    Posts
    112
    Rep Power
    93
    Hi FOBioPatel,
    Welcome to the forum!

    In addition to abareplace's comment (you would use \b around some words or the equivalent syntax for your regex engine), here are some suggestions on your rev.

    - it looks like each of your character classes contains the same upper- and lower-case letter, maybe your regex engine has a "case-insensitive" mode?
    - you have a + after each character class, meaning "one or several" of the elements in the class. This makes sense to me for repeated letters such as the G, but are you sure you want that everywhere?
    - your last two patterns have (parentheses) which (i) result in the capture of a string of esses in Group 1 (unneeded I assume) and (ii) are not needed for the regex to function.
    - your last two patterns have a "?+" modifier, which I am fairly sure is not what you intend. The ? makes the esses optional, the + makes the group of esses atomic. Guessing that your intent is to make the s optional, a simple ? would be enough.

    As a way of example, in case-insensitive mode (if available), your first regex could be simplified to this:
    Code:
    n[!1iI]g+[3ueaUEA]rs?
    Here, we're not using word boundaries because you'd be happy to match that pattern even when embedded in more characters.
    Also, you can drop the s?, because once you're past the R, you know you have a match:
    Code:
    n[!1iI]g+[3ueaUEA]r
    Let us know if you need more help with this.
    Wishing you a fun week.

IMN logo majestic logo threadwatch logo seochat tools logo