#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2012
    Posts
    153
    Rep Power
    3

    Question Regular Expression - Unicode Characters


    Found this (or something like it) on a forum:

    Code:
    trim(preg_replace('/#[^\p{L}\p{N}]+#u/','',$_POST["something"]));
    Well, I have no problem with it and it seems to work... but what the heck do the #...# bits do? - Some forums indicate that the u should be used in the form '/.../u' ...which I would have expected.

    Outside of that, I have worked out that the u is a switch, similar to i, only for matching unicode characters, as opposed to being case insensitive but, while on the subject, is it OK to use more than one switch, eg: '/.../iu'?


    ...following further testing... still do not know the purpose of the hash signs, but...

    Code:
    trim(preg_replace('/[^\p{L}\p{N}]+u/i','',$_POST["something"]));
    ...seems to work [the u switch will not work if placed outside of the pattern (with the i switch)].

    ...and for anyone else trying to match unicode characters this page is a good reference: http://www.regular-expressions.info/unicode.html.
  2. #2
  3. Jealous Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,302
    Rep Power
    9400
    As it is right now, the #...#u is literal. Literal hash signs, literal letter "u".

    Judging by context, the //s should not be there.
    PHP Code:
    trim(preg_replace('#[^\p{L}\p{N}]+#u','',$_POST["something"])); 
    In that case the #s are delimiters and there's a /u flag (which means UTF-8).
    Or you can remove the #s. Same difference.
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2012
    Posts
    153
    Rep Power
    3
    Originally Posted by requinix
    As it is right now, the #...#u is literal. Literal hash signs, literal letter "u".

    Judging by context, the //s should not be there.
    PHP Code:
    trim(preg_replace('#[^\p{L}\p{N}]+#u','',$_POST["something"])); 
    In that case the #s are delimiters and there's a /u flag (which means UTF-8).
    Or you can remove the #s. Same difference.
    I see what you are saying, - thank-you, - although I have made a major mess of this.

    I was using the code to filter out all but lower case letters (including unicode), hyphens, and digits for domain names, but I have found that adding the u switch anywhere, including using those literals, breaks the expression. - Not sure why, - still trying to find examples and documentation to work out what is going wrong.

    The code shown below is what I am using at the moment ...and this truly does seem to be fine for all but some of the really weird foreign letters and character constructs, which might not necessarily be in the unicode table - hence the reason for their not working.

    Code:
    trim(preg_replace('/[^\p{Ll}\p{N}-]/','',$_POST["something"]));
  6. #4
  7. Jealous Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,302
    Rep Power
    9400
    \p is only available in UTF-8 mode.
    Originally Posted by C M Stafford
    I have found that adding the u switch anywhere, including using those literals, breaks the expression.
    "Breaks" how?
  8. #5
  9. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2012
    Posts
    153
    Rep Power
    3
    Originally Posted by requinix
    \p is only available in UTF-8 mode.

    "Breaks" how?
    Code:
    trim(preg_replace('/[^\p{L}\p{N}]+u/i','',$_POST["something"]));
    ...or...

    Code:
    trim(preg_replace('/[^\p{L}\p{N}]/iu','',$_POST["something"]));
    ...both allow any characters, including punctuation, to be accepted in the $_POST["something"] value. - In other words, the addition of the u switch either breaks the functionality of the \p codes or it overrides them. Either way, without the u everything works fine for certainly the more common foreign glyphs, eg: , but adding the u anywhere would allow even the above code to be added with absolutely nothing omitted.
  10. #6
  11. Jealous Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,302
    Rep Power
    9400
    The first one is wrong because it has the 'u' in the wrong place. If it's inside the // delimiters then it's just a normal character. It has to be on the outside like in the second example. Also, \pL includes both uppercase and lowercase so the /i isn't necessary.
    Code:
    /[^\p{L}\p{N}]+/u
    That above works for me. Using that in your code, what does it all look like?

    Comments on this post

    • C M Stafford agrees
  12. #7
  13. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2012
    Posts
    153
    Rep Power
    3
    Originally Posted by requinix
    The first one is wrong because it has the 'u' in the wrong place. If it's inside the // delimiters then it's just a normal character. It has to be on the outside like in the second example. Also, \pL includes both uppercase and lowercase so the /i isn't necessary.
    Code:
    /[^\p{L}\p{N}]+/u
    That above works for me. Using that in your code, what does it all look like?
    That works for me, too, but fails to filter the inputs. The following is the code which I am currently using, and which filters everything as expected.

    Code:
    $domain = trim(preg_replace('/[^\p{L}\p{N}-]/','',$_POST["dmn"]));
    The above processing is the first processing step and is followed by further processing with mysql_real_escape_string().
  14. #8
  15. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2012
    Posts
    153
    Rep Power
    3
    I am still having problems with this. - As I have said previously, the u switch breaks the functionality of the patterns...

    $val_domain = preg_replace('/[^\p{L}\p{N}\-]/','',$_POST["domain"]);
    // works

    $val_domain = preg_replace('/[^\p{L}\p{N}\-]/u','',$_POST["domain"]);
    // does not work

    ...and now I have hit a problem trying to get Russian characters to display.

    This should just involve modifying the pattern like so:

    $val_domain = preg_replace('/[^\p{L}\p{N}\p{Cyrillic}\-]/','',$_POST["domain"]);
    // ...but this does not work - the Russian characters are displayed as a long string of numbers.

    $val_domain = preg_replace('/[^\p{L}\p{N}\p{Cyrillic}\-]/u','',$_POST["domain"]);
    // ...ditto (for the Russian characters, anyway)

    ...so I ran pcretest -C, which returns the following:

    UTF-8 support
    Unicode properties support
    Newline sequence is LF
    \R matches all Unicode newlines
    Internal link size = 2
    POSIX malloc threshold = 10
    Default match limit = 10000000
    Default recursion depth limit = 10000000
    Match recursion uses stack


    ...thus, PCRE is definitely compiled and ready to go.

    ...and I added mb_internal_encoding("UTF-8");, along with AddDefaultCharset UTF-8 in Apache.


    Does anyone have any further insights on this, please? - I am not sure where to go with this, now, - I have even tried \p{IsBlock} and \p{InBlock} ...but neither of those work, either. - Could I possibly need to recompile PHP with Apache mod charset?
  16. #9
  17. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2012
    Posts
    153
    Rep Power
    3
    Got it! - Sort of...

    Code:
    mb_internal_encoding("UTF-8");
    $val_domain = preg_replace('/[^\p{L}\p{N}\p{Cyrillic}\-]+$/u','',$_POST["domain"]);
    The problem, now, is that even if I were to enter the above code block into the input box ...it would not be filtered...
  18. #10
  19. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2012
    Posts
    153
    Rep Power
    3
    Sorted!

    Code:
    	mb_internal_encoding("UTF-8");
    	$val_domain = preg_replace('/[^\p{Common}\p{Arabic}\p{Armenian}\p{Bengali}\p{Bopomofo}\p{Braille}\p{Buhid}\p{Cherokee}\p{Cyrillic}\p{Devanagari}\p{Ethiopic}\p{Georgian}\p{Greek}\p{Gujarati}\p{Gurmukhi}\p{Han}\p{Hangul}\p{Hanunoo}\p{Hebrew}\p{Hiragana}\p{Inherited}\p{Kannada}\p{Katakana}\p{Khmer}\p{Lao}\p{Latin}\p{Limbu}\p{Malayalam}\p{Mongolian}\p{Myanmar}\p{Ogham}\p{Oriya}\p{Runic}\p{Sinhala}\p{Syriac}\p{Tagalog}\p{Tagbanwa}\p{Tamil}\p{Telugu}\p{Thaana}\p{Thai}\p{Tibetan}\p{Yi}\p{Nd}\-]+$/','',$_POST["domain"]);
    Note: A u switch is not required for the above. - The u switch is only required when the input takes the form of UTF-8 encoded data.


    ...but don't expect to be able to use \p{CanadianAboriginal} or \p{TaiLe} ...'cause you won't be able to:

    Compilation failed: unknown property name after \\P or \\p at offset...

    Why? - Who knows? - I'm not a unicode.org geek, just a geek needing some greek ...or cyrillic, in this case ...
    Last edited by C M Stafford; June 1st, 2012 at 09:43 PM.
  20. #11
  21. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2012
    Posts
    153
    Rep Power
    3
    The following works for practically all unicode characters, including that heinous Swedish letter :
    Code:
    		$domain = preg_replace('/[^\p{Common}\p{Arabic}\p{Armenian}\p{Bengali}\p{Bopomofo}\p{Braille}\p{Buhid}\p{Cherokee}\p{Cyrillic}\p{Devanagari}\p{Ethiopic}\p{Georgian}\p{Greek}\p{Gujarati}\p{Gurmukhi}\p{Han}\p{Hangul}\p{Hanunoo}\p{Hebrew}\p{Hiragana}\p{Inherited}\p{Kannada}\p{Katakana}\p{Khmer}\p{Lao}\p{Latin}\p{Limbu}\p{Malayalam}\p{Mongolian}\p{Myanmar}\p{Ogham}\p{Oriya}\p{Runic}\p{Sinhala}\p{Syriac}\p{Tagalog}\p{Tagbanwa}\p{Tamil}\p{Telugu}\p{Thaana}\p{Thai}\p{Tibetan}\p{Yi}\p{L}\p{M}\p{Nd}\-]+$/','',$_POST['domain']);
    The following string can also be used with preg_replace(), but filters out Cyrillic characters:
    Code:
    		$pattern[0] = '/[^\p{L}\p{M}\p{Nd}\-]+$/';
    Last edited by C M Stafford; June 1st, 2012 at 10:28 PM.

IMN logo majestic logo threadwatch logo seochat tools logo