The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.
|
 |
|
Dev Shed Forums
> Programming Languages - More
> Regex Programming
|
Regular Expression - Unicode Characters
Discuss Regular Expression - Unicode Characters in the Regex Programming forum on Dev Shed. Regular Expression - Unicode Characters Regular expressions forum covering PCRE and POSIX techniques, practices, and standards. Regular expressions help shorten coding time by providing the ability to compact many lines of code into one string.
|
|
 |
|
|
|
|

Dev Shed Forums Sponsor:
|
|
|

May 11th, 2012, 05:44 PM
|
|
Contributing User
|
|
Join Date: Apr 2012
Posts: 153
Time spent in forums: 1 Day 14 h 40 m 46 sec
Reputation Power: 2
|
|
Regular Expression - Unicode Characters
Found this (or something like it) on a forum:
Code:
trim(preg_replace('/#[^\p{L}\p{N}]+#u/','',$_POST["something"]));
Well, I have no problem with it and it seems to work... but what the heck do the #... # bits do? - Some forums indicate that the u should be used in the form '/... /u' ...which I would have expected.
Outside of that, I have worked out that the u is a switch, similar to i, only for matching unicode characters, as opposed to being case insensitive but, while on the subject, is it OK to use more than one switch, eg: '/... /iu'?
...following further testing... still do not know the purpose of the hash signs, but...
Code:
trim(preg_replace('/[^\p{L}\p{N}]+u/i','',$_POST["something"]));
...seems to work [the u switch will not work if placed outside of the pattern (with the i switch)].
...and for anyone else trying to match unicode characters this page is a good reference: http://www.regular-expressions.info/unicode.html.
|

May 11th, 2012, 07:01 PM
|
 |
Still alive
|
|
Join Date: Mar 2007
Location: Washington, USA
|
|
As it is right now, the #...#u is literal. Literal hash signs, literal letter "u".
Judging by context, the //s should not be there.
PHP Code:
trim(preg_replace('#[^\p{L}\p{N}]+#u','',$_POST["something"]));
In that case the #s are delimiters and there's a /u flag (which means UTF-8).
Or you can remove the #s. Same difference.
|

May 11th, 2012, 07:22 PM
|
|
Contributing User
|
|
Join Date: Apr 2012
Posts: 153
Time spent in forums: 1 Day 14 h 40 m 46 sec
Reputation Power: 2
|
|
Quote: | Originally Posted by requinix As it is right now, the #...#u is literal. Literal hash signs, literal letter "u".
Judging by context, the //s should not be there.
PHP Code:
trim(preg_replace('#[^\p{L}\p{N}]+#u','',$_POST["something"]));
In that case the #s are delimiters and there's a /u flag (which means UTF-8).
Or you can remove the #s. Same difference. |
I see what you are saying, - thank-you, - although I have made a major mess of this.
I was using the code to filter out all but lower case letters (including unicode), hyphens, and digits for domain names, but I have found that adding the u switch anywhere, including using those literals, breaks the expression. - Not sure why, - still trying to find examples and documentation to work out what is going wrong.
The code shown below is what I am using at the moment ...and this truly does seem to be fine for all but some of the really weird foreign letters and character constructs, which might not necessarily be in the unicode table - hence the reason for their not working.
Code:
trim(preg_replace('/[^\p{Ll}\p{N}-]/','',$_POST["something"]));
|

May 11th, 2012, 09:27 PM
|
 |
Still alive
|
|
Join Date: Mar 2007
Location: Washington, USA
|
|
\p is only available in UTF-8 mode.
Quote: | Originally Posted by C M Stafford I have found that adding the u switch anywhere, including using those literals, breaks the expression. |
"Breaks" how?
|

May 12th, 2012, 09:42 AM
|
|
Contributing User
|
|
Join Date: Apr 2012
Posts: 153
Time spent in forums: 1 Day 14 h 40 m 46 sec
Reputation Power: 2
|
|
Quote: | Originally Posted by requinix \p is only available in UTF-8 mode.
"Breaks" how? |
Code:
trim(preg_replace('/[^\p{L}\p{N}]+u/i','',$_POST["something"]));
...or...
Code:
trim(preg_replace('/[^\p{L}\p{N}]/iu','',$_POST["something"]));
...both allow any characters, including punctuation, to be accepted in the $_POST["something"] value. - In other words, the addition of the u switch either breaks the functionality of the \p codes or it overrides them. Either way, without the u everything works fine for certainly the more common foreign glyphs, eg: à, but adding the u anywhere would allow even the above code to be added with absolutely nothing omitted.
|

May 12th, 2012, 03:51 PM
|
 |
Still alive
|
|
Join Date: Mar 2007
Location: Washington, USA
|
|
The first one is wrong because it has the 'u' in the wrong place. If it's inside the // delimiters then it's just a normal character. It has to be on the outside like in the second example. Also, \pL includes both uppercase and lowercase so the /i isn't necessary.
That above works for me. Using that in your code, what does it all look like?
|

May 12th, 2012, 07:11 PM
|
|
Contributing User
|
|
Join Date: Apr 2012
Posts: 153
Time spent in forums: 1 Day 14 h 40 m 46 sec
Reputation Power: 2
|
|
Quote: | Originally Posted by requinix The first one is wrong because it has the 'u' in the wrong place. If it's inside the // delimiters then it's just a normal character. It has to be on the outside like in the second example. Also, \pL includes both uppercase and lowercase so the /i isn't necessary.
That above works for me. Using that in your code, what does it all look like? |
That works for me, too, but fails to filter the inputs. The following is the code which I am currently using, and which filters everything as expected.
Code:
$domain = trim(preg_replace('/[^\p{L}\p{N}-]/','',$_POST["dmn"]));
The above processing is the first processing step and is followed by further processing with mysql_real_escape_string().
|

May 27th, 2012, 07:28 PM
|
|
Contributing User
|
|
Join Date: Apr 2012
Posts: 153
Time spent in forums: 1 Day 14 h 40 m 46 sec
Reputation Power: 2
|
|
|
I am still having problems with this. - As I have said previously, the u switch breaks the functionality of the patterns...
$val_domain = preg_replace('/[^\p{L}\p{N}\-]/','',$_POST["domain"]);
// works
$val_domain = preg_replace('/[^\p{L}\p{N}\-]/u','',$_POST["domain"]);
// does not work
...and now I have hit a problem trying to get Russian characters to display.
This should just involve modifying the pattern like so:
$val_domain = preg_replace('/[^\p{L}\p{N}\p{Cyrillic}\-]/','',$_POST["domain"]);
// ...but this does not work - the Russian characters are displayed as a long string of numbers.
$val_domain = preg_replace('/[^\p{L}\p{N}\p{Cyrillic}\-]/u','',$_POST["domain"]);
// ...ditto (for the Russian characters, anyway)
...so I ran pcretest -C, which returns the following:
UTF-8 support
Unicode properties support
Newline sequence is LF
\R matches all Unicode newlines
Internal link size = 2
POSIX malloc threshold = 10
Default match limit = 10000000
Default recursion depth limit = 10000000
Match recursion uses stack
...thus, PCRE is definitely compiled and ready to go.
...and I added mb_internal_encoding("UTF-8");, along with AddDefaultCharset UTF-8 in Apache.
Does anyone have any further insights on this, please? - I am not sure where to go with this, now, - I have even tried \p{IsBlock} and \p{InBlock} ...but neither of those work, either. - Could I possibly need to recompile PHP with Apache mod charset?
|

May 27th, 2012, 07:59 PM
|
|
Contributing User
|
|
Join Date: Apr 2012
Posts: 153
Time spent in forums: 1 Day 14 h 40 m 46 sec
Reputation Power: 2
|
|
Got it! - Sort of...
Code:
mb_internal_encoding("UTF-8");
$val_domain = preg_replace('/[^\p{L}\p{N}\p{Cyrillic}\-]+$/u','',$_POST["domain"]);
The problem, now, is that even if I were to enter the above code block into the input box ...it would not be filtered...
|

May 27th, 2012, 08:41 PM
|
|
Contributing User
|
|
Join Date: Apr 2012
Posts: 153
Time spent in forums: 1 Day 14 h 40 m 46 sec
Reputation Power: 2
|
|
Sorted!
Code:
mb_internal_encoding("UTF-8");
$val_domain = preg_replace('/[^\p{Common}\p{Arabic}\p{Armenian}\p{Bengali}\p{Bopomofo}\p{Braille}\p{Buhid}\p{Cherokee}\p{Cyrillic} \p{Devanagari}\p{Ethiopic}\p{Georgian}\p{Greek}\p{Gujarati}\p{Gurmukhi}\p{Han}\p{Hangul}\p{Hanunoo}\ p{Hebrew}\p{Hiragana}\p{Inherited}\p{Kannada}\p{Katakana}\p{Khmer}\p{Lao}\p{Latin}\p{Limbu}\p{Malaya lam}\p{Mongolian}\p{Myanmar}\p{Ogham}\p{Oriya}\p{Runic}\p{Sinhala}\p{Syriac}\p{Tagalog}\p{Tagbanwa}\ p{Tamil}\p{Telugu}\p{Thaana}\p{Thai}\p{Tibetan}\p{Yi}\p{Nd}\-]+$/','',$_POST["domain"]);
Note: A u switch is not required for the above. - The u switch is only required when the input takes the form of UTF-8 encoded data.
...but don't expect to be able to use \p{CanadianAboriginal} or \p{TaiLe} ...'cause you won't be able to:
Compilation failed: unknown property name after \\P or \\p at offset...
Why? - Who knows? - I'm not a unicode.org geek, just a geek needing some greek ...or cyrillic, in this case  ...
Last edited by C M Stafford : June 1st, 2012 at 08:43 PM.
|

June 1st, 2012, 08:57 PM
|
|
Contributing User
|
|
Join Date: Apr 2012
Posts: 153
Time spent in forums: 1 Day 14 h 40 m 46 sec
Reputation Power: 2
|
|
The following works for practically all unicode characters, including that heinous Swedish å letter  :
Code:
$domain = preg_replace('/[^\p{Common}\p{Arabic}\p{Armenian}\p{Bengali}\p{Bopomofo}\p{Braille}\p{Buhid}\p{Cherokee}\p{Cyrillic} \p{Devanagari}\p{Ethiopic}\p{Georgian}\p{Greek}\p{Gujarati}\p{Gurmukhi}\p{Han}\p{Hangul}\p{Hanunoo}\ p{Hebrew}\p{Hiragana}\p{Inherited}\p{Kannada}\p{Katakana}\p{Khmer}\p{Lao}\p{Latin}\p{Limbu}\p{Malaya lam}\p{Mongolian}\p{Myanmar}\p{Ogham}\p{Oriya}\p{Runic}\p{Sinhala}\p{Syriac}\p{Tagalog}\p{Tagbanwa}\ p{Tamil}\p{Telugu}\p{Thaana}\p{Thai}\p{Tibetan}\p{Yi}\p{L}\p{M}\p{Nd}\-]+$/','',$_POST['domain']);
The following string can also be used with preg_replace(), but filters out Cyrillic characters:
Code:
$pattern[0] = '/[^\p{L}\p{M}\p{Nd}\-]+$/';
Last edited by C M Stafford : June 1st, 2012 at 09:28 PM.
|
Developer Shed Advertisers and Affiliates
| Thread Tools |
Search this Thread |
|
|
|
| Display Modes |
Rate This Thread |
Linear Mode
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
|
|