Regex Programming
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me

The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.

Go Back   Dev Shed ForumsProgramming Languages - MoreRegex Programming

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old May 11th, 2012, 05:44 PM
C M Stafford C M Stafford is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Apr 2012
Posts: 153 C M Stafford User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 Day 14 h 40 m 46 sec
Reputation Power: 2
Question Regular Expression - Unicode Characters

Found this (or something like it) on a forum:

Code:
trim(preg_replace('/#[^\p{L}\p{N}]+#u/','',$_POST["something"]));


Well, I have no problem with it and it seems to work... but what the heck do the #...# bits do? - Some forums indicate that the u should be used in the form '/.../u' ...which I would have expected.

Outside of that, I have worked out that the u is a switch, similar to i, only for matching unicode characters, as opposed to being case insensitive but, while on the subject, is it OK to use more than one switch, eg: '/.../iu'?


...following further testing... still do not know the purpose of the hash signs, but...

Code:
trim(preg_replace('/[^\p{L}\p{N}]+u/i','',$_POST["something"]));


...seems to work [the u switch will not work if placed outside of the pattern (with the i switch)].

...and for anyone else trying to match unicode characters this page is a good reference: http://www.regular-expressions.info/unicode.html.

Reply With Quote
  #2  
Old May 11th, 2012, 07:01 PM
requinix's Avatar
requinix requinix is offline
Still alive
Click here for more information.
 
Join Date: Mar 2007
Location: Washington, USA
Posts: 12,701 requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)  Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1
Time spent in forums: 5 Months 1 Week 4 Days 5 h 23 m 54 sec
Reputation Power: 8969
Send a message via AIM to requinix Send a message via MSN to requinix Send a message via Yahoo to requinix Send a message via Google Talk to requinix
As it is right now, the #...#u is literal. Literal hash signs, literal letter "u".

Judging by context, the //s should not be there.
PHP Code:
 trim(preg_replace('#[^\p{L}\p{N}]+#u','',$_POST["something"])); 

In that case the #s are delimiters and there's a /u flag (which means UTF-8).
Or you can remove the #s. Same difference.

Reply With Quote
  #3  
Old May 11th, 2012, 07:22 PM
C M Stafford C M Stafford is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Apr 2012
Posts: 153 C M Stafford User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 Day 14 h 40 m 46 sec
Reputation Power: 2
Quote:
Originally Posted by requinix
As it is right now, the #...#u is literal. Literal hash signs, literal letter "u".

Judging by context, the //s should not be there.
PHP Code:
 trim(preg_replace('#[^\p{L}\p{N}]+#u','',$_POST["something"])); 

In that case the #s are delimiters and there's a /u flag (which means UTF-8).
Or you can remove the #s. Same difference.


I see what you are saying, - thank-you, - although I have made a major mess of this.

I was using the code to filter out all but lower case letters (including unicode), hyphens, and digits for domain names, but I have found that adding the u switch anywhere, including using those literals, breaks the expression. - Not sure why, - still trying to find examples and documentation to work out what is going wrong.

The code shown below is what I am using at the moment ...and this truly does seem to be fine for all but some of the really weird foreign letters and character constructs, which might not necessarily be in the unicode table - hence the reason for their not working.

Code:
trim(preg_replace('/[^\p{Ll}\p{N}-]/','',$_POST["something"]));

Reply With Quote
  #4  
Old May 11th, 2012, 09:27 PM
requinix's Avatar
requinix requinix is offline
Still alive
Click here for more information.
 
Join Date: Mar 2007
Location: Washington, USA
Posts: 12,701 requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)  Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1
Time spent in forums: 5 Months 1 Week 4 Days 5 h 23 m 54 sec
Reputation Power: 8969
Send a message via AIM to requinix Send a message via MSN to requinix Send a message via Yahoo to requinix Send a message via Google Talk to requinix
\p is only available in UTF-8 mode.
Quote:
Originally Posted by C M Stafford
I have found that adding the u switch anywhere, including using those literals, breaks the expression.

"Breaks" how?

Reply With Quote
  #5  
Old May 12th, 2012, 09:42 AM
C M Stafford C M Stafford is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Apr 2012
Posts: 153 C M Stafford User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 Day 14 h 40 m 46 sec
Reputation Power: 2
Quote:
Originally Posted by requinix
\p is only available in UTF-8 mode.

"Breaks" how?


Code:
trim(preg_replace('/[^\p{L}\p{N}]+u/i','',$_POST["something"]));


...or...

Code:
trim(preg_replace('/[^\p{L}\p{N}]/iu','',$_POST["something"]));


...both allow any characters, including punctuation, to be accepted in the $_POST["something"] value. - In other words, the addition of the u switch either breaks the functionality of the \p codes or it overrides them. Either way, without the u everything works fine for certainly the more common foreign glyphs, eg: à, but adding the u anywhere would allow even the above code to be added with absolutely nothing omitted.

Reply With Quote
  #6  
Old May 12th, 2012, 03:51 PM
requinix's Avatar
requinix requinix is offline
Still alive
Click here for more information.
 
Join Date: Mar 2007
Location: Washington, USA
Posts: 12,701 requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)  Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1
Time spent in forums: 5 Months 1 Week 4 Days 5 h 23 m 54 sec
Reputation Power: 8969
Send a message via AIM to requinix Send a message via MSN to requinix Send a message via Yahoo to requinix Send a message via Google Talk to requinix
The first one is wrong because it has the 'u' in the wrong place. If it's inside the // delimiters then it's just a normal character. It has to be on the outside like in the second example. Also, \pL includes both uppercase and lowercase so the /i isn't necessary.
Code:
/[^\p{L}\p{N}]+/u


That above works for me. Using that in your code, what does it all look like?

Reply With Quote
  #7  
Old May 12th, 2012, 07:11 PM
C M Stafford C M Stafford is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Apr 2012
Posts: 153 C M Stafford User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 Day 14 h 40 m 46 sec
Reputation Power: 2
Quote:
Originally Posted by requinix
The first one is wrong because it has the 'u' in the wrong place. If it's inside the // delimiters then it's just a normal character. It has to be on the outside like in the second example. Also, \pL includes both uppercase and lowercase so the /i isn't necessary.
Code:
/[^\p{L}\p{N}]+/u


That above works for me. Using that in your code, what does it all look like?


That works for me, too, but fails to filter the inputs. The following is the code which I am currently using, and which filters everything as expected.

Code:
$domain = trim(preg_replace('/[^\p{L}\p{N}-]/','',$_POST["dmn"]));


The above processing is the first processing step and is followed by further processing with mysql_real_escape_string().

Reply With Quote
  #8  
Old May 27th, 2012, 07:28 PM
C M Stafford C M Stafford is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Apr 2012
Posts: 153 C M Stafford User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 Day 14 h 40 m 46 sec
Reputation Power: 2
I am still having problems with this. - As I have said previously, the u switch breaks the functionality of the patterns...

$val_domain = preg_replace('/[^\p{L}\p{N}\-]/','',$_POST["domain"]);
// works

$val_domain = preg_replace('/[^\p{L}\p{N}\-]/u','',$_POST["domain"]);
// does not work

...and now I have hit a problem trying to get Russian characters to display.

This should just involve modifying the pattern like so:

$val_domain = preg_replace('/[^\p{L}\p{N}\p{Cyrillic}\-]/','',$_POST["domain"]);
// ...but this does not work - the Russian characters are displayed as a long string of numbers.

$val_domain = preg_replace('/[^\p{L}\p{N}\p{Cyrillic}\-]/u','',$_POST["domain"]);
// ...ditto (for the Russian characters, anyway)

...so I ran pcretest -C, which returns the following:

UTF-8 support
Unicode properties support
Newline sequence is LF
\R matches all Unicode newlines
Internal link size = 2
POSIX malloc threshold = 10
Default match limit = 10000000
Default recursion depth limit = 10000000
Match recursion uses stack


...thus, PCRE is definitely compiled and ready to go.

...and I added mb_internal_encoding("UTF-8");, along with AddDefaultCharset UTF-8 in Apache.


Does anyone have any further insights on this, please? - I am not sure where to go with this, now, - I have even tried \p{IsBlock} and \p{InBlock} ...but neither of those work, either. - Could I possibly need to recompile PHP with Apache mod charset?

Reply With Quote
  #9  
Old May 27th, 2012, 07:59 PM
C M Stafford C M Stafford is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Apr 2012
Posts: 153 C M Stafford User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 Day 14 h 40 m 46 sec
Reputation Power: 2
Got it! - Sort of...

Code:
mb_internal_encoding("UTF-8");
$val_domain = preg_replace('/[^\p{L}\p{N}\p{Cyrillic}\-]+$/u','',$_POST["domain"]);


The problem, now, is that even if I were to enter the above code block into the input box ...it would not be filtered...

Reply With Quote
  #10  
Old May 27th, 2012, 08:41 PM
C M Stafford C M Stafford is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Apr 2012
Posts: 153 C M Stafford User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 Day 14 h 40 m 46 sec
Reputation Power: 2
Sorted!

Code:
	mb_internal_encoding("UTF-8");
	$val_domain = preg_replace('/[^\p{Common}\p{Arabic}\p{Armenian}\p{Bengali}\p{Bopomofo}\p{Braille}\p{Buhid}\p{Cherokee}\p{Cyrillic}  \p{Devanagari}\p{Ethiopic}\p{Georgian}\p{Greek}\p{Gujarati}\p{Gurmukhi}\p{Han}\p{Hangul}\p{Hanunoo}\  p{Hebrew}\p{Hiragana}\p{Inherited}\p{Kannada}\p{Katakana}\p{Khmer}\p{Lao}\p{Latin}\p{Limbu}\p{Malaya  lam}\p{Mongolian}\p{Myanmar}\p{Ogham}\p{Oriya}\p{Runic}\p{Sinhala}\p{Syriac}\p{Tagalog}\p{Tagbanwa}\  p{Tamil}\p{Telugu}\p{Thaana}\p{Thai}\p{Tibetan}\p{Yi}\p{Nd}\-]+$/','',$_POST["domain"]);

Note: A u switch is not required for the above. - The u switch is only required when the input takes the form of UTF-8 encoded data.


...but don't expect to be able to use \p{CanadianAboriginal} or \p{TaiLe} ...'cause you won't be able to:

Compilation failed: unknown property name after \\P or \\p at offset...

Why? - Who knows? - I'm not a unicode.org geek, just a geek needing some greek ...or cyrillic, in this case ...

Last edited by C M Stafford : June 1st, 2012 at 08:43 PM.

Reply With Quote
  #11  
Old June 1st, 2012, 08:57 PM
C M Stafford C M Stafford is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Apr 2012
Posts: 153 C M Stafford User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 Day 14 h 40 m 46 sec
Reputation Power: 2
The following works for practically all unicode characters, including that heinous Swedish å letter :
Code:
		$domain = preg_replace('/[^\p{Common}\p{Arabic}\p{Armenian}\p{Bengali}\p{Bopomofo}\p{Braille}\p{Buhid}\p{Cherokee}\p{Cyrillic}  \p{Devanagari}\p{Ethiopic}\p{Georgian}\p{Greek}\p{Gujarati}\p{Gurmukhi}\p{Han}\p{Hangul}\p{Hanunoo}\  p{Hebrew}\p{Hiragana}\p{Inherited}\p{Kannada}\p{Katakana}\p{Khmer}\p{Lao}\p{Latin}\p{Limbu}\p{Malaya  lam}\p{Mongolian}\p{Myanmar}\p{Ogham}\p{Oriya}\p{Runic}\p{Sinhala}\p{Syriac}\p{Tagalog}\p{Tagbanwa}\  p{Tamil}\p{Telugu}\p{Thaana}\p{Thai}\p{Tibetan}\p{Yi}\p{L}\p{M}\p{Nd}\-]+$/','',$_POST['domain']);

The following string can also be used with preg_replace(), but filters out Cyrillic characters:
Code:
		$pattern[0] = '/[^\p{L}\p{M}\p{Nd}\-]+$/';

Last edited by C M Stafford : June 1st, 2012 at 09:28 PM.

Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming Languages - MoreRegex Programming > Regular Expression - Unicode Characters

Developer Shed Advertisers and Affiliates



Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 


Powered by: vBulletin Version 3.0.5
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.

© 2003-2013 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap