#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2012
    Posts
    4
    Rep Power
    0

    Question \p{} in regex not working as I expect it to


    I have the following line with regex code in my PHP script:
    PHP Code:
    return preg_match('/^[\p{L}\p{M}\p{N}\p{Pd}\p{Pc}\'.]{2,50}$/u'trim(utf8_encode($_POST[$field]))); 
    Here's how I'm expecting this to work:
    \p{L} any letter in any language, including accented letters.
    \p{M} any mark, such as an accent, which is a separate character
    \p{N} any number
    \p{Pd} any hyphen
    \p{Pc} any underscore or word connector in another language
    \' a single quote

    {2,50} any combination of the above with a length of 2-50 characters
    /u utf8 encoded text

    My reference is regular dash expressions dot info slash unicode.html (had to spell it out - blocked).

    However, given a string like "Dubé", it returns false . Anybody know why this is, || how I can fix it?
    Last edited by ZeroCrash; January 1st, 2013 at 03:20 PM. Reason: Forgot to write the word "dash"
  2. #2
  3. --
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2012
    Posts
    3,959
    Rep Power
    1014
    Hi,

    I can't reproduce the error, this snippet works like it should. So try to debug the problem on your side: make a minimal example with only the necessary code and try different variants like a hard coded string, other letters like umlauts etc.

    Comments on this post

    • ZeroCrash agrees : Thanks for all the help so far, I appreciate it!
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2012
    Posts
    4
    Rep Power
    0
    I found the culprit:
    <meta charset="utf-8" />

    Ideally, I want to be able to take utf-8 content right out of the database and on to an html page, so I'd imagine I want to keep that tag there. Perhaps I could just take it off for the register page, but I don't think I should need to. Here's what I've done in my testing script, which seems to work with the tag still there:
    PHP Code:
        $enc=mb_detect_encoding($_POST['word']);
        print(
    $enc);
        if (
    $enc != "UTF-8") {
            
    $new_comp=utf8_encode($_POST['word']);
        } else {
            print(
    "Not encoding...");
            
    $new_comp=$_POST['word'];
        }
        
    $var=preg_match('/^[\pL\pM\p{Pd}\p{Pc}\'\.]{2,50}$/u'trim($new_comp));
        if (
    $var) {
            print(
    "true");
        } else {
            print(
    "false");
        } 
    This also works, however, if I don't use utf8_encode() ever. Should I rely on the users browser to always send data in UTF-8 when possible with the use of that meta tag?
  6. #4
  7. --
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2012
    Posts
    3,959
    Rep Power
    1014
    The utf8_encode() functions requires the input to be ISO-8859-1, so it makes no sense if your strings are UTF-8. Just leave it out.

    You should, however, set the document encoding with an HTTP header rather than the meta tag. If your webserver doesn't already do that, you can use PHP for it:
    PHP Code:
    header('Content-type: text/html; charset=utf-8'); 

IMN logo majestic logo threadwatch logo seochat tools logo