Thread: Japanese

    #1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2011
    Posts
    9
    Rep Power
    0

    Japanese


    How to find Japanese Characters using regular expressions?
  2. #2
  3. Sarcky
    Devshed Supreme Being (6500+ posts)

    Join Date
    Oct 2006
    Location
    Pennsylvania, USA
    Posts
    10,908
    Rep Power
    6352
    You can use unicode directly in regular expressions in many languages. Using unicode with the range pattern will let you find the entire japanese character set...maybe.
    HEY! YOU! Read the New User Guide and Forum Rules

    "They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin

    "The greatest tragedy of this changing society is that people who never knew what it was like before will simply assume that this is the way things are supposed to be." -2600 Magazine, Fall 2002

    Think we're being rude? Maybe you asked a bad question or you're a Help Vampire. Trying to argue intelligently? Please read this.
  4. #3
  5. Jealous Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,302
    Rep Power
    9400
    Depends on the programming language and what its regular expressions support, but you'd be looking for character ranges. Wikipedia lists the Unicode ranges as U+3040-FF and U+4E00-9FBF, so in essence
    Code:
    /[\u3040-\u30ff\u4e00-\u9fbf]/
    You also have to take into account text encoding: UTF-8, JIS, Shift_JIS, UC2... they all use different byte sequences to represent the same characters. Most regex implementations I know have support for UTF-8, but if your string is in Shift_JIS then you'll have to do some extra work.

    So questions:
    - What programming language?
    - What text encoding?
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    May 2007
    Posts
    765
    Rep Power
    929
    Your regex engine might support searching by unicode script.

    Code:
    Contents of jp.txt
    食べる、 タベル
    Code:
    C:\temp>cat re.pl
    open FH, '<:utf8', 'jp.txt';
    
    chomp($line = <FH>);
    
    for $ch (split //, $line ) {
        printf "Char %X:", ord $ch;
        print " Han" if $ch =~ /\p{Han}/;
        print " Hiragana" if $ch =~ /\p{Hiragana}/;
        print " Katakana" if $ch =~ /\p{Katakana}/;
        print "\n";
    }
    
    C:\temp>perl re.pl
    Char 98DF: Han
    Char 3079: Hiragana
    Char 308B: Hiragana
    Char 3001:
    Char 20:
    Char 30BF: Katakana
    Char 30D9: Katakana
    Char 30EB: Katakana
    sub{*{$::{$_}}{CODE}==$_[0]&& print for(%:: )}->(\&Meh);

IMN logo majestic logo threadwatch logo seochat tools logo