#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2013
    Posts
    2
    Rep Power
    0

    Read text and print each (byte) character in separate line


    preforming this code to read file and print each character \ (byte) in separate line

    works well with ASCII

    void
    preprocess_file (FILE *fp)

    {
    int cc;

    for ( ; ; )
    {
    cc = getc (fp);
    if (cc == EOF)
    break;
    printf ("%c\n", cc);
    }
    }

    int
    main(int argc, char *argv [])
    {
    preprocess_file (stdin);

    exit (0);
    }
    but when i use it with UTF-8 encoded text it shows unredable character such as



    ؟
    ط

    ظ

    ظ

    ط

    ط

    ط
    and advice ?
  2. #2
  3. Contributing User
    Devshed Supreme Being (6500+ posts)

    Join Date
    Jan 2003
    Location
    USA
    Posts
    7,174
    Rep Power
    2222
    Please use code tags to preserve your code's indentation.

    Your original formatting retrieved via the Reply button:
    Code:
    void
    preprocess_file (FILE *fp)
    
    {
      int cc;
    
        for ( ; ;  )
          { 
        cc = getc (fp);
        if (cc == EOF)
            break;
        printf ("%c\n", cc);
          }
    }
    
    int
    main(int argc, char *argv [])
    {
        preprocess_file (stdin);
    
        exit (0);
    }
    In order to see just exactly what characters you're reading in, you should display the character codes for cc; eg:
    Code:
        printf ("%c [%d]\n", cc, cc);
    There are control characters and characters outside the normal range of 7-bit ASCII. By printing out the character code (AKA "ASCII code"), you can see just exactly what you're getting, which can help you to debug.

    You could also use your debugger to see exactly what you're getting.
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2013
    Posts
    2
    Rep Power
    0
    Originally Posted by dwise1_aol
    Please use code tags to preserve your code's indentation.

    Your original formatting retrieved via the Reply button:
    Code:
    void
    preprocess_file (FILE *fp)
    
    {
      int cc;
    
        for ( ; ;  )
          { 
        cc = getc (fp);
        if (cc == EOF)
            break;
        printf ("%c\n", cc);
          }
    }
    
    int
    main(int argc, char *argv [])
    {
        preprocess_file (stdin);
    
        exit (0);
    }
    In order to see just exactly what characters you're reading in, you should display the character codes for cc; eg:
    Code:
        printf ("%c [%d]\n", cc, cc);
    There are control characters and characters outside the normal range of 7-bit ASCII. By printing out the character code (AKA "ASCII code"), you can see just exactly what you're getting, which can help you to debug.

    You could also use your debugger to see exactly what you're getting.


    that was the output
    Code:
     [239]
     [187]
    ؟ [191]
    ط [216]
     [167]
    ظ [217]
     [132]
    ظ [217]
     [133]
    ط [216]
     [164]
    ط [216]
     [180]
    ط [216]
     [177]
      [32]
    in UTF-8 character is more than 1 byte

    so for that it is showing strange characters

    any help?
  6. #4
  7. Contributing User
    Devshed Supreme Being (6500+ posts)

    Join Date
    Jan 2003
    Location
    USA
    Posts
    7,174
    Rep Power
    2222
    I haven't worked with it much, but there are definitions of wide characters. I've seen reference to the datatype wchar_t and in a Windows Unicode app I've dealt with the TCHAR type. A number of times when I look up a string handling function in Visual Studio help, I will also see wchar_t versions of the function.

    That might provide you with leads for your research. Plus others here who have worked with UTF-8 can contribute.

IMN logo majestic logo threadwatch logo seochat tools logo