#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2004
    Posts
    18
    Rep Power
    0

    Parsing in a string character by character in ANSI C


    Hi All,

    I intend to solve the following task (coding in ANSI C):

    Let "01011001|00100101|\n" be a line of a file encoding a binary vector of 16 dimensions.
    I want to parse the vector into a char array (char *) by reading this line character by character so I can skip the separator characters '|'. (I am not interested in any solution by removing first the '|' characters.

    What I came up with so far is as follows (just a simple sample code):

    Code:
    FILE *fp;
    char *bitvector;
    int bitcounter = 0;	
    int act_char = 0;
    const int endofline = 10;
    
    fp = fopen ("foo.txt", "r"); 
    
    bitvector = (char *) calloc (17, sizeof(char)); // 16 + 1 for ending \0 char
    
    while (act_char != endofline)
    {
           act_char = fgetc (fp);
           if ((act_char != *"|") && (act_char != endofline))
           {
                  bitvector[bitcounter] = act_char;
                  bitcounter++;
           }
    }
    
    free(bitvector);
    bitvector = NULL;
    
    fclose(fp);

    So the code seems to be working but here is the catch: I just don't get it fully what happens in this line:

    Code:
    bitvector[bitcounter] = act_char;
    Seems to me as an implicit type conversion (or casting).
    Interestingly the g++ compiler (g++ -v showing: gcc version 4.3.2 (Debian 4.3.2-1.1)) does not complain about assigning an int (act_char) to a char element of a char array. From what I know a char is
    represented in 8 bits and the int on 32 bits. I would like to emphesize again that the code produces seemingly correct results. I never get any segmentation fault message meaning the everything fits well into the assigned memory.

    What would be cool if someone could confirm me that what Iam doing in the code is corrrect or suggest me a correct solution.

    Thanks for your help!

    Best wishes,

    Zahoo
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2009
    Posts
    194
    Rep Power
    71
    Despite the "bit" in the name, bitvector is just an array of char's, and nothing special.

    In your if statements, you have double quotes, which are correct for strings, but not for char's. Use single quotes, instead:

    if ((act_char != '|') ...

    On my (very old) compiler, I use type short int when I want to match an int with a char, because on my compiler, they are the same size. Otherwise I use the char - '0' trick to translate char's to their number equivalents. My newer compiler will automatically change my int's that get matched up with char's, into shorts, when it compiles.

    If you have included stdlib.h, then there is no need in C to cast the pointer from calloc() into a char *. BUT there is a need to do that in C++. Sometimes people believe they are compiling in C, when they are really compiling in C++, so check on that.
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2004
    Posts
    18
    Rep Power
    0
    Hi Adak,

    thank you for your answer!

    Originally Posted by Adak

    In your if statements, you have double quotes, which are correct for strings, but not for char's. Use single quotes, instead:

    if ((act_char != '|') ...
    Thanks for that! What is the reason of the distinction? The missing terminal null character in the case of the char?

    Originally Posted by Adak
    If you have included stdlib.h, then there is no need in C to cast the pointer from calloc() into a char *. BUT there is a need to do that in C++. Sometimes people believe they are compiling in C, when they are really compiling in C++, so check on that.
    I included these headers:

    Code:
    #include <stdlib.h>
    #include <stdio.h>
    #include <string.h>
    So am I getting it right that if I am using this line n a C compiler gcc (version 4.3.2 (Debian 4.3.2-1.1)) then it's safe to use this line:

    Code:
    bitvector[bitcounter] = act_char;
    or would this make more sense (talking about ANSI C):

    Code:
    bitvector[bitcounter] = (char) act_char;
    or

    Code:
    bitvector[bitcounter] = (short int) act_char;
    ?

    Thanks!

    Bests,

    Zahoo
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2009
    Posts
    194
    Rep Power
    71
    Yes, the char can only hold ONE char, so by definition, it can't be a string, which must have space for at least two char's - because of the end of string char: '\0'.

    If you want to work with shorts, just declare them as short int's, right from the start.

    Programs that use a lot of casting are nearly always using some of it needlessly, because of a poor design decision.

    print out your sizeof(short int), sizeof(char), sizeof(int), and see what you have there, pilgrim. :)

    Yowzerr! :p
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2004
    Posts
    18
    Rep Power
    0
    Thanks for the reply!

    Originally Posted by Adak
    Programs that use a lot of casting are nearly always using some of it needlessly, because of a poor design decision.
    I agree. However, fgetc() returns an int anyway /int fgetc(FILE *)/. So I must use casting in the case I want to use the result as a char or as a short. Am I right?


    Originally Posted by Adak
    print out your sizeof(short int), sizeof(char), sizeof(int), and see what you have there, pilgrim. :)
    I wrote this toy code:

    Code:
    #include <stdlib.h>
    #include <stdio.h>
    
    int main (void) {
    
      fprintf (stdout, "char: %ld\n", sizeof(char));
      fprintf (stdout, "short int: %ld\n", sizeof(short int));
      fprintf (stdout, "int: %ld\n", sizeof(int));
    
      return 1;
    }
    and here it is the result:

    char: 1
    short int: 2
    int: 4


    I am using 64 bit architecture if it does matter. I tested the code by compiling it with gcc and with g++, the results are the same.


    Originally Posted by Adak
    If you want to work with shorts, just declare them as short int's, right from the start.

    Well, actually I intend using a char array. I know I could use an array of shorts but all the time I'd like to output them as a char I needed to make some hassle with the conversion which I want to avoid. So the most convenient way for me would be to store the read-in chars as an array of chars.
    But I have some concerns about using shorts as chars. They seem to be represented by 16 bits (see the result of the toy code above). If this holds, then I still don't get what is really going on in the line:

    Code:
    bitvector[bitcounter] = act_char;
    Now I am getting to feel this would be the correct line:

    Code:
    bitvector[bitcounter] = (char) act_char;
    Nor the gcc neither the g++ complains about this and produces correct results, with no memory fault at all.

    Do you agree with this solution?

    Thanks!

    Best regards,

    Zahoo
  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2009
    Posts
    194
    Rep Power
    71
    Yes, I agree with the cast, because your short and char data types are not the same size. :cool:

    Char's are like small int's in C. They represent differently (%c, versus %d in printf()), but if you take a char, and print it with %d, you'll see what I'm talking about. As long as you're within the smaller range of the char, it can be used much like an int.

    But that range is so very small! :D

    Adak
  12. #7
  13. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2004
    Posts
    18
    Rep Power
    0
    Thanks for the great help!

    Zahoo
  14. #8
  15. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Feb 2004
    Location
    San Francisco Bay
    Posts
    1,939
    Rep Power
    1313
    Originally Posted by zahoo
    Thanks for that! What is the reason of the distinction? The missing terminal null character in the case of the char?
    The syntax is different because a character is conceptually distinct from a one-character string. A string is a sequence of characters; a character is a constituent of a string. A one-character string is a sequence of characters; the one character is the element of the sequence, but it isn't the sequence itself.

    Not all languages make the distinction, but in C it's very important, since characters and strings have completely different representations. A character is an integral type: the integer value of a character is the numerical value of the character itself (usually in ASCII). Strings, on the other hand, are a pointer type: the literal value of a "string" variable is actually the address of the first character of the string. Thus, it's important to know whether you're dealing with a character itself or a string, hence the different syntax for characters vs. string literals.

IMN logo majestic logo threadwatch logo seochat tools logo