#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Posts
    7
    Rep Power
    0

    N00bish C Pointers/Strings Question


    Hope this hasn't been covered before, didn't see it in a quick skim for other similar threads.

    Was hoping someone could help me clear up my understanding of pointers and strings in C. I'm a beginning programmer - kinda exposed to a lot of programming concepts informally before, but chose C to be my first "real" programming language, so may be around more in the future. I've been going through Sams Teach Yourself C in 24 Hrs the past few months and so far have found it to be really good at anticipating my questions and explaining things adequately. However, I'm in Hour 13 (Manipulating Strings) and while I think I've understood everything so far, I just hit a major road block in my understanding and could use some counseling.

    Since this turned out long, here's the executive summary: How does C store/compile strings that are only attached to pointers?

    Here's the excerpt that threw me:
    Another important thing is that a string is interpreted as a char pointer. Therefore, you can assign a character string to a pointer variable directly, like this:

    char *ptr_str;
    ptr_str = "A character string.";
    And the code that implements it:

    /* assign a string to a pointer */
    ptr_str = "Assign a string to a pointer.";
    for (i=0; *ptr_str; i++)
    printf("%c", *ptr_str++);
    Here's my problem, I get that pointers contain memory addresses. Say if I said:
    int x=7, *ptr_x;
    ptr_x=&x;
    printf("%d",*ptr_x);
    I understand that that would define a memory address/lvalue for x (over however many bytes), store the value 7 in that memory location/rvalue, create a memory address for ptr_x, store the memory address of x in the memory location of ptr_x, then print onscreen the contents of the memory location at the memory address stored in ptr_x's memory location (7)...

    I also understand that when you assign a pointer to an array, the array's name is interpreted as the memory address of the first element in the array. And I get that a string is an array of characters (with null character at end) and therefore I'm guessing when you assign a string to a pointer, you're really putting the address of the first character of the string in the pointer's memory location.

    What is confusing me in this example is, WHAT memory address? This is not like the int x=7 case, where x was assigned to a memory address beforehand. As far as I can tell, the string is just drifting along in lala land. And suppose you did just pick a random address and assign it to the string constant. That's mostly okay by me (although I wish I knew how they chose it, but I don't know how they choose the other memory locations either :p), but how does the compiler keep track of it? My (rough) understanding was there was a symbol table it generated, say with int x=7: it would associate the symbol "x" with the memory address/lvalue of whatever. What symbol would it use to keep track of the entire string?

    So, I'm confused. :p Sure I can just memorize how to use strings assigned to pointers (and I will for the time being), but it drives me crazy not to understand how it works underneath (that's why I picked C to start, to get a better feel for the lower level workings). Any help here would be great.
  2. #2
  3. Contributed User
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    Jun 2005
    Posts
    4,387
    Rep Power
    1871
    Code:
     100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116
    +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
    | 9 |100| 1 | 2 | 3 | 4 | 5 |108| H | e | l | l | o |\0 |   |   |   |
    +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
      a  *b  c[]                  t
    
    Where we have
    int a = 9;
    int *b = &a
    int c[5] = { 1, 2, 3, 4, 5};
    char *t = "Hello";
    Memory is just a row of numbered 'bins' in which to put things. In this example, the numbering starts at 100 (could be anything, doesn't matter so long as it's sequential).

    The compiler will internally allocate such bins according to the variable symbols you write in your code.
    So
    a being an integer is placed in bin 100 and initialised to 9
    b is a pointer to an integer, it is placed in bin 101. Because it is pointing to a, the content is initialised to 100 (the bin number for the variable a).
    c is an array starting at bin 102. When we say things like c[3] to index an array, what the compiler is doing is looking in bin 102 + 3, which would give us the value 4.

    When it comes to string constants, the compiler picks some unused bins and puts the text there. The string pointer variable itself then points to the start of this chosen memory.

    When you do char *t = "Hello", what the compiler is doing is
    char anon[] = "Hello\0"; // initialise an array of chars
    char *t = &anon[0]; // point to the start of it
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper
  4. #3
  5. Contributing User
    Devshed Supreme Being (6500+ posts)

    Join Date
    Jan 2003
    Location
    USA
    Posts
    7,162
    Rep Power
    2222
    You actually do have most everything right. I can't really see where you have anything wrong; it's just some small holes in your knowledge, but you're asking the right questions about them.

    As salem says, the compiler just finds somewhere to put that string literal and then inserts that address into the code that it generates. Exactly where and how is left up to the compiler designer, since the language standard only really specifies what the end result must be. The way that the compiler keeps track of where that string literal is is the exact same way it keeps track of where all the variables are: it inserts those addresses into the object code that it generates.

    In my experience with our development systems, the memory image of the program when it's loaded into memory for execution is mapped out into distinct segments. One segment, which our system calls DATA, contains the global and static local variables. Additionally, our particular system has STACK for the stack, HEAP for the heap (used with malloc), TEXT for the executable code and string literals, etc. Some segments are read/write, but TEXT is read-only. As a result, you cannot change a string literal; the program would crash with an access error of some kind.

    When you initialize an array -- eg
    char anon[] = "Hello\0";
    -- then that array is created in DATA or STACK and then that string literal is copied into it. If you wanted to change it to "Herro" (like Korean-American comedian Johnny Yune would do), you could, because it's stored in read/write memory so you have permission to write to it.

    Now if you point a char pointer to a string literal -- eg
    char *t = "Hello";
    -- then that string is stored in read-only memory and if you try to change that string then you will crash because you don't have permission to write to it.

    Keep up the good work in your studies. You demonstrate a good grasp of the concepts.
  6. #4
  7. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Posts
    7
    Rep Power
    0
    OK, I've read all the posts so far and also experimented a bit with a sample program set up similarly to what salem posted. If I follow, when a string constant (say *s="Blah") is created, a pointer s is created and a character array {'B','l','a','h','\0'} is created (which I guess is more resource-intensive than just a character array since the pointer is stored too). The pointer has the symbol *s associated with it and points to the address of the character 'B'. The character array doesn't necessarily have an associated symbol in the compiler(? I'm not sure what is meant by "object code" in dwise1_aol's post), but it does have a memory address put aside for each element, albeit in a different portion of the memory from where variables are normally defined(?). The string constant called this way is also read-only, whereas if a character array is used, it can be edited without crashing the program. (Experimentation with my own setup supports this.)

    If I'm right so far, then if the pointer redirects itself from the string constant "Blah", I think "Blah" itself remains in memory, but is inaccessible without hardcoding the address back in unless the memory address is first stored in another pointer. Is this accurate? (I have experimented some with this, and this is how it looks from my experimenting.)

    I've only heard about the stack so far and haven't even heard of heaps yet - guessing these are more advanced topics but hopefully will get to them eventually. I have heard of malloc() but it's in later chapters than where I've reached so far.

    Thanks for the help! Do think I'm getting a better grasp of this now.
  8. #5
  9. Contributed User
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    Jun 2005
    Posts
    4,387
    Rep Power
    1871
    Originally Posted by Elro
    If I'm right so far, then if the pointer redirects itself from the string constant "Blah", I think "Blah" itself remains in memory, but is inaccessible without hardcoding the address back in unless the memory address is first stored in another pointer. Is this accurate? (I have experimented some with this, and this is how it looks from my experimenting.)
    Consider these three assignments.
    char *t = "Hello";
    t = "World";
    t = "Hello";

    On most modern systems, you will only have one copy of the "Hello" string in memory. So whilst string constants themselves have no symbol name (at the source code level), the compiler is smart enough to see that it has seen a string before, and re-use the previously assigned anonymous symbol.

    This might be instructive, if you're investigating at this low level.
    Use "gcc -S" to show the generated assembler code for your C code.
    Code:
    $ cat foo.c
    #include <stdio.h>
    
    int a = 9;
    int *b = &a;
    int c[5] = { 1, 2, 3, 4, 5};
    char *t = "Hello";
    
    int main ( ) {
      printf("%d %d %d %s\n",a,*b,c[3],t);
      return 0;
    }
    $ gcc -S foo.c
    $ head -24 foo.s
    	.file	"foo.c"
    	.globl	a		; a is a global symbol
    	.data			; it's in the data section
    	.align 4		; align the memory address for efficient access to a
    	.type	a, @object	; it's type (for the debugger)
    	.size	a, 4		; this is how big a is (in other words, sizeof(a)
    a:				; the symbol itself
    	.long	9		; and the initial value
    	.globl	b
    	.align 8
    	.type	b, @object
    	.size	b, 8
    b:
    	.quad	a
    	.globl	c
    	.align 16
    	.type	c, @object
    	.size	c, 20
    c:
    	.long	1
    	.long	2
    	.long	3
    	.long	4
    	.long	5
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper
  10. #6
  11. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Posts
    7
    Rep Power
    0
    Originally Posted by salem
    On most modern systems, you will only have one copy of the "Hello" string in memory. So whilst string constants themselves have no symbol name (at the source code level), the compiler is smart enough to see that it has seen a string before, and re-use the previously assigned anonymous symbol.
    Okay, that makes sense. The compiler does actually have a symbol for string constants then? At the time I thought you might have been just saying it had an equivalent effect as anon[]="Blah"; t=&anon[0];

    Originally Posted by salem
    This might be instructive, if you're investigating at this low level.
    Use "gcc -S" to show the generated assembler code for your C code.
    Code:
    $ cat foo.c
    #include <stdio.h>
    
    int a = 9;
    int *b = &a;
    int c[5] = { 1, 2, 3, 4, 5};
    char *t = "Hello";
    
    int main ( ) {
      printf("%d %d %d %s\n",a,*b,c[3],t);
      return 0;
    }
    $ gcc -S foo.c
    $ head -24 foo.s
    	.file	"foo.c"
    	.globl	a		; a is a global symbol
    	.data			; it's in the data section
    	.align 4		; align the memory address for efficient access to a
    	.type	a, @object	; it's type (for the debugger)
    	.size	a, 4		; this is how big a is (in other words, sizeof(a)
    a:				; the symbol itself
    	.long	9		; and the initial value
    	.globl	b
    	.align 8
    	.type	b, @object
    	.size	b, 8
    b:
    	.quad	a
    	.globl	c
    	.align 16
    	.type	c, @object
    	.size	c, 20
    c:
    	.long	1
    	.long	2
    	.long	3
    	.long	4
    	.long	5
    Hmm, interesting. I can't get this to work on my setup (MingW for Windows, which does support cat and gcc but apparently not gcc -S) but I'm planning to get a dual-boot Linux setup going in the foreseeable future, will remember this command for then. :)
  12. #7
  13. Contributing User
    Devshed Supreme Being (6500+ posts)

    Join Date
    Jan 2003
    Location
    USA
    Posts
    7,162
    Rep Power
    2222
    Originally Posted by Elro
    Hmm, interesting. I can't get this to work on my setup (MingW for Windows, which does support cat and gcc but apparently not gcc -S) but I'm planning to get a dual-boot Linux setup going in the foreseeable future, will remember this command for then. :)
    gcc -S does indeed work for MinGW gcc. What it does is to generate a .s file, so you have to cat or the like to read it; note that salem used $ head -24 foo.s ... and there you see the command using the .s file that gcc -S had generated.
  14. #8
  15. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Posts
    7
    Rep Power
    0
    Originally Posted by dwise1_aol
    gcc -S does indeed work for MinGW gcc. What it does is to generate a .s file, so you have to cat or the like to read it; note that salem used $ head -24 foo.s ... and there you see the command using the .s file that gcc -S had generated.
    Ahh, I see now, durr. Thanks! :)
  16. #9
  17. Contributing User
    Devshed Supreme Being (6500+ posts)

    Join Date
    Jan 2003
    Location
    USA
    Posts
    7,162
    Rep Power
    2222
    Now, what you will get will be an assembly listing, the assembly code that the compiler translated the C source to. Since you were confused by the term "object code", I was wanting to ask how familiar you are with assembly code or even the concept of it as well as your understanding of what the compiler generates. I didn't want to assume that you knew too little.
  18. #10
  19. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Posts
    7
    Rep Power
    0
    Originally Posted by dwise1_aol
    Now, what you will get will be an assembly listing, the assembly code that the compiler translated the C source to. Since you were confused by the term "object code", I was wanting to ask how familiar you are with assembly code or even the concept of it as well as your understanding of what the compiler generates. I didn't want to assume that you knew too little.
    It's fine to err on the safe side, I wouldn't be insulted, just trying to learn.

    As I understand it, the English-like C source code I write is taken by the compiler and translated into binary (i.e., 1s and 0s, I take it in some sort of code that associates certain combinations of 1s and 0s with certain primitive computery actions). I haven't dealt much with object files so far, have just been compiling .exes, but I think they're the stage in-between source code and executables. I guess object files use assembly language? I know OF assembly language, in that it's pretty low-level programming (and runs very quickly) but a lot more difficult to learn/program in and I think it's machine-specific - so I would have to compile a file on my processor (or download a version pre-compiled for my processor) to use it. I actually would like to learn assembly language one day, I'm just starting with C since it sounds less intimidating and I don't want to scare myself off programming right at the start. :p
  20. #11
  21. Contributing User
    Devshed Supreme Being (6500+ posts)

    Join Date
    Jan 2003
    Location
    USA
    Posts
    7,162
    Rep Power
    2222
    You have the general idea. I'll try to refine that a bit.

    The idea behind C and other higher-level languages (HLLs) is to provide the programmer with something that is human readable. In reality there exists no computer that can actually read C or any other HLL, but rather every HLL program has to be translated into a form that the computer can understand, machine code. Even in the case of systems like the 70's/80's home computers running BASIC as their operating system, a translation program called an interpreter still had to be used to convert each BASIC command to machine code.

    It is true that machine code is binary numbers, ones and zeros, but there's much more to it than that. Since memory is organized into words, when you access a memory location you cannot access just one single bit, but rather the entire memory word. Word sizes would vary from one processor family to another, but nowadays the size of a single memory location is one byte, which is 8 bits -- since byte size has also been known to vary between processor families (though rarely nowadays), in network programming the term octet is used. Machine code is organized into instructions which consist of an operation code (AKA "op-code") and a variable number of operands. Each op-code represents a single operation that the processor can perform, such as moving data from one location to another, adding, negating, logical operations (AND, OR, XOR, NOT), jumping to another instruction either unconditionally or based on a condition, jumping to another location but saving the return address, loading a memory address into an index register, etc. In hardware, the processor quite literally decodes the op-code in order to generate all the necessary control signals at the necessary times during the execution cycle in order to perform that operation; in tech school, we would trace through the logic diagrams of our training computer, a COMTRAN-10, and follow all the signals generated and used in executing an instruction. All the instructions that a given processor can execute taken together form that processor's instruction set. Each processor has a different instruction set and the op-code for ADD is different in each processor, which is why the only kind of program you can run on a given processor is one that was generated for that particular processor. At the same time, rather than come up with completely new instruction sets all the time, families of processors, such as the Intel 80x86 family, will use the same instruction set, though expanded by each new processor in the family, so that some degree of compatibility can be maintained within that family.

    A quick hardware aside here. Among the hardware that processors contain are special temporary memory circuits called registers, each of which can contain one value. One of the characteristics of a processor is the width of its data bus, which is the largest number of bits it can read from or write to memory in one memory access operation. Data busses started out only 8 bits wide, but then grew to 16 bits, then 32 bits, and now 64 bits wide. The width of the data bus is usually the size of the general-purpose registers, though each general-purpose register can usually be carved up into smaller registers (eg, subdividing the 16-bit AX register into two 8-bit registers, AH and AL). Some registers are special purpose (ie, the Instruction Pointer which contains the address of the next instruction, the Stack Pointer which points to the current top of the stack, the Flag Register whose bits get set and cleared based on the outcomes of each operation and which can be tested) and general-purpose registers that the programmer is free to use. Though even some of the general-purpose registers have special uses: the accumulator is commonly used for most arithmetic and logical and shift operations, the counter register is used by some looping instructions as a down-counter, the index register can be loaded with an address and then be used to indirectly access a memory location (this is the basis for pointer operations in C). Each register has a name and a uniquely identifying number.

    As I said, each instruction consists of an op-code and its operands. An operand can be an immediate value, a register, or a memory address. When a new instruction is read in during the instruction fetch cycle, the op-code is immediately decoded so that the processor will know how many and what kinds of operands it has so that it can fetch those too and place them where they need to be.

    The three basic kinds of translation programs are compilers, interpreters, and assemblers. Assemblers were the first translation programs meant to provide programmers with something human readable rather than having to deal with ones and zeros all the time. Each line in an assembly program corresponds with one instruction. Each op-code has assigned to it a mnemonic, a two-to-five letter abbreviation for that instruction; eg, ADD, MOV, JMP, JNE (jump if not equal), CALL. Furthermore, you use symbolic names for the registers and you can use symbolic names for memory locations. When you write machine code directly, you have to explicitly determine exactly where in memory every variable will be stored and you have to calculate exactly where that instruction is located that you want to jump to. In assembly, you can give symbolic names to variables and locations in the code and the assembler will keep track of those symbolic names, calculate where each location associated with a symbolic name is, and then insert those proper location values into the object code it generates. And, yes, it builds a symbol table in order to do that, but once it has finished generating the object file then that symbol table no longer exists.

    OK, so in an assembly program each line represents one machine instruction. That means that in order to write an assembly program, you have to write down each and every step of the process. When you compile a HLL program, you effectively convert it into an assembly program which can then be assembled into an object file. A simple line of C like x = a + b; could easily generate 6 instructions (eg, load effective address (LEA) of a, move from a to the accumulator, load effective address of b, add b to the accumulator, load effective address of x, move accumulator to x). When you use gcc -S to generate an assembly file for your C program, it uses the C statements as comments, so you can compare the C with the assembly that it generates.

    Now to try to explain what I mean by "object code" -- CAVEAT: different schools and authors may use this terminology differently, so what I am presenting is what I remember from school. Machine code is what the computer actually executes. But neither the compiler nor the assembler generates machine code, but rather an intermediate form called object code. Object code is very close to machine code -- all the instructions are there and converted to binary -- , but it cannot run yet. The problem is with the addresses. To explain that, I'll switch back to C now.

    The C compiler compiles each source file separately and independently of the others. That means that it knows nothing of what's in the other source files nor in the library files (special pre-compiled object files). That is why you include the header files, which contain information of what's in those other source and library files. So when the compiler reads a function prototype, it knows that somewhere there exists or will exist a function by that name and with that return type and parameter list, but nothing more. When the compiler finds that function within this source file, then it can place that function's location in the function call code, but if that function is in another file then the compiler cannot complete the object code for that function call. Instead, the compiler makes room for an address there, but inserts a marker to indicate that this address needs to be resolved and what it's referencing. The same is true of a global variable that has been extern'd in a header file; the compiler knows all about it except for where it's actually located, so all references to that address are marked for address resolution to be performed later on.

    The compiler generates an object file. In Windows, you will find the object files bearing the same names as the source files but with an .OBJ file extension; in Linux, that file extension is .o -- MinGW gcc, although a Windows port, uses the Linux naming conventions. The actual format of an object file will vary from compiler to compiler, but basically they contain the object code (ie, the almost machine code) and a table of function and variable names that it contains along with their locations in the object code, and a table of what addesses it needs to have resolved. These object file tables are then used by the linker to generate the executable file. The linker concatenates each object file's code together into one body of code, and then, now knowing where in that amalgamation each function and variable is, it goes through and resolves all the addresses; if you left out one called function, then you get an "unresolved identifier" error from the linker.

    Now, the executable is a lot closer to machine code, but it's still not quite there. The problem is that when you load that program into memory to execute it, you have no way of knowing ahead of time exactly where in memory that will be. So all those addresses in the executable code are not actual memory addresses, but rather offsets relative to the beginning of the program or to some other common point. What the loader must do is to perform address fixing to convert all those relative addresses into actual memory addresses. It is only after address fixing has been performed that you finally have machine code that the processor can actually execute.

    Like the object file, the executable file has a specific format that will vary from one operating system to another; you should be able to find these formats on-line. The file header contains information such as where the code starts (ie, the offset into this file), how much memory will be needed to hold the code, how much read/write memory the program needs, how much heap it needs (the heap is extra memory that you can allocate dynamically with malloc), and a list of the addresses that need to be fixed by the loader.


    So then back to your original question of how the address of that string literal is known:
    Code:
        char *ptr_str;
        ptr_str = "A character string.";
    As the compiler is compiling that source file, it builds tables to store for its own use information about each identifier. It then uses that information to generate the object code that goes into the object file, plus the other tables that go into the object file for use by the linker. But once the compiler has finished generating that object file, its own tables then go away and no longer exist. The information that they contained is now implicitly contained in the object code. The object code handles the data in accordance with their data types and as required by C; that is implicit in the assembly code generated by the compiler and hence in the object code generated by the assembly step and hence eventually in the loaded executable's machine code. Literals (eg, char, int, float, pointer values) are incorporated into the code as immediate operands, though the pointers would be subject to address resolution and fixing. Specific to your question, code assigning the address of that string literal to ptr_str was generated, but with the address marked for resolution. Then the linker resolved that address, which was now marked for fixing by the loader. Then when you run the executable, the loader fixed that address and the machine code stored that address in ptr_str and from that point on your program used ptr_str to access that string literal.

    Now as for exactly how this C compiler does it, you can investigate with gcc -S.

    Other interesting questions:

    1. Is ptr_str a global or a local variable? If global, I would expect it to be initialized only once in the startup initialization code that is run before main is called. But if local, then I would expect it to be initialized every time the function is called, in which case the initialization code should be in the function's code. Examination of the assembly listing should tell us that.

    2. salem mentioned that the compiler should be smart enough to reuse a string literal, meaning that if you initialize more than one char pointer to the same string, then they should all point to the same memory location. You could test that with printf using the flag for a pointer, "%p". However, since each source file is compiled separately and independently, it wouldn't seem that the compiler would know to reuse a string literal already used in another source file. Unless the linker somehow takes care of that; I don't know.


    I hope all that makes sense.

    Comments on this post

    • MrFujin agrees : Great explanation
  22. #12
  23. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Posts
    7
    Rep Power
    0
    Hi dwise_aol:

    I'm sorry for my lapse in replying - half of it was getting busy in real life and half of it was just digesting all of that, but in a good way, that was awesome. :p Must've reread it 20 times just committing the information to my long-term memory/understanding. I think I mostly understand what you were saying. I wasn't aware of the hardware details before, still a little gray there, but that's probably an entire course worth of material (and I do intend to track down a course eventually). And I think I get the general concepts much better than before now. I guess the only major open question from that in my head (beyond some details like why bits have to be in bytes) is: what is the loader? Is it in the OS, or built into the exe? Don't feel you have to answer that if you don't have the time though - I'm sure I'll stumble across it eventually and that was already lots of good info. I think I understand the answer to the original question I asked now.

    Thanks!

IMN logo majestic logo threadwatch logo seochat tools logo