July 21st, 2013, 02:38 AM
What compilers and linkers do
From what I've learned on header files, a compiler does not need the definition of a function, but it does need the prototype.
What exactly is it that compilers do?
Also, I feel my knowledge on compilers/linkers/loaders is insufficient. What I know of how a program is built is that you write the source file with a text editor, translate the source code to machine language with a compiler creating an object file, link relevant files together with the linker, and then execute the .exe file made by the linker with the loader.
I want to know more about what each of the compiler, linker, and loader does, because they help me understand errors better.
For example, I learned last time here that the preprocessor replaces the directives, so if I had the directive
, all occurrences of SIZE in the code body would get replaced by 100. Therefore, I understood clearly what was going on when I forgot to include the value of 100, writing
and received an expected expression error for the line
; the compiler was reading
Another thing I want to know is how files are linked together in Projects.
Please point me to a good starting point reference.
Last edited by 046; July 21st, 2013 at 02:56 AM.
July 21st, 2013, 03:04 AM
> Please point me to a good starting point reference.
Seriously, you've got past 100 posts and so far I've failed to see any initiative on your part to using a search engine.
I mean, if you'd posted something like
"I've read <link> and <link>, but I'm still not clear what foo means", that would be something.
July 21st, 2013, 09:44 AM
I do admit that sometimes I want answers quick and I don't spend enough time searching.
But I will disagree on your point that I don't show any effort at all.
For example, in this thread (http://forums.devshed.com/c-programming-42/difference-between-c-and-c-948745.html) I asked how c and c++ is different after searching "c c++ difference" and not understanding the replies on stackoverflow, and finding other articles not providing me with the answer I wanted.
Similarly, I did look up on carriage returns for this thread (http://forums.devshed.com/c-programming-42/carriage-return-948751.html) prior to posting this thread, as can be seen in my first post.
The reason I asked for tutorials is because there were many articles when I searched "compiler linker tutorial".
I thought rather than going through each one, it would be more efficient if I asked people who knew which choice would be better.
Last edited by 046; July 21st, 2013 at 09:48 AM.
July 21st, 2013, 02:46 PM
One problem is that many if not all of the experienced programmers here had not learned from those tutorials. For example, I learned C and C++ back around 1990 (yes, I was a late-comer), a couple years before the Internet became available for the public and before the Web had caught on. By the time any on-line tutorials came along, I was way past the point of their being of any use to me. I cannot even recommend any beginning books because the ones I learned from are out of print.
My recommendation is that you Google for the tutorials you want, pick one that looks likely, and start reading it. If you don't like it or find it too confusing, then pick another one.
Do a search on my replies here. On a few occasions I've explained how compilers and linkers work.
Last edited by dwise1_aol; July 21st, 2013 at 02:49 PM.
July 21st, 2013, 03:37 PM
My draft of one such explanation I posted here:
The idea behind C and other higher-level languages (HLLs) is to provide the programmer with something that is human readable. In reality there exists no computer that can actually read C or any other HLL, but rather every HLL program has to be translated into a form that the computer can understand, machine code. Even in the case of systems like the 70's/80's home computers running BASIC as their operating system, a translation program called an interpreter still had to be used to convert each BASIC command to machine code.
It is true that machine code is binary numbers, ones and zeros, but there's much more to it than that. Since memory is organized into words, when you access a memory location you cannot access just one single bit, but rather the entire memory word. Word sizes would vary from one processor family to another, but nowadays the size of a single memory location is one byte, which is 8 bits -- since byte size has also been known to vary between processor families (though rarely nowadays), in network programming the term octet is used. Machine code is organized into instructions which consist of an operation code (AKA "op-code") and a variable number of operands. Each op-code represents a single operation that the processor can perform, such as moving data from one location to another, adding, negating, logical operations (AND, OR, XOR, NOT), jumping to another instruction either unconditionally or based on a condition, jumping to another location but saving the return address, loading a memory address into an index register, etc. In hardware, the processor quite literally decodes the op-code in order to generate all the necessary control signals at the necessary times during the execution cycle in order to perform that operation; in tech school, we would trace through the logic diagrams of our training computer, a COMTRAN-10, and follow all the signals generated and used in executing an instruction. All the instructions that a given processor can execute taken together form that processor's instruction set. Each processor has a different instruction set and the op-code for ADD is different in each processor, which is why the only kind of program you can run on a given processor is one that was generated for that particular processor. At the same time, rather than come up with completely new instruction sets all the time, families of processors, such as the Intel 80x86 family, will use the same instruction set, though expanded by each new processor in the family, so that some degree of compatibility can be maintained within that family.
A quick hardware aside here. Among the hardware that processors contain are special temporary memory circuits called registers, each of which can contain one value. One of the characteristics of a processor is the width of its data bus, which is the largest number of bits it can read from or write to memory in one memory access operation. Data busses started out only 8 bits wide, but then grew to 16 bits, then 32 bits, and now 64 bits wide. The width of the data bus is usually the size of the general-purpose registers, though each general-purpose register can usually be carved up into smaller registers (eg, subdividing the 16-bit AX register into two 8-bit registers, AH and AL). Some registers are special purpose (ie, the Instruction Pointer which contains the address of the next instruction, the Stack Pointer which points to the current top of the stack, the Flag Register whose bits get set and cleared based on the outcomes of each operation and which can be tested) and general-purpose registers that the programmer is free to use. Though even some of the general-purpose registers have special uses: the accumulator is commonly used for most arithmetic and logical and shift operations, the counter register is used by some looping instructions as a down-counter, the index register can be loaded with an address and then be used to indirectly access a memory location (this is the basis for pointer operations in C). Each register has a name and a uniquely identifying number.
As I said, each instruction consists of an op-code and its operands. An operand can be an immediate value, a register, or a memory address. When a new instruction is read in during the instruction fetch cycle, the op-code is immediately decoded so that the processor will know how many and what kinds of operands it has so that it can fetch those too and place them where they need to be.
The three basic kinds of translation programs are compilers, interpreters, and assemblers. Assemblers were the first translation programs meant to provide programmers with something human readable rather than having to deal with ones and zeros all the time. Each line in an assembly program corresponds with one instruction. Each op-code has assigned to it a mnemonic, a two-to-five letter abbreviation for that instruction; eg, ADD, MOV, JMP, JNE (jump if not equal), CALL. Furthermore, you use symbolic names for the registers and you can use symbolic names for memory locations. When you write machine code directly, you have to explicitly determine exactly where in memory every variable will be stored and you have to calculate exactly where that instruction is located that you want to jump to. In assembly, you can give symbolic names to variables and locations in the code and the assembler will keep track of those symbolic names, calculate where each location associated with a symbolic name is, and then insert those proper location values into the object code it generates. And, yes, it builds a symbol table in order to do that, but once it has finished generating the object file then that symbol table no longer exists.
OK, so in an assembly program each line represents one machine instruction. That means that in order to write an assembly program, you have to write down each and every step of the process. When you compile a HLL program, you effectively convert it into an assembly program which can then be assembled into an object file. A simple line of C like x = a + b; could easily generate 6 instructions (eg, load effective address (lea) of a, move from a to the accumulator, load effective address of b, add b to the accumulator, load effective address of x, move accumulator to x). When you use gcc -S to generate an assembly file for your C program, it uses the C statements as comments, so you can compare the C with the assembly that it generates.
Now to try to explain what I mean by "object code" -- CAVEAT: different schools and authors may use this terminology differently, so what I am presenting is what I remember from school. Machine code is what the computer actually executes. But neither the compiler nor the assembler generates machine code, but rather an intermediate form called object code. Object code is very close to machine code -- all the instructions are there and converted to binary -- , but it cannot run yet. The problem is with the addresses. To explain that, I'll switch back to C now.
The C compiler compiles each source file separately and independently of the others. That means that it knows nothing of what's in the other source files nor in the library files (special pre-compiled object files). That is why you include the header files, which contain information of what's in those other source and library files. So when the compiler reads a function prototype, it knows that somewhere there exists or will exist a function by that name and with that return type and parameter list, but nothing more. When the compiler finds that function within this source file, then it can place that function's location in the function call code, but if that function is in another file then the compiler cannot complete the object code for that function call. Instead, the compiler makes room for an address there, but inserts a marker to indicate that this address needs to be resolved and what it's referencing. The same is true of a global variable that has been extern'd in a header file; the compiler knows all about it except for where it's actually located, so all references to that address are marked for address resolution to be performed later on.
The compiler generates an object file. In Windows, you will find the object files bearing the same names as the source files but with an .OBJ file extension; in Linux, that file extension is .o -- MinGW gcc, although a Windows port, uses the Linux naming conventions. The actual format of an object file will vary from compiler to compiler, but basically they contain the object code (ie, the almost machine code) and a table of function and variable names that it contains along with their locations in the object code, and a table of what addesses it needs to have resolved. These object file tables are then used by the linker to generate the executable file. The linker concatenates each object file's code together into one body of code, and then, now knowing where in that amalgamation each function and variable is, it goes through and resolves all the addresses; if you left out one called function, then you get an "unresolved identifier" error from the linker.
Now, the executable is a lot closer to machine code, but it's still not quite there. The problem is that when you load that program into memory to execute it, you have no way of knowing ahead of time exactly where in memory that will be. So all those addresses in the executable code are not actual memory addresses, but rather offsets relative to the beginning of the program or to some other common point. What the loader must do is to perform address fixing to convert all those relative addresses into actual memory addresses. It is only after address fixing has been performed that you finally have machine code that the processor can actually execute.
Like the object file, the executable file has a specific format that will vary from one operating system to another; you should be able to find these formats on-line. The file header contains information such as where the code starts (ie, the offset into this file), how much memory will be needed to hold the code, how much read/write memory the program needs, how much heap it needs (the heap is extra memory that you can allocate dynamically with malloc), and a list of the addresses that need to be fixed by the loader.
July 22nd, 2013, 03:38 PM
So to summarize,
- machine code = op-code + operands (value / memory address)
- register: hardware; temporary memory circuit
1. compiler: .c -> .o
- converts instructions to binary
- places markers for addresses that need to be resolved
2. linker: .o -> .exe
- performs address resolution
- replace relative addresses with actual addresses, depending on where the program is located in the memory
One question I had was why compilers needed declarations but not necessarily definitions.
Your post answered that it had to do with address resolutions, so I'll go read a bit more about it.
Thank you very much for the post.