August 18th, 2012, 04:51 AM
Making of C Compiler
I want to know that how exactly C code get executed. I want to know that what things get happened when we give command for compiling C program.
August 18th, 2012, 06:05 AM
August 19th, 2012, 07:17 AM
C code is converted to assembler which is then assembled into binary instructions (usually position independent). Nothing happens until the linker/loader is invoked when you run the program and the linker resolves any calls to other object (binary) files and the loader puts the executable image into RAM at a specific location and then launches the startup routine.
This isn't really a C/C++ question. The compiler converts the human readable text into a binary representation specific to the OS/hardware the program is expected to run on.
I have a couple of examples of code written directly in binary instructions if you are curious:
August 21st, 2012, 12:35 AM
1. Preprocess (insert #included files, replace macros by their definition, etc.)
Originally Posted by sagarkamble
2. Compile to assembly code
3. Assemble to machine code
It's helpful to view the result of each stage. I'm going to use gcc for my examples because that's what I'm familiar with. It's most instructive to do this with a small "hello world"-type program. Suppose the source file you want to compile is called program.c.
Now, you can peruse program.i and see all the header files spliced right in and all your macros expanded. If you've never read stdio.h, you might be surprised by how big the preprocessed file is.
gcc -E program.c > program.i
2. Compile: This created an assembly source file called program.s, which you can view in a text editor. This part really benefits from using a very simple "hello world" program; you might be surprised how short the assembly source is, especially compared to how long the preprocessed file was! With a little effort, you can probably even understand the assembly code.
3. Assemble: You now have an object file, program.o. This is one small step away from being an executable program. The main difference has to do with any external functions or variables (for example, printf) you use in your source code: at this stage of the process, GCC has not tried to track those down for you, so those remain unresolved references in the object file. The object file is a binary file and thus not human readable, but there are utilities for extracting information from object files. A pretty useful one is nm, which can show all the external references in an object file: You should see all the non-static functions in your code listed as well as any standard library functions like printf or scanf.
Not much to do now but run the program! (Of course, the compiled program is not human-readable. You can still get information from it in various ways, but I don't think it's relevant to your question any more.)
gcc program.o -o program
Last edited by Lux Perpetua; August 21st, 2012 at 04:42 AM.
Reason: Fixing a typo
August 21st, 2012, 01:43 AM
You're asking two different questions there, so we're a bit confused as to what you want.
I'll assume that you're asking about the C build process:
When the build involves multiple source files, each source file is compiled separately, such that the compiler starts each compilation with absolutely no knowledge of what it had found in any other source file. That is why you place type definitions, macros, extern variables, and function prototypes in a header file that's associated with a source file so that the other source files can #include it to let them know what's in that other source file.
With each source file, first the compiler runs the preprocessor, which executes megacommands that start with #, such as #define, #include, #ifdef. The preprocessor inserts the files indicated by the #include commands, expands macros (which are defined by #define), interpret conditional compilation commands by including or excluding the indicated code, etc. Basically, the preprocessor creates the final compilable form of the source file. In many compilers, you can command the compiler to generate an output file which is that final compilable form; how to do that differs from one compiler to another.
Then the compiler does its thing, parsing the source code, building symbol tables, translating the source code to assembly (or to an intermediate form that will then be converted to assembly) and converting the assembly to object code, which is mostly but not quite machine code. That object code goes into an object file (eg, .obj in Microsoft, .o in Linux) which also marks up the object code for where the code accesses external resources (these are called unresolved symbols) as well as contains tables for the linker to use in resolving those unresolved symbols. Actual tables and file format depends on the compiler, etc.
When all the source files have been compiled, the linker is invoked to generate the executable. The linker takes all the object files and all the referenced libraries (special object files designed for reuse; .LIB in Microsoft and .a in Linux, the Standard C Library is an example, though you could create your own libraries) and links them all together in the executable, generating location tables in the process. Then it uses those tables to go through each object file and replace the "unresolved symbol" markers with the actual address of each symbol, AKA "resolving the addresses".
Each step depends on what can be known at that time -- these times being known as "compile-time", "link-time", and "run-time" -- ; it is absolutely and vitally necessary to know which "time" you are in. At compile-time, compiling a source file depends on header files to tell it what should exist in other source files or in libraries being linked in, so the object code contains markers, AKA "place holders", for address information to be inserted later. At link-time, linking handles all that, but it still does not know exactly where in memory the program will be loaded, so the linker has no idea of the exact memory location of each variable and function, information which is absolutely necessary for the code to actually execute.
For that reason, all addresses in the executable are resolved relative to a common starting address and the location of all addresses is marked either in the object code or in a relocation table in the executable. Then when you execute the program, you do so with a loader which obtains a block of memory for the program and then performs relocation of all the addresses. That creates in the memory a memory image which is executable as-is; note that in embedded programming, the end-result is a memory image that can be loaded into some kind of PROM (programmable read-only memory).
If you can get your hands on it, read The MS-DOS Encyclopedia (Microsoft Press, 1988 -- decades out-of-print by now, obviously). It not only explains the process excellently (it's where I learned what the loader does), but also provides and explains the file formats. Of course, that's the only reason for you to read it and those formats are obsolete. If you can find a similar description of the OS and compiler that you're using, then get that description and read it.
August 26th, 2012, 09:23 PM
Sorry for a somewhat late reply, but if you are interested in executable formats and how they are used, you would do well to check out Linkers and Loaders by John Levine. While it is now somewhat dated, most of the information is still relevant, and an early version of the book is available for reading online on that page.
August 29th, 2012, 03:02 PM
C code is not executed, it is compiler into a machine code executable. It is the the usually operating system that is responsible for loading and starting execution of the code. Embedded systems or bootstrap code (where there is no OS) may be started by other mechanisms, but that is probably not what you are asking about here?
Originally Posted by sagarkamble
Compilation of C comprised of a number of stages, primarily:
Originally Posted by sagarkamble
- Pre-processing - any line beginning with # is a preprocessor directive. The preprocessor outputs C code with all the #include'd code inserted, all the #define macro instances replaced, and any #if... conditional code included or removed as directed.
- Compilation - the compiler proper generated "object" code. Some compilers generate assembler and then have an assembler pass to generate machine code, others generate machine code directly. The object code output by the compiler does not include code relating to external references to library code or code in separately compiled object code - the object file contains unresolved links to such code.
- Linking - the linker is responsible for assembling separate modules and library code into a single executable, and resolving all unresolved references with references to the linked code.
The body of you post does not contain any clear and specific questions and your title seems to be asking something else altogether; that suggests that you want to build a compiler?
Compilers can be complex things. First of all they are required to generate assembler or machine code, so to create a compiler you must be familiar with the target instruction set and architecture. Moreover to create a linker, you need to know how the OS loads an executable and the format of the executable file to support loading. Luckily C is a rather small and simple language for the most part, bit still not insignificant. Modern compilers perform many complex optimisations requiring deep analysis of code flow and instruction execution.
One way to start studying a simple compiler implementation is to perhaps look at the source and documentation for Tiny C Compiler