Everything you were afraid to ask about how a compiler/interpreter works.
Let's start by discussing a very simple form of a compiler/interpreter:
Source File ---> Scanner ---> Lexer ---> Parser ---> Interpreter/Code Generator
So what do all the terms here mean. We'll work on that as we go:
Source File: This is the program that is read by the compiler or interpreter. This is the text that needs to be compiled or interpreted.
Scanner: This is the first module in a compiler or interpreter. Its job is to read the source file one character at a time. It can also keep track of which line number and character is currently being read. A typical scanner can be instructed to move backwards and forwards through the source file. Why do we need to move backwards? We will see why in just a little bit when we examine the lexer. For now, assume that each time the scanner is called, it returns the next character in the file.
Lexer: This module serves to break up the source file into chunks (called tokens). It calls the scanner to get characters one at a time and organizes them into tokens and token types. For instance, if the source file read something like this:
Code:
cx = cy + 324;
print "value of cx is ", cx;
a lexer would perhaps break it like this:
Code:
cx --> Identifier (variable)
= --> Symbol (assignment operator)
cy --> Identifier (variable)
+ --> Symbol (addition operator)
324 --> Numeric constant (integer)
; --> Symbol (end of statement)
print --> Identifier (keyword)
"value of cx is " --> String constant
, --> Symbol (string concatenation operator)
cx --> Identifier (variable)
; --> Symbol (end of statement)
Thus, the lexer calls the scanner to pass it one character at a time and groups them together and identifies them up as tokens for the language parser (which is the next stage). It also identifies the type of token (variable vs. keyword, assignment operator vs. addition operator vs. string concatenation operator etc.) Occasionally, the lexer has to tell the scanner to back up though. Consider a language that has operators that may be more than one character long (! vs. !=, < vs. <=, + vs. ++ etc.). Assume that the lexer has requested the scanner for a character and it has returned '<'. The lexer needs to determine whether the operator is a < or a <=. So it requests the scanner for another character. If the next character is a '=', it changes the token to '<=' and passes it to the parser. If not, it tells the scanner to back up one character and hold it in the buffer, while it passes the '<' to the parser.
Parser: This is the part of the compiler that really understands the syntax of the language. It calls the lexer to get tokens and processes the tokens per the syntax of the language. For instance, taking the example from the lexer above, the hypothetical interaction between the lexer and parser could go like this:
Code:
Parser: Give me the next token
Lexer: Next token is "cx" which is a variable.
Parser: Ok, I have "cx" as a declared integer variable. Give me next token
Lexer: Next token is "=", the assignment operator.
Parser: Ok, the program wants me to assign something to "cx". Next token
Lexer: The next token is "cy" which is a variable.
Parser: Ok, I know "cy" is an integer variable. Next token please
Lexer: The next token is '+', which is an addition operator.
Parser: Ok, so I need to add something to the value in "cy". Next token please.
Lexer: The next token is "324", which is an integer.
Parser: Ok, both "cy" and "324" are integers, so I can add them. Next token please:
Lexer: The next token is ";" which is end of statement.
Parser: Ok, I will evaluate "cy + 324" and get the answer
Parser: I'll take the answer from "cy + 324" and assign it to "cx"
In the above, the indenting shows a subprocess that the parser enters, to evaluate "cy + 324". This gives you a decent idea about how the parser operates. Also note that the parser is checking types and syntax rules (for instance, it checked whether cy and 324 were both integer types before adding them). If the parser gets a token that it was not expecting, it will stop processing and complain to the user about an error. The Scanner holds the current line number and character, so the Parser can inform the user approximately where the error occurred.
Interpreter/Code Generator: This is the part that actually takes the action that is specified by a program statement. In some cases, this is actually part of the parser (especially for interpreters) and the parser interprets and takes action directly. In other cases, the parser converts the statements into byte-code (intermediate language). In case of a compiler, it then hands them to the Code Generator to convert into machine code instructions. If you want a compiler for a different CPU or architecture, all you have to do is put a new code generator unit to translate the byte code into machine code for the new CPU.
This is about the simplest form for an interpreter or compiler. In the next few threads, we will look in some more detail at the interaction between the Parser and Code Generator.
Feedback about this post will be greatly appreciated.