#1
  1. Commie Mutant Traitor
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Jun 2004
    Location
    Alpharetta, GA
    Posts
    1,806
    Rep Power
    1570

    Lightbulb Compiler writing and new language design


    One of the most common questions which comes up in the Other Languages forum is how to write a compiler for a new language. This is actually two separate questions: a) how do you design and define a language?, and b) how do you write a language translator? Both of these are significant projects in and of themselves, and of the two, learning about compilers is the one you usually want to tackle first - with some subset of an existing language - before going ahead with trying anything novel. Many of the skills needed to implement an existing language are the same ones needed in designing a new language.

    Writing a simple compiler is a common project for an undergrad CS course on compilers. There are several textbooks on the topic, with the Compilers: Principles, Techniques, and Tools being the one used by most courses. In the US at least, most public and university libraries have at least book on the subject.

    There are also several online tutorials and pages on the subject floating around, of varying quality. Fortunately, there are several ways of doing this, so even a bad tutorial can be insightful.


    Language design is both easier in some ways and harder in others; easier, because you generally need only describe the language's properties and define it's grammar, and harder, because it is an open-ended problem that can easily overwhelm you, and because, to make it practical, you still need to design a translator once the language design is finished. There are no real clear guidelines about it; the limits of the project are absolute computability, practicality and your own patience. Ironically, a simpler language is usually going to be both more flexible and useful, on the one hand, and harder to design effectively, on the other.

    In order to design a language, you need to choose certain overall principles first. Is it a special-purpose or a general-purpose language, and if it is special purpose, what is it for? If it is a general-purpose language, does it fit a particular paradigm (procedural programming, OOP, functional programming, relational/constraint programming) or is it meant to be multi-paradigm? What kind of notation does it use - is it expression structured (that is, every statement has some sort of value or return) or statement structured (i.e,, individual lines of code form a statement which is evaluated but have no lvalue)? how are statements or expressions organized into functions or subroutines? How are data structures defined and combined? Even taking an existing familiar language as a pattern, there are a surprisingly large number of decisions one has to make about it.

    Last edited by Schol-R-LEA; August 28th, 2009 at 11:06 PM.
    Rev First Speaker Schol-R-LEA;2 JAM LCF ELF KoR KCO BiWM TGIF
    #define KINSEY (rand() % 7) λ Scheme is the Red Pill
    Scheme in Short Understanding the C/C++ Preprocessor
    Taming Python A Highly Opinionated Review of Programming Languages for the Novice, v1.1

    FOR SALE: One ShapeSystem 2300 CMD, extensively modified for human use. Includes s/w for anthro, transgender, sex-appeal enhance, & Gillian Anderson and Jason D. Poit clone forms. Some wear. $4500 obo. tverres@et.ins.gov
  2. #2
  3. Commie Mutant Traitor
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Jun 2004
    Location
    Alpharetta, GA
    Posts
    1,806
    Rep Power
    1570
    To continue:

    As you presumably know, in order to run any computer program, you first need a translator that can convert the program into a form that the computer can run. There are three basic classes of translators: assemblers, which convert a (more or less) direct representation of the CPU's operations into the actual running form; interpreters, which read one line of a language and perform the operation directly; and compilers, which convert one language into another, usually converting a high-level language into a low-level one (assembly language or machine code). There is some overlap between these; for example, most interpreters partially compile the language into an intermediate representation which can be more easily interpreted, while many compilers target an intermediate language (variously called 'p-code', 'bytecode', or 'CIL') which is then interpreted on a pseudo-machine emulator. Because interpreting and assembling are generally seen as a subset of compiling (that is to say, you need more or less the same skills for all three, but compiling is usually the most complicated of them), most discussions of the subject assume you'll be designing a compiler.

    Compiling generally has three main stages, Lexical analysis (breaking the source code into a stream of tokens), parsing (processing the token stream for it's grammatical structure), and code generation (the resulting output, whether as an executable file or a source file for a different language). Most compilers use some sort of intermediate stage between the parser and the code generator (to separate the parts specific to the language from the parts specific to the target system, making it easier to re-target the compiler to a different system), and many have some sort of optimizer either between the parser and the code generator, or after the code generator, or both. It is possible to write a compiler in some other way, but this is by far the best-known way to do it, and almost always the easiest.

    The first step (whether you are implementing an old language or a new one) is to write a grammar for the language. Almost all modern programming languages are designed so that they can be defined with a context-free grammars, often written in some variant of Backus-Naur Form. While this only defines the syntax of the language, it serves as the 'skeleton' for the semantics, and defines the tokens which the lexer will need to accept. For almost all modern programming languages, the grammar is the core of the design, especially since much of what might otherwise be 'semantic' can be pushed off into the syntax.

    You will need to then decide on the semantics of the grammar, that is to say, what the language should do with a given parsed statement. This is one of the hardest parts, as there are no generally accepted formalisms like EBNF for defining the syntax (and those formalisms that do exist are often hard to work with, and/or specific to certain types of languages). Fortunately, if you were careful about the grammar design, the semantics should not be too difficult to decide upon. While I am describing this as a linear process, it's usually an iterative one, i.e., you may need to go back and forth between the grammar and the semantics as you design them.

    To write the lexer, you generally want to define a Deterministic Finite State Automata which can 'recognize' the tokens of the language. It is not too difficult to write a lexical analyzer by hand, but the task is simple enough that it can be automated through a variety of tools such as LEX. However you write it, it should turn the source text into a stream of tokens, with each token comprised of a description or type (e.g., IF_STATEMENT, FLOAT_NUMBER, IDENTIFIER, etc.) and the actual text of the lexeme. The lexer usually will also generate a symbol table of all of the identifiers (text words not otherwise recognized, such a variable names) which creates a 'placeholder' for the parser and code generator to use later.

    As with the lexer, the parser can either be written by hand, or using a parser generator such as YACC or ANTLR. Generally speaking, the parser is written based directly off of the grammar. However, there is a complication: there often is more than one way to define a language's grammar, and most approaches to parsing only work with certain classes of grammars. Most hand-coded parsers use a method called 'recursive descent', which is easy to write by hand but hard to generate automatically; RD parsers need what is called an LL(1) (left-to-right, leftmost derivation) grammar, while many parser generators require an LALR(1) (left-to-right, rightmost derivation) grammar. While they define the same language, they represent it in different ways, which affect how it is processed.

    Unlike lexical analysis and parsing, there really are no established methods for defining a code generator; by its nature, defining the semantics of a Turing-complete language is more difficult than defining a context-free grammar. There are a number of approaches to formal semantics around, but each of them have weaknesses which limit their acceptance. Furthermore, the code generator itself relies on a detailed understanding of the target language, and mixing up the details of the target with the semantics of the source language makes portability nearly impossible. For these reasons, among others, it is common to separate the parser from the code generator with an intermediate stage, often in the form of an intermediate language which describes the grammar productions. Recurive-descent parsers often skip this, and emit code directly, but table-driven compilers almost always have some sort of intermediate code.

    Optimization is a subject unto itself, and is something of a black art; but generally speaking, there are two places where optimization can be performed, on the intermediate language or parse tree, and on the code that is emitted. The latter ('peephole optimization') is generally more limited, as a lot of the semantic information has already been discarded, but can apply optimizations that are more specific to the target language. Parse tree optimizations tend to be more abstract ones, such as common subexpression elimination, and usually apply to any kind of target.

    This thread includes an attachment with the source code for a simple compiler for a subset of an older language named Algol. The compiler is written in Python, and while it is an unusual design in some respects, it should serve as an example of a working (if very limited) compiler.
    Rev First Speaker Schol-R-LEA;2 JAM LCF ELF KoR KCO BiWM TGIF
    #define KINSEY (rand() % 7) λ Scheme is the Red Pill
    Scheme in Short Understanding the C/C++ Preprocessor
    Taming Python A Highly Opinionated Review of Programming Languages for the Novice, v1.1

    FOR SALE: One ShapeSystem 2300 CMD, extensively modified for human use. Includes s/w for anthro, transgender, sex-appeal enhance, & Gillian Anderson and Jason D. Poit clone forms. Some wear. $4500 obo. tverres@et.ins.gov
  4. #3
  5. Commie Mutant Traitor
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Jun 2004
    Location
    Alpharetta, GA
    Posts
    1,806
    Rep Power
    1570
    One option in writing a compiler for a new language that is often overlooked is to use an existing compiler suite as a back end. While I would not recommend it as a place to learn how compilers work or how to write one, for someone who is looking to create a production-quality open source compiler for a new language in a short time, the GNU Compiler Collection represents a fast way to have a system that will work on a wide variety of systems. While it is not the best compiler in the world, nor the easiest to work with, it has the advantage that it is highly portable, well understood by programmers around the world, and allows you to leverage millions of man-hours of design and debugging work for your project.


    This is not a project for a beginner, but for an experienced programmer working on a new language design, it does present a way of getting a compiler up and running.

    Comments on this post

    • drgroove agrees : You should write an article on compilers and submit it to Devshed, and get paid for it! :D
    Rev First Speaker Schol-R-LEA;2 JAM LCF ELF KoR KCO BiWM TGIF
    #define KINSEY (rand() % 7) λ Scheme is the Red Pill
    Scheme in Short Understanding the C/C++ Preprocessor
    Taming Python A Highly Opinionated Review of Programming Languages for the Novice, v1.1

    FOR SALE: One ShapeSystem 2300 CMD, extensively modified for human use. Includes s/w for anthro, transgender, sex-appeal enhance, & Gillian Anderson and Jason D. Poit clone forms. Some wear. $4500 obo. tverres@et.ins.gov
  6. #4
  7. No Profile Picture
    Redpill
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Nov 2005
    Posts
    1,660
    Rep Power
    151
    The Low Level Virtual Machine (LLVM) is much younger (and less developed) than GCC, but this also means that its code is saner to work with. It is mostly written in C++, while GCC is all C.

    It is partly based on GCC, but there is a side project to implement a new C and C++ compiler using LLVM as a back-end.
    Code:
    #include <stdio.h>
    int main(int o,char**O){return o>-1?o-2||!main(-1,1+O)?!!fprintf(stderr,"%s [0-"
    "9]{81}\n",*O):main(-83,++O):o>-83?(*O)[-1-o]?81==(o=-o-1)||o[*O]<'0'||'9'<o[*O]
    ?0:main(-2-o,O):o==-82:o>-164?(*O)[-83-o]<'1'?main(o-82,O):main(--o,O):o+164?o>-
    246?(*O)[-165-o]<'1'?main(o-82,O):main(--o,O):o+246?o>-328?(*O)[o=-o-247]<='8'?(
    main(-328-o,(++o[*O],O)),main(-247-o,O)):!(o[*O]='0'):(o=-o-328)<729?(o%9/3*3-o%
    27+o/243*9+o/81%3&&(*O)[o%81]==(*O)[o%81-o%27+o%9/3*3+o/243*9+o/81%3])||(o%81-o%
    9-o/81*9&&(*O)[o%81]==(*O)[o%9+o/81*9])||(o/81-o%9&&(*O)[o%81]==(*O)[o%81-o%9+o/
    81])?0:main(-409-o,O):main(-165-o%81,O):!puts(*O):0                           ;}

IMN logo majestic logo threadwatch logo seochat tools logo