#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2010
    Posts
    7
    Rep Power
    0

    [Assembly] Developing an Assembler in Theory


    I'm just curious how an assembler converts its source code into machine code.

    I'd assume it compares the character values of the source code to a table from which it'd generate it's binary instruction counterparts... and write it to file of course.

    Can anyone explain this process and show me any references if they exist?
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Feb 2004
    Location
    San Francisco Bay
    Posts
    1,939
    Rep Power
    1313
    Writing an assembler for a RISC architecture is relatively straightforward, since converting assembly language to machine code is a pretty direct translation. (Assembling for a CISC architecture is in theory fundamentally similar, just more complicated due to, well, the complexity of the instruction set.) However, it's going to seem really tedious if you have to think about it on the level of "compares the character values of the source code to a table." To make it seem less tedious, you should break it down into multiple steps: first the source file is parsed into its logical structure, and then the structure is translated into ones and zeroes.

    That probably isn't clear, so let me give some pseudo-C++.
    Code:
    enum instr_type {
        ADD, SUB, MOV, LDR, STR, // ...
    };
    
    struct instruction {
        instr_type type;
        operands *ops;  // array of operands
        // etc. (whatever is logically needed to specify a machine instruction)
    };
    
    // Returns pointer to byte after last written byte
    void *assemble_instruction(const instruction & i, void *dest) {
        switch (i.type) {
            // ...
        }
    }
    That struct instruction is the key element of the logical structure of an assembly program. (It's not the only element: there's also the global data and variables, and the division of the program into sections. For simplicitly, I'll ignore those, but you will have to deal with them if you want to write an assembler.) The main process would then roughly be to parse the source code into a sequence of type struct instruction and then to call assemble_instruction repeatedly to translate it into machine code.

    I haven't yet told you how to parse the source file into a sequence of those structs. That itself is usually divided into two steps: first perform lexical analysis to produce a sequence of tokens, and then interpret the tokens into logical units. Roughly speaking, tokens (the output of lexical analysis) are to computer language as words are to natural language. Continuing the analogy, the logical units (the output of parsing) are to computer language what phrases, clauses, sentences, and paragraphs are to natural language. On the other hand, the machine code (the output of assembling) has no counterpart in natural language; the closest thing would be a brain's internal representation of information.

    There are a few other things that make the whole process more complicated than the above might suggest. One complication in the machine-code-generation step is labels: to assemble an instruction that refers to a label (jumps being the canonical example), you have to know what address that label points to. A common technique to handle this is to make two passes: first go through and figure out where all the labels point and put that information in a symbol table, and then go through it again to do the assembling normally, now that you know what the labels refer to.

    Of course, after you've completed all that, you still need to produce a valid object file, whose format depends on the operating system. I don't know much about this part, but it's probably safe to say that if you've done everything up to this point, you can probably finish it now, relying heavily on documentation.
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2010
    Posts
    7
    Rep Power
    0
    Parsing and tokenizing this won't be too much of a hassle for me. I'm more concerned with how to get the form of a binary executable, since this language is not too too hard to translate over.

    Things like the addressing are what I have to worry about, but what you recommended is simple enough, to collect them and reference them whenever needed.

    But, like, my concern lies in, the addressing itself. Is it relative to the beginning of the program? Or perhaps assemblers pass the difference between the addresses of 'jmp' and the label, so the execute-instruction register pointer thingy can be changed relative to the program's perspective. I'll see if my reading clues me in on this instruction tomorrow maybe.

    But yeah, thanks for the reply Lux. It was pretty quick and informative. I've sort of confirmed what I wanted to know before, and now I'm going to wander into the more specific procedures.
  6. #4
  7. Contributing User
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Sep 2007
    Location
    outside Washington DC
    Posts
    2,642
    Rep Power
    3700
    Standard practice for a compiler is to generate its output with a symbol table as input to a link program. Sometimes called a linkloader.

    The compiler generates binary that corresponds to something like

    LOAD A, #age

    and the linker fills in the proper address for where the "age" variable is located in memory.

    This happens for most addresses, things like function entry points, arguments, constants, etc.

    Any good Computer Science book on compilers will have tons of detail on how this is done'

IMN logo majestic logo threadwatch logo seochat tools logo