#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    7
    Rep Power
    0

    Need tool where to specify indentation block grammar


    I'm trying to write a compiler for a language that uses indentation only to indicate code blocks, and newline to indicate end of statement.

    I have been using bison, but I cannot figure out how to specify such a grammar in it.

    The trouble is with empty blocks after statements that optionally start a block, such as while loops. An empty block is a newline followed by an indentation equal or less than the line that starts the block. But that indentation may also signal that the outer block is closed.

    In order to keep the lexer simple, a newline followed by whitespace can emit only one type of token per pattern match, so it emits either INDENT, DEDENT, or NODENT, with the DEDENT possibly emitted multiple times. But emitting multiple different tokens for the same pattern is complexity I am trying to avoid.

    Is there a tool that would allow me to specify such a grammar without complicating the lexer?

    I am using C++.
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    May 2007
    Posts
    765
    Rep Power
    928
    Could you show the grammar you're using and the specific token string that's causing a problem? I don't see the problem offhand--the while production should only care about a following INDENT token, if that's not present then the next NODENT or DEDENT doesn't belong to the while anyway.

    E.g. I would use a grammar something like this, where the while production ends if it doesn't get an INDENT leaving whatever token is next for the containing block to deal with.

    Code:
    while-loop  ::= WHILE block
    
    block       ::= INDENT statement block-lines
                  | <empty>
    
    block-lines ::= NODENT statement block-lines
                  | DEDENT
    sub{*{$::{$_}}{CODE}==$_[0]&& print for(%:: )}->(\&Meh);
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    7
    Rep Power
    0
    Sure, here's what I'm trying to do in grammar terms (using bison syntax).

    Code:
    while_statement
        : WHILE condition block {}
        ;
    
    block
        : INDENT statements DEDENT {/*non empty block is straight forward*/}
        | NODENT {/*this doesn't work if while loop is last statement in block*/}
        | {/*empty block being empty doesn't work, because it allows while not to be followed by newline*/}
       ;
    There's also some awkwardness with the fact that statements that end in non empty blocks don't need an extra newline (NODENT) to separate them. That's resolvable for the non empty case, but I can't differentiate based on whether the block is empty or not to determine if the newline is needed.

    I could post more grammar to explain that last paragraph more, if it helps.
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    May 2007
    Posts
    765
    Rep Power
    928
    I think I understand now. I don't think you're going to get out of this without a lexer change (unless you're willing to do something weird like pushing tokens back on to the lexer stream).

    The problem is INDENT/DEDENT/NODENT are conflating two tokens, the newline terminating the statement and the indentation change of the next line. If you separate these so that you have an explicit NEWLINE token plus the INDENT/DEDENT/NODENT. That way the while-statement can take the NEWLINE to terminate itself. (I think you can dispense with the NODENTs in this case.) It may also solve your problem if your lexer emits a NODENT following every INDENT or string of DEDENTs (and a statement list requires NODENTs between each statement)--though my brains a bit slow at the moment so I have not puzzled out all of the corner cases.

    (So maybe that's why Python doesn't permit empty blocks...)
    sub{*{$::{$_}}{CODE}==$_[0]&& print for(%:: )}->(\&Meh);
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    7
    Rep Power
    0
    Well, this strikes me as something that should be specified in the grammar, with the right tool. I can't possibly be the first person writing a parser like this, and it's easy enough to specify the correct behaviour in english. In so far as your analysis is true, and I'm conflating two tokens, then it's similar to C++'s problem with closing templates with >>.

    The trouble with treating this as two tokens is that DEDENT could be a newline followed by no whitespace. So it's a single character, split into two tokens.

    Thinking about it more, I did come up with an idea that should work in bison, but it's ugly, and could hide bugs in the grammar. I could define a block like this:
    Code:
    block
        : INDENT statements DEDENT {/*non empty block is straight forward*/}
        | {/*empty block*/}
        | statement {error}
       ;
    That makes the grammar ambiguous as bison sees it, but as long as the rules are specified in the right order, it should prefer to parse "while (true) 1+1" as a statement in the loop, instead of a loop followed by a statement. But bison will complain for every statement, and It might not tell me effectively when two statement specifications conflict for real.

    I'm leaning towards changing the lexer, but I would prefer a tool that would allow me to specify the grammar cleanly as it is.
  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    May 2007
    Posts
    765
    Rep Power
    928
    Originally Posted by King Mir 2
    Well, this strikes me as something that should be specified in the grammar, with the right tool. I can't possibly be the first person writing a parser like this, and it's easy enough to specify the correct behaviour in english.
    It's easy enough to specify in english because you can use the concept of a newline. There's no NEWLINE token available to your grammer though since the lexer isn't preserving that information. That's why there's so much difficulty describing the behavior you want in the parser.

    Originally Posted by King Mir 2
    The trouble with treating this as two tokens is that DEDENT could be a newline followed by no whitespace. So it's a single character, split into two tokens.
    Your lexer is already able to turn a newline into many DEDENT tokens, so it should be able to handle a newline into a NEWLINE + *DENT tokens.

    Originally Posted by King Mir 2
    Thinking about it more, I did come up with an idea that should work in bison, but it's ugly, and could hide bugs in the grammar. I could define a block like this...That makes the grammar ambiguous as bison sees it, but as long as the rules are specified in the right order, it should prefer to parse "while (true) 1+1" as a statement in the loop, instead of a loop followed by a statement.
    Here are two cases you might want to check against your parser--they caused trouble with some of the solutions I was contemplating.

    Code:
    # good
    while 1
       2 + 2
    3 + 3
    
    # bad
    while 1
       2 + 2 3 + 3
    Originally Posted by King Mir 2
    I'm leaning towards changing the lexer, but I would prefer a tool that would allow me to specify the grammar cleanly as it is.
    I'm not sure what type of tool could help you. An LALR(1) grammar is an LALR(1) grammar regardless of whether bison or yapp is generating the parser.
    sub{*{$::{$_}}{CODE}==$_[0]&& print for(%:: )}->(\&Meh);
  12. #7
  13. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    7
    Rep Power
    0
    Originally Posted by OmegaZero
    Your lexer is already able to turn a newline into many DEDENT tokens, so it should be able to handle a newline into a NEWLINE + *DENT tokens.
    Token repetition is simpler, because it sends the same token multiple times, with a counter. Sending 2 different tokens is not the same, and the documentation of the lexer generator (quex) says that it's prohibited under the same settings. It would need a token queue.
    Here are two cases you might want to check against your parser--they caused trouble with some of the solutions I was contemplating.

    Code:
    # good
    while 1
       2 + 2
    3 + 3
    
    # bad
    while 1
       2 + 2 3 + 3
    So I have been able to write a grammar that has some statements that require a newline after them, and some that don't. It looks something like this:
    Code:
    file_blocks
        : function_def
        | class_def
        | namespace_def
        ;
    
    file_statement
        : variable_def
        | using_decl
        ;
    
    file_blocks_or_statement
        : file_blocks
        | file_statement 
        | file_blocks file_statement {}
        ;
    
    file_scope
        : file_blocks_or_statement
        | file_scope '\n' m_nls file_blocks_or_statement {}
        ;
    
    file:  
        | m_nls file_scope m_nls
        ;
    There's a bit more to handle syntax errors smoothly, but that's the idea. Works great with non-empty blocks.

    I'm not sure what type of tool could help you. An LALR(1) grammar is an LALR(1) grammar regardless of whether bison or yapp is generating the parser.
    Is there another kind of grammar that makes this easy to do?
  14. #8
  15. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    May 2007
    Posts
    765
    Rep Power
    928
    I have not heard of quex before. It seems a bit odd that you can't send several different tokens as an action--many of these lexer generators permit you to attach arbitrary C code to a pattern to manipulate the output. Maybe a different lexer would work better for you.

    Is that code something new you're trying? If not I'm not sure what the trouble is. If you can already parse for the '\n' in bison can't you check for that after the while condition?

    There are other types of parsers out there. I think your situation could be resolved by one that can do look-ahead without consuming input or if the parser could push tokens back onto the lexer stream--I don't know what you're call a parser with that feature though. It seems much simpler to me to make sure you retain all the information you need to perform the parse.


    Thinking a bit more, the information is there--just not in the most convenient format since the statement separator is mixed up with the tokens that start and end a block. Every method I try to work around that runs into the problem that in the token string:
    Code:
    WHILE x INDENT y DEDENT z
    the block ends up eating the DEDENT that separates the while statement from the "z" statement. All I can think of is hoisting a lot of ugly work into the statements rule like

    Code:
    statements ::= line-statement NODENT statements
                 | block-statement DEDENT statements
                 | line-statement
                 | block-statement DEDENT
    
    line-statement ::= expression
                     | WHILE expression
    
    block-statement ::= WHILE expression INDENT statements
    so that DEDENT can do double the work.
    sub{*{$::{$_}}{CODE}==$_[0]&& print for(%:: )}->(\&Meh);
  16. #9
  17. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2010
    Posts
    13
    Rep Power
    0
    Originally Posted by OmegaZero
    I have not heard of quex before. It seems a bit odd that you can't send several different tokens as an action--many of these lexer generators permit you to attach arbitrary C code to a pattern to manipulate the output. Maybe a different lexer would work better for you.
    In my research for different lexers, I found quex to have the best unicode support. That's why I chose it. There are several possibilities with messing with the lexer to get this to work. Including token queues, which it supports natively. It just seems awkward.

    And by the way, flex (and lex), which is the lexer bison was designed to work with, would also have the same problem with sending multiple tokens per pattern. Worse, it doesn't have support for token repetition.

    Is that code something new you're trying? If not I'm not sure what the trouble is. If you can already parse for the '\n' in bison can't you check for that after the while condition?
    That is basically the code I was alluding to before, specifying that blocks, which end in DEDENT, don't need a newline after them. It doesn't solve the problem of the empty block, but it works for the cases you listed.

    There are other types of parsers out there. I think your situation could be resolved by one that can do look-ahead without consuming input or if the parser could push tokens back onto the lexer stream--I don't know what you're call a parser with that feature though. It seems much simpler to me to make sure you retain all the information you need to perform the parse.
    Exactly. Just yesterday I was taking a deeper look at the bison documentation, and found that it does, sometimes, provide the lookahead token in the match action. So that'll do the trick. Probably.

    For namespaces, the not quite working code looks like this:
    Code:
    namespace_block
        : {
                if(yychar != '\n' && yychar != token::T_DEDENT
                        && yychar != token::T_TERMINATION){
                    error(@$,"syntax error, namespace declaration must be followed "
                            "by newline or end of file");
                    YYERROR; /*starts error recovery*/
                }
                $$=ParseNode();/*empty, valid block*/
            }
        | T_INDENT file_scope T_DEDENT {
                /*handle regular block*/
            }
        ;
    This looks like a promising solution. It's what I'm trying now. It's a hack though.

    Thinking a bit more, the information is there--just not in the most convenient format since the statement separator is mixed up with the tokens that start and end a block. Every method I try to work around that runs into the problem that in the token string:
    Code:
    WHILE x INDENT y DEDENT z
    the block ends up eating the DEDENT that separates the while statement from the "z" statement. All I can think of is hoisting a lot of ugly work into the statements rule like

    Code:
    statements ::= line-statement NODENT statements
                 | block-statement DEDENT statements
                 | line-statement
                 | block-statement DEDENT
    
    line-statement ::= expression
                     | WHILE expression
    
    block-statement ::= WHILE expression INDENT statements
    so that DEDENT can do double the work.
    I didn't think this would work, but I didn't actually try it. You actually want my code above, but with the exta kind of statement:

    Code:
    file_statement
        : variable_def
        | using_decl
        | NAMESPACE namespace_id 
        ;
    I'll try this too.
  18. #10
  19. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2010
    Posts
    13
    Rep Power
    0
    Having a while statement as both a line statement and a block statement doesn't work, as I expected. If there is a too that allows this, that could be a strait forward solution.

    Manually reporting a syntax error doesn't do very good error recovery.

    The worst is allowing statements on the same line as the the block, because this causes every token that can start a statement to be ambiguous. And you still have to look at the lookahead token to determine if whatever follows that is a DEDENT, INDEND, or NODENT. Which is why it feels like I'm using the wrong tool.

IMN logo majestic logo threadwatch logo seochat tools logo