DamnBASIC: Lexical Analyzer

About

The main idea behind building DamnBASIC was to have a tool that would help me build games using a higher language definition and then generate code for platforms that had a MOS 6502 chip (Commodore 64, NES, Atari 800, Apple II, etc…).

The language

The first thing I did was create a BNF document that would allow me to define the syntax structure of the language. I wasn’t looking to build a new type of semantic, I only wanted an easy to use and already known syntax. That’s why I decided to go for a BASIC-ish / C-ish type of syntax. This is how I wanted the language to look:

let z = 0

func fibonacci(n)
    if (n = 0)
        return 0
    elseif (n = 1)
        return 1
    end
    return fibonacci(n - 1) + fibonacci(n - 2)
end

func main()
    z = fibonacci(10)
end

This looks like something simple and straight forward if you’ve ever programmed in a C based language.

Building the thing

The good thing about having a BNF document is that you can use is as a reference for each step on the building of the compiler. The first thing I looked for were terminals. In this case I had digits [0-9], alphabetic characters [a-zA-Z], multiple symbols [+|-|*|/|…] and finally keywords like “if”, “else”, “func”, etc. Once I’ve read the symbols I was able to construct tokens. Tokens are used by the parser to generate a syntax tree.

Here you can see the whole list of tokens I had.

enum token_type
{
    LINEFEED,
    ENDOFFILE,
    WORD_OR,
    WORD_AND,
    WORD_NOT,
    WORD_FUNC,
    WORD_IF,
    WORD_ELSEIF,
    WORD_ELSE,
    WORD_LET,
    WORD_END,
    WORD_RET,
    WORD_WHILE,
    WORD_BAND,
    WORD_XOR,
    WORD_BOR,
    WORD_SHL,
    WORD_SHR,
    WORD_MOD,
    SYM_NOTEQU,
    SYM_EQU,
    SYM_LT,
    SYM_GT,
    SYM_LTEQU,
    SYM_GTEQU,
    SYM_PLUS,
    SYM_MINUS,
    SYM_SLASH,
    SYM_ASTERISK,
    SYM_COMMA,
    SYM_LPAREN,
    SYM_RPAREN,
    IDENTIFIER,
    NUMBER,
    BOOLEAN
};

The token was a very simple structure which only had a type, raw value and the line of the scan.

struct token
{
    token_type type;
    string raw_value;
    int line;
};

The scanning routine is very simple and it’s broken in multiple parts in a hierarchical way. For example the first thing I want to scan is white spaces and comments then it would be symbols after that keywords and finally terminals like boolean values, numbers and identifiers. The main loop looks something like this.

while (!at_eof())
{
    token* current_token = nullptr;
    if (has_empty_space(current_token)) continue;
    else if(has_symbols(current_token));
    else if(has_keywords(current_token));
    else if(has_terminal(current_token));
    else
    {
        error("Invalid Symbol");
    }
    if (token) 
    {
        stream.push_back(token);
    }
}

Lets break apart the “has_terminal” function so we can see how we can process string of characters and translate that into a usable token. The body of the function is very similar to our main loop. You’ll see this a lot through the compiler because it’s highly based on the BNF document which breaks each step into smaller parts.

bool has_terminal(token*& current_token) 
{
    if (has_number(current_token)) return true;
    if (has_boolean(current_token)) return true;
    if (has_identifier(current_token)) return true;
    return false;
}

Lets finish disecting “has_number”. First we want to check if the current character is a valid number. There are multiple ways we can do that. We could check if the characters ASCII value is in the range of [0-9] values or we could just compare it to all the characters in a string “0123456789”. So what we’ll do is loop until we stop getting a match or we reach the end of the file/string we are reading.

bool has_number(token*& current_token) 
{
    if (current_char >= '0' && current_char <= '9') 
    {
        current_token = new token(NUMBER);
        current_token->raw_value = current_char;
        current_char = next_char();
        while (current_char >= '0' && current_char <= '9')
        {
            current_token->raw_value += current_char;
            current_char = next_char();
        }
        return true;
    }
    return false;
}

Here you can see a very nice video example explaining more about state machines and the use on language design. Basically the BNF document I wrote is a less “academic” representation of a finite state automata.