DamnBASIC: Lexical Analyzer
Experimenting with writing my own language
About
The main idea behind building DamnBASIC was to have a tool that would help me build games using a higher language definition and then generate code for platforms that had a MOS 6502 chip (Commodore 64, NES, Atari 800, Apple II, etcā¦).
The language
The first thing I did was create a BNF document that would allow me to define the syntax structure of the language. I wasnāt looking to build a new type of semantic, I only wanted an easy to use and already known syntax. Thatās why I decided to go for a BASIC-ish / C-ish type of syntax. This is how I wanted the language to look:
This looks like something simple and straight forward if youāve ever programmed in a C based language.
Building the thing
The good thing about having a BNF document is that you can use is as a reference for each step on the building of the compiler. The first thing I looked for were terminals. In this case I had digits [0-9], alphabetic characters [a-zA-Z], multiple symbols [+|-|*|/|ā¦] and finally keywords like āifā, āelseā, āfuncā, etc. Once Iāve read the symbols I was able to construct tokens. Tokens are used by the parser to generate a syntax tree.
Here you can see the whole list of tokens I had.
The token was a very simple structure which only had a type, raw value and the line of the scan.
The scanning routine is very simple and itās broken in multiple parts in a hierarchical way. For example the first thing I want to scan is white spaces and comments then it would be symbols after that keywords and finally terminals like boolean values, numbers and identifiers. The main loop looks something like this.
Lets break apart the āhas_terminalā function so we can see how we can process string of characters and translate that into a usable token. The body of the function is very similar to our main loop. Youāll see this a lot through the compiler because itās highly based on the BNF document which breaks each step into smaller parts.
Lets finish disecting āhas_numberā. First we want to check if the current character is a valid number. There are multiple ways we can do that. We could check if the characters ASCII value is in the range of [0-9] values or we could just compare it to all the characters in a string ā0123456789ā. So what weāll do is loop until we stop getting a match or we reach the end of the file/string we are reading.
Here you can see a very nice video example explaining more about state machines and the use on language design. Basically the BNF document I wrote is a less āacademicā representation of a finite state automata.