# ⌈LUNAR GARDEN⌋

#### toddgaunt@protonmail.ch

## C2 Language Grammar

#### 2018-11-12

## Introduction Over the past two years I have been working on various compilers that were developed to various stages of incompletion. Each language designed has been a stepping stone towards the next language, an iterative process, that has landed me at the current language I am developing now, C2. This may be the language I stick with and develop a fully-featured compiler for, or it may not be. Only time will tell. I am, however, much more satisfied by the design and planned feature-set of this language than all my prior ones, as I believe it hits a sweet spot between comfort, ease-of-adoption, having a modern feature set, and the manual control of a language like C. This is where it gets its name, C2, from reflecting how I wish I could improve C, my favorite programming language. Similar to C++ in spirit, but much different in method, C2 aims to heal warts of the C language, and add modern features found in other languages that complement what C does best: manual memory-management and performance. Rather than pretending to maintain backwards compatibility, while the syntax should look familiar to any ALGOL programmer, the language is quite visually distinct to accomodate the new features and to remove language ambiguity. Features on the roadmap include but are not limited to: lambdas, explicit and implicit variable capture, function overloading, automatic currying, prefix and infix operators, manual memory management, pointers and pointer arithmetic, basic type inference, generic functions, and parameterized types. For all of these features, the language syntax was designed to keep refactoring code easy, be unambiguous, and be fast to parse. ## Grammar and Tokens The grammar for C2 is very simple, and was designed with making the writing of a hand-made recursive descent parser in mind. It should be easy to bootstrap a programming language. A hand-written recursive-descent parser is always easy to write. Each token type is also rather simple, and can be expressed with a regular expression. This allows for the classic lexer/parser split when writing an implementation of the language, since token matching is independent of grammar context. The grammar: THE C2 LANGUAGE GRAMMAR decimal_digit = "0" ... "9" hex_digit = "0" ... "9" | "A" ... "F" | "a" ... "f" . binary_lit = "0" ("b" | "B") {"0" | "1"} . octal_lit = "0" {"0" ... "7"} . decimal_lit = ("1" ... "9") { decimal_digit } . hex_lit = "0" ("x" | "X") hex_digit {hex_digit} . integer_lit = binary_lit | octal_lit | decimal_lit | hex_lit . float_lit = {decimal_digit} "." decimal_digit {decimal_digit} . string_lit = `"` {UNICODE_LETTER} `"` . normal_ident = ASCII_LETTER { ASCII_LETTER | "0" ... "9" } . prefix_ident = "`" {UNICODE_LETTER} "`" . infix_ident = "'" {UNICODE_LETTER} "'" . ident = normal_ident | prefix_ident | infix_ident . type_primary = "(" type ")" | normal_ident [":" type_unary ] . type_unary = {"*" | "[" integer_lit "]"} type_primary . type_function = type_unary {"->" type_unary} . type_union = type_function {"|" type_function} . type_cons = type_union {"," type_union} . type = type_cons . block = [ "[" [ident { "," ident }] "]" ] "{" { ";" | expr ";" } "}" . primary = integer_lit | float_lit | string_lit | ident [":"] | "(" expr ")" | "type" ident type | "if" "(" expr ")" expr ";" {"else" "if" "(" expr ")" expr ";"} ["else" expr ";"] | "for" ["(" expr [";" expr] [";" expr] ")"] expr | "let" ident {"(" [ident {"," ident}] ")"} [":" type] "=" expr | "goto" ident | "type" ident type | "lambda" {"(" [ident {"," ident}] ")"} [":" type] "=" expr | block prefix = { "*" | "&" | "!" | "~" | "+" | "-" } primary . postfix = prefix {"[" expr "]" | "(" expr ")" | "as" type} . infix = postfix {infix_ident postfix} . multiplicative = infix {("*" | "/" | "%" | "<<" | ">>" | "&") infix} . additive = multiplicative {("+" | "-" | "|" | "^") multiplicative} . comparative = additive {("==" | "!=" | "<=" | ">=" | "<" | ">") additive} . logical_and = comparative {"&&" comparative} . logical_or = logical_and {"||" logical_and} . cons = logical_or {"," logical_or} . assignment = cons {("=" | "*=" | "/=" | "%=" | "<<=" | ">>=" | "&=" | "+=" | "-=" | "|=" | "^=") cons} . expr = assignment . start = {";" | expr ";"} EOF . THE C2 LANGUAGE OPERTOR PRECEDENCE (EXCLUDING PRIME) Category | Operators ---------------+---------------------------------- prime | ... unary | * & ! ~ + - infix | INFIX_IDENT multiplicative | * / % << >> & additive | + - | ^ comparative | == != <= >= < > logical_and | && logical_or | || cons | , assignment | = += -= |= ^= *= /= %= <<= >>= &= The operator precedence for the language is baked into the grammar for the language if written using a recursive descent parser. C2 breaks some precedence rules established from C, but for good reason. Some precedence rules broken here are moving the bitwise '|' (or) and '&' (and) into the additive and multiplicative categories respectively. The bit-shift operators have been moved to the multiplicative category as well. Otherwise precedence remains identical for pre-defined infix operators. A special level of precedence was added for custom infix operators just below unary operators, yet higher than all other binary operators. This precedence level is for the special infix identifer form, which allows users to enclose any identifier with single quotes to use it as a binary operator e.g. f(5)(6) and 5 'f' 6 are equivalent excepting precedence. To complement this infix form, a prefix identifier form is also in the language, allowing any identifiers enclosed by backticks to be used in the standard prefix calling form e.g. 5 + 6 and `+`(5)(6) are equivalent in all ways except precedence. These special identification forms work because in C2, operators and functions are equivalent. It is simply that certain symbols are designated as infix by default, such as + - * and /, and the rest are designated as prefix by default. The benefit of such a detail is that a user can define his own infix notation, and is able to use what are traditionally thought of as special operators in the same way as functions. The lowest precedence operator that is still above assignment is the 'cons' operator, or tuple construction operator, represented with , (comma). This comma is one of the basic building blocks for how expressions in C2 are formed, and adds a very useful data structure to the language for building data aggregates, otherwise known as tuples. Otherwise, most operators behave the same as they would in C, and are by default implemented for all primitive types as appropriate. ## Examples Below are some examples of the language grammar in action. They won't really be explained, as that is for a later post. But it should be pretty straight- forward to understand for someone familiar with C, C++, or Rust syntax and semantics. /* Variable Declaration */ let a : u8 = 5; let b : u16 = 5; let c : i32 = b as i32; let d : f32 = 3.14; let e : f64 = 2.71 + (d as f64); /* Function Declaration */ let f(x, y) : (u32, u32) -> u32 = { x / y + y / x; } let g(a)(b)(c) : f32 -> f32 -> f32 -> f32 = { a * b * c; } f(2, 4); g(1)(2)(3); /* Defining Aggregate and Sum Types */ type Vec3 (x : i32, y : i32, z : i32); type u8_or_f32 (u : u8 | f : f32); /* Defining and Using an Infix Function */ let 'cross'(a)(b) : (Vec3, Vec3) -> Vec3 = (a.x * b.x, a.y * b.y, a.z * b.z); let a : Vec3 = (1, 2, 3); let b : Vec3 = (2, 2, 2); let c = a 'cross' b; /* Using an Operator as a Prefix Function */ let x = 5; let y = 2; let z = `+`(x)(y); let t = `,`(20)(30);