The fact that blocks may end in a semicolon but this semicolon is not
counted by recursive lastToken() evaluation on the sub expression causes
off-by-one errors for lastToken() on blocks currently.
To fix this, introduce BlockSemicolon and BlockTwoSemicolon following
the pattern used for trailing commas in e.g. builtin function arguments.
rename PtrType => PtrTypeBitRange, SliceType => PtrType
This rename was done as the current SliceType is used for non-bitrange
pointers as well as slices and because PtrTypeSentinel/PtrTypeAligned
are also used for slices. Therefore using the same Ptr prefix for all
these pointer/slice nodes is an improvement.
This is a proof-of-concept of switching to a new memory layout for
tokens and AST nodes. The goal is threefold:
* smaller memory footprint
* faster performance for tokenization and parsing
* most importantly, a proof-of-concept that can be also applied to ZIR
and TZIR to improve the entire compiler pipeline in this way.
I had a few key insights here:
* Underlying premise: using less memory will make things faster, because
of fewer allocations and better cache utilization. Also using less
memory is valuable in and of itself.
* Using a Struct-Of-Arrays for tokens and AST nodes, saves the bytes of
padding between the enum tag (which kind of token is it; which kind
of AST node is it) and the next fields in the struct. It also improves
cache coherence, since one can peek ahead in the tokens array without
having to load the source locations of tokens.
* Token memory can be conserved by only having the tag (1 byte) and byte
offset (4 bytes) for a total of 5 bytes per token. It is not necessary
to store the token ending byte offset because one can always re-tokenize
later, but also most tokens the length can be trivially determined from
the tag alone, and for ones where it doesn't, string literals for
example, one must parse the string literal again later anyway in
astgen, making it free to re-tokenize.
* AST nodes do not actually need to store more than 1 token index because
one can poke left and right in the tokens array very cheaply.
So far we are left with one big problem though: how can we put AST nodes
into an array, since different AST nodes are different sizes?
This is where my key observation comes in: one can have a hash table for
the extra data for the less common AST nodes! But it gets even better than
that:
I defined this data that is always present for every AST Node:
* tag (1 byte)
- which AST node is it
* main_token (4 bytes, index into tokens array)
- the tag determines which token this points to
* struct{lhs: u32, rhs: u32}
- enough to store 2 indexes to other AST nodes, the tag determines
how to interpret this data
You can see how a binary operation, such as `a * b` would fit into this
structure perfectly. A unary operation, such as `*a` would also fit,
and leave `rhs` unused. So this is a total of 13 bytes per AST node.
And again, we don't have to pay for the padding to round up to 16 because
we store in struct-of-arrays format.
I made a further observation: the only kind of data AST nodes need to
store other than the main_token is indexes to sub-expressions. That's it.
The only purpose of an AST is to bring a tree structure to a list of tokens.
This observation means all the data that nodes store are only sets of u32
indexes to other nodes. The other tokens can be found later by the compiler,
by poking around in the tokens array, which again is super fast because it
is struct-of-arrays, so you often only need to look at the token tags array,
which is an array of bytes, very cache friendly.
So for nearly every kind of AST node, you can store it in 13 bytes. For the
rarer AST nodes that have 3 or more indexes to other nodes to store, either
the lhs or the rhs will be repurposed to be an index into an extra_data array
which contains the extra AST node indexes. In other words, no hash table needed,
it's just 1 big ArrayList with the extra data for AST Nodes.
Final observation, no need to have a canonical tag for a given AST. For example:
The expression `foo(bar)` is a function call. Function calls can have any
number of parameters. However in this example, we can encode the function
call into the AST with a tag called `FunctionCallOnlyOneParam`, and use lhs
for the function expr and rhs for the only parameter expr. Meanwhile if the
code was `foo(bar, baz)` then the AST node would have to be `FunctionCall`
with lhs still being the function expr, but rhs being the index into
`extra_data`. Then because the tag is `FunctionCall` it means
`extra_data[rhs]` is the "start" and `extra_data[rhs+1]` is the "end".
Now the range `extra_data[start..end]` describes the list of parameters
to the function.
Point being, you only have to pay for the extra bytes if the AST actually
requires it. There's no limit to the number of different AST tag encodings.
Preliminary results:
* 15% improvement on cache-misses
* 28% improvement on total instructions executed
* 26% improvement on total CPU cycles
* 22% improvement on wall clock time
This is 1/4 items on the checklist before this can actually be merged:
* [x] parser
* [ ] render (zig fmt)
* [ ] astgen
* [ ] translate-c
This is part of an ongoing effort to reduce size of in-memory AST. This
enum flattening pattern is widespread throughout the self-hosted
compiler.
This is a API breaking change for consumers of the self-hosted parser.
* AST: flatten ControlFlowExpression into Continue, Break, and Return.
* AST: unify identifiers and literals into the same AST type: OneToken
* AST: ControlFlowExpression uses TrailerFlags to optimize storage
space.
* astgen: support `var` as well as `const` locals, and support
explicitly typed locals. Corresponding Module and codegen code is not
implemented yet.
* astgen: support result locations.
* ZIR: add the following instructions (see the corresponding doc
comments for explanations of semantics):
- alloc
- alloc_inferred
- bitcast_result_ptr
- coerce_result_block_ptr
- coerce_result_ptr
- coerce_to_ptr_elem
- ensure_result_used
- ensure_result_non_error
- ret_ptr
- ret_type
- store
- param_type
* the skeleton structure for result locations is set up. It's looking
pretty clean so far.
* add compile error for unused result and compile error for discarding
errors.
* astgen: split builtin calls up to implemented manually, and implement
`@as`, `@bitCast` (and others) with respect to result locations.
* add CLI support for hex and raw object formats. They are not
supported by the self-hosted compiler yet, and emit errors.
* rename `--c` CLI to `-ofmt=[objectformat]` which can be any of the
object formats. Only ELF and C are supported so far. Also added missing
help to the help text.
* Remove hard tabs from C backend test cases. Shame on you Noam, you
are grounded, you should know better, etc. Bad boy.
* Delete C backend code and test case that relied on comptime_int
incorrectly making it all the way to codegen.
InfixOp is flattened out so that each operator is an independent AST
node tag. The two kinds of structs are now Catch and SimpleInfixOp.
Beginning implementation of supporting codegen for const locals.
ast.Node.Id => ast.Node.Tag, matching recent style conventions.
Now multiple different AST node tags can map to the same AST node data
structures. In this commit, simple prefix operators now all map top
SimplePrefixOp.
`ast.Node.castTag` is now preferred over `ast.Node.cast`.
Upcoming: InfixOp flattened out.
These AST nodes now have a flags field and then a bunch of optional
trailing objects. The end result is lower memory usage and consequently
better performance. This is part of an ongoing effort to reduce the
amount of memory parsed ASTs take up.
Running `zig fmt` on the std lib:
* cache-misses: 2,554,321 => 2,534,745
* instructions: 3,293,220,119 => 3,302,479,874
* peak memory: 74.0 MiB => 73.0 MiB
Holding the entire std lib AST in memory at the same time:
93.9 MiB => 88.5 MiB
This is part of a larger effort to improve the memory layout of AST
nodes of the self-hosted parser to reduce wasted memory. Reduction of
wasted memory also translates to improved performance because of fewer
memory allocations, and fewer cache misses.
Compared to master, when running `zig fmt` on the std lib:
* cache-misses: 801,829 => 768,624
* instructions: 3,234,877,167 => 3,232,075,022
* peak memory: 81480 KB => 75964 KB
To prevent cache misses, token ids go in their own array, and the
start/end offsets go in a different one.
perf measurement before:
2,667,914 cache-misses:u
2,139,139,935 instructions:u
894,167,331 cycles:u
perf measurement after:
1,757,723 cache-misses:u
2,069,932,298 instructions:u
858,105,570 cycles:u
The DocComment AST node now only points to the first doc comment token.
API users are expected to iterate over the following tokens directly.
After this commit there are no more linked lists in use in the
self-hosted AST API.
Performance impact is negligible. Memory usage slightly reduced.
* Extract Call ast node tag out of SuffixOp; parameters go in memory
after Call.
* Demote AsmInput and AsmOutput from AST nodes to structs inside the
Asm node.
* The following ast nodes get their sub-node lists directly following
them in memory:
- ErrorSetDecl
- Switch
- BuiltinCall
* ast.Node.Asm gets slices for inputs, outputs, clobbers instead of
singly linked lists
Performance changes:
throughput: 72.7 MiB/s => 74.0 MiB/s
maxrss: 72 KB => 69 KB (nice)
block statements are now directly following the Block AST node rather
than a singly linked list. This had negligible impact on performance:
throughput: 72.3 MiB/s => 72.7 MiB/s
however it greatly improves the API since the statements are laid out in
a flat array in memory.