77 Commits

Author SHA1 Message Date
Veikka Tuominen
349d78a443 validate number literals in AstGen 2022-09-13 20:26:04 -04:00
r00ster91
83909651ea test: simplify testTokenize
What this does is already done by `expectEqual`.
Now the trace seems to be shorter and more concise so the errors should be easier to read now.
2022-08-16 00:20:19 +02:00
r00ster91
5490688d65 refactor: use std.ascii functions 2022-08-16 00:20:19 +02:00
r00ster91
e3b3eab840 test(names): some renamings 2022-08-16 00:20:19 +02:00
r00ster91
f07cba10a3 test(names): remove unnecessary "tokenizer - " prefix 2022-08-16 00:20:19 +02:00
zooster
8fd20a5eb0 fix: disallow newline in char literal 2022-08-10 16:13:56 -04:00
Ali Chraghi
a4df443f96 Update Tokenizer Dump Function
fix missed `loc` field
2022-02-20 17:47:42 -05:00
Veikka Tuominen
9c36cf92f0 parser: make some errors point to end of previous token
For some errors if the found token is not on the same line as
the previous token, point to the end of the previous token.
This usually results in more helpful errors.
2022-02-17 14:23:35 +02:00
Andrew Kelley
902df103c6 std lib API deprecations for the upcoming 0.9.0 release
See #3811
2021-11-30 00:13:07 -07:00
Travis Staloch
4870595352 sat-arithmetic: add additional tokenizer tests 2021-09-28 17:03:43 -07:00
Travis Staloch
dcbc52ec85 sat-arithmetic: correctly tokenize <<|, <<|=
- set state rather than result.tag in tokenizer.zig
- add test to tokenizer.zig for <<, <<|, <<|=
2021-09-28 17:03:43 -07:00
Travis Staloch
29f41896ed sat-arithmetic: add operator support
- adds initial support for the operators +|, -|, *|, <<|, +|=, -|=, *|=, <<|=
- uses operators in addition to builtins in behavior test
- adds binOpExt() and assignBinOpExt() to AstGen.zig. these need to be audited
2021-09-28 17:02:43 -07:00
Ryan Liptak
3b09262c12 tokenizer: Fix index-out-of-bounds on unfinished unicode escapes before EOF 2021-09-22 14:33:33 -04:00
Andrew Kelley
1ad905c71e
Merge pull request #9649 from Snektron/address-space
Address Spaces
2021-09-20 20:37:04 -04:00
Ryan Liptak
2a728f6e5f tokenizer: Fix index-out-of-bounds on string_literal_backslash right before EOF 2021-09-20 20:16:14 -04:00
Robin Voetter
ccc7f9987d Address spaces: addrspace(A) parsing
The grammar for function prototypes, (global) variable declarations, and
pointer types now accepts an optional addrspace(A) modifier.
2021-09-20 02:29:03 +02:00
Andrew Kelley
05cf44933d stage2: delete keywords true, false, undefined, null
The grammar does not need these as keywords; they are merely primitives
provided by the language the same as `void`, `u32`, etc.
2021-08-28 12:10:55 -07:00
Andrew Kelley
d29871977f remove redundant license headers from zig standard library
We already have a LICENSE file that covers the Zig Standard Library. We
no longer need to remind everyone that the license is MIT in every single
file.

Previously this was introduced to clarify the situation for a fork of
Zig that made Zig's LICENSE file harder to find, and replaced it with
their own license that required annual payments to their company.
However that fork now appears to be dead. So there is no need to
reinforce the copyright notice in every single file.
2021-08-24 12:25:09 -07:00
Andrew Kelley
ec63411905 Revert "Skip over CRs at the end of multiline literals"
This reverts commit 9de452f9a69d5590743a194bc2d0817d26d66a0b.

No CRs allowed in multiline string literals - this is intentional.
2021-07-07 18:00:04 -07:00
Daniele Cocca
9de452f9a6 Skip over CRs at the end of multiline literals
Fixes #9257.
This is needed when tokenizing input containing DOS line endings, i.e.
the CRLF sequence.
2021-07-07 20:03:19 +03:00
Andrew Kelley
c5c23db627 tokenizer: clean up invalid token error
It now displays the byte with proper printability handling. This makes
the relevant compile error test case no longer a regression in quality
from stage1 to stage2.
2021-07-02 13:28:31 -07:00
Andrew Kelley
7a2e0d9810 AstGen: cleanups to pass more compile error test cases 2021-07-02 13:28:29 -07:00
Andrew Kelley
24c432608f stage2: improve compile errors from tokenizer
In order to not regress the quality of compile errors, some improvements
had to be made.

 * std.zig.parseCharLiteral is improved to return more detailed parse
   failure information.
 * tokenizer is improved to handle null bytes in the middle of strings,
   character literals, and line comments.
 * validating how many unicode escape digits in string literals is moved
   to std.zig.parseStringLiteral rather than handled in the tokenizer.
 * when a tokenizer error occurs, if the reported token is the 'invalid'
   tag, an error note is added to point to the invalid byte location.
   Further improvements would be:
   - Mention the expected set of allowed bytes at this location.
   - Display the invalid byte (if printable, print it, otherwise
     escape-print it).
2021-07-02 13:27:35 -07:00
Andrew Kelley
3f680abbe2 stage2: tokenizer: require null terminated source
By requiring the source file to be null-terminated, we avoid extra
branching while simplifying the logic at the same time.

Running ast-check on a large zig source file (udivmodti4_test.zig),
master branch compared to this commit:
 * 4% faster wall clock
 * 7% fewer cache misses
 * 1% fewer branches
2021-07-02 13:27:35 -07:00
Jacob G-W
641ecc260f std, src, doc, test: remove unused variables 2021-06-21 17:03:03 -07:00
Dmitry Matveyev
00982f75e9
stage2: Remove special double ampersand parsing case (#9114)
* Remove parser error on double ampersand

* Add failing test for double ampersand case

* Add error when encountering double ampersand in AstGen

"Bit and" operator should not make sense when one of its operands
is an address.

* Check that 2 ampersands are adjacent to each other in source string

* Remove cases of unused variables in tests
2021-06-20 21:04:14 +03:00
Isaac Freund
608bc1cbd5
stage2: disallow 1.e9 and 0x1.p9 as float literals
Instead require `1e9` and `0x1p9`, disallowing the trailing dot.

This change to the grammar is consistent with forbidding `1.` and `0x1.`
as float literals and ensures there is only one way to do things here.
2021-05-31 19:51:11 +00:00
Isaac Freund
ec10595b65 stage2: disallow trailing dot on float literals
This disallows e.g. `1.` or `0x1.` as a float literal, which is
consistent with the grammar.
2021-05-27 21:00:44 -04:00
Matthew Borkowski
f750618846 stop tokenizer from recognizing lone @ or @ followed by a digit as a builtin 2021-05-26 04:13:04 -04:00
Andrew Kelley
44de884980 tokenizer: fix crash on multiline string with only 1 backslash
Closes #8904
2021-05-25 18:13:10 -07:00
Veikka Tuominen
fd77f2cfed std: update usage of std.testing 2021-05-08 15:15:30 +03:00
Andrew Kelley
8e6c2b7a47 Merge remote-tracking branch 'origin/master' into ast-memory-layout 2021-02-24 15:08:23 -07:00
Josh Wolfe
8b9434871e
Avoid concept of a "Unicode character" in documentation and error messages (#8059) 2021-02-24 08:26:13 -05:00
Isaac Freund
f3ee10b454
zig fmt: fix comments ending with EOF after decls
Achieve this by reducing the amount of special casing to handle EOF so
that the already correct logic for normal comments does not need to be
duplicated.
2021-02-22 18:32:37 +01:00
Veikka Tuominen
e2289961c6
snake_case Token.Tag 2021-02-12 02:12:00 +02:00
Andrew Kelley
272a0ab359 zig fmt: implement "line comment followed by top-level comptime" 2021-02-01 20:11:55 -07:00
Andrew Kelley
20554d32c0 zig fmt: start reworking with new memory layout
* start implementation of ast.Tree.firstToken and lastToken
 * clarify some ast.Node doc comments
 * reimplement renderToken
2021-02-01 17:23:49 -07:00
Andrew Kelley
bf8fafc37d stage2: tokenizer does not emit line comments anymore
only std.zig.render cares about these, and it can find them in the
original source easily enough.
2021-01-31 21:57:48 -07:00
Andrew Kelley
4dca99d3f6 stage2: rework AST memory layout
This is a proof-of-concept of switching to a new memory layout for
tokens and AST nodes. The goal is threefold:

 * smaller memory footprint
 * faster performance for tokenization and parsing
 * most importantly, a proof-of-concept that can be also applied to ZIR
   and TZIR to improve the entire compiler pipeline in this way.

I had a few key insights here:

 * Underlying premise: using less memory will make things faster, because
   of fewer allocations and better cache utilization. Also using less
   memory is valuable in and of itself.
 * Using a Struct-Of-Arrays for tokens and AST nodes, saves the bytes of
   padding between the enum tag (which kind of token is it; which kind
   of AST node is it) and the next fields in the struct. It also improves
   cache coherence, since one can peek ahead in the tokens array without
   having to load the source locations of tokens.
 * Token memory can be conserved by only having the tag (1 byte) and byte
   offset (4 bytes) for a total of 5 bytes per token. It is not necessary
   to store the token ending byte offset because one can always re-tokenize
   later, but also most tokens the length can be trivially determined from
   the tag alone, and for ones where it doesn't, string literals for
   example, one must parse the string literal again later anyway in
   astgen, making it free to re-tokenize.
 * AST nodes do not actually need to store more than 1 token index because
   one can poke left and right in the tokens array very cheaply.

So far we are left with one big problem though: how can we put AST nodes
into an array, since different AST nodes are different sizes?

This is where my key observation comes in: one can have a hash table for
the extra data for the less common AST nodes! But it gets even better than
that:

I defined this data that is always present for every AST Node:

 * tag (1 byte)
   - which AST node is it
 * main_token (4 bytes, index into tokens array)
   - the tag determines which token this points to
 * struct{lhs: u32, rhs: u32}
   - enough to store 2 indexes to other AST nodes, the tag determines
     how to interpret this data

You can see how a binary operation, such as `a * b` would fit into this
structure perfectly. A unary operation, such as `*a` would also fit,
and leave `rhs` unused. So this is a total of 13 bytes per AST node.
And again, we don't have to pay for the padding to round up to 16 because
we store in struct-of-arrays format.

I made a further observation: the only kind of data AST nodes need to
store other than the main_token is indexes to sub-expressions. That's it.
The only purpose of an AST is to bring a tree structure to a list of tokens.
This observation means all the data that nodes store are only sets of u32
indexes to other nodes. The other tokens can be found later by the compiler,
by poking around in the tokens array, which again is super fast because it
is struct-of-arrays, so you often only need to look at the token tags array,
which is an array of bytes, very cache friendly.

So for nearly every kind of AST node, you can store it in 13 bytes. For the
rarer AST nodes that have 3 or more indexes to other nodes to store, either
the lhs or the rhs will be repurposed to be an index into an extra_data array
which contains the extra AST node indexes. In other words, no hash table needed,
it's just 1 big ArrayList with the extra data for AST Nodes.

Final observation, no need to have a canonical tag for a given AST. For example:
The expression `foo(bar)` is a function call. Function calls can have any
number of parameters. However in this example, we can encode the function
call into the AST with a tag called `FunctionCallOnlyOneParam`, and use lhs
for the function expr and rhs for the only parameter expr. Meanwhile if the
code was `foo(bar, baz)` then the AST node would have to be `FunctionCall`
with lhs still being the function expr, but rhs being the index into
`extra_data`. Then because the tag is `FunctionCall` it means
`extra_data[rhs]` is the "start" and `extra_data[rhs+1]` is the "end".
Now the range `extra_data[start..end]` describes the list of parameters
to the function.

Point being, you only have to pay for the extra bytes if the AST actually
requires it. There's no limit to the number of different AST tag encodings.

Preliminary results:

 * 15% improvement on cache-misses
 * 28% improvement on total instructions executed
 * 26% improvement on total CPU cycles
 * 22% improvement on wall clock time

This is 1/4 items on the checklist before this can actually be merged:

 * [x] parser
 * [ ] render (zig fmt)
 * [ ] astgen
 * [ ] translate-c
2021-01-30 20:16:59 -07:00
LemonBoy
dd973fb365 std: Use {s} instead of {} when printing strings 2021-01-02 17:12:57 -07:00
Frank Denis
6c2e0c2046 Year++ 2020-12-31 15:45:24 -08:00
Travis
d7f9128b5d add error message to zig side of tokenizing/parsing 2020-10-29 12:03:45 -05:00
Travis
960b5b518f updated zig tokenizer to handle .*** and added tests 2020-10-29 12:03:45 -05:00
Tadeo Kondrak
069fbb3c01
Add opaque type syntax 2020-10-06 22:08:24 -06:00
LemonBoy
5c6cd5e2c9 stage{1,2}: Fix parsing of range literals
stage1 was unable to parse ranges whose starting point was written in
binary/octal as the first dot in '...' was incorrectly interpreted as
decimal point.

stage2 forgot to reset the literal type to IntegerLiteral when it
discovered the dot was not a decimal point.

I've only stumbled across this bug because zig fmt keeps formatting the
ranges without any space around the ...
2020-09-28 14:16:26 -04:00
Vexu
1174cb1517
stage2: fix tokenizer float bug 2020-09-03 15:05:47 +03:00
Andrew Kelley
4a69b11e74 add license header to all std lib files
add SPDX license identifier
copyright ownership is zig contributors
2020-08-20 16:07:04 -04:00
Vexu
c2fb4bfff3
add 'anytype' to self-hosted parser 2020-07-11 17:41:16 +03:00
Alexandros Naskos
aa1a727284 Allow carriare return in comments 2020-06-02 00:56:05 -04:00
Vexu
a47257d9b0
fix std.zig rejecting literal tabs in comments 2020-06-01 14:37:36 +03:00