mirror/zig - zig - Bouvais Git

mirror/zig

mirror of https://github.com/ziglang/zig.git synced 2026-01-02 11:33:21 +00:00

Author	SHA1	Message	Date
Veikka Tuominen	349d78a443	validate number literals in AstGen	2022-09-13 20:26:04 -04:00
r00ster91	83909651ea	test: simplify testTokenize What this does is already done by `expectEqual`. Now the trace seems to be shorter and more concise so the errors should be easier to read now.	2022-08-16 00:20:19 +02:00
r00ster91	5490688d65	refactor: use std.ascii functions	2022-08-16 00:20:19 +02:00
r00ster91	e3b3eab840	test(names): some renamings	2022-08-16 00:20:19 +02:00
r00ster91	f07cba10a3	test(names): remove unnecessary "tokenizer - " prefix	2022-08-16 00:20:19 +02:00
zooster	8fd20a5eb0	fix: disallow newline in char literal	2022-08-10 16:13:56 -04:00
Ali Chraghi	a4df443f96	Update Tokenizer Dump Function fix missed `loc` field	2022-02-20 17:47:42 -05:00
Veikka Tuominen	9c36cf92f0	parser: make some errors point to end of previous token For some errors if the found token is not on the same line as the previous token, point to the end of the previous token. This usually results in more helpful errors.	2022-02-17 14:23:35 +02:00
Andrew Kelley	902df103c6	std lib API deprecations for the upcoming 0.9.0 release See #3811	2021-11-30 00:13:07 -07:00
Travis Staloch	4870595352	sat-arithmetic: add additional tokenizer tests	2021-09-28 17:03:43 -07:00
Travis Staloch	dcbc52ec85	sat-arithmetic: correctly tokenize <<\|, <<\|= - set state rather than result.tag in tokenizer.zig - add test to tokenizer.zig for <<, <<\|, <<\|=	2021-09-28 17:03:43 -07:00
Travis Staloch	29f41896ed	sat-arithmetic: add operator support - adds initial support for the operators +\|, -\|, \|, <<\|, +\|=, -\|=, \|=, <<\|= - uses operators in addition to builtins in behavior test - adds binOpExt() and assignBinOpExt() to AstGen.zig. these need to be audited	2021-09-28 17:02:43 -07:00
Ryan Liptak	3b09262c12	tokenizer: Fix index-out-of-bounds on unfinished unicode escapes before EOF	2021-09-22 14:33:33 -04:00
Andrew Kelley	1ad905c71e	Merge pull request #9649 from Snektron/address-space Address Spaces	2021-09-20 20:37:04 -04:00
Ryan Liptak	2a728f6e5f	tokenizer: Fix index-out-of-bounds on string_literal_backslash right before EOF	2021-09-20 20:16:14 -04:00
Robin Voetter	ccc7f9987d	Address spaces: addrspace(A) parsing The grammar for function prototypes, (global) variable declarations, and pointer types now accepts an optional addrspace(A) modifier.	2021-09-20 02:29:03 +02:00
Andrew Kelley	05cf44933d	stage2: delete keywords `true`, `false`, `undefined`, `null` The grammar does not need these as keywords; they are merely primitives provided by the language the same as `void`, `u32`, etc.	2021-08-28 12:10:55 -07:00
Andrew Kelley	d29871977f	remove redundant license headers from zig standard library We already have a LICENSE file that covers the Zig Standard Library. We no longer need to remind everyone that the license is MIT in every single file. Previously this was introduced to clarify the situation for a fork of Zig that made Zig's LICENSE file harder to find, and replaced it with their own license that required annual payments to their company. However that fork now appears to be dead. So there is no need to reinforce the copyright notice in every single file.	2021-08-24 12:25:09 -07:00
Andrew Kelley	ec63411905	Revert "Skip over CRs at the end of multiline literals" This reverts commit 9de452f9a69d5590743a194bc2d0817d26d66a0b. No CRs allowed in multiline string literals - this is intentional.	2021-07-07 18:00:04 -07:00
Daniele Cocca	9de452f9a6	Skip over CRs at the end of multiline literals Fixes #9257. This is needed when tokenizing input containing DOS line endings, i.e. the CRLF sequence.	2021-07-07 20:03:19 +03:00
Andrew Kelley	c5c23db627	tokenizer: clean up invalid token error It now displays the byte with proper printability handling. This makes the relevant compile error test case no longer a regression in quality from stage1 to stage2.	2021-07-02 13:28:31 -07:00
Andrew Kelley	7a2e0d9810	AstGen: cleanups to pass more compile error test cases	2021-07-02 13:28:29 -07:00
Andrew Kelley	24c432608f	stage2: improve compile errors from tokenizer In order to not regress the quality of compile errors, some improvements had to be made. * std.zig.parseCharLiteral is improved to return more detailed parse failure information. * tokenizer is improved to handle null bytes in the middle of strings, character literals, and line comments. * validating how many unicode escape digits in string literals is moved to std.zig.parseStringLiteral rather than handled in the tokenizer. * when a tokenizer error occurs, if the reported token is the 'invalid' tag, an error note is added to point to the invalid byte location. Further improvements would be: - Mention the expected set of allowed bytes at this location. - Display the invalid byte (if printable, print it, otherwise escape-print it).	2021-07-02 13:27:35 -07:00
Andrew Kelley	3f680abbe2	stage2: tokenizer: require null terminated source By requiring the source file to be null-terminated, we avoid extra branching while simplifying the logic at the same time. Running ast-check on a large zig source file (udivmodti4_test.zig), master branch compared to this commit: * 4% faster wall clock * 7% fewer cache misses * 1% fewer branches	2021-07-02 13:27:35 -07:00
Jacob G-W	641ecc260f	std, src, doc, test: remove unused variables	2021-06-21 17:03:03 -07:00
Dmitry Matveyev	00982f75e9	stage2: Remove special double ampersand parsing case (#9114 ) * Remove parser error on double ampersand * Add failing test for double ampersand case * Add error when encountering double ampersand in AstGen "Bit and" operator should not make sense when one of its operands is an address. * Check that 2 ampersands are adjacent to each other in source string * Remove cases of unused variables in tests	2021-06-20 21:04:14 +03:00
Isaac Freund	608bc1cbd5	stage2: disallow `1.e9` and `0x1.p9` as float literals Instead require `1e9` and `0x1p9`, disallowing the trailing dot. This change to the grammar is consistent with forbidding `1.` and `0x1.` as float literals and ensures there is only one way to do things here.	2021-05-31 19:51:11 +00:00
Isaac Freund	ec10595b65	stage2: disallow trailing dot on float literals This disallows e.g. `1.` or `0x1.` as a float literal, which is consistent with the grammar.	2021-05-27 21:00:44 -04:00
Matthew Borkowski	f750618846	stop tokenizer from recognizing lone `@` or `@` followed by a digit as a builtin	2021-05-26 04:13:04 -04:00
Andrew Kelley	44de884980	tokenizer: fix crash on multiline string with only 1 backslash Closes #8904	2021-05-25 18:13:10 -07:00
Veikka Tuominen	fd77f2cfed	std: update usage of std.testing	2021-05-08 15:15:30 +03:00
Andrew Kelley	8e6c2b7a47	Merge remote-tracking branch 'origin/master' into ast-memory-layout	2021-02-24 15:08:23 -07:00
Josh Wolfe	8b9434871e	Avoid concept of a "Unicode character" in documentation and error messages (#8059 )	2021-02-24 08:26:13 -05:00
Isaac Freund	f3ee10b454	zig fmt: fix comments ending with EOF after decls Achieve this by reducing the amount of special casing to handle EOF so that the already correct logic for normal comments does not need to be duplicated.	2021-02-22 18:32:37 +01:00
Veikka Tuominen	e2289961c6	snake_case Token.Tag	2021-02-12 02:12:00 +02:00
Andrew Kelley	272a0ab359	zig fmt: implement "line comment followed by top-level comptime"	2021-02-01 20:11:55 -07:00
Andrew Kelley	20554d32c0	zig fmt: start reworking with new memory layout * start implementation of ast.Tree.firstToken and lastToken * clarify some ast.Node doc comments * reimplement renderToken	2021-02-01 17:23:49 -07:00
Andrew Kelley	bf8fafc37d	stage2: tokenizer does not emit line comments anymore only std.zig.render cares about these, and it can find them in the original source easily enough.	2021-01-31 21:57:48 -07:00
Andrew Kelley	4dca99d3f6	stage2: rework AST memory layout This is a proof-of-concept of switching to a new memory layout for tokens and AST nodes. The goal is threefold: * smaller memory footprint * faster performance for tokenization and parsing * most importantly, a proof-of-concept that can be also applied to ZIR and TZIR to improve the entire compiler pipeline in this way. I had a few key insights here: * Underlying premise: using less memory will make things faster, because of fewer allocations and better cache utilization. Also using less memory is valuable in and of itself. * Using a Struct-Of-Arrays for tokens and AST nodes, saves the bytes of padding between the enum tag (which kind of token is it; which kind of AST node is it) and the next fields in the struct. It also improves cache coherence, since one can peek ahead in the tokens array without having to load the source locations of tokens. * Token memory can be conserved by only having the tag (1 byte) and byte offset (4 bytes) for a total of 5 bytes per token. It is not necessary to store the token ending byte offset because one can always re-tokenize later, but also most tokens the length can be trivially determined from the tag alone, and for ones where it doesn't, string literals for example, one must parse the string literal again later anyway in astgen, making it free to re-tokenize. * AST nodes do not actually need to store more than 1 token index because one can poke left and right in the tokens array very cheaply. So far we are left with one big problem though: how can we put AST nodes into an array, since different AST nodes are different sizes? This is where my key observation comes in: one can have a hash table for the extra data for the less common AST nodes! But it gets even better than that: I defined this data that is always present for every AST Node: * tag (1 byte) - which AST node is it * main_token (4 bytes, index into tokens array) - the tag determines which token this points to * struct{lhs: u32, rhs: u32} - enough to store 2 indexes to other AST nodes, the tag determines how to interpret this data You can see how a binary operation, such as `a * b` would fit into this structure perfectly. A unary operation, such as `a` would also fit, and leave `rhs` unused. So this is a total of 13 bytes per AST node. And again, we don't have to pay for the padding to round up to 16 because we store in struct-of-arrays format. I made a further observation: the only kind of data AST nodes need to store other than the main_token is indexes to sub-expressions. That's it. The only purpose of an AST is to bring a tree structure to a list of tokens. This observation means all the data that nodes store are only sets of u32 indexes to other nodes. The other tokens can be found later by the compiler, by poking around in the tokens array, which again is super fast because it is struct-of-arrays, so you often only need to look at the token tags array, which is an array of bytes, very cache friendly. So for nearly every kind of AST node, you can store it in 13 bytes. For the rarer AST nodes that have 3 or more indexes to other nodes to store, either the lhs or the rhs will be repurposed to be an index into an extra_data array which contains the extra AST node indexes. In other words, no hash table needed, it's just 1 big ArrayList with the extra data for AST Nodes. Final observation, no need to have a canonical tag for a given AST. For example: The expression `foo(bar)` is a function call. Function calls can have any number of parameters. However in this example, we can encode the function call into the AST with a tag called `FunctionCallOnlyOneParam`, and use lhs for the function expr and rhs for the only parameter expr. Meanwhile if the code was `foo(bar, baz)` then the AST node would have to be `FunctionCall` with lhs still being the function expr, but rhs being the index into `extra_data`. Then because the tag is `FunctionCall` it means `extra_data[rhs]` is the "start" and `extra_data[rhs+1]` is the "end". Now the range `extra_data[start..end]` describes the list of parameters to the function. Point being, you only have to pay for the extra bytes if the AST actually requires it. There's no limit to the number of different AST tag encodings. Preliminary results: 15% improvement on cache-misses * 28% improvement on total instructions executed * 26% improvement on total CPU cycles * 22% improvement on wall clock time This is 1/4 items on the checklist before this can actually be merged: * [x] parser * [ ] render (zig fmt) * [ ] astgen * [ ] translate-c	2021-01-30 20:16:59 -07:00
LemonBoy	dd973fb365	std: Use {s} instead of {} when printing strings	2021-01-02 17:12:57 -07:00
Frank Denis	6c2e0c2046	Year++	2020-12-31 15:45:24 -08:00
Travis	d7f9128b5d	add error message to zig side of tokenizing/parsing	2020-10-29 12:03:45 -05:00
Travis	960b5b518f	updated zig tokenizer to handle .*** and added tests	2020-10-29 12:03:45 -05:00
Tadeo Kondrak	069fbb3c01	Add opaque type syntax	2020-10-06 22:08:24 -06:00
LemonBoy	5c6cd5e2c9	stage{1,2}: Fix parsing of range literals stage1 was unable to parse ranges whose starting point was written in binary/octal as the first dot in '...' was incorrectly interpreted as decimal point. stage2 forgot to reset the literal type to IntegerLiteral when it discovered the dot was not a decimal point. I've only stumbled across this bug because zig fmt keeps formatting the ranges without any space around the ...	2020-09-28 14:16:26 -04:00
Vexu	1174cb1517	stage2: fix tokenizer float bug	2020-09-03 15:05:47 +03:00
Andrew Kelley	4a69b11e74	add license header to all std lib files add SPDX license identifier copyright ownership is zig contributors	2020-08-20 16:07:04 -04:00
Vexu	c2fb4bfff3	add 'anytype' to self-hosted parser	2020-07-11 17:41:16 +03:00
Alexandros Naskos	aa1a727284	Allow carriare return in comments	2020-06-02 00:56:05 -04:00
Vexu	a47257d9b0	fix std.zig rejecting literal tabs in comments	2020-06-01 14:37:36 +03:00

1 2

77 Commits