100 Commits

Author SHA1 Message Date
Andrew Kelley
d789f1e5cf fuzzer: write inputs to shared memory before running
breaking change to the fuzz testing API; it now passes a type-safe
context parameter to the fuzz function.

libfuzzer is reworked to select inputs from the entire corpus.

I tested that it's roughly as good as it was before in that it can find
the panics in the simple examples, as well as achieve decent coverage on
the tokenizer fuzz test.

however I think the next step here will be figuring out why so many
points of interest are missing from the tokenizer in both Debug and
ReleaseSafe modes.

does not quite close #20803 yet since there are some more important
things to be done, such as opening the previous corpus, continuing
fuzzing after finding bugs, storing the length of the inputs, etc.
2025-02-11 13:39:20 -08:00
Igor Stojković
0676c04681
tokenizer: fix 0 byte following invalid (#21482)
closes #21481
2024-09-23 13:06:30 -07:00
Andrew Kelley
892ce7ef52 rework fuzzing API
The previous API used `std.testing.fuzzInput(.{})` however that has the
problem that users call it multiple times incorrectly, and there might
be work happening to obtain the corpus which should not be included in
coverage analysis, and which must not slow down iteration speed.

This commit restructures it so that the main loop lives in libfuzzer and
directly calls the "test one" function.

In this commit I was a little too aggressive because I made the test
runner export `fuzzer_one` for this purpose. This was motivated by
performance, but it causes "exported symbol collision: fuzzer_one" to
occur when more than one fuzz test is provided.

There are three ways to solve this:

1. libfuzzer needs to be passed a function pointer instead. Possible
   performance downside.

2. build runner needs to build a different process per fuzz test.
   Potentially wasteful and unclear how to isolate them.

3. test runner needs to perform a relocation at runtime to point the
   function call to the relevant unit test. Portability issues and
   dubious performance gains.
2024-09-11 13:41:29 -07:00
Eric Petersen
36b89101df tokenizer: use labeled switch statements 2024-09-10 16:09:37 -07:00
Ian Johnson
9007534551 std.zig.tokenizer: simplify line-based tokens
Closes #21358
Closes #21360

This commit modifies the `multiline_string_literal_line`, `doc_comment`,
and `container_doc_comment` tokens to no longer include the line ending
as part of the token. This makes it easier to handle line endings (which
may be LF, CRLF, or in edge cases possibly nonexistent) consistently.

In the two issues linked above, Autodoc was already assuming this for
doc comments, and yielding incorrect results when handling files with
CRLF line endings (both in Markdown parsing and source rendering).

Applying the same simplification for multiline string literals also
brings `zig fmt` into conformance with
https://github.com/ziglang/zig-spec/issues/38 regarding formatting of
multiline strings with CRLF line endings: the spec says that `zig fmt`
should remove the CR from such line endings, but this was not previously
the case.
2024-09-10 13:34:33 +03:00
Andrew Kelley
e0ffac4e3c introduce a web interface for fuzzing
* new .zig-cache subdirectory: 'v'
  - stores coverage information with filename of hash of PCs that want
    coverage. This hash is a hex encoding of the 64-bit coverage ID.
* build runner
  * fixed bug in file system inputs when a compile step has an
    overridden zig_lib_dir field set.
  * set some std lib options optimized for the build runner
    - no side channel mitigations
    - no Transport Layer Security
    - no crypto fork safety
  * add a --port CLI arg for choosing the port the fuzzing web interface
    listens on. it defaults to choosing a random open port.
  * introduce a web server, and serve a basic single page application
    - shares wasm code with autodocs
    - assets are created live on request, for convenient development
      experience. main.wasm is properly cached if nothing changes.
    - sources.tar comes from file system inputs (introduced with the
      `--watch` feature)
  * receives coverage ID from test runner and sends it on a thread-safe
    queue to the WebServer.
* test runner
  - takes a zig cache directory argument now, for where to put coverage
    information.
  - sends coverage ID to parent process
* fuzzer
  - puts its logs (in debug mode) in .zig-cache/tmp/libfuzzer.log
  - computes coverage_id and makes it available with
    `fuzzer_coverage_id` exported function.
  - the memory-mapped coverage file is now namespaced by the coverage id
    in hex encoding, in `.zig-cache/v`
* tokenizer
  - add a fuzz test to check that several properties are upheld
2024-08-07 00:48:32 -07:00
Andrew Kelley
c2b8afcac9 tokenizer: tabs and carriage returns spec conformance 2024-07-31 16:57:42 -07:00
Andrew Kelley
377e8579f9 std.zig.tokenizer: simplify
I pointed a fuzzer at the tokenizer and it crashed immediately. Upon
inspection, I was dissatisfied with the implementation. This commit
removes several mechanisms:
* Removes the "invalid byte" compile error note.
* Dramatically simplifies tokenizer recovery by making recovery always
  occur at newlines, and never otherwise.
* Removes UTF-8 validation.
* Moves some character validation logic to `std.zig.parseCharLiteral`.

Removing UTF-8 validation is a regression of #663, however, the existing
implementation was already buggy. When adding this functionality back,
it must be fuzz-tested while checking the property that it matches an
independent Unicode validation implementation on the same file. While
we're at it, fuzzing should check the other properties of that proposal,
such as no ASCII control characters existing inside the source code.

Other changes included in this commit:

* Deprecate `std.unicode.utf8Decode` and its WTF-8 counterpart. This
  function has an awkward API that is too easy to misuse.
* Make `utf8Decode2` and friends use arrays as parameters, eliminating a
  runtime assertion in favor of using the type system.

After this commit, the crash found by fuzzing, which was
"\x07\xd5\x80\xc3=o\xda|a\xfc{\x9a\xec\x91\xdf\x0f\\\x1a^\xbe;\x8c\xbf\xee\xea"
no longer causes a crash. However, I did not feel the need to add this
test case because the simplified logic eradicates most crashes of this
nature.
2024-07-31 16:57:42 -07:00
gooncreeper
c50f300387 Tokenizer bug fixes and improvements
Fixes many error messages corresponding to invalid bytes displaying the
wrong byte. Additionaly improves handling of UTF-8 in some places.
2024-07-15 11:31:19 +03:00
Michael Bradshaw
02b3d5b58a Rename isASCII to isAscii 2024-07-02 16:31:15 +02:00
Travis Staloch
8af59d1f98 ComptimeStringMap: return a regular struct and optimize
this patch renames ComptimeStringMap to StaticStringMap, makes it
accept only a single type parameter, and return a known struct type
instead of an anonymous struct.  initial motivation for these changes
was to reduce the 'very long type names' issue described here
https://github.com/ziglang/zig/pull/19682.

this breaks the previous API.  users will now need to write:
`const map = std.StaticStringMap(T).initComptime(kvs_list);`

* move `kvs_list` param from type param to an `initComptime()` param
* new public methods
  * `keys()`, `values()` helpers
  * `init(allocator)`, `deinit(allocator)` for runtime data
  * `getLongestPrefix(str)`, `getLongestPrefixIndex(str)` - i'm not sure
     these belong but have left in for now incase they are deemed useful
* performance notes:
  * i posted some benchmarking results here:
    https://github.com/travisstaloch/comptime-string-map-revised/issues/1
  * i noticed a speedup reducing the size of the struct from 48 to 32
    bytes and thus use u32s instead of usize for all length fields
  * i noticed speedup storing KVs as a struct of arrays
  * latest benchmark shows these wall_time improvements for
    debug/safe/small/fast builds: -6.6% / -10.2% / -19.1% / -8.9%. full
    output in link above.
2024-04-22 15:31:41 -07:00
Jacob Young
509be7cf1f x86_64: fix std test failures 2023-11-03 23:18:21 -04:00
Jacob Young
fe93332ba2 x86_64: implement enough to pass unicode tests
* implement vector comparison
 * implement reduce for bool vectors
 * fix `@memcpy` bug
 * enable passing std tests
2023-10-23 22:42:18 -04:00
Jacob Young
27fe945a00 Revert "Revert "Merge pull request #17637 from jacobly0/x86_64-test-std""
This reverts commit 6f0198cadbe29294f2bf3153a27beebd64377566.
2023-10-22 15:46:43 -04:00
Andrew Kelley
6f0198cadb Revert "Merge pull request #17637 from jacobly0/x86_64-test-std"
This reverts commit 0c99ba1eab63865592bb084feb271cd4e4b0357e, reversing
changes made to 5f92b070bf284f1493b1b5d433dd3adde2f46727.

This caused a CI failure when it landed in master branch due to a
128-bit `@byteSwap` in std.mem.
2023-10-22 12:16:35 -07:00
Jacob Young
32e85d44eb x86_64: disable failing tests, enable test-std testing 2023-10-21 10:55:41 -04:00
Jacob Young
2e6e39a700 x86_64: fix bugs and disable erroring tests 2023-10-21 10:55:41 -04:00
mlugg
f26dda2117 all: migrate code to new cast builtin syntax
Most of this migration was performed automatically with `zig fmt`. There
were a few exceptions which I had to manually fix:

* `@alignCast` and `@addrSpaceCast` cannot be automatically rewritten
* `@truncate`'s fixup is incorrect for vectors
* Test cases are not formatted, and their error locations change
2023-06-24 16:56:39 -07:00
Tom Read Cutting
346ec15c50
Correctly handle carriage return characters according to the spec (#12661)
* Scan from line start when finding tag in tokenizer

This resolves a crash that can occur for invalid bytes like carriage
returns that are valid characters when not parsed from within literals.

There are potentially other edge cases this could resolve as well, as
the calling code for this function didn't account for any potential
'pending_invalid_tokens' that could be queued up by the tokenizer from
within another state.

* Fix carriage return crash in multiline string

Follow the guidance of #38:

> However CR directly before NL is interpreted as only a newline and not part of the multiline string. zig fmt will delete the CR.

Zig fmt already had code for deleting carriage returns, but would still
crash - now it no longer does so. Carriage returns encountered before
line-feeds are now appropriately removed on program compilation as well.

* Only accept carriage returns before line feeds

Previous commit was much less strict about this, this more closely
matches the desired spec of only allow CR characters in a CRLF pair, but
not otherwise.

* Fix CR being rejected when used as whitespace

Missed this comment from ziglang/zig-spec#83:

> CR used as whitespace, whether directly preceding NL or stray, is still unambiguously whitespace. It is accepted by the grammar and replaced by the canonical whitespace by zig fmt.

* Add tests for carriage return handling
2023-02-19 14:14:03 +02:00
Techatrix
c63be507cf don't tokenize an invalid string literal 2023-02-11 14:25:25 +02:00
Veikka Tuominen
841b38aae8 tokenizer: detect null as non-first byte of a line comment
Line comments do not produce actual tokens so they need special
handling for null bytes.

Closes #14346
2023-01-17 20:39:19 +02:00
r00ster91
6b7d9b34e8 api(std.ascii): remove deprecated decls 2022-12-09 21:57:17 +01:00
Veikka Tuominen
6039554b26 tokenizer: detect null bytes before EOF
Closes #13811
2022-12-08 00:16:30 +02:00
Veikka Tuominen
349d78a443 validate number literals in AstGen 2022-09-13 20:26:04 -04:00
r00ster91
83909651ea test: simplify testTokenize
What this does is already done by `expectEqual`.
Now the trace seems to be shorter and more concise so the errors should be easier to read now.
2022-08-16 00:20:19 +02:00
r00ster91
5490688d65 refactor: use std.ascii functions 2022-08-16 00:20:19 +02:00
r00ster91
e3b3eab840 test(names): some renamings 2022-08-16 00:20:19 +02:00
r00ster91
f07cba10a3 test(names): remove unnecessary "tokenizer - " prefix 2022-08-16 00:20:19 +02:00
zooster
8fd20a5eb0 fix: disallow newline in char literal 2022-08-10 16:13:56 -04:00
Ali Chraghi
a4df443f96 Update Tokenizer Dump Function
fix missed `loc` field
2022-02-20 17:47:42 -05:00
Veikka Tuominen
9c36cf92f0 parser: make some errors point to end of previous token
For some errors if the found token is not on the same line as
the previous token, point to the end of the previous token.
This usually results in more helpful errors.
2022-02-17 14:23:35 +02:00
Andrew Kelley
902df103c6 std lib API deprecations for the upcoming 0.9.0 release
See #3811
2021-11-30 00:13:07 -07:00
Travis Staloch
4870595352 sat-arithmetic: add additional tokenizer tests 2021-09-28 17:03:43 -07:00
Travis Staloch
dcbc52ec85 sat-arithmetic: correctly tokenize <<|, <<|=
- set state rather than result.tag in tokenizer.zig
- add test to tokenizer.zig for <<, <<|, <<|=
2021-09-28 17:03:43 -07:00
Travis Staloch
29f41896ed sat-arithmetic: add operator support
- adds initial support for the operators +|, -|, *|, <<|, +|=, -|=, *|=, <<|=
- uses operators in addition to builtins in behavior test
- adds binOpExt() and assignBinOpExt() to AstGen.zig. these need to be audited
2021-09-28 17:02:43 -07:00
Ryan Liptak
3b09262c12 tokenizer: Fix index-out-of-bounds on unfinished unicode escapes before EOF 2021-09-22 14:33:33 -04:00
Andrew Kelley
1ad905c71e
Merge pull request #9649 from Snektron/address-space
Address Spaces
2021-09-20 20:37:04 -04:00
Ryan Liptak
2a728f6e5f tokenizer: Fix index-out-of-bounds on string_literal_backslash right before EOF 2021-09-20 20:16:14 -04:00
Robin Voetter
ccc7f9987d Address spaces: addrspace(A) parsing
The grammar for function prototypes, (global) variable declarations, and
pointer types now accepts an optional addrspace(A) modifier.
2021-09-20 02:29:03 +02:00
Andrew Kelley
05cf44933d stage2: delete keywords true, false, undefined, null
The grammar does not need these as keywords; they are merely primitives
provided by the language the same as `void`, `u32`, etc.
2021-08-28 12:10:55 -07:00
Andrew Kelley
d29871977f remove redundant license headers from zig standard library
We already have a LICENSE file that covers the Zig Standard Library. We
no longer need to remind everyone that the license is MIT in every single
file.

Previously this was introduced to clarify the situation for a fork of
Zig that made Zig's LICENSE file harder to find, and replaced it with
their own license that required annual payments to their company.
However that fork now appears to be dead. So there is no need to
reinforce the copyright notice in every single file.
2021-08-24 12:25:09 -07:00
Andrew Kelley
ec63411905 Revert "Skip over CRs at the end of multiline literals"
This reverts commit 9de452f9a69d5590743a194bc2d0817d26d66a0b.

No CRs allowed in multiline string literals - this is intentional.
2021-07-07 18:00:04 -07:00
Daniele Cocca
9de452f9a6 Skip over CRs at the end of multiline literals
Fixes #9257.
This is needed when tokenizing input containing DOS line endings, i.e.
the CRLF sequence.
2021-07-07 20:03:19 +03:00
Andrew Kelley
c5c23db627 tokenizer: clean up invalid token error
It now displays the byte with proper printability handling. This makes
the relevant compile error test case no longer a regression in quality
from stage1 to stage2.
2021-07-02 13:28:31 -07:00
Andrew Kelley
7a2e0d9810 AstGen: cleanups to pass more compile error test cases 2021-07-02 13:28:29 -07:00
Andrew Kelley
24c432608f stage2: improve compile errors from tokenizer
In order to not regress the quality of compile errors, some improvements
had to be made.

 * std.zig.parseCharLiteral is improved to return more detailed parse
   failure information.
 * tokenizer is improved to handle null bytes in the middle of strings,
   character literals, and line comments.
 * validating how many unicode escape digits in string literals is moved
   to std.zig.parseStringLiteral rather than handled in the tokenizer.
 * when a tokenizer error occurs, if the reported token is the 'invalid'
   tag, an error note is added to point to the invalid byte location.
   Further improvements would be:
   - Mention the expected set of allowed bytes at this location.
   - Display the invalid byte (if printable, print it, otherwise
     escape-print it).
2021-07-02 13:27:35 -07:00
Andrew Kelley
3f680abbe2 stage2: tokenizer: require null terminated source
By requiring the source file to be null-terminated, we avoid extra
branching while simplifying the logic at the same time.

Running ast-check on a large zig source file (udivmodti4_test.zig),
master branch compared to this commit:
 * 4% faster wall clock
 * 7% fewer cache misses
 * 1% fewer branches
2021-07-02 13:27:35 -07:00
Jacob G-W
641ecc260f std, src, doc, test: remove unused variables 2021-06-21 17:03:03 -07:00
Dmitry Matveyev
00982f75e9
stage2: Remove special double ampersand parsing case (#9114)
* Remove parser error on double ampersand

* Add failing test for double ampersand case

* Add error when encountering double ampersand in AstGen

"Bit and" operator should not make sense when one of its operands
is an address.

* Check that 2 ampersands are adjacent to each other in source string

* Remove cases of unused variables in tests
2021-06-20 21:04:14 +03:00
Isaac Freund
608bc1cbd5
stage2: disallow 1.e9 and 0x1.p9 as float literals
Instead require `1e9` and `0x1p9`, disallowing the trailing dot.

This change to the grammar is consistent with forbidding `1.` and `0x1.`
as float literals and ensures there is only one way to do things here.
2021-05-31 19:51:11 +00:00