Comptime code can't execute assembly code, so we need some way to
force comptime code to use the generic path. This should be replaced
with whatever is implemented for #868, when that day comes.
I am seeing that the result for the hash is incorrect in stage1 and
crashes stage2, so presumably this never worked correctly. I will follow
up on that soon.
This gets us most of the way back to the performance I had when
I was using the LLVM intrinsics:
- Intel Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz:
190.67 MB/s (w/o intrinsics) -> 1285.08 MB/s
- AMD EPYC 7763 (VM) @ 2.45 GHz:
240.09 MB/s (w/o intrinsics) -> 1360.78 MB/s
- Apple M1:
216.96 MB/s (w/o intrinsics) -> 2133.69 MB/s
Minor changes to this source can swing performance from 400 MB/s to
1400 MB/s or... 20 MB/s, depending on how it interacts with the
optimizer. I have a sneaking suspicion that despite LLVM inheriting
GCC's extremely strict inline assembly semantics, its passes are
rather skittish around inline assembly (and almost certainly, its
instruction cost models can assume nothing)
There's probably plenty of room to optimize these further in the
future, but for the moment this gives ~3x improvement on Intel
x86-64 processors, ~5x on AMD, and ~10x on M1 Macs.
These extensions are very new - Most processors prior to 2020 do
not support them.
AVX-512 is a slightly older alternative that we could use on Intel
for a much bigger performance bump, but it's been fused off on
Intel's latest hybrid architectures and it relies on computing
independent SHA hashes in parallel. In contrast, these SHA intrinsics
provide the usual single-threaded, single-stream interface, and should
continue working on new processors.
AArch64 also has SHA-512 intrinsics that we could take advantage
of in the future
Packed memory has a well-defined layout that doesn't require
conversion from an integer to read from. Let's use it :-)
This change means that for bitcasting to/from a packed value that
is N layers deep, we no longer have to create N temporary big-ints
and perform N copies.
Other miscellaneous improvements:
- Adds support for casting to packed enums and vectors
- Fixes bitcasting to/from vectors outside of a packed struct
- Adds a fast path for bitcasting <= u/i64
- Fixes bug when bitcasting f80 which would clear following fields
This also changes the bitcast memory layout of exotic integers on
big-endian systems to match what's empirically observed on our targets.
Technically, this layout is not guaranteed by LLVM so we should probably
ban bitcasts that reveal these padding bits, but for now this is an
improvement.
Similar to what was done for EdDSA, allow incremental creation
and verification of ECDSA signatures.
Doing so for ECDSA is trivial, and can be useful for TLS as well
as the future package manager.
* Old cmake option: `-DZIG_SKIP_INSTALL_LIB_FILES=ON`
* New cmake option: `-DZIG_NO_LIB=ON`
* Old build.zig option: `-Dskip-install-lib-files`
* New build.zig option: `-Dno-lib`
Motivation is making build commands easier to type.
This definition communicates to libcxxabi that the libc will provide the
`__cxa_thread_atexit_impl` symbol. This is true for glibc but not
true for other libcs, such as musl.