BLAKE2 includes the expected output length in the initial state.
This length is actually distinct from the actual output length
used at finalization.
BLAKE2b-256/128 is thus not the same as BLAKE2b-128.
This behavior can be a little bit surprising, and has been "fixed"
in BLAKE3.
In order to support this, we may want to provide an option to set the
length used for domain separation.
In Zig, there is another reason to allow this: we assume that the
output length is defined at comptime.
But BLAKE2 doesn't have a fixed output length. For an output length that
is not known at comptime, we can't take the full block size and
truncate it due to the reason above.
What we can do now is set that length as an option to get the correct
initial state, and truncate the output if necessary.
Leverage result location semantics for X25519 like we do everywhere
else in 25519/*
Also add the edwards25519->curve25519 map by the way since many
applications seem to use this to share the same key pair for encryption
and signature.
Intel keeps changing the latency & throughput of the aes* and clmul
instructions every time they release a new model.
Adjust `optimal_parallel_blocks` accordingly, keeping 8 as a safe
default for unknown data.
It seems that Apple has finally got rid of the 32bit versions of
`fstat` and `fstatat`, and instead, only 64bit versions are available
on BigSur and Apple Silicon.
The tweak in this commit is required to make Zig stage1 compile on
BigSur + aarch64.
std.os.accept already wants to allow null, which matches `man 3p accept`:
> address Either a null pointer, or a pointer to a sockaddr structure
> where the address of the connecting socket shall be re‐
> turned.
>
> address_len Either a null pointer, if address is a null pointer, or a
> pointer to a socklen_t object which on input specifies the
> length of the supplied sockaddr structure, and on output
> specifies the length of the stored address.
Fixes ziglang#6832.
Gives a ~40% speedup on x86_64.
However, the generic code remains faster on aarch64.
This is still processing only one block at a time for now.
I'm pretty confident that processing more blocks per round
will eventually give a substantial performance improvement on
all platforms with vector units.
The bcrypt function intentionally requires quite a lot of CPU cycles
to complete.
In addition to that, not having its full state constantly in the
CPU L1 cache causes a massive performance drop.
These properties slow down brute-force attacks against low-entropy
inputs (typically passwords), and GPU-based attacks get little
to no advantages over CPUs.