This is a reimplementation of Io.Threaded that fixes the issues
highlighted in the recent Zulip discussion. It's poorly tested but it
does successfully run to completion the litmust test example that I
offered in the discussion.
This implementation has the following key design decisions:
- `t.cpu_count` is used as the threadpool size.
- `t.concurrency_limit` is used as the maximum number of
"burst, one-shot" threads that can be spawned by `io.concurrent` past
`t.cpu_count`.
- `t.available_thread_count` is the number of threads in the pool that
is not currently busy with work (the bookkeeping happens in the worker
function).
- `t.one_shot_thread_count` is the number of active threads that were
spawned by `io.concurrent` past `t.cpu_count`.
In this implementation:
- `io.async` first tries to decrement `t.available_thread_count`. If
there are no threads available, it tries to spawn a new one if possible,
otherwise it runs the task immediately.
- `io.concurrent` first tries to use a thread in the pool same as
`io.async`, but on failure (no available threads and pool size limit
reached) it tries to spawn a new one-shot thread. One shot threads
run a different main function that just executes one task, decrements
the number of active one shot threads, and then exits.
A relevant future improvement is to have one-shot threads stay on for a
few seconds (and potentially pick up a new task) to amortize spawning
costs.
It was not obvious that the KT128/KT256 customization string can be
used to set a key, or what it was designed to be used for at all.
Also properly use key_length and not digest_length for the BLAKE3
key length (no practical changes as they are both 32, but that was
confusing).
Remove unneeded simd_degree copies by the way, and that doesn't need
to be in the public interface.
68d2f68ed introduced special handling for StructInit fields
containing multiline strings to prevent inserting whitespace after =.
However, this logic didn't handle cases without a trailing comma,
which resulted in unwanted trailing whitespace.
* threaded K12: separate context computation from thread spawning
Compute all contexts and store them in a pre-allocated array,
then spawn threads using the pre-computed contexts.
This ensures each context is fully materialized in memory with the
correct values before any thread tries to access it.
* kt128: unroll the permutation rounds only twice
This appears to deliver the best performance thanks to improved cache
utilization, and it’s consistent with what we already do for SHA3.
KT128 and KT256 are fast, secure cryptographic hash functions based on Keccak (SHA-3).
They can be seen as the modern version of SHA-3, and evolution of SHAKE, with better performance.
After the SHA-3 competition, the Keccak team proposed these variants in 2016, and the constructions underwent 8 years of public scrutiny before being standardized in October 2025 as RFC 9861.
They uses a tree-hashing mode on top of TurboSHAKE, providing both high security and excellent performance, especially on large inputs.
They support arbitrary-length output and optional customization strings.
Hashing of very large inputs can be done using multiple threads, for high throughput.
KT128 provides 128-bit security strength, equivalent to AES-128 and SHAKE128, which is sufficient for virtually all applications.
KT256 provides 256-bit security strength, equivalent to SHA-512. For virtually all applications, KT128 is enough (equivalent to SHA-256 or BLAKE3).
For small inputs, TurboSHAKE128 and TurboSHAKE256 (which KT128 and KT256 are based on) can be used instead as they have less overhead.
Also remove the example implementation from the file doc comment; it's
better to just link to `defaultLog` as an example, since this avoids
writing the example implementation twice and prevents the example from
bitrotting.
`std.Io.tty.Config.detect` may be an expensive check (e.g. involving
syscalls), and doing it every time we need to print isn't really
necessary; under normal usage, we can compute the value once and cache
it for the whole program's execution. Since anyone outputting to stderr
may reasonably want this information (in fact they are very likely to),
it makes sense to cache it and return it from `lockStderrWriter`. Call
sites who do not need it will experience no significant overhead, and
can just ignore the TTY config with a `const w, _` destructure.