* threaded K12: separate context computation from thread spawning
Compute all contexts and store them in a pre-allocated array,
then spawn threads using the pre-computed contexts.
This ensures each context is fully materialized in memory with the
correct values before any thread tries to access it.
* kt128: unroll the permutation rounds only twice
This appears to deliver the best performance thanks to improved cache
utilization, and it’s consistent with what we already do for SHA3.