* std.crypto.chacha: support larger vectors on AVX2 and AVX512 targets
Ryzen 7 7700, ChaCha20/8 stream, long outputs:
Generic: 3268 MiB/s
AVX2 : 6023 MiB/s
AVX512 : 8086 MiB/s
Bump the rand.chacha buffer a tiny bit to take advantage of this.
More than 8 blocks doesn't seem to make any measurable difference.
ChaChaPoly also gets a small performance boost from this, albeit
Poly1305 remains the bottleneck.
Generic: 707 MiB/s
AVX2 : 981 MiB/s
AVX512 : 1202 MiB/s
aarch64 appears to generally benefit from 4-way vectorization.
Verified on Apple Silicon, but also on a Cortex A72.