ChaCha implementation

The ChaCha stream cipher, a variant of the successful Salsa20 cipher, has as rotation constants for its quarter round 16, 12, 8, 7 (for a complete description of ChaCha, refer to http://cr.yp.to/chacha.html). It is easy to see that two of these constants are multiples of 8; this allows for a 1 instruction rotation in Core2 and later Intel CPUs using the pshufb instruction. Thus, the main improvement here is to switch from the instruction sequences:

movdqa %xmm15,%xmm6
psrld $16,%xmm15
pslld $16,%xmm6
pxor %xmm6,%xmm15

to

pshufb %xmm6, %xmm15

Whereas in the original Core2 (Conroe) pshufb takes 4 uops to complete, Penryn introduced a dedicated shuffle unit which allows it to complete in only 1 uop. The Core i7 (Nehalem) has 2 of these shuffle units, allowing 2 pshufb instructions to execute per cycle.

In an Intel E8400 CPU (Penryn), the following timings were obtained using the eSTREAM benchmarking tool for the amd64-xmm6 implementation of ChaCha (20 rounds):

Encrypted 46 blocks of 4096 bytes (under 1 keys, 46 blocks/key)
Total time: 735723 clock ticks (245.24 usec)
Encryption speed (cycles/byte): 3.90
Encryption speed (Mbps): 6146.31

My implementation, running in the same conditions:

Encrypted 50 blocks of 4096 bytes (under 1 keys, 50 blocks/key)
Total time: 650520 clock ticks (216.84 usec)
Encryption speed (cycles/byte): 3.18
Encryption speed (Mbps): 7555.80

The BLAKE SHA-3 candidate uses ChaCha as its underlying core function. Thus, the same technique can be applied to slightly speed BLAKE up. It also seems to have relevant speedups.

The ChaCha source code can be downloaded here, ready for the eSTREAM benchmarking tool.

The source code for the improved BLAKE can be downloaded here.