powerpc32: optimise csum_partial() loop
authorChristophe Leroy <christophe.leroy@c-s.fr>
Tue, 22 Sep 2015 14:34:32 +0000 (16:34 +0200)
committerScott Wood <oss@buserror.net>
Sat, 5 Mar 2016 05:03:45 +0000 (23:03 -0600)
commitf867d556dd8525fe6ff0d22a34249528e590f994
tree32ebba9cfc1b00d1f394b480d5cfab443382864e
parent48821a34b1bdc5d89505cb814b3f7c166940f200
powerpc32: optimise csum_partial() loop

On the 8xx, load latency is 2 cycles and taking branches also takes
2 cycles. So let's unroll the loop.

This patch improves csum_partial() speed by around 10% on both:
* 8xx (single issue processor with parallel execution)
* 83xx (superscalar 6xx processor with dual instruction fetch
and parallel execution)

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
arch/powerpc/lib/checksum_32.S