* Re-format the test code for MSVC bug 981648.
* Improved generated x86 code for SSE4.1 and later targets.
Prefer movdqu to lddqu on CPUs supporting SSE4.1 and later. lddqu has one
extra cycle latency on Skylake and later Intel CPUs, and with AVX vlddqu
is not merged to the following instructions as a memory operand, which makes
the code slightly larger. Legacy SSE3 lddqu is still preferred when SSE4.1
is not enabled because it is faster on Prescott and the same as movdqu on
AMD CPUs. It also doesn't affect code size because movdqu cannot be converted
to a memory operand as memory operands are required to be aligned in SSE.
Closes https://github.com/boostorg/uuid/issues/137.
* Use movdqu universally for loading UUIDs.
This effectively drops the optimization for NetBurst CPUs and instead
prefers code that is slightly better on Skylake and later Intel CPUs,
even when the code is compiled for SSE3 and not SSE4.1.