Description
int src1 = kiss99(prog_rnd) % PROGPOW_REGS;
int src2 = kiss99(prog_rnd) % PROGPOW_REGS;
If src1 == src2
and we do XOR, result will be 0. This 0 will most likely spread because
0 * b = 0, a * 0 = 0
mul_hi(0, b) = 0, mul_hi(a, 0) = 0
ROTL32(0, b) = 0
ROTR32(0, b) = 0
0 & b = 0, a & 0 = 0
min(0, b) = 0, min(a, 0) = 0
The fix is to never do math operations that cancel out both arguments to 0. As far as I can see, it's only XOR currently. ASIC can add optimizations for the case when one of the numbers is 0.
Moreover, the case when src1 == src2
allows many other optimizations for ASIC that OpenCL/GPU won't do. It can use squarer instead of full multiplier for multiplication (more energy efficient), MIN/AND/OR simply become NOP, ADD becomes SHL by 1, CLZ/POPCOUNT become 2 times simpler/energy efficient. OpenCL compiler, on the other hand, is not guaranteed to take advantage of this. Compiler will be able to remove MIN/AND/OR from the generated code if src1 == src2, but it's unlikely to do more.
Again, the fix for all this is simple: never do math on the same register, always use two different registers:
int src_index = kiss99(prog_rnd) % (PROGPOW_REGS * (PROGPOW_REGS - 1));
int src1 = src_index % PROGPOW_REGS; // 0 <= src1 < PROGPOW_REGS
int src2 = src_index / PROGPOW_REGS; // 0 <= src2 < PROGPOW_REGS - 1
// src2 is not the final index yet
// Example: if we have 5 registers and src1 = 1, src2 = 3
// src1: 0 _1_ 2 3 4
// src2 = 3, but it's an index in the list of remaining registers: 0 2 3 _4_
// so the final index for src2 will be 4 = 3 + 1
if (src2 >= src1) ++src2; // 0 <= src2 < PROGPOW_REGS and src2 != src1