Nice compact implementation. However, I fail to see the "spacing" of the FastCDC paper implemented, which I think degrades the quality (as explained in the paper).
So instead of a mask like this (4 bits high):
maskS = 0b100100010001000
You create this:
maskS = 0b000000000001111
Some additional math to spread out the one bits would be needed.