(n * 2-1)% p: avoid using 64 bits when n and p are 32 bits

Consider the following function:

inline unsigned int f(unsigned int n, unsigned int p) 
{
    return (n*2-1)%p;
}

      

Now suppose n

(and p

) is greater than std::numeric_limits<int>::max()

.

For example f(4294967295U, 4294967291U)

.

Mathematical result 7

, but the function will return 2

because it n*2

will overflow.

Then the solution is simple: instead you just need to use a 64-bit integer. Assuming the function declaration must remain the same:

inline unsigned int f(unsigned int n, unsigned int p) 
{
    return (static_cast<unsigned long long int>(n)*2-1)%p;
}

      

Everything is good. At least in principle. The problem is that this function will be called millions of times in my code (I mean the overflow version) and the 64 bit module is slower than the 32 bit version (see here ).

The question is: is there any trick (mathematical or algorithmic) to avoid doing the 64-bit version of the module operation. And what will the new version be f

with this trick? (keeping the same declaration).

  • Note 1: n > 0

  • Note 2: p > 2

  • Note 3: n

    may be lower p

    : n=4294967289U

    ,p=4294967291U

  • Note 4: the less the modulation operation is used, the better (3 32 bits modulo is too large, 2 is interesting, and 1 will definitely outperform)
  • Note 5: of course the result will be processor dependent. Let's assume the latest supercomputers are using the latest available xeon.
+3


source to share


3 answers


Even though I don't like dealing with AT&T syntax and GCC extended asm constraints, I think it works (it worked in my reputedly limited tests)

uint32_t f(uint32_t n, uint32_t p)
{
    uint32_t res;
    asm (
      "xorl %%edx, %%edx\n\t"
      "addl %%eax, %%eax\n\t"
      "adcl %%edx, %%edx\n\t"
      "subl $1, %%eax\n\t"
      "sbbl $0, %%edx\n\t"
      "divl %1"
      : "=d"(res)
      : "S"(p), "a"(n)
      : 
      );
  return res;
}

      



The restrictions can be overly strict or wrong, I don't know. It seemed to work.

The idea here is to do regular 32-bit division, which actually accepts a 64-bit dividend. It only works if the ratio is 32 bits (aka overflow signaling), which is always true under circumstances ( p

at least 2, n

not zero). Material before division processes 2 times (with overflow in edx

, "high half"), then "subtract 1" with potential borrowing. The exit function "=d"

makes it take the result as a result. "a"(n)

puts n

in eax

(letting it select a different register does not help, division will take input in edx:eax

). "S"(p)

maybe "r"(p)

(seems to work), but I'm not sure to trust it.

+1


source


We know that p

less max

, then n % p

less max. They are both unsigned, which means n % p

positive and less p

. The unsigned overflow is well defined, so if it n % p * 2

does p

, we can compute it as n % p - p + n % p

not overflowing, so it looks like this:

unsigned m = n % p;
unsigned r;
if (p - m < m) // m * 2 > p
    r = m - p + m;
else // m * 2 <= p
    r = m * 2;

// subtract 1, account for the fact that r can be 0
if (r == 0) r = p - 1;
else r = r - 1;
return r % p;

      

Note that you can avoid the last module because we know that it r

does not exceed p * 2

(it does not exceed m * 2

, it m

does not exceed p

), so the last line can be rewritten as



return r >= p ? r - p : r

      

Which brings the number of operations of the module to 1.

+3


source


FWIW, this version seems to avoid overflow:

std::uint32_t f(std::uint32_t n, std::uint32_t p) 
{
    auto m = n%p;
    if (m <= p/2) {
        return (m==0)*p+2*m-1;
    }
    return p-2*(p-m)-1;
}

      

Demo . The idea is that if2*m-1

an overflow occurs, we can work withp-2*(p-m)-1

, which avoids this by multiplying2

by modular additive inverse.

+1


source







All Articles