Binary division

From Hamsterworks Wiki!

Jump to: navigation, search

Division is slow. Very slow compared to addition, subtraction or even multiplication. It also consumes a lot of CPU die space and power - so much so that some processor designers decide not to have a divide instruction at all!

If you find a better binary division algorithm than SRT Division (http://cs.uns.edu.ar/~jechaiz/arquitectura/division/SRT.html) and will be a rich man...

PLEASE NOTE - MOST OF THIS IS JUNK - THE __udivsi3() ROUTINE IN LIBGCC IS PRETTY SIMPLE AND FAST -

In a lot of cases it is much faster than radix 4 division, but as it's latency is variable it is unsuited for implementation in an FPGA.

Contents

Division on an FPGA

I've got 16 bit Radix 2 Division up and running on an FPGA - 350MHz with a 16 stage pipleline, 29MHz fully combinatorial.

Faster division

Pretty much everybody is familiar with using right shifts to perform division by powers of two, but can you do better when the divisor isn't a power of two, or isn't known at compile time? Well, it turns out you can - especially on AMD processors!

By exploiting knowledge of the range of numbers involved you can be much quicker.

(As pointed out on hackaday, if you do a lot of division by constants: http://www.hackersdelight.org/divcMore.pdf. Thanks Kukuta!)

Radix 4 division

I'm playing around with radix-4 division to look at implementing it in an FPGA.

Here's unsigned 16 bit division in C, and a test bench main() routine:

#include <stdio.h>
#include <stdlib.h>

/* This code assumes that unsigned short is 16 bits, and unsigned int is 32 bits */

unsigned short radix4_division_u16(unsigned short divisor, unsigned short dividend)
{
  unsigned short quotient = 0;
  unsigned char i = 16;  /* Number of bits to process */

  if(dividend == 0)
  {
    printf("Divide by zero");
    exit(0);
  }

  while(i > 0)
  {
    unsigned char j;
    unsigned int d1,d2,d3;

    i -= 2;

    d1 = dividend << i;
    d2 = dividend << (i+1);
    d3 = d1+d2;

    if(divisor < d1)
      j = 0;
    else if(divisor < d2) {
      j = 1;
      divisor -= d1;
    } else if(divisor < d3) {
      j = 2;
      divisor -= d2;
    } else {
      j = 3;
      divisor -= d3;
    }
    quotient = (quotient << 2)+j;
  }
  return quotient;
}

int main(int c, char *v[])
{
  int i,j;
  for(i = 0; i < 256*256; i++)
  {
    for(j = 1; j < 256*256; j++)
    {
      unsigned k = radix4_division_u16(i,j);
      if(i/j != k)
      {
        printf("Error with %i/%i != %i\n",i,j,k);
        exit(0);
      }
    }
    if(i%650 == 0)
      printf("%2.2f%% tested\n",i*100.0/256/256);
  }
  return 0;
}

Performance on my AMD laptop

Strangely this is surprisingly fast on an AMD Athlon II P320 - or should that be that division is very slow on AMD?

To use the CPU's divide instruction to test the full 16 bit range for both dividend and divisor took 2:51 (two integer divisions * 2^16 * 2^16) = approx 8 billion divides.

With one of the divides replaced to call the Radix 4 code it now took 3:47 - only 58 seconds longer!

This works out to an extra 13.5ns per 16 bit divide (or around 27 clock cycles on a 2.1GHz AMD).

Here's the same test program used for comparison:

#include <stdio.h>
#include <stdlib.h>

/* This code assumes that unsigned short is 16 bits, and unsigned int is 32 bits */

unsigned short radix4_division_u16(unsigned short divisor, unsigned short dividend)
{
  return divisor / dividend;
}

int main(int c, char *v[])
{
  int i,j;
  for(i = 0; i < 256*256; i++)
  {
    for(j = 1; j < 256*256; j++)
    {
      unsigned k = radix4_division_u16(i,j);
      if(i/j != k)
      {
        printf("Error with %i/%i != %i\n",i,j,k);
        exit(0);
      }
    }
    if(i%650 == 0)
      printf("%2.2f%% tested\n",i*100.0/256/256);
  }
  return 0;
}

But now lets turn on optimisaitons. Compiling with -O3 turns the tables - radix4 now takes 2:26 vs a 2:47 - the Radix-4 version was quicker than the CPU's IDIV!?

Interesting. This definitely warrants further investigation...

So what gives?

Initial guess - maybe the divide is using a 64 bit instruction, where as the code above can use the multiple integer cores in parallel. It may also be benefiting from the CPU's branch prediction.

My new test case

To investigate this more I've changed my code to maximise the time spent doing divides - and taking a nod to IC testing methodology I'm just adding up the results and checking that the totals match.

Remember to compile with -O3!

idiv source

#include <stdio.h>
#include <stdlib.h>

int main(int c, char *v[])
{
  int i,j;
  int total=0;
  for(i = 0; i < 256*256; i++)
    for(j = 1; j < 256*256; j++)
      total += i/j;

  printf("Total = %i\n",total);
  return 0;
}

This now takes 1m38.019s

#include <stdio.h>
#include <stdlib.h>

/* This code assumes that unsigned short is 16 bits, and unsigned int is 32 bits */

/* (C) 2011 Mike Field (hamster@snap.net.nz */

static unsigned short radix4_division_u16(unsigned short divisor, unsigned short dividend)
{
  unsigned short quotient = 0;
  unsigned char i = 16;  /* Number of bits to process */

  if(dividend == 0)
  {
    printf("Divide by zero");
    exit(0);
  }

  while(i > 0)
  {
    unsigned char j;
    unsigned int d1,d2,d3;

    i -= 2;

    d1 = dividend << i;
    d2 = dividend << (i+1);
    d3 = d1+d2;

    if(divisor < d1)
      j = 0;
    else if(divisor < d2) {
      j = 1;
      divisor -= d1;
    } else if(divisor < d3) {
      j = 2;
      divisor -= d2;
    } else {
      j = 3;
      divisor -= d3;
    }
    quotient = (quotient << 2)+j;
  }
  return quotient;
}

int main(int c, char *v[])
{
  int i,j;
  int total=0;
  for(i = 0; i < 256*256; i++)
    for(j = 1; j < 256*256; j++)
      total += radix4_division_u16(i,j);

  printf("Total = %i\n",total);
  return 0;
}

The Radix4 version now takes 1m00.364s. Wow! radix4 is 40% faster!

Taking it to the next level

Ok, so 40% faster than an AMD is is pretty good - I can do 66% more divisions in the given time period. Can I actually do any better than that?

Sure can - I've got down to 36 seconds for the same test. My radix4_division_u16() function is now 64% faster - 175% more work in the same time.

This is done by avoiding 4 passes of the loop when the dividend is greater than 255, halving the work for these divides

static unsigned short radix4_division_u16(unsigned short divisor, unsigned short dividend)
{
  unsigned short quotient = 0;
  unsigned char i;

  if(dividend == 0)
  {
    printf("Divide by zero");
    exit(0);
  }
  if( dividend < 256)
  {
    i = 16;  /* Number of bits to process */
    while(i > 8)
    {
      unsigned char j;
      unsigned int d1,d2,d3;

      i -= 2;

      d1 = dividend << i;
      d2 = dividend << (i+1);
      d3 = d1+d2;

      if(divisor < d1)
        j = 0;
      else if(divisor < d2) {
        j = 1;
        divisor -= d1;
      } else if(divisor < d3) {
        j = 2;
        divisor -= d2;
      } else {
        j = 3;
        divisor -= d3;
      }
      quotient = (quotient << 2)+j;
    }
  }

  i = 8;  /* Number of bits to process */
  while(i > 0)
  {
    unsigned char j;
    unsigned int d1,d2,d3;

    i -= 2;

    d1 = dividend << i;
    d2 = dividend << (i+1);
    d3 = d1+d2;

    if(divisor < d1)
      j = 0;
    else if(divisor < d2) {
      j = 1;
      divisor -= d1;
    } else if(divisor < d3) {
      j = 2;
      divisor -= d2;
    } else {
      j = 3;
      divisor -= d3;
    }
    quotient = (quotient << 2)+j;
  }
  return quotient;
}

Performance on Intel

So, AMD can be improved on, what about Intel? Well, on Intel Core i5 things are dramatically different.

[root@mikef-acer ~]# time ./radix4; time ./idiv 
Total = 1599432336

real	0m26.835s
user	0m26.696s
sys	0m0.014s
Total = 1599432336

real	0m16.249s
user	0m16.170s
sys	0m0.004s

My test that uses "idiv" takes only 16.2 seconds, the radix4 version takes 26 seconds. A bit of a loss there.

Maybe it is just branch prediction?

Changing the order of the index variables changes the effectiveness of the branch prediction - for example, when dividing by 2^16-1 the same path is followed in all but one case.

On the AMD P320:

[hamster@linuxdev radix4]$ time ./radix4; time ./idiv
Total = 1599432336

real 0m34.167s
user 0m34.111s
sys 0m0.016s
Total = 1599432336

real 1m28.819s
user 1m28.600s
sys 0m0.018s

On the Intel:

[root@mikef-acer ~]# time ./radix4; time ./idiv
Total = 1599432336

real	0m18.520s
user	0m18.363s
sys	0m0.022s
Total = 1599432336

real	0m16.581s
user	0m16.453s
sys	0m0.020s
 

It is closer for Intel, but the AMD is really really slow,

What about modulo?

When we only need the remainder (aka modulo) the loop gets tighter - the 'quotient' and 'j' variables are no longer needed:

static unsigned short radix4_modulo_u16(unsigned short divisor, unsigned short dividend)
{
  unsigned char i;

  if(dividend == 0)
  {
    printf("Divide by zero");
    exit(0);
  }
  i = 16;  /* Number of bits to process */
  while(i > 0)
  {
    unsigned int d1,d2,d3;
    i -= 2;
    d1 = dividend << i;
    d2 = dividend << (i+1);
    d3 = d1+d2;

    if(divisor < d1)
      ;
    else if(divisor < d2)
      divisor -= d1;
    else if(divisor < d3) 
      divisor -= d2;
    else
      divisor -= d3;
  }
  return divisor;
}

Although the code has changed quote a bit the timing is pretty much the same:

[hamster@linuxdev radix4]$ time ./radix4m; time ./idivm
Total = 788240730

real 0m33.396s
user 0m33.347s
sys 0m0.009s
Total = 788240730

real 1m27.861s
user 1m27.773s
sys 0m0.010s

A few ideas on Radix 4 division on an FPGA

  • d1,d2,d3 can be calculated each time, or can be put into shift-by-two registers. Using shift registers will greatly reduce the critical path between pipeline stages and allowing higher clock rates.
  • for a divisor of n bits d1,d2,d3 need to be either 2n-2 bits long (or at least n+2 bits long if '1's are not shifted out bits n and n+1).
  • this method allows you to generate 2 bits of the result per iteration (which will be equivalent to the FPGA clock cycle time).
  • the radix of 4 is perfect for implimentation in the 4 input lookup tables in an FPGA.
  • two of these stages could be put back to back to achieve 4 bits per iteration, but at only half the clock speed.
  • the comparison and subtraction could be the same (using the "borrow bit" as the result of the comparison). A single stage would be an adder for calculating (2d+d), three N bit subtractors, a priority encoder and a n-bit 4-to-1 mux to select the output.
  • The remainder is accessible at the end of the process.

Back to Main Page

Personal tools