How can I speed up my math operations in VHDL?

I have some calculations currently running with a 75MHz rising edge to output 720p video on screen. Some of the math (a few in modulus for example) takes too long (20 + ns, whereas 75 MHz is 13.3 ns) so my timing constraints are not being met. I am new to FPGA, but I am wondering if there is, for example, a way to run computations at a faster speed than the current pixel clock so that it ends with the next 75 MHz tick. By the way, I am using VHDL.

+3


source to share


3 answers


Here are some methods:

  • Pipelining - split logic to work with multiple clock cycles
  • multi-cycle path - if you don't need an answer in every cycle, you can say that it is okay so that it takes longer. Care must be taken not to tell the tools the wrong thing, but!
  • Think again - for example, do you really need to do x mod 3

    on very wide x

    , or can you use the constantly updated modulo 3 counter?
  • Use the best tools. I've had instances where I could come across a deep logical path sync using an expensive synthesizer versus not syncing time to the same code with a vendor synthesizer.


More extreme solutions include changing silicon, a faster device, or a newer device, or a newer, faster device.

+9


source


75 MHz is already pretty slow by FPGA standards today.

The problem is a modular operation that effectively involves division; and division is slow.

Think about what operations you need and if there is any way to refactor the calculation. If you sync pixels, it is not as if you had 32-bit integers; limited values ​​are easier to handle.

Martin hinted at one option: strength reduction. If you have 1280 pixels / line and need to work every third, you don't need to compute 1280 mod 3! Count 0,1,2,0, ... instead of.

Another, if you need modulo 3 from an 8-bit (or 12-bit) number, you need to store all possible values ​​in a lookup table, which is fast enough.

Or sometimes you can multiply by 1/3 (X "5555") instead of dividing by 3, then multiply by 3 (which is one addition) and subtract to get the modulus. This pipeline is really good, but since X "5555" is only an approximation of 1/3, you need to check in simulation that it provides the correct output for each input. (for 16-bit inputs this is not a big simulation!) Modulo 9 expansion is easy.

EDIT:

Two points from your comments: Another possibility you have is to create an X2 clock (150 MHz) using Spartan clock generators, which gives you 2 cycles per pixel. Well pipelined code should be able to handle 150 MHz without too much trouble.

If not a conveyor!

PROCESS(Clk)
BEGIN
    if(rising_edge(Clk)) then
        for i in 0 to 2 loop
            case i is
                when 0 => temp1 <= a*data;
                when 1 => temp2 <= temp1*b;
                when 2 => result <= temp2*c;
                when others => null;
            end case;
        end loop;
    end if;
END PROCESS;

      



The first thing to understand is that the loop and case statements cancel each other out, so this makes it easy

PROCESS(Clk)
BEGIN
    if rising_edge(Clk) then
        temp1 <= a*data;
        temp2 <= temp1*b;
        result <= temp2*c;
    end if;
END PROCESS;

      

which is buggy! Testbench is also a bug, hides the problem.

Loop 1 presents data, a, b, c, and computes temp1 = Data * a.
In loop 2, temp1 is multiplied by the new value of b instead of the correct one!
Same thing again in loop 3!

Since testbench sets the inputs and leaves them constant, this won't catch the problem!

PROCESS(Clk)
BEGIN
    if rising_edge(Clk) then
        -- cycle 1
        temp1   <= a*data;
        b_copy  <= b;
        c_copy1 <= c;
        -- cycle 2
        temp2   <= temp1*b_copy;
        c_copy2 <= c_copy1;
        -- cycle 3
        result  <= temp2*c_copy2;
    end if;
END PROCESS;

      

I like to comment on each cycle; every term I use in a loop must come from the immediately preceding loop, either by calculation or from a copy.

At least it works, but it can be reduced to 2 clock cycles and less copy registers because in this example the four inputs are independent (and I assume no action is required to prevent overflow). So:

PROCESS(Clk)
BEGIN
    if rising_edge(Clk) then
        -- cycle 1
        temp1   <= a * data;
        temp2   <= b * c;
        -- cycle 2
        result  <= temp1 * temp2;
    end if;
END PROCESS;

      

+13


source


Usually complex mathematical operations in FPGAs are pipelined. Pipelining means that you divide your operations into stages. Let's say you have a multiplier that is taking too long for your clock speed. You will divide your multiplier into 3 steps. Basically, your multiplier is made up of three different parts (which have their own clock input) assigned one to it. These three parts will be less than one part, so they will have less latency, so you can use a faster clock for them.

The disadvantage of this would be "delay". Your conveyor system will produce a delayed result. In the multiplier example above, in order to have the correct output, you need to wait for your input to go through all three stages. But this is usually very small (depending on your design) and can be ignored.

Here's a good (!) Article about it: http://vhdlguru.blogspot.com/2011/01/what-is-pipelining-explanation-with.html EDIT: see Brian's post instead.

Also, vendors typically supply optimized and pipelined versions of math operations as IP cores in their design software. Look for them.

+2


source







All Articles