Was the P4 model with bi-directional 64-bit operations?

I remember that one of the interesting features of the original P4 microarchitecture was the dual-channel ALU . I think Intel called it something like the Rapid Execution Unit, but it basically meant that each execution unit in the ALU was efficiently running at twice the frequency and could handle two simple ALUs in one cycle, even if they were dependent.

This feature disappeared at some point (before or at the same time as P4), but was there ever a 64-bit P4 with double ALU reset? The 64-bit variants of the P4 appeared in 2004, about four years after the original 32-bit version, but it's not clear to me if the ALU double speed is gone. It looks like the pipeline-width approach used to double the speed would be tricky for 64-bit, which is what piqued my curiosity.

Since there is still some (seemingly quite old) 64-bit P4 hardware to support, knowing that ALU behavior is interesting to optimize.

+3


source to share


1 answer


Figure 7 of the original paper 1 on an Intel Pentium 4 Willamette 2 processor discusses how the dual pumped ALP 3 works with some details (at the logic design level).

enter image description here

The figure shows one 32-bit staggered ALU. This confirms that the ALU can perform two fully dependent (both input operands are dependent) simple ALU operations in three fast cycles (where the fast cycle is half the main clock cycle). The result of the operation itself is available after 2 fast cycles (1 main cycle), but the new flags are available only after the third fast cycle (1.5 main cycles). Note that there are two such ALUs on ports 0 and 1, and both ALUs themselves are not staggered.

This document was published in 2001. In 2005, Intel published another article 4 that discusses in detail at the schematic level, as a checkerboard whole core in the Intel Pentium 4 Prescott 5 processor . It is not clear to me if the article is discussing the 64 bit version of Prescott or the 32 bit version. However, this document clearly states that staggered ALUs can only perform padding, boolean operations, shifts and rotations (the other article does not specify exactly what operations staggered ALUs can perform). Another important difference is this statement from the article:

There are two different 32-bit FCLK data paths, staggered one clock cycle to implement 64-bit operations.

So it seems like the two fast ALUs on ports 0 and 1 are staggered, providing 64-bit fast integer operations such as padding. Unfortunately, the flowchart was not included in the document. Therefore, it is unclear what effect this will have on the time it takes to complete the two 64-bit padding.

Another 6 paper 7 from Intel confirms that Intel was indeed able to design a dual pumped 64-bit ALU. I am citing from the article:

In this article, we describe a single-ended integer ALU manufactured in 90nm dual-processor CMOS technology, operating at 4 GHz in 64b mode, with a latency in 32b mode of 7 GHz (measured at 1.3 V, 25 ° C).

But then again, this article does not mention if this project is actually used on any of the Intel processors. But given that the document was published in 2004, there is a good chance one of the 64-bit Pentium 4 processors used the design. Also note that nothing is said about performing fully dependent 64-bit simple integer operations in one main cycle, so this may not be possible on 64-bit Pentium 4 processors.

In 2002, Intel filed a patent for the general stepping design of the ALU. It was general in the sense that it was not about any specific ALU operation, number of clock cycles, or hours. Interestingly, one of the figures shows a scattered 64-bit ALU design! This was in 2002. The patent also addresses some of the problems in the design of stepped ALUs.



The patent says it was granted and left on the same day in 2006, which is confusing to me. Then another identical patent application was filed a few months later .

There is another patent , also filed by Intel in 1998 and granted in 2001 by step-by-step execution of an instruction, any instruction in the main, not just ALU operations. This patent is still active. There's a lot of discussion out there about how staggered execution can be useful for 128-bit SIMD instructions.


(1) In case the link goes down, the document is titled "Microarchitecture of the Pentium® 4 Processor" and by Glenn Hinton et al.

(2) Also known as first generation Pentium 4.

(3) Also known as staggered ALU.

(4) In case reference is omitted, the paper is titled "Low Voltage Connection Logic Diagrams for the Pentium® 4 Integer Core Processor" and by Daniel J. Deleganes et al.

(5) Also known as the third generation Pentium 4 .

(6) In case the reference is down, the paper is titled "A 4GHz 300mW 64bit Dual Voltage ALU Integer Execution in 90nm CMOS" and by Sun K. Matthew et al.

(7) In the event that reference is omitted, the document is entitled "HIGH PERFORMANCE DYNAMIC DYNAMIC DESIGN" by Sanou C. Matthew et al.

+2


source







All Articles