Instruction Latency

The latency associated with execution of each instruction is given below.

Instructions	Latency (in clock cycles)
`add`	3
`dp3`	5
`dp4`	5
`dph`	5
`dst`	3
`exp`	4
`flr`	2
`litp`	4
`log`	4
`mad`	4
`max`	2
`min`	2
`mov`	2
`mova`	4
`mul`	3
`nop`	1
`rcp`	4
`rsq`	4
`sge`	2
`slt`	2
`cmp`	4
Other Branch Instructions	1 – 3

The clock cycles given in the table above for arithmetic instructions and the cmp instruction are approximate values. Actual latency depends on the instructions that come before and after each instruction. Latency can be reduced by inserting instructions unrelated to registers used in arithmetic calculations.

Example where latency increases:

mul     r0, r1, r2
mad     r1, r0, r4, r5
add     r8, r9, r10
mad     r9, r8, r10, r11
add     r7, c1, r12

	1	2	3	4	5	6	7	8	9	10	11	12	13	14
mul	read	MUL	post	write
mad		STALL		read	MUL	ADD	post	write
add					read	STALL	ADD	post	write
mad							STALL		read	MUL	ADD	post	write
add										read	STALL	ADD	post	write

Example where latency decreases:

mul     r0, r1, r2
add     r8, r9, r10
add     r7, c1, r12
mad     r1, r0, r4, r5
mad     r9, r8, r10, r11

	1	2	3	4	5	6	7	8	9
mul	read	MUL	post	write
add		read	ADD	post	write
add			read	ADD	post	write
mad				read	MUL	ADD	post	write
mad					read	MUL	ADD	post	write

A latency of one to three clock cycles is shown for branch instructions in the above table. A branch instruction has a latency of 1 if it causes the program counter to change by +1; 2 if it causes the program counter to change change by +2; and 3 if it causes the program counter to change by +3.
This is because if the program changes by other than +1, the previously read instruction and the instruction scheduled to execute next are canceled and an instruction check is performed again. Note, however, that only the instruction scheduled to execute next is canceled if the program counter changes by +2. The previously read instruction will be unaffected.

Example:

ifb     b0
  add     r0, r1, r2
endif
mul     r0, r1, r2
mul     r1, r2, r3

If b0 is true

	1	2	3	4	5	6	7
ifb	ifb
add		read	ADD	post	write
mul			read	MUL	post	write
mul				read	MUL	post	write

If b0 is false

	1	2	3	4	5	6	7
ifb	ifb	STALL
mul			read	MUL	post	write
mul				read	MUL	post	write

The result of an instruction that executes later never affects the result of an instruction that executes first. Regardless of the duration of latency associated with instructions coming before and after, this is always guaranteed to happen without stalling because the registers used by the instruction coming first are always read before those used by an instruction that comes later.

For example, if a write is made by a later instruction to a source register used by an earlier instruction, the result of the write by the later instruction will never be used as input to the earlier instruction.

Example 1

exp     r0, r1.x
mov     r1, c0

In the above example, the source register of the earlier instruction, which has high latency, and the destination register of the later instruction, which has low latency, are the same register. (The latency of exp is 4 clock cycles and that of mov is 2 clock cycles.)
When code of this type is executed, the output result of mov is not used as input to exp. (The result of mov therefore does not affect the calculation made by exp.)

In addition, calculated results are guaranteed to be output in the order that instructions execute.

Example 2

exp     r0, r1.x
mov     r0, c0

In the above example, the same register is used as a destination register of both the earlier instruction, which has high latency, and the later instruction, which has low latency. When code of this type is executed, the output of exp, which has high latency, is never performed after the output of mov.
When this type of register dependency is present, operations are guaranteed without any stalling because the write operation being made by the earlier instruction will be canceled at the point the later instruction is decoded. The write operation of the earlier instruction is canceled when the registers being written to by the earlier and later instructions are found to be the same. Detection and cancellation are performed separately for each component. In other words, write operations of the earlier instruction are cancelled only if both instructions are writing to the same register.

	1	2	3	4	5
exp	read	EXP		post	cancel
mov		read	mov	write

Stalling will not occur when consecutively writing to the same register, as in the above example, even when the register components are the same.

Example 3

exp     r0.x, r1.x
mov     r0.y, c0

In the above example, r0 is written to twice in a row, but the components written to (r0.x and r0.y) are different so there is no overlap. Although the write to r0.x by exp is canceled, execution does not stall.

	1	2	3	4	5
exp	read	EXP		post	write
mov		read	mov	write

Example 4

exp     r0.xyz,  r1.x
mov     r0.xyzw, c0

In the example above, the write to r0.xyz by exp is cancelled. Execution does not stall.

	1	2	3	4	5
exp	read	EXP		post	cancel
mov		read	mov	write

If an instruction having low latency is executed after an instruction having high latency and the two instructions complete at the same time, the calculation result of the instruction that executed later will be output delayed by 1 clock cycle. More than one register cannot be written to at the same time.

Example:

exp     r0, r1.x
mul     r2, c3, r4

If the above code is executed, the result calculated for r0 and r2 would appear to be output at the same time, but output to r2 will actually be delayed by 1 clock cycle.

	1	2	3	4	5	6
exp	read	EXP		post	write
mul		read	MUL	post	STALL	write

The mad, dp3, dp4, dph, and add instructions contend for access to the arithmetic unit.

If these instructions are executed consecutively, if the time they use the arithmetic unit overlaps, latency may increase because the instructions that execute later must wait for earlier instructions to complete execution.
The arithmetic unit is used for the first cycle of an add instruction, the second cycle of a mad instruction, and the second and third cycles of a dp3, dp4, or dph instruction.

	1	2	3	4	5	6	7	8
dp3	read	MUL	ADD	ADD	post	write
mad		read	MUL	STALL	ADD	post	write
add			read	STALL		ADD	post	write

Note, however, that consecutive execution of the dp3, dp4, and dph instructions is an exception. Stalling due to arithmetic unit conflicts does not occur even when the instructions are called consecutively. Stalling does not occur due to arithmetic unit conflicts when multiple dp3 instructions (or dp4 or dph instructions) are called consecutively, or when some combination of dp3, dp4, and dph are called consecutively.

	1	2	3	4	5	6	7	8
dp3	read	MUL	ADD	ADD	post	write
dp4		read	MUL	ADD	ADD	post	write
dph			read	MUL	ADD	ADD	post	write

Sometimes execution stalls due to dependency relationships among the instructions being invoked. This problem occurs when the register storing the calculation result of a given instruction is used as a source register by the instruction that immediately follows.
Example:

add     r0, r1, r2
mul     r4, r0, r3

If this type of code is executed, execution will stall because the result output to r0 is being used as a source register by the instruction that immediately follows.

	1	2	3	4	5	6	7
add	read	ADD	post	write
mul		STALL		read	MUL	post	write

Execution will stall if the registers are the same, even if the components differ.

Example:

add     r0.x, r1,   r2
mul     r4,   r0.y, r3

With code of this type, the result output to r0.x by the earlier instruction is not accessed by the next instruction, but execution stalls because r0 itself is being accessed (through the use of r0.y).

	1	2	3	4	5	6	7
add	read	ADD	post	write
mul		STALL		read	MUL	post	write

If successive writes are made to the same register, the write made by the first instruction will be cancelled (see Output Order of Calculation Results for details) and any subsequent instruction that tries to read the result written by the cancelled instruction later may stall.

Example:

dp4     r0.x, r1, r2
mov     r0.x, r1
mul     r4, r0, r3

Here, the write by dp4 will be cancelled because dp4 and mov both write to the same register and execution of mul will stall due to the dp4 and mov instructions.
Execution of mul stalls until execution of dp4 completes because, as seen from mul, the latency of dp4, occurring two instructions before, is larger than that of mov, occurring one instruction before.

	1	2	3	4	5	6	7	8	9
dp4	read	MUL	ADD	ADD	post	cancel
mov		read	mov	write
mul			STALL			read	MUL	post	write

Calling the mova instruction results in an unconditional stall of 3 clock cycles.

Unlike stalls due to instruction dependencies, stalling occurs unconditionally when mova is called. Stalling cannot be avoided when a mova instruction and an instruction that reads an address register written to by that mova instruction occur consecutively by placing an unrelated instruction (an instruction that uses a register unrelated to either instruction) between them.

Example:

mova    a0.x, r0
nop
nop
nop
mov     r1, c[a0.x]

Here, a mova instruction is followed by three consecutive nop instructions, in turn followed by a mov instruction that reads the address register that the mova instruction writes to. Execution stalls at the mova instruction whether the nop instructions are included or not.

	1	2	5	6	7	8	9	10
mova	read	mova
nop		STALL	NOP
nop				NOP
nop					NOP
mov						read	mov	write

2011/12/20: Initial version.

Instruction Latency

Latency of Arithmetic Instructions and the `cmp` Instruction

Latency of Branch Instructions

Output Order of Calculation Results

Stalling Due to Calculation Result Output Timing Conflicts

Stalling Due to Arithmetic Unit Conflicts

Stalling Due to Instruction Dependencies

Unconditional Stalls

Revision History