Instruction Latency

Instruction Latency

The latency associated with execution of each instruction is given below.

Instructions Latency (in clock cycles)
add 3
dp3 5
dp4 5
dph 5
dst 3
exp 4
flr 2
litp 4
log 4
mad 4
max 2
min 2
mov 2
mova 4
mul 3
nop 1
rcp 4
rsq 4
sge 2
slt 2
cmp 4
Other Branch Instructions 1 – 3

Latency of Arithmetic Instructions and the cmp Instruction

The clock cycles given in the table above for arithmetic instructions and the cmp instruction are approximate values. Actual latency depends on the instructions that come before and after each instruction. Latency can be reduced by inserting instructions unrelated to registers used in arithmetic calculations.


Example where latency increases:

mul     r0, r1, r2
mad     r1, r0, r4, r5
add     r8, r9, r10
mad     r9, r8, r10, r11
add     r7, c1, r12

12345678910 11121314
mul read MUL post write
mad STALL read MUL ADD post write
add read STALL ADD post write
mad STALL read MUL ADD post write
add read STALL ADD post write


Example where latency decreases:
mul     r0, r1, r2
add     r8, r9, r10
add     r7, c1, r12
mad     r1, r0, r4, r5
mad     r9, r8, r10, r11
123456789
mul read MUL post write
add read ADD post write
add read ADD post write
mad read MUL ADD post write
mad read MUL ADD post write

Latency of Branch Instructions

A latency of one to three clock cycles is shown for branch instructions in the above table. A branch instruction has a latency of 1 if it causes the program counter to change by +1; 2 if it causes the program counter to change change by +2; and 3 if it causes the program counter to change by +3.
This is because if the program changes by other than +1, the previously read instruction and the instruction scheduled to execute next are canceled and an instruction check is performed again. Note, however, that only the instruction scheduled to execute next is canceled if the program counter changes by +2. The previously read instruction will be unaffected.

Example:

ifb     b0
  add     r0, r1, r2
endif
mul     r0, r1, r2
mul     r1, r2, r3
If b0 is true

1234567
ifb ifb
add read ADD post write
mul read MUL post write
mul read MUL post write

If b0 is false

1234567
ifb ifb STALL
mul read MUL post write
mul read MUL post write

Output Order of Calculation Results

The result of an instruction that executes later never affects the result of an instruction that executes first. Regardless of the duration of latency associated with instructions coming before and after, this is always guaranteed to happen without stalling because the registers used by the instruction coming first are always read before those used by an instruction that comes later.

For example, if a write is made by a later instruction to a source register used by an earlier instruction, the result of the write by the later instruction will never be used as input to the earlier instruction.

Example 1

exp     r0, r1.x
mov     r1, c0
In the above example, the source register of the earlier instruction, which has high latency, and the destination register of the later instruction, which has low latency, are the same register. (The latency of exp is 4 clock cycles and that of mov is 2 clock cycles.)
When code of this type is executed, the output result of mov is not used as input to exp. (The result of mov therefore does not affect the calculation made by exp.)

In addition, calculated results are guaranteed to be output in the order that instructions execute.


Example 2
exp     r0, r1.x
mov     r0, c0
In the above example, the same register is used as a destination register of both the earlier instruction, which has high latency, and the later instruction, which has low latency. When code of this type is executed, the output of exp, which has high latency, is never performed after the output of mov.
When this type of register dependency is present, operations are guaranteed without any stalling because the write operation being made by the earlier instruction will be canceled at the point the later instruction is decoded. The write operation of the earlier instruction is canceled when the registers being written to by the earlier and later instructions are found to be the same. Detection and cancellation are performed separately for each component. In other words, write operations of the earlier instruction are cancelled only if both instructions are writing to the same register.

12345
exp read EXP post cancel
mov read mov write


Stalling will not occur when consecutively writing to the same register, as in the above example, even when the register components are the same.

Example 3
exp     r0.x, r1.x
mov     r0.y, c0
In the above example, r0 is written to twice in a row, but the components written to (r0.x and r0.y) are different so there is no overlap. Although the write to r0.x by exp is canceled, execution does not stall.

12345
exp read EXP post write
mov read mov write


Example 4
exp     r0.xyz,  r1.x
mov     r0.xyzw, c0
In the example above, the write to r0.xyz by exp is cancelled. Execution does not stall.

12345
exp read EXP post cancel
mov read mov write

Stalling Due to Calculation Result Output Timing Conflicts

If an instruction having low latency is executed after an instruction having high latency and the two instructions complete at the same time, the calculation result of the instruction that executed later will be output delayed by 1 clock cycle. More than one register cannot be written to at the same time.

Example:

exp     r0, r1.x
mul     r2, c3, r4
If the above code is executed, the result calculated for r0 and r2 would appear to be output at the same time, but output to r2 will actually be delayed by 1 clock cycle.

123456
exp read EXP post write
mul read MUL post STALL write

Stalling Due to Arithmetic Unit Conflicts

The mad, dp3, dp4, dph, and add instructions contend for access to the arithmetic unit.

If these instructions are executed consecutively, if the time they use the arithmetic unit overlaps, latency may increase because the instructions that execute later must wait for earlier instructions to complete execution.
The arithmetic unit is used for the first cycle of an add instruction, the second cycle of a mad instruction, and the second and third cycles of a dp3, dp4, or dph instruction.

12345678
dp3 read MUL ADD ADD post write
mad read MUL STALL ADD post write
add read STALL ADD post write


Note, however, that consecutive execution of the dp3, dp4, and dph instructions is an exception. Stalling due to arithmetic unit conflicts does not occur even when the instructions are called consecutively. Stalling does not occur due to arithmetic unit conflicts when multiple dp3 instructions (or dp4 or dph instructions) are called consecutively, or when some combination of dp3, dp4, and dph are called consecutively.

12345678
dp3 read MUL ADD ADD post write
dp4 read MUL ADD ADD post write
dph read MUL ADD ADD post write

Stalling Due to Instruction Dependencies

Sometimes execution stalls due to dependency relationships among the instructions being invoked. This problem occurs when the register storing the calculation result of a given instruction is used as a source register by the instruction that immediately follows.
Example:

add     r0, r1, r2
mul     r4, r0, r3
If this type of code is executed, execution will stall because the result output to r0 is being used as a source register by the instruction that immediately follows.

1234567
add read ADD post write
mul STALL read MUL post write

Execution will stall if the registers are the same, even if the components differ.

Example:
add     r0.x, r1,   r2
mul     r4,   r0.y, r3
With code of this type, the result output to r0.x by the earlier instruction is not accessed by the next instruction, but execution stalls because r0 itself is being accessed (through the use of r0.y).

1234567
add read ADD post write
mul STALL read MUL post write


If successive writes are made to the same register, the write made by the first instruction will be cancelled (see Output Order of Calculation Results for details) and any subsequent instruction that tries to read the result written by the cancelled instruction later may stall.

Example:
dp4     r0.x, r1, r2
mov     r0.x, r1
mul     r4, r0, r3
Here, the write by dp4 will be cancelled because dp4 and mov both write to the same register and execution of mul will stall due to the dp4 and mov instructions.
Execution of mul stalls until execution of dp4 completes because, as seen from mul, the latency of dp4, occurring two instructions before, is larger than that of mov, occurring one instruction before.

123456789
dp4 read MUL ADD ADD post cancel
mov read mov write
mul STALL read MUL post write

Unconditional Stalls

Calling the mova instruction results in an unconditional stall of 3 clock cycles.

Unlike stalls due to instruction dependencies, stalling occurs unconditionally when mova is called. Stalling cannot be avoided when a mova instruction and an instruction that reads an address register written to by that mova instruction occur consecutively by placing an unrelated instruction (an instruction that uses a register unrelated to either instruction) between them.

Example:

mova    a0.x, r0
nop
nop
nop
mov     r1, c[a0.x]
Here, a mova instruction is followed by three consecutive nop instructions, in turn followed by a mov instruction that reads the address register that the mova instruction writes to. Execution stalls at the mova instruction whether the nop instructions are included or not.

12345678910
mova read mova
nop STALL NOP
nop NOP
nop NOP
mov read mov write

Revision History

2011/12/20
Initial version.

CONFIDENTIAL