The latency associated with execution of each instruction is given below.
| Instructions |
Latency (in clock cycles) |
add |
3 |
dp3 |
5 |
dp4 |
5 |
dph |
5 |
dst |
3 |
exp |
4 |
flr |
2 |
litp |
4 |
log |
4 |
mad |
4 |
max |
2 |
min |
2 |
mov |
2 |
mova |
4 |
mul |
3 |
nop |
1 |
rcp |
4 |
rsq |
4 |
sge |
2 |
slt |
2 |
cmp |
4 |
| Other Branch Instructions |
1 – 3 |
The clock cycles given in the table above for arithmetic instructions and the cmp instruction are approximate values. Actual latency depends on the instructions that come before and after each instruction. Latency can be reduced by inserting instructions unrelated to registers used in arithmetic calculations.
Example where latency increases:
mul r0, r1, r2
mad r1, r0, r4, r5
add r8, r9, r10
mad r9, r8, r10, r11
add r7, c1, r12
Example where latency decreases:
mul r0, r1, r2
add r8, r9, r10
add r7, c1, r12
mad r1, r0, r4, r5
mad r9, r8, r10, r11
A latency of one to three clock cycles is shown for branch instructions in the above table. A branch instruction has a latency of 1 if it causes the program counter to change by +1; 2 if it causes the program counter to change change by +2; and 3 if it causes the program counter to change by +3.
This is because if the program changes by other than +1, the previously read instruction and the instruction scheduled to execute next are canceled and an instruction check is performed again. Note, however, that only the instruction scheduled to execute next is canceled if the program counter changes by +2. The previously read instruction will be unaffected.
Example:
ifb b0
add r0, r1, r2
endif
mul r0, r1, r2
mul r1, r2, r3
If b0 is true
If b0 is false
The result of an instruction that executes later never affects the result of an instruction that executes first. Regardless of the duration of latency associated with instructions coming before and after, this is always guaranteed to happen without stalling because the registers used by the instruction coming first are always read before those used by an instruction that comes later.
For example, if a write is made by a later instruction to a source register used by an earlier instruction, the result of the write by the later instruction will never be used as input to the earlier instruction.
Example 1
exp r0, r1.x
mov r1, c0
In the above example, the source register of the earlier instruction, which has high latency, and the destination register of the later instruction, which has low latency, are the same register. (The latency of
exp is 4 clock cycles and that of
mov is 2 clock cycles.)
When code of this type is executed, the output result of
mov is not used as input to
exp. (The result of
mov therefore does not affect the calculation made by
exp.)
In addition, calculated results are guaranteed to be output in the order that instructions execute.
Example 2
exp r0, r1.x
mov r0, c0
In the above example, the same register is used as a destination register of both the earlier instruction, which has high latency, and the later instruction, which has low latency. When code of this type is executed, the output of
exp, which has high latency, is never performed after the output of
mov.
When this type of register dependency is present, operations are guaranteed without any stalling because the write operation being made by the earlier instruction will be canceled at the point the later instruction is decoded. The write operation of the earlier instruction is canceled when the registers being written to by the earlier and later instructions are found to be the same. Detection and cancellation are performed separately for each component. In other words, write operations of the earlier instruction are cancelled only if both instructions are writing to the same register.
Stalling will not occur when consecutively writing to the same register, as in the above example, even when the register components are the same.
Example 3
exp r0.x, r1.x
mov r0.y, c0
In the above example, r0 is written to twice in a row, but the components written to (r0.x and r0.y) are different so there is no overlap. Although the write to r0.x by
exp is canceled, execution does not stall.
Example 4
exp r0.xyz, r1.x
mov r0.xyzw, c0
In the example above, the write to r0.xyz by
exp is cancelled. Execution does not stall.
If an instruction having low latency is executed after an instruction having high latency and the two instructions complete at the same time, the calculation result of the instruction that executed later will be output delayed by 1 clock cycle. More than one register cannot be written to at the same time.
Example:
exp r0, r1.x
mul r2, c3, r4
If the above code is executed, the result calculated for r0 and r2 would appear to be output at the same time, but output to r2 will actually be delayed by 1 clock cycle.
The mad, dp3, dp4, dph, and add instructions contend for access to the arithmetic unit.
If these instructions are executed consecutively, if the time they use the arithmetic unit overlaps, latency may increase because the instructions that execute later must wait for earlier instructions to complete execution.
The arithmetic unit is used for the first cycle of an add instruction, the second cycle of a mad instruction, and the second and third cycles of a dp3, dp4, or dph instruction.
Note, however, that consecutive execution of the
dp3,
dp4, and
dph instructions is an exception. Stalling due to arithmetic unit conflicts does not occur even when the instructions are called consecutively. Stalling does not occur due to arithmetic unit conflicts when multiple
dp3 instructions (or
dp4 or
dph instructions) are called consecutively, or when some combination of
dp3,
dp4, and
dph are called consecutively.
Sometimes execution stalls due to dependency relationships among the instructions being invoked. This problem occurs when the register storing the calculation result of a given instruction is used as a source register by the instruction that immediately follows.
Example:
add r0, r1, r2
mul r4, r0, r3
If this type of code is executed, execution will stall because the result output to r0 is being used as a source register by the instruction that immediately follows.
Execution will stall if the registers are the same, even if the components differ.
Example:
add r0.x, r1, r2
mul r4, r0.y, r3
With code of this type, the result output to r0.x by the earlier instruction is not accessed by the next instruction, but execution stalls because r0 itself is being accessed (through the use of r0.y).
If successive writes are made to the same register, the write made by the first instruction will be cancelled (see
Output Order of Calculation Results for details) and any subsequent instruction that tries to read the result written by the cancelled instruction later may stall.
Example:
dp4 r0.x, r1, r2
mov r0.x, r1
mul r4, r0, r3
Here, the write by
dp4 will be cancelled because
dp4 and
mov both write to the same register and execution of
mul will stall due to the
dp4 and
mov instructions.
Execution of
mul stalls until execution of
dp4 completes because, as seen from
mul, the latency of
dp4, occurring two instructions before, is larger than that of
mov, occurring one instruction before.
Calling the mova instruction results in an unconditional stall of 3 clock cycles.
Unlike stalls due to instruction dependencies, stalling occurs unconditionally when mova is called. Stalling cannot be avoided when a mova instruction and an instruction that reads an address register written to by that mova instruction occur consecutively by placing an unrelated instruction (an instruction that uses a register unrelated to either instruction) between them.
Example:
mova a0.x, r0
nop
nop
nop
mov r1, c[a0.x]
Here, a
mova instruction is followed by three consecutive
nop instructions, in turn followed by a
mov instruction that reads the address register that the
mova instruction writes to. Execution stalls at the
mova instruction whether the
nop instructions are included or not.
Revision History
- 2011/12/20
- Initial version.
CONFIDENTIAL