Hardware Specifications

This section describes the vertex shader processor and the related hardware specifications.

Following is a description of the hardware specifications for the vertex shader processor.

Features

Below are the main features of the vertex shader processor.

Four vertex shader units built into the GPU (of which one is shared with the geometry shader processor).
Operations are conducted using 24-bit, floating-point numbers: 1 sign bit, 7 exponent bits and 16 mantissa bits.
32-bit fixed length instructions.
Support for flow control instructions as well as instructions for indexing.
Support for masking of output components and rearranging (swizzling) of input components.
Cannot read or write any data other than register data.

The vertex shader processor has the following stages.

Stage name Symbol in timetable Description

Prefetch

p.fetch

Prefetches instruction from program RAM into cache.

Fetch

fetch

Fetches instruction from instruction cache.

Decode

decode

Decodes fetched instruction.

Read

read

Reads data from register.
This stage may not exist for some instructions.

Execute

ifb mov

MUL MAX

etc. Executes the instruction.
Conducts such tasks as flow control, copying of registers, and computations in the arithmetic units.
Some instructions use multiple arithmetic units for computations, so sometimes this stage can take 2 or more clock cycles.

Post-processing

post

Performs post-processing on the instruction.
This stage may not exist for some instructions.

Write-back

write

Writes the result of instruction execution to a register.

The instruction timetables in the Assembler Reference show the stages from read through write-back.
Execution latency is less than the clock ticks shown in the timetable. Processing does not stall even when write-back of instruction takes place at the same clock time as the read stage of the next instruction.

Arithmetic Unit Structure

The vertex shader processor incorporates the following arithmetic units.

Unit name Number installed Clock cycles required for operation Symbol in timetable Description

MUL 4 1

MUL

Calculates the product of two values.

ADD 4 1

ADD

Calculates the sum of two values.

RCP / RSQ 2 2

RCP / RSQ

Calculates the reciprocal, reciprocal square root.

FLOOR 4 1

FLOOR

Calculates the greatest integer less than or equal to the specified value.

LOG 1 2

LOG

Calculates a binary logarithm.

EXP 1 2

EXP

Calculates a power of two.

MAX 4 1

MAX

Selects the larger of two given values.

MIN 4 1

MIN

Selects the smaller of two given values.

SGE 4 1

SGE

Compares two values to determine if the one is greater than or equal to the other.

SLT 4 1

SLT

Compares two values to determine if the one is less than than other.

CMP 2 2

CMP

Compares two values.

Register Set

The vertex shader processor incorporates the following register set.

Register type No. of components Number R/W Bit width Description

Input registers 4 16 R 24 Floating-point registers storing vertex attribute data.

Temporary registers 4 16 RW 24 Reusable floating-point registers for temporarily holding the results of calculations.
The content is maintained until overwritten.

Floating-point constant registers 4 96 R 24 Floating-point registers storing constants for operations.
Supports specification of register index offset by address register and loop counter register.

Address registers 2 1 RW 8 Integer-type register for specifying the register index offset for the floating-point constant registers.

Boolean register 1 16 R 1 Boolean registers used for conditional branching and jumping.
One of these registers (b15) is reserved for use by the geometry shader.

Integer registers 1 4 R 24 Integer registers used for controlling loop instructions.

Loop-counter register 1 1 R 8 Integer-type register that stores the counter value for loop instructions.
Can be used for specifying the register index offset for the floating-point constant registers.

Output registers 4 16 W 24 Floating-point registers for storing the vertex attribute data at the completion of processing by the vertex shader processor.
The contents in these registers are output to later stages in the graphics pipeline (or to the geometry shader processor).

Status registers 1 2 RW 1 Floating-point registers storing vertex attribute data.

Register type	No. of components	Number	R/W	Bit width	Description
Input registers	4	16	R	24	Floating-point registers storing vertex attribute data.
Temporary registers	4	16	RW	24	Reusable floating-point registers for temporarily holding the results of calculations. The content is maintained until overwritten.
Floating-point constant registers	4	96	`R`	24	Floating-point registers storing constants for operations. Supports specification of register index offset by address register and loop counter register.
Address registers	2	1	RW	8	Integer-type register for specifying the register index offset for the floating-point constant registers.
Boolean register	1	16	`R`	1	Boolean registers used for conditional branching and jumping. One of these registers (b15) is reserved for use by the geometry shader.
Integer registers	1	4	`R`	24	Integer registers used for controlling loop instructions.
Loop-counter register	1	1	`R`	8	Integer-type register that stores the counter value for loop instructions. Can be used for specifying the register index offset for the floating-point constant registers.
Output registers	4	16	W	24	Floating-point registers for storing the vertex attribute data at the completion of processing by the vertex shader processor. The contents in these registers are output to later stages in the graphics pipeline (or to the geometry shader processor).
Status registers	1	2	RW	1	Floating-point registers storing vertex attribute data.

The post-vertex cache functions to cache the result of the processing of vertex data by the vertex shader processor.

If the vertex buffer and vertex index are being used for the input of vertex data, the processing result of the vertex shader processor is stored in cache, and this processed data in cache is output when the same vertex index is input. Cache hits boost performance because processing by the vertex shader processor is skipped.

Cache for 32 entries is available. When cache is not hit, data is ejected from the cache starting from the oldest of the last-referenced data.
The probability of cache hits is higher when there are 32 or fewer sets of vertex data to be repeatedly referenced, and depending on how the vertex index has been created, rendering using TRIANGLES is more efficient in some cases than using TRIANGLE_STRIP.

2011/12/20: Initial version.

Hardware Specifications

Vertex Shader Processor

Features

Stage Structure

Arithmetic Unit Structure

Register Set

Post-Vertex Cache

Revision History