Pinned resources
a=b;c=a;
to c=b;
, but a
is not a standard memory address but an I/O address (so RW is eliminated). Need to indicate using volatile
keyword.An example of an implementation for the address decoder, here the higher order bits specify the I/O device, and the lower order bits specify the address within the device. Reading is also handled by the address decoder to specify which device is being read.
0x800 == 0b1000_0000_0000 0x9FF == 0b1001_1111_1111 ---=========== device address
Note that we avoid byte-addressable memory, as opposed to word-addressable memory, in the FPGA, because of suboptimal implementation for the former (using an entire FF for the storage).
Protocol | Serial | Synchronous | Bus | Master-slave | Full-duplex | In-band |
---|---|---|---|---|---|---|
I2C | Y | Y | Y | Y | N | Y |
SPI | Y | Y | Y | Y | Y | N |
UART | Y | N | N | Y | Y | - |
The instruction and data memory are usually external to the processor, and the WriteData
and ReadData
lines are typically implemented as a single bidirectional bus interface. Different types of memories:
Type | Implementation | Usage | Access time | Cost (/GB) |
---|---|---|---|---|
SRAM | Flip-flops | Caching | 0.5-2.5 ns | $2k-5k |
DRAM | Capacitors | Main memory | 50-70 ns | $20-75 |
SSD | Flash | Disk storage | - | ~$.30 |
HDD | Magnetic disk | Disk storage | 5-20 ms | ~$0.10 |
A suitable memory hierarchy needs to be implemented to balance between cost/speed and size. Some units for implementation, with each level holding most-frequently accessed data from next immediate higher level:
Some properties to be observed:
Memory capacity planning involves hit ratios, i.e. the probability a word is found when accessing a memory. Hit ratios are a function of memory capacities (higher capacity -> higher hit ratio), management policies (decision of what data to keep/drop), program behavior (written with higher locality in mind).
A good assumption is the hit rate for the register-level $$h_0 = 0$$ (since if the data is not in the register anyway, no need to do LDR or STR), and the hit rate for the outermost level $$h_n = 1$$. We define the access frequency $$f_i$$ for $$M_i$$ as a series of cache misses before a hit at the level,
$$$$ f_i = h_i \prod_{k<i} (1-h_k), \qquad{} \sum_i f_i = 1 $$$$
By the locality property, the inner levels are thus accessed more often than the outer levels. Some other metrics to quantify the memory performance,
Optimization is thus to minimize $$T_\text{eff}$$ given cost constraint $$C_\text{total} < C_\text{max}$$ (or the other way around), with the $$T_\text{eff} \rightarrow t_1$$ and $$C_\text{total} \rightarrow{} c_n$$.
The contents of a block of memory words are transferred to the cache upon receipt of read request from CPU, and data directly read by the cache controller for subsequent reads from locations in that block. Note that cache memory access is ideally transparent to the CPU. The cache controller has mapping functions to determine where the block should be placed in the cache, and replacement algorithms for replacing blocks that has not been referenced recently.
Control hazards also arise from exception handling. Procedure to deal with this:
ARM defines two types of exceptions:
Some examples of exceptions and where they occur in the pipeline (beware that multiple exceptions can occur simultaneously in a single clock cycle!):
Cause | Type | Pipeline stage |
---|---|---|
Arithmetic error (e.g. DIVZERO) | Synchronous | E |
Executing undefined instruction (i.e. Usage Fault) | Synchronous | D |
OS service request (e.g. TLB/page fault) | Synchronous | F, M |
I/O service request | Asynchronous | Any |
Hardware malfunction | Asynchronous | Any |
We need to have a mechanism to load the PC with the address of the exception handler, by extending the PC input mux with:
Exception handling splits the responsibility between hardware and software. In general:
Processors typically stay in thread mode (for application software) which can be unprivileged, and switches to handler mode (for exception handling) which is always privileged.
A typical computer architecture may look like this:
A hierarchy of bus speeds is typically more common:
Peripheral registers are typically accessed like memory:
To interface with peripherals, libraries of driver functions / API are usually provided to encapsulate the process of RW peripheral registers (internally uses pointers to said registers).
Some postulated numbers for the time taken in picoseconds by each instruction, based on each component (stage) of the single cycle processor. In a single-cycle processor, the clock cycle is limited by the slowest instruction. We can perform pipelining by fetching the next instruction before the current instruction has completed, so the clock cycle is now limited to the slowest stage.
The stages are termed "Fetch", "Decode", "Execute", "Memory", "Writeback" (FDEMW
).
Since not all instructions use all stages, there will be wasted cycles.
This is easily achieved by inserting state registers between the distinct stages. Note that control signals and the register-write signal need to be delayed to the proper pipeline stage, as well as an appropriate change to the program counter (by propagating the PC+4 value directly to the next stage, whose value belongs to the previous instruction):
This is thus the reason why PC+8 is used instead of PC+4 (so no need to store in another 32-bit register).
Several race conditions are created as a result of implementing pipelining, i.e. pipelining hazards. Some of the hazards and their mitigations.
This corresponds to attempts to use the same resource by two different instructions at the same time. One such example is when only a single memory is used, then there can be cases where different instructions attempt to read from it at the same time. Quick fix is to duplicate the problematic resource, e.g. to have separate instruction and data memories.
In a modern computer, the same memory stick is typically used, but the memory also maintains separate instruction and data caches. The cache miss rate is relatively low, so this is effective at mitigating most cache conflicts. Resolution of conflicts can still be done by slightly delaying the pipeline.
This is more common than the instruction-data structural hazard, since data memory is commonly written to as well. This is otherwise known as the RAW (read-after-write) hazard, or true data dependency. For the following 5-stage pipeline, the next three instructions may be potentially impacted:
Several methods of resolution:
0x00000000
-> ANDEQ R0,R0,R0
0xE1A00000
-> MOV R0,R0
(typical NOP
expansion)Stalling strategy and implementation:
Main idea is to forward results directly between pipeline stages, when the next few instructions have to read from the same register that will be written to. In the diagram below, the commonly shared registers
See the example below:
ADD R2, R3, R4 ADD R2, R3, R4 ADD R1, R7, R1 // R1 read from E register ADD R1, R3, R1 // R1 read from M register, i.e. forwarding ORR R9, R6, R1 // R1 read from M register, prioritize latest write SUB R2, R1, R1 // R1 read from W register, note need to forward to either operand SUB R2, R7, R1 // R1 read from E register, register written back to register already
See below for implementation, where a control line for each operands A and B determine whether they will be loaded from a forwarded line.
ADD R1, R3, R4 // WA3M == R1 (R1 will be written to, // will be in M register during next instruction) ADD R2, R1, R3 // RA1E == R1, so read from M register # Implementation for register A line # Similar for register B line if ((RA1E == WA3M) & RegWriteM) ForwardAE = 10; elif ((RA1E == WA3W) & RegWriteW) ForwardAE = 01; else ForwardAE = 00; // no forwarding
Recall:
LDR R1, [R4]
loads the register R1 with data read from address specified in R4.STR R1, [R3]
stores data in register R1 to address specified in R3.ForwardM = \ // forward from W to M ((RA2M == WA3W) \ // if data in W needed in M & MemWriteM \ // and M instruction is STR & MemtoRegW \ // and W instruction is LDR & RegWriteW) // and the LDR instruction will be executed
This is called memory-memory copy, since data directly read from memory is directly written to memory again. Note that this is not a typical operation for the processor, which can be delegated by the processor to a Direct Memory Accessor (DMA) to perform the copy directly.
Since the Memory stage is after the Execute stage, we need some way to forward data read from memory to the ALU. This can be solved by stalling the next instruction.
We need to stall the F and D stages so that the instructions will not advance. This can be achieved simply by disabling the write controls for the F and D registers.
The E stage is now effectively a NOP. The only four control signals that will change the processor state, and hence need to be zeroed in order for the Execute stage to functionally do nothing, are:
PCSE
: Write to PCRegWE
: Write to registerMemWE
: Write to memoryFlagWE
: Write to ALU status flagsldrstall = \ // stall the processor ((RA1D == WA3E) | \ // if RA1 is to be loaded from D stage // and RA1 will eventually be written to (RA2D == WA3E)) \ // same for RA2 & MemtoRegE \ // and E instruction is LDR & RegWriteE // and the E instruction will be executed
The branch instruction, as well as writes to R15 (which doubles as PC), can cause control hazards. Control hazards are difficult to mitigate, compared to data hazards.
Possible approaches:
LDR R15,...
is problematic since the branch target will only appear in the M stage.The branch decision and target address comes from the E stage:
For early BTA, if the branch is wrong (e.g. outcome from E stage), the instructions in F and D stages need to be cleared. Note that this flushing is not actually necessary if the ISA has 2 branch delay slots.
FlushD = FlushE = PCSrcE # if including the stalling circuitry, then the FlushE # control signal is overloaded. FlushE = ldrstall || PCSrcE
A simple division scheme is achieved by performing restoring division:
For fractional numbers, there are two main encoding systems:
0110.1100
corresponds to 6.75. Has the advantage of being simple.-7.5
as 1111.1000
, or as two's complement, e.g. -7.5
as 10001000
0011.0000 x 0100.0100 (3.0 x 4.25, fixed point) 00110000 x 01000100 (48 x 68, representation as integers) 110011000000 (3264, result as integer) 11001100 (right shift by 4 bits to account for scaling factor) 1100.1100 (12.75, fixed point)
sign * 1.mantissa ^ (exponent - bias)
.-58.25
== 0xC2690000
)To perform floating-point addition, we perform the following steps:
Number 1: 0x3FC00000 (1.1 * 2^0) Number 2: 0x40500000 (1.101 * 2^1) Extract significands: 1.1 , 1.101 Rshift smaller value: 0.11 , 1.101 (precision loss if too small) Add significands: 10.011 Adjust significand: 1.0011 Round significand: 1.0011 (no need since <32-bits) Reassemble number: 0x40980000
The possible rounding choices are round-up, round-down, towards zero (truncation), or towards nearest (even rounding).
One-bit adders are straightforward. One-bit adders can be chained for arbitrary number of bits, but is relatively slower due to the ripple/propagation of carries.
Carry-lookahead adders can precompute the carry after n-bits. To avoid excessively large fan-outs, as well as the quadratically increasing number of gates, the lookahead can be implemented only for 4-bits, then cascaded to form 16-bit adders.
Lookahead logic as per the following, with $$g_i$$ generating a carry when A = B = 1 (independent of C), and $$p_i$$ propagating a carry when C = 1 and only one of A = 1 or B = 1. The goal is to compute $$C_i$$ from the set of $$\{C_0, A_0, \ldots{}, A_{i-1}, B_0, \ldots{}, B_{i-1}\}$$ inputs:
$$$$ C_{i+1} = p_i C_i + g_i = (A_i + B_i) C_i + (A_i B_i) $$$$
Subtraction is simply negating the right operand, and performing an addition. All these functionality can be encapsulated in a monolithic Arithmetic Logical Unit (ALU), with the "ALUControl" bits determining which of {SUM, SUB, AND, OR} is performed. The NZCV flags (negative, zero, carry, overflow) can also be extended from the ALU:
Equality comparison can be done via subtraction:
Shifting is done either by performing a barrel shift (i.e. assigning the input into the target output bit, and this is relatively expensive), or cascading shifts (for each bit value in the "shift amount", we perform that amount of shift at that stage):
The whole set of shifting and rotating instructions can be packaged into one unit:
Multiplication performed by computing the partial products (of the multiplier with individual digits of the multiplicand), shifting them and summing this up. This is either done as an array multiplier (single cycle), or basically a cascaded version of shift+product+sum:
Improved sequential multiplier
There are different multiplication instructions in the ARM instruction set:
UMULL
: multiplication of 2 n-bit unsigned numbers (output is 2n-bits)SMULL
: multiplication of 2 n-bit signed numbersMUL
: n-bit LSB of multiple of 2 n-bit numbers (the same regardless of sign).
See the following example, noting that the 2-bit LSB of the output 10
is identical regardless of whether signed or unsigned multiplication is performed. For the correct long multiplication, sign extension to 2n-bits needs to be performed, i.e. propagate the MSb.
10 <-- unsigned 2 or signed -1 11 <-- unsigned 3 or signed -2 ---- 0010 0100 ---- 0110 <-- unsigned 6
A summary of the ARM instruction set, and making sense of it... Need to remember that here, this is a 32-bit microarchitecture, so 1 word is 32-bits (4-bytes). Update: Spent 2+hours summarizing this, yay me.
Today's lecture on the addressing modes for each data processing / memory instruction, etc. Not particularly useful unless one does not have prior experience reading a microarchitecture datasheet.
A special note on the branch instruction (A4.1.5):
signed_immed_24
field is in twos-complement signed 24-bit value, a quick recap below.0...00110 == 6 1...11010 == -6 (two's complement)
The summary of the main instructions we'll cover and try to implement in the course:
We cover the implementation of a single-cycle processor here. This is the super interesting part!
Explanation of the processor diagram
The (barebones) complete single-cycle ARM processor with a control unit is summarized as follows:
The control unit can be abstracted into two separate units, the decoder and the conditional logic circuit (which handles the portion that can conditionally modify the state).
Decomposition of conditional logic
This single cycle processor is relatively simple with fetch + decode + execute instructions executed in a single cycle, but has two main disadvantages:
Execution speed can be improved by other technologies, e.g. pipelining, superscalar, multi-processors.
We ought to view FPGA execution flow as groups of combinatorial logic circuits, separated by "brick-wall" registers. The longest critical path delay between successive registers determine the maximum clock rate.
Diagram to illustrate max clock rate
Some tips:
Z = AND(X,-X)
can result in transient spikes.Synchronous edge detection circuit
reg
is something that can hold a value in a block, but does not necessarily have to be a physical register. This means regs synthesized as combinational circuits cannot be meaningfully initialized.always
blocks:always @(posedge CLK or posedge RST) begin if (RST) // Asynchronous code here for SET/RST Z <= 1'b0; else begin // Synchronous code here Z <= X | Y; end end
A small commercial break to explain the differences between blocking and non-blocking assignments:
Blocking | Non-blocking | |
---|---|---|
Statement | = | <= |
Evaluation order | Ordered (source-code) | Unordered |
Output circuit | Parallel | Series |
A small note on inferred registers:
always @(posedge CLK) begin B = A; C = B ^ Z; D <= C; // D is a reg E = C; A = E; // A is an implicit reg, should be written with '<=' end
We repeat the distinction between architecture and micro-architecture: the former is defined by the set of architecture instructions, while the latter is about implementing that architecture in hardware.
ARM follows the design principle which includes:
The architecture in a bit more detail:
ADD R1, R2, R3
saves result to R1. ADDEQ R1, R2, R3
saves result to R1 only if Z flag is set.Format:
The rotation value is twice the rot
value. To allow up to 32-bit rotations (so that we can ideally access the full range of 32-bit values), we need 5 bits rot
. But only 4 bits are accessible in this instruction format, so we halve the resolution of the rotation (i.e. the rot
value has an implicit 0 at the end). This, combined with the 8-bit immediate value, means the only values that can be encoded are those with 1s of up to 8-bit width and can align with two-bit boundaries.
Example:
SUB R2, R3, #0xFF0 +-- cond: 1110 (14) unconditioned +-- op: 00 (0) data processing +-- cmd: 0010 (2) subtraction +-- I-bit: 1 (1) src2 is immediate +-- S-bit: 0 (0) no set flag +-- Rn: 0011 (3) +-- Rd: 0010 (2) +-- rot: 1110 (14) +-- imm8: 11111111 (255) See that: 0xFF0 == 0b0...0_00001111_11110000 which is (0xFF << 4) = (0xFF ROR 28) = (0xFF ROR 0b11100). So rot == 0b1110 with implicit 0 at the end, and imm8 == 0xFF. The instruction is thus, 0x E0 A3 2E FF
Tip
Important to note that the number of cycles taken per instruction is not necessarily due to the need to chain instructions, but also due to the settling time of the logical levels. This means a higher clock speed means instructions will potentially take more cycles to complete.
A common benchmark for computers is the set of SPEC benchmarks (SPEC CPU 2017) including computation time and power consumption. Power consumption across workloads, using ssj_ops
(server-side Java operations) per Watt (averaged across workloads in 10% increments).
Some non-linearities in system metrics:
$$$$ \text{MIPS} = \frac{\text{number of instructions}}{\text{execution time} \times 10^6} = \frac{\text{clock rate}}{\text{cycles per instruction} \times 10^6} $$$$
Some trivia:
Instruction Set Architecture (ISA) is the abstraction between hardware and instructions, also known as microarchitecture. This combined with the operating system interface (that defines how data is piped, how functions should be called and return) is termed Application Binary Interface (ABI), which provides for binary compatibility between computers.
Examples of popular ISAs:
ASIC, ASSP, etc.
Processor/GPU: Intel, AMD, Nvidia, Broadcomm, Qualcomm
SoC: Snapdragon, A-Series, AMD APU, etc.
Processor information:
Note the rough power scaling model with capactiv
$$$$ P = CV^2 f $$$$
Methods to improve performance: