Table of Contents

CG3207 Computer Architecture

Pinned resources

  • Rajesh C Panicker (email: rajesh@nus.edu.sg)
  • Canvas inbox
  • References:
    • Harris - Digital Design and Computer Architecture
    • Patterson, Hennessey - Computer Organization and Design
  • Background:
    • Assembly in ARM version 3, or RISC-V
    • HDL in Verilog, or VHDL

Week 9

Addressing I/O device/register

An example of an implementation for the address decoder, here the higher order bits specify the I/O device, and the lower order bits specify the address within the device. Reading is also handled by the address decoder to specify which device is being read.

0x800 == 0b1000_0000_0000
0x9FF == 0b1001_1111_1111
           ---===========
        device  address

Note that we avoid byte-addressable memory, as opposed to word-addressable memory, in the FPGA, because of suboptimal implementation for the former (using an entire FF for the storage).

Performing I/O

I/O protocols

Protocol Serial Synchronous Bus Master-slave Full-duplex In-band
I2C Y Y Y Y N Y
SPI Y Y Y Y Y N
UART Y N N Y Y -

R/W memory principles

The instruction and data memory are usually external to the processor, and the WriteData and ReadData lines are typically implemented as a single bidirectional bus interface. Different types of memories:

Type Implementation Usage Access time Cost (/GB)
SRAM Flip-flops Caching 0.5-2.5 ns $2k-5k
DRAM Capacitors Main memory 50-70 ns $20-75
SSD Flash Disk storage - ~$.30
HDD Magnetic disk Disk storage 5-20 ms ~$0.10

A suitable memory hierarchy needs to be implemented to balance between cost/speed and size. Some units for implementation, with each level holding most-frequently accessed data from next immediate higher level:

Some properties to be observed:

Memory capacity planning involves hit ratios, i.e. the probability a word is found when accessing a memory. Hit ratios are a function of memory capacities (higher capacity -> higher hit ratio), management policies (decision of what data to keep/drop), program behavior (written with higher locality in mind).

A good assumption is the hit rate for the register-level $$h_0 = 0$$ (since if the data is not in the register anyway, no need to do LDR or STR), and the hit rate for the outermost level $$h_n = 1$$. We define the access frequency $$f_i$$ for $$M_i$$ as a series of cache misses before a hit at the level,

$$$$ f_i = h_i \prod_{k<i} (1-h_k), \qquad{} \sum_i f_i = 1 $$$$

By the locality property, the inner levels are thus accessed more often than the outer levels. Some other metrics to quantify the memory performance,

Optimization is thus to minimize $$T_\text{eff}$$ given cost constraint $$C_\text{total} < C_\text{max}$$ (or the other way around), with the $$T_\text{eff} \rightarrow t_1$$ and $$C_\text{total} \rightarrow{} c_n$$.

The contents of a block of memory words are transferred to the cache upon receipt of read request from CPU, and data directly read by the cache controller for subsequent reads from locations in that block. Note that cache memory access is ideally transparent to the CPU. The cache controller has mapping functions to determine where the block should be placed in the cache, and replacement algorithms for replacing blocks that has not been referenced recently.


Week 8

Control hazard: Exception handling

Control hazards also arise from exception handling. Procedure to deal with this:

  1. Stop execution of offending instruction midstream
  2. Let all prior instructions complete
  3. Flush all following instructions
  4. Set register to show cause of exception and save its address
  5. Jump to a exception handler address

ARM defines two types of exceptions:

  1. Interrupts: Asynchronous to program execution (caused by external events), so instructions in the pipeline can be safely completed.
  2. Traps: Synchronous to program execution (caused by internal events), the trap handler must deal with that specific instruction midstream in the pipeline.

Some examples of exceptions and where they occur in the pipeline (beware that multiple exceptions can occur simultaneously in a single clock cycle!):

Cause Type Pipeline stage
Arithmetic error (e.g. DIVZERO) Synchronous E
Executing undefined instruction (i.e. Usage Fault) Synchronous D
OS service request (e.g. TLB/page fault) Synchronous F, M
I/O service request Asynchronous Any
Hardware malfunction Asynchronous Any

We need to have a mechanism to load the PC with the address of the exception handler, by extending the PC input mux with:

Exception handling splits the responsibility between hardware and software. In general:

Processors typically stay in thread mode (for application software) which can be unprivileged, and switches to handler mode (for exception handling) which is always privileged.

Generic computer system

A typical computer architecture may look like this:

A hierarchy of bus speeds is typically more common:

Peripheral registers are typically accessed like memory:

To interface with peripherals, libraries of driver functions / API are usually provided to encapsulate the process of RW peripheral registers (internally uses pointers to said registers).


Week 7

Some postulated numbers for the time taken in picoseconds by each instruction, based on each component (stage) of the single cycle processor. In a single-cycle processor, the clock cycle is limited by the slowest instruction. We can perform pipelining by fetching the next instruction before the current instruction has completed, so the clock cycle is now limited to the slowest stage.

The stages are termed "Fetch", "Decode", "Execute", "Memory", "Writeback" (FDEMW). Since not all instructions use all stages, there will be wasted cycles.

This is easily achieved by inserting state registers between the distinct stages. Note that control signals and the register-write signal need to be delayed to the proper pipeline stage, as well as an appropriate change to the program counter (by propagating the PC+4 value directly to the next stage, whose value belongs to the previous instruction):

This is thus the reason why PC+8 is used instead of PC+4 (so no need to store in another 32-bit register).

Several race conditions are created as a result of implementing pipelining, i.e. pipelining hazards. Some of the hazards and their mitigations.

Structural hazards

This corresponds to attempts to use the same resource by two different instructions at the same time. One such example is when only a single memory is used, then there can be cases where different instructions attempt to read from it at the same time. Quick fix is to duplicate the problematic resource, e.g. to have separate instruction and data memories.

In a modern computer, the same memory stick is typically used, but the memory also maintains separate instruction and data caches. The cache miss rate is relatively low, so this is effective at mitigating most cache conflicts. Resolution of conflicts can still be done by slightly delaying the pipeline.

Data hazards

This is more common than the instruction-data structural hazard, since data memory is commonly written to as well. This is otherwise known as the RAW (read-after-write) hazard, or true data dependency. For the following 5-stage pipeline, the next three instructions may be potentially impacted:

Several methods of resolution:

Stalling strategy and implementation:

Data forwarding for Execute (ALU)

Main idea is to forward results directly between pipeline stages, when the next few instructions have to read from the same register that will be written to. In the diagram below, the commonly shared registers

See the example below:

ADD R2, R3, R4
ADD R2, R3, R4
ADD R1, R7, R1  // R1 read from E register
ADD R1, R3, R1  // R1 read from M register, i.e. forwarding
ORR R9, R6, R1  // R1 read from M register, prioritize latest write
SUB R2, R1, R1  // R1 read from W register, note need to forward to either operand
SUB R2, R7, R1  // R1 read from E register, register written back to register already

See below for implementation, where a control line for each operands A and B determine whether they will be loaded from a forwarded line.

ADD R1, R3, R4  // WA3M == R1  (R1 will be written to,
                // will be in M register during next instruction)
ADD R2, R1, R3  // RA1E == R1, so read from M register

# Implementation for register A line
# Similar for register B line
if   ((RA1E == WA3M) & RegWriteM) ForwardAE = 10;
elif ((RA1E == WA3W) & RegWriteW) ForwardAE = 01;
else ForwardAE = 00;  // no forwarding

Data forwarding for Memory (LDR/STR)

Recall:

ForwardM = \           // forward from W to M
    ((RA2M == WA3W) \  // if data in W needed in M
    & MemWriteM \      // and M instruction is STR
    & MemtoRegW \      // and W instruction is LDR
    & RegWriteW)       // and the LDR instruction will be executed

This is called memory-memory copy, since data directly read from memory is directly written to memory again. Note that this is not a typical operation for the processor, which can be delegated by the processor to a Direct Memory Accessor (DMA) to perform the copy directly.

Data forwarding (memory-to-ALU)

Since the Memory stage is after the Execute stage, we need some way to forward data read from memory to the ALU. This can be solved by stalling the next instruction.

We need to stall the F and D stages so that the instructions will not advance. This can be achieved simply by disabling the write controls for the F and D registers.

The E stage is now effectively a NOP. The only four control signals that will change the processor state, and hence need to be zeroed in order for the Execute stage to functionally do nothing, are:

ldrstall = \             // stall the processor
    ((RA1D == WA3E) | \  // if RA1 is to be loaded from D stage
                         // and RA1 will eventually be written to
     (RA2D == WA3E))  \  // same for RA2
    & MemtoRegE \        // and E instruction is LDR
    & RegWriteE          // and the E instruction will be executed

Control hazards

The branch instruction, as well as writes to R15 (which doubles as PC), can cause control hazards. Control hazards are difficult to mitigate, compared to data hazards.

Possible approaches:

The branch decision and target address comes from the E stage:

For early BTA, if the branch is wrong (e.g. outcome from E stage), the instructions in F and D stages need to be cleared. Note that this flushing is not actually necessary if the ISA has 2 branch delay slots.

FlushD = FlushE = PCSrcE

# if including the stalling circuitry, then the FlushE
# control signal is overloaded.
FlushE = ldrstall || PCSrcE


Week 6

Division arithmetic

A simple division scheme is achieved by performing restoring division:

Worked example of restoring division

Fractional numbers

For fractional numbers, there are two main encoding systems:

0011.0000 x 0100.0100   (3.0 x 4.25, fixed point)
 00110000 x  01000100   (48 x 68, representation as integers)
         110011000000   (3264, result as integer)
             11001100   (right shift by 4 bits to account for scaling factor)
            1100.1100   (12.75, fixed point)

To perform floating-point addition, we perform the following steps:

Number 1: 0x3FC00000 (1.1   * 2^0)
Number 2: 0x40500000 (1.101 * 2^1)

Extract significands:  1.1  , 1.101
Rshift smaller value:  0.11 , 1.101  (precision loss if too small)
Add significands:     10.011
Adjust significand:    1.0011
Round significand:     1.0011 (no need since <32-bits)

Reassemble number:  0x40980000

The possible rounding choices are round-up, round-down, towards zero (truncation), or towards nearest (even rounding).


Week 5

Arithmetic

One-bit adders are straightforward. One-bit adders can be chained for arbitrary number of bits, but is relatively slower due to the ripple/propagation of carries.

Carry-lookahead adders can precompute the carry after n-bits. To avoid excessively large fan-outs, as well as the quadratically increasing number of gates, the lookahead can be implemented only for 4-bits, then cascaded to form 16-bit adders.

Lookahead logic as per the following, with $$g_i$$ generating a carry when A = B = 1 (independent of C), and $$p_i$$ propagating a carry when C = 1 and only one of A = 1 or B = 1. The goal is to compute $$C_i$$ from the set of $$\{C_0, A_0, \ldots{}, A_{i-1}, B_0, \ldots{}, B_{i-1}\}$$ inputs:

$$$$ C_{i+1} = p_i C_i + g_i = (A_i + B_i) C_i + (A_i B_i) $$$$

Subtraction is simply negating the right operand, and performing an addition. All these functionality can be encapsulated in a monolithic Arithmetic Logical Unit (ALU), with the "ALUControl" bits determining which of {SUM, SUB, AND, OR} is performed. The NZCV flags (negative, zero, carry, overflow) can also be extended from the ALU:

Equality comparison can be done via subtraction:

Shifting is done either by performing a barrel shift (i.e. assigning the input into the target output bit, and this is relatively expensive), or cascading shifts (for each bit value in the "shift amount", we perform that amount of shift at that stage):

The whole set of shifting and rotating instructions can be packaged into one unit:

Multiplication

Multiplication performed by computing the partial products (of the multiplier with individual digits of the multiplicand), shifting them and summing this up. This is either done as an array multiplier (single cycle), or basically a cascaded version of shift+product+sum:

Improved sequential multiplier

There are different multiplication instructions in the ARM instruction set:

See the following example, noting that the 2-bit LSB of the output 10 is identical regardless of whether signed or unsigned multiplication is performed. For the correct long multiplication, sign extension to 2n-bits needs to be performed, i.e. propagate the MSb.

  10  <-- unsigned 2 or signed -1
  11  <-- unsigned 3 or signed -2
----
0010
0100
----
0110  <-- unsigned 6

Week 4

A summary of the ARM instruction set, and making sense of it... Need to remember that here, this is a 32-bit microarchitecture, so 1 word is 32-bits (4-bytes). Update: Spent 2+hours summarizing this, yay me.

Today's lecture on the addressing modes for each data processing / memory instruction, etc. Not particularly useful unless one does not have prior experience reading a microarchitecture datasheet.

A special note on the branch instruction (A4.1.5):

0...00110 == 6
1...11010 == -6 (two's complement)

The summary of the main instructions we'll cover and try to implement in the course:

Architectural state elements

We cover the implementation of a single-cycle processor here. This is the super interesting part!

Explanation of the processor diagram

The (barebones) complete single-cycle ARM processor with a control unit is summarized as follows:

Control unit

The control unit can be abstracted into two separate units, the decoder and the conditional logic circuit (which handles the portion that can conditionally modify the state).

Decomposition of conditional logic

Decomposition of the main decoder

Extension to handle CMP

Extension to handle shifted register operations

Single-cycle processor conclusion

This single cycle processor is relatively simple with fetch + decode + execute instructions executed in a single cycle, but has two main disadvantages:

  1. No datapath resource can be reused more than once => duplication of resources
  2. Cycle time will be limited by the longest instruction (LDR)

Execution speed can be improved by other technologies, e.g. pipelining, superscalar, multi-processors.


Week 3

Verilog overview

We ought to view FPGA execution flow as groups of combinatorial logic circuits, separated by "brick-wall" registers. The longest critical path delay between successive registers determine the maximum clock rate.

Diagram to illustrate max clock rate

Some tips:

Synchronous edge detection circuit

Missing flip-flop

Initialization circuit

always @(posedge CLK or posedge RST) begin
    if (RST)
        // Asynchronous code here for SET/RST
        Z <= 1'b0;
    else begin
        // Synchronous code here
        Z <= X | Y;
    end
end

A small commercial break to explain the differences between blocking and non-blocking assignments:

Blocking Non-blocking
Statement = <=
Evaluation order Ordered (source-code) Unordered
Output circuit Parallel Series

A small note on inferred registers:

always @(posedge CLK) begin
    B = A;
    C = B ^ Z;
    D <= C;  // D is a reg
    E = C;
    A = E;  // A is an implicit reg, should be written with '<='
end

Solution

(Single-cycle) Processor

We repeat the distinction between architecture and micro-architecture: the former is defined by the set of architecture instructions, while the latter is about implementing that architecture in hardware.

ARM follows the design principle which includes:

The architecture in a bit more detail:

Format:

The rotation value is twice the rot value. To allow up to 32-bit rotations (so that we can ideally access the full range of 32-bit values), we need 5 bits rot. But only 4 bits are accessible in this instruction format, so we halve the resolution of the rotation (i.e. the rot value has an implicit 0 at the end). This, combined with the 8-bit immediate value, means the only values that can be encoded are those with 1s of up to 8-bit width and can align with two-bit boundaries.

Example:

SUB R2, R3, #0xFF0

+-- cond: 1110 (14) unconditioned
+-- op: 00 (0) data processing
+-- cmd: 0010 (2) subtraction
+-- I-bit: 1 (1) src2 is immediate
+-- S-bit: 0 (0) no set flag
+-- Rn: 0011 (3)
+-- Rd: 0010 (2)
+-- rot: 1110 (14)
+-- imm8: 11111111 (255)

See that:

  0xFF0 == 0b0...0_00001111_11110000

which is (0xFF << 4) = (0xFF ROR 28) = (0xFF ROR 0b11100).
So rot == 0b1110 with implicit 0 at the end, and imm8 == 0xFF.
The instruction is thus,

  0x E0 A3 2E FF

Week 2

Tip

Important to note that the number of cycles taken per instruction is not necessarily due to the need to chain instructions, but also due to the settling time of the logical levels. This means a higher clock speed means instructions will potentially take more cycles to complete.

A common benchmark for computers is the set of SPEC benchmarks (SPEC CPU 2017) including computation time and power consumption. Power consumption across workloads, using ssj_ops (server-side Java operations) per Watt (averaged across workloads in 10% increments).

Some non-linearities in system metrics:

$$$$ \text{MIPS} = \frac{\text{number of instructions}}{\text{execution time} \times 10^6} = \frac{\text{clock rate}}{\text{cycles per instruction} \times 10^6} $$$$

Some trivia:


Week 1

Instruction Set Architecture (ISA) is the abstraction between hardware and instructions, also known as microarchitecture. This combined with the operating system interface (that defines how data is piped, how functions should be called and return) is termed Application Binary Interface (ABI), which provides for binary compatibility between computers.

Examples of popular ISAs:

  1. ARM family, e.g. ARMv7M, ARM Cortex-7M, ARMv7A
  2. x86 family, e.g. AMD64/x86_64

Some fun facts

ASIC, ASSP, etc.

Processor/GPU: Intel, AMD, Nvidia, Broadcomm, Qualcomm

SoC: Snapdragon, A-Series, AMD APU, etc.

Processor information:

Note the rough power scaling model with capactiv

$$$$ P = CV^2 f $$$$

Methods to improve performance: