CG3207 Computer Architecture

Pinned resources

Rajesh C Panicker (email: rajesh@nus.edu.sg)
- https://wiki.nus.edu.sg/display/CG3207
Canvas inbox
References:
- Harris - Digital Design and Computer Architecture
- Patterson, Hennessey - Computer Organization and Design
Background:
- Assembly in ARM version 3, or RISC-V
- HDL in Verilog, or VHDL

Week 9

Addressing I/O device/register

Port-mapped I/O (PMIO):
- Dedicated instructions for I/O, e.g. x86 has dedicated IN and OUT instructions separate from memory access instructions
- Processor has an extra output pin to specify whether the address is meant for memory or I/O peripherals (not compliant with RISC philosophy)
- Simpler software, using separate address spaces for memory and I/O
Memory-mapped I/O (MMIO):
- Same LDR-STR instructions for I/O
- Address decoder is present to determine whether memory or I/O is being addressed
- I/O devices eat into the memory address space
- Simpler processor design, but more complicated software, e.g.
  - Address range corresponding to I/O device should not be cached, since data can become stale.
  - Compiler optimization from a=b;c=a; to c=b;, but a is not a standard memory address but an I/O address (so RW is eliminated). Need to indicate using volatile keyword.
  - Memory barriers / fence instructions to tell compiler not to reorder memory operations

An example of an implementation for the address decoder, here the higher order bits specify the I/O device, and the lower order bits specify the address within the device. Reading is also handled by the address decoder to specify which device is being read.

0x800 == 0b1000_0000_0000
0x9FF == 0b1001_1111_1111
           ---===========
        device  address

Note that we avoid byte-addressable memory, as opposed to word-addressable memory, in the FPGA, because of suboptimal implementation for the former (using an entire FF for the storage).

Performing I/O

Programmed I/O (PIO):
- Process:
  - Processor reads the peripheral register and places data into a location in the memory, using LDR-STR instructions.
  - Initiated either by polling (i.e. read a status register within the device), or triggered by an interrupt from the device.
- Processor time is wasted for bulk transfers.
Direct memory access (DMA):
- The processor triggers a separate controller present on the bus, which is capable of generating addresses, bus R/W signals and transferring data.
- Process:
  - CPU writes to the DMA registers: Source and destination addresses, Number of bytes to be transferred, Control bits, Status bits
    - Different configurations available, e.g. cycle-stealing mode (DMA uses the bus only when CPU is not using the bus)
  - CPU assigns DMA as the bus master, which generates the addresses sequentially for the transfer.
  - CPU performs computations which do not require memory access (i.e. data in cache and registers), until the DMA raises an interrupt.

I/O protocols

Parallel protocols (e.g. PATA, PCI, IDE) use a number of parallel wires to transmit bits in parallel - these are more commonly use for intra-chip communication. Serial protocols (e.g. SATA, PCIe, I2C, SPI, UART) instead transmit bits one at a time. This suffers less from bus skew (different transmission speeds) and crosstalk (signal interference).
Synchronous transfer (e.g. SPI, I2C) carries the clock signal along a separate wire, but suffers from clock-talk and skew as well. Asynchronous transfer (e.g. UART, USB) uses the data for clock recovery (e.g. via edge detection by oversampling start bits to identify the middle of the bits, or a handshaking mechanism using separate RDY/ACK control lines), but is slower and has more complicated hardware.
Bus-based protocols (e.g. I2C, SPI, USB) have multiple devices connected to the same set of wires (bus), with each device having a unique address. Point-to-point protocols have the devices connected via a dedicated link, so no addressing is needed.
Master-slave / Host-device protocols (e.g. I2C, SPI, USB, AHB, AXI-Stream) have a master device initiating communication, and dictates which slave should respond or generate the clock (for synchronous protocols). Peer-to-peer protocols (e.g. Ethernet) can have any device initiate communication, but will need some protocol for back-off when collisions occur.
Simplex connections are a unidirectional link with one-way interaction (e.g. writing to LEDs, or reading from switches). Half-duplex protocols have one bidirectional link (e.g. I2C, USB2.0, LTE-TDD). Full-duplex have at least two links for simultaneous bidirectional communication (e.g. SPI, UART, USB3.X).
In-band addressing is performed when both address and data are sent through the same bus (e.g. I2C, USB) but address and data transmission need to be time-multiplexed, as opposed to out-of-band addressing that has a separate address bus with an address decoder that enables the specified device (e.g. SPI). This set of terminology is only applicable for bus-based protocols.

Protocol	Serial	Synchronous	Bus	Master-slave	Full-duplex	In-band
I2C	Y	Y	Y	Y	N	Y
SPI	Y	Y	Y	Y	Y	N
UART	Y	N	N	Y	Y	-

R/W memory principles

The instruction and data memory are usually external to the processor, and the WriteData and ReadData lines are typically implemented as a single bidirectional bus interface. Different types of memories:

Type	Implementation	Usage	Access time	Cost (/GB)
SRAM	Flip-flops	Caching	0.5-2.5 ns	$2k-5k
DRAM	Capacitors	Main memory	50-70 ns	$20-75
SSD	Flash	Disk storage	-	~$.30
HDD	Magnetic disk	Disk storage	5-20 ms	~$0.10

A suitable memory hierarchy needs to be implemented to balance between cost/speed and size. Some units for implementation, with each level holding most-frequently accessed data from next immediate higher level:

$$M_0$$ CPU registers (bytes/words)
$$M_1$$ SRAM cache memory (blocks/lines, roughly ~64B)
$$M_2$$ DRAM main memory (pages, roughly ~1KB)
$$M_3$$ Disk storage (file segments)

Some properties to be observed:

Coherence: Copies of same data should remain consistent at all levels of the memory hierarchy.
Locality of references: Memory accesses tend to be clustered temporally (recently accessed data), spatially (nearby data), and in specific orders.

Memory capacity planning involves hit ratios, i.e. the probability a word is found when accessing a memory. Hit ratios are a function of memory capacities (higher capacity -> higher hit ratio), management policies (decision of what data to keep/drop), program behavior (written with higher locality in mind).

A good assumption is the hit rate for the register-level $$h_0 = 0$$ (since if the data is not in the register anyway, no need to do LDR or STR), and the hit rate for the outermost level $$h_n = 1$$. We define the access frequency $$f_i$$ for $$M_i$$ as a series of cache misses before a hit at the level,

$$$$ f_i = h_i \prod_{k<i} (1-h_k), \qquad{} \sum_i f_i = 1 $$$$

By the locality property, the inner levels are thus accessed more often than the outer levels. Some other metrics to quantify the memory performance,

Effective access time: $$T_{\text{eff}} = \sum_i f_i t_i$$
Total cost: $$C_\text{total} = \sum_i c_i s_i$$ (c cost/MB, s size in MB)

Optimization is thus to minimize $$T_\text{eff}$$ given cost constraint $$C_\text{total} < C_\text{max}$$ (or the other way around), with the $$T_\text{eff} \rightarrow t_1$$ and $$C_\text{total} \rightarrow{} c_n$$.

The contents of a block of memory words are transferred to the cache upon receipt of read request from CPU, and data directly read by the cache controller for subsequent reads from locations in that block. Note that cache memory access is ideally transparent to the CPU. The cache controller has mapping functions to determine where the block should be placed in the cache, and replacement algorithms for replacing blocks that has not been referenced recently.

Miss penalty is defined as the time taken to retrieve block from slower levels in memory hierarchy.
- A read miss normally is loaded into the cache before being sent to the CPU.
- Alternatively a load-through (early restart) can be performed to send the word directly to the CPU to reduce waiting time, but increased required circuitry.
Valid bits are kept for each cache line, and cleared if the cached location was modified in main memory (e.g. by the DMA bypassing the CPU). These accesses are considered cache misses.
- One implementation is through a snoopy cache that monitors bus transactions to maintain cache coherency.

Week 8

Control hazard: Exception handling

Control hazards also arise from exception handling. Procedure to deal with this:

Stop execution of offending instruction midstream
Let all prior instructions complete
Flush all following instructions
Set register to show cause of exception and save its address
Jump to a exception handler address

ARM defines two types of exceptions:

Interrupts: Asynchronous to program execution (caused by external events), so instructions in the pipeline can be safely completed.
Traps: Synchronous to program execution (caused by internal events), the trap handler must deal with that specific instruction midstream in the pipeline.

Some examples of exceptions and where they occur in the pipeline (beware that multiple exceptions can occur simultaneously in a single clock cycle!):

Cause	Type	Pipeline stage
Arithmetic error (e.g. DIVZERO)	Synchronous	E
Executing undefined instruction (i.e. Usage Fault)	Synchronous	D
OS service request (e.g. TLB/page fault)	Synchronous	F, M
I/O service request	Asynchronous	Any
Hardware malfunction	Asynchronous	Any

We need to have a mechanism to load the PC with the address of the exception handler, by extending the PC input mux with:

Non-vectored interrupt (fixed exception handler): Simpler hardware, but will need registers to store the cause of the exception (e.g. MIPS)
Vectored interrupt (table of handler addresses): Simpler and faster interrupt handling (e.g. x86, ARM)

Exception handling splits the responsibility between hardware and software. In general:

Hardware:
- Flush offending (and subsequent) instructions during traps (not needed for an interrupt), and save the return address to a register.
- Implementation-specific for some processors:
  - Automatically save some registers to the stack.
  - Have banked registers for use by the handler (ARM R13 is addressed differently by the main program and the interrupt handler)
- Multiple interrupts can be handled by a dedicated nested vector interrupt controller (NVIC).
Software:
- Save additional registers if needed.
- Figure out which peripheral (and possibly the condition) causing the interrupt
- Do the bare minimum required to deal with the interrupt
  - Further processing should be done by some lower priority code, by setting a flag for the main program, or to wake a separate thread
- Clear the interrupt by getting the peripheral to de-assert the interrupt request
- Restore additional status/registers
- Exit the handler

Processors typically stay in thread mode (for application software) which can be unprivileged, and switches to handler mode (for exception handling) which is always privileged.

Unprivileged (user) mode
- Limited access to instructions (e.g. ARM MRS, MSR) and cannot use change processor state (CPS) instructions
- Cannot access system timer, interrupt controller, system control registers
- Restricted access to reserved memory addresses/peripherals
Privileged (kernel) mode
- Unlimited access to instructions and resources
- Userspace can use SVC instruction to make supervisor call for control transfer to privileged software

Generic computer system

A typical computer architecture may look like this:

A hierarchy of bus speeds is typically more common:

High speed bus for processor + RAM + flash
- On the order of 100s Gbps
- Usually proprietary, e.g. Intel QuickPath Interconnect, AMD Hypertransport, ARM AHB (open standard)
Lower speed bus for other external peripherals.
- Typically industry standard for modularity, e.g. PCIe, APB/AXI
- Bus bridges are available for data exchange and clock domain crossing

Peripheral registers are typically accessed like memory:

Status registers (read by processor to know state of hardware; RO mode)
Control registers (written by processor to give commands/control signals; RW or WO modes)
Data registers (read or write for data transfer; RO, WO or RW modes)

To interface with peripherals, libraries of driver functions / API are usually provided to encapsulate the process of RW peripheral registers (internally uses pointers to said registers).

Week 7

Some postulated numbers for the time taken in picoseconds by each instruction, based on each component (stage) of the single cycle processor. In a single-cycle processor, the clock cycle is limited by the slowest instruction. We can perform pipelining by fetching the next instruction before the current instruction has completed, so the clock cycle is now limited to the slowest stage.

The stages are termed "Fetch", "Decode", "Execute", "Memory", "Writeback" (FDEMW). Since not all instructions use all stages, there will be wasted cycles.

This is easily achieved by inserting state registers between the distinct stages. Note that control signals and the register-write signal need to be delayed to the proper pipeline stage, as well as an appropriate change to the program counter (by propagating the PC+4 value directly to the next stage, whose value belongs to the previous instruction):

This is thus the reason why PC+8 is used instead of PC+4 (so no need to store in another 32-bit register).

Several race conditions are created as a result of implementing pipelining, i.e. pipelining hazards. Some of the hazards and their mitigations.

Structural hazards

This corresponds to attempts to use the same resource by two different instructions at the same time. One such example is when only a single memory is used, then there can be cases where different instructions attempt to read from it at the same time. Quick fix is to duplicate the problematic resource, e.g. to have separate instruction and data memories.

In a modern computer, the same memory stick is typically used, but the memory also maintains separate instruction and data caches. The cache miss rate is relatively low, so this is effective at mitigating most cache conflicts. Resolution of conflicts can still be done by slightly delaying the pipeline.

Data hazards

This is more common than the instruction-data structural hazard, since data memory is commonly written to as well. This is otherwise known as the RAW (read-after-write) hazard, or true data dependency. For the following 5-stage pipeline, the next three instructions may be potentially impacted:

Several methods of resolution:

Hardware
- Read and write to register at different clock edges, e.g. read-then-write. This helps make instructions at least 2-stages forward independent from the current instruction.
Compile time fixes:
- Delay the pipeline by adding NOPs
  - 0x00000000 -> ANDEQ R0,R0,R0
  - 0xE1A00000 -> MOV R0,R0 (typical NOP expansion)
  - This wastes code memory, and not very portable since the compiler will need to know the microarchitecture (5-stage or 10-stage pipeline?)
- Rearrange code to interleave uncorrelated instructions
Run time resolutions, e.g. data forwarding, processor stalling
- Processor stalling
  - Similar to NOP, but program correctness is guaranteed by the hardware instead of the compiler.
  - Write registers for the relevant instructions are deasserted to pause execution, see below.
- Data forwarding

Stalling strategy and implementation:

Data forwarding for Execute (ALU)

Main idea is to forward results directly between pipeline stages, when the next few instructions have to read from the same register that will be written to. In the diagram below, the commonly shared registers

See the example below:

If the register is not being written to recently (i.e. two instructions prior), the register value will have to be loaded from the Execute register (which was written to by the Decode stage).
- Instructions with a gap of two will go through the usual write-then-read process
If the register is being written to by the previous instruction, the immediate result is available from the Memory register (written to by the Execute stage for the previous instruction).
If the register is being written to by the instruction two instructions prior, this is retrieved from the Writeback stage instead.

ADD R2, R3, R4
ADD R2, R3, R4
ADD R1, R7, R1  // R1 read from E register
ADD R1, R3, R1  // R1 read from M register, i.e. forwarding
ORR R9, R6, R1  // R1 read from M register, prioritize latest write
SUB R2, R1, R1  // R1 read from W register, note need to forward to either operand
SUB R2, R7, R1  // R1 read from E register, register written back to register already

See below for implementation, where a control line for each operands A and B determine whether they will be loaded from a forwarded line.

ADD R1, R3, R4  // WA3M == R1  (R1 will be written to,
                // will be in M register during next instruction)
ADD R2, R1, R3  // RA1E == R1, so read from M register

# Implementation for register A line
# Similar for register B line
if   ((RA1E == WA3M) & RegWriteM) ForwardAE = 10;
elif ((RA1E == WA3W) & RegWriteW) ForwardAE = 01;
else ForwardAE = 00;  // no forwarding

Data forwarding for Memory (LDR/STR)

Recall:

LDR R1, [R4] loads the register R1 with data read from address specified in R4.
STR R1, [R3] stores data in register R1 to address specified in R3.
Note that R1 in STR above is operand A, i.e. RA2M in Memory stage.

ForwardM = \           // forward from W to M
    ((RA2M == WA3W) \  // if data in W needed in M
    & MemWriteM \      // and M instruction is STR
    & MemtoRegW \      // and W instruction is LDR
    & RegWriteW)       // and the LDR instruction will be executed

This is called memory-memory copy, since data directly read from memory is directly written to memory again. Note that this is not a typical operation for the processor, which can be delegated by the processor to a Direct Memory Accessor (DMA) to perform the copy directly.

Data forwarding (memory-to-ALU)

Since the Memory stage is after the Execute stage, we need some way to forward data read from memory to the ALU. This can be solved by stalling the next instruction.

We need to stall the F and D stages so that the instructions will not advance. This can be achieved simply by disabling the write controls for the F and D registers.

The E stage is now effectively a NOP. The only four control signals that will change the processor state, and hence need to be zeroed in order for the Execute stage to functionally do nothing, are:

PCSE: Write to PC
RegWE: Write to register
MemWE: Write to memory
FlagWE: Write to ALU status flags

ldrstall = \             // stall the processor
    ((RA1D == WA3E) | \  // if RA1 is to be loaded from D stage
                         // and RA1 will eventually be written to
     (RA2D == WA3E))  \  // same for RA2
    & MemtoRegE \        // and E instruction is LDR
    & RegWriteE          // and the E instruction will be executed

Control hazards

The branch instruction, as well as writes to R15 (which doubles as PC), can cause control hazards. Control hazards are difficult to mitigate, compared to data hazards.

Possible approaches:

Stall until branch decision and the target address (BTA) is available
- This is automatically a 4-stage delay since the writeback from B instruction comes from the W stage. This strongly affects CPI.
Delay decision
- Compiler can insert other independent instructions which are okay to execute irrespective of branch outcome (ISA delay slot), or
- Compiler can insert NOP if such instructions cannot be found
Early BTA (bring forward branch outcome from W stage to somewhere earlier)
- Branch decision in PCSrcE control line, while BTA decided by ALU. These are in the E stage, so we have a 2-stage NOP instead of 4-stage NOP (where the branch is passed only from the W stage).
- Note that the instruction LDR R15,... is problematic since the branch target will only appear in the M stage.
- This may increase critical path delay since E stage potentially needs to propagate further.
Branch prediction
- Predict both the decision and BTA, so that the target instruction can be executed with any stalls, provided the branch predicted was correct.
- If branch prediction was wrong, can stall, and also refine branch prediction (adaptive branch prediction). Note that writes will thus need to be buffered to prevent wrong writes.
Fetch and execute for both branches (provided the BTA is known)
Predicated/conditional execution
Fine grained multithreading (e.g. GPU, by running with higher bandwidth/in parallel so that a slower clock rate can be tolerated)

The branch decision and target address comes from the E stage:

For early BTA, if the branch is wrong (e.g. outcome from E stage), the instructions in F and D stages need to be cleared. Note that this flushing is not actually necessary if the ISA has 2 branch delay slots.

FlushD = FlushE = PCSrcE

# if including the stalling circuitry, then the FlushE
# control signal is overloaded.
FlushE = ldrstall || PCSrcE

Week 6

Division arithmetic

A simple division scheme is achieved by performing restoring division:

One register each for the divisor, remainder and the quotient
Algorithm:
1. Right-shift in the divisor value
2. Perform a subtraction - if remainder is non-negative, the division is successful, add to quotient and repeat from current step; if remainder is negative, re-add the divisor and stop.
3. Right-shift the divisor and repeat from the previous step.
If the division operands are signed, can consider converting into positive values then add negative sign to quotient if operands are of opposite sign.

Worked example of restoring division

Fractional numbers

For fractional numbers, there are two main encoding systems:

Fixed point: the position of the binary point is fixed, with the number of bits of the integer and fractional parts pre-determined, e.g. 0110.1100 corresponds to 6.75. Has the advantage of being simple.
- Sign either encoded as "sign+magnitude", e.g. -7.5 as 1111.1000, or as two's complement, e.g. -7.5 as 10001000
- Note multiplication/division is seen as integer multiplication with left/right-shift by the number of fraction bits, e.g.

0011.0000 x 0100.0100   (3.0 x 4.25, fixed point)
 00110000 x  01000100   (48 x 68, representation as integers)
         110011000000   (3264, result as integer)
             11001100   (right shift by 4 bits to account for scaling factor)
            1100.1100   (12.75, fixed point)

Floating-point: the position of the binary point is floating to the right of 1. Basically just see IEE-754, as a summary: [sign][exponent][mantissa], with sign * 1.mantissa ^ (exponent - bias).
- Single-precision: 1-bit sign, 8-bit exponent, 23-bit mantissa (e.g. -58.25 == 0xC2690000)
- Double-precision: 1-bit sign, 11-bit exponent, 52-bit mantissa
- Half-precision: 1-bit sign, 5-bit exponent, 10-bit mantissa (popular in machine learning / neural networks)
- Note the extreme values, represented in single-precision:
  - 0 := X_00...0_00...0 (actually 2^-127)
  - infty := 0_11...1_00...0 (actually 2^128)
  - -infty := 1_11...1_00...0 (actually -2^128)
  - NaN := X_11...1_(non-zero) (actually X * 2^128)

To perform floating-point addition, we perform the following steps:

Number 1: 0x3FC00000 (1.1   * 2^0)
Number 2: 0x40500000 (1.101 * 2^1)

Extract significands:  1.1  , 1.101
Rshift smaller value:  0.11 , 1.101  (precision loss if too small)
Add significands:     10.011
Adjust significand:    1.0011
Round significand:     1.0011 (no need since <32-bits)

Reassemble number:  0x40980000

The possible rounding choices are round-up, round-down, towards zero (truncation), or towards nearest (even rounding).

Week 5

Arithmetic

One-bit adders are straightforward. One-bit adders can be chained for arbitrary number of bits, but is relatively slower due to the ripple/propagation of carries.

Carry-lookahead adders can precompute the carry after n-bits. To avoid excessively large fan-outs, as well as the quadratically increasing number of gates, the lookahead can be implemented only for 4-bits, then cascaded to form 16-bit adders.

Lookahead logic as per the following, with $$g_i$$ generating a carry when A = B = 1 (independent of C), and $$p_i$$ propagating a carry when C = 1 and only one of A = 1 or B = 1. The goal is to compute $$C_i$$ from the set of $$\{C_0, A_0, \ldots{}, A_{i-1}, B_0, \ldots{}, B_{i-1}\}$$ inputs:

$$$$ C_{i+1} = p_i C_i + g_i = (A_i + B_i) C_i + (A_i B_i) $$$$

Subtraction is simply negating the right operand, and performing an addition. All these functionality can be encapsulated in a monolithic Arithmetic Logical Unit (ALU), with the "ALUControl" bits determining which of {SUM, SUB, AND, OR} is performed. The NZCV flags (negative, zero, carry, overflow) can also be extended from the ALU:

N = Result31 (the sign bit)
Z = all bits of Result are 0
C = ADD/SUB is performed and Cout = 1
V = A and SUM have different signs + (ADD is performed + A and B have the same sign) | (SUB is performed + A and B have different signs)

Equality comparison can be done via subtraction:

Shifting is done either by performing a barrel shift (i.e. assigning the input into the target output bit, and this is relatively expensive), or cascading shifts (for each bit value in the "shift amount", we perform that amount of shift at that stage):

The whole set of shifting and rotating instructions can be packaged into one unit:

Multiplication

Multiplication performed by computing the partial products (of the multiplier with individual digits of the multiplicand), shifting them and summing this up. This is either done as an array multiplier (single cycle), or basically a cascaded version of shift+product+sum:

Improved sequential multiplier

There are different multiplication instructions in the ARM instruction set:

UMULL: multiplication of 2 n-bit unsigned numbers (output is 2n-bits)
SMULL: multiplication of 2 n-bit signed numbers
MUL: n-bit LSB of multiple of 2 n-bit numbers (the same regardless of sign).

See the following example, noting that the 2-bit LSB of the output 10 is identical regardless of whether signed or unsigned multiplication is performed. For the correct long multiplication, sign extension to 2n-bits needs to be performed, i.e. propagate the MSb.

  10  <-- unsigned 2 or signed -1
  11  <-- unsigned 3 or signed -2
----
0010
0100
----
0110  <-- unsigned 6

Week 4

A summary of the ARM instruction set, and making sense of it... Need to remember that here, this is a 32-bit microarchitecture, so 1 word is 32-bits (4-bytes). Update: Spent 2+hours summarizing this, yay me.

Today's lecture on the addressing modes for each data processing / memory instruction, etc. Not particularly useful unless one does not have prior experience reading a microarchitecture datasheet.

A special note on the branch instruction (A4.1.5):

The signed_immed_24 field is in twos-complement signed 24-bit value, a quick recap below.
The offset is specified to be relative to the current PC + 8 bytes, e.g. value of -6 (word == 4-bytes) with PC = 0x8050 (bytes), means the target instruction to jump to is located at 0x8040 (bytes).

0...00110 == 6
1...11010 == -6 (two's complement)

The summary of the main instructions we'll cover and try to implement in the course:

Architectural state elements

We cover the implementation of a single-cycle processor here. This is the super interesting part!

Explanation of the processor diagram

First note the Register File (RF) that stores the values for the 15 registers, plus R15 that can function as a proxy of the PC value (specifically PC+8):

Suppose the instruction we want to execute is the LDR instruction (i.e. load data from memory) with an immediate offset (LDR Rd, [Rn, #Imm12]). The first step is for the instruction to be fetched from the instruction memory, after the PC is incremented.

The instruction is read separately (Rn address written to A1 -> Rn contents output in RD1; the immediate value is extended to fill 32-bit width), before the result is computed in the ALU:

The data read from memory from the data memory unit, is sent to WD3, and written to the register via specified by A3.

The PC is subsequently incremented (i.e. +4 bytes) and further incremented to send to R15 (i.e. PC+8 bytes, this is a design choice, and this setup is done to avoid keeping PC and possibly separate R15 in register file in sync).

For a STR instruction (i.e. write data in register to memory, STR Rd, [Rn, #Imm12]), the Rd register specifies the data to write to memory. We trigger a write via the MemWrite control.

If data processing is required, and bits 11:8 is not part of the immediate value, e.g. extending 8-bit immediate instead of 12-bit in OP Rd, Rn, #Imm8), there needs to be a control line ImmSrc. Since result is to be written to the register, there also needs to be a passthrough MemtoReg multiplexer.

We can also have support for register-based data processing, with Rm specified, and we select from register file output instead of the immediate:

Support for branching (with a two's complement 24-bit immediate) by allowing direct read from R15 (PC+8) and overloading the Extend unit to deal with word addresses (4-byte aligned) that is signed (perform sign bit extension, ). Note that this extension + shift is fixed, so there is no need for a shifter (two grounds for LSb, and fanout for MSb). ImmSrc has to be a two-bit value, to select between three possible combinations. RegSrc also a two bit value for control of RA1 and RA2.

The (barebones) complete single-cycle ARM processor with a control unit is summarized as follows:

Control unit

The control unit can be abstracted into two separate units, the decoder and the conditional logic circuit (which handles the portion that can conditionally modify the state).

Decomposition of conditional logic

Decomposition of the main decoder

Extension to handle CMP

Extension to handle shifted register operations

Single-cycle processor conclusion

This single cycle processor is relatively simple with fetch + decode + execute instructions executed in a single cycle, but has two main disadvantages:

No datapath resource can be reused more than once => duplication of resources
Cycle time will be limited by the longest instruction (LDR)

Execution speed can be improved by other technologies, e.g. pipelining, superscalar, multi-processors.

Week 3

Verilog overview

We ought to view FPGA execution flow as groups of combinatorial logic circuits, separated by "brick-wall" registers. The longest critical path delay between successive registers determine the maximum clock rate.

Diagram to illustrate max clock rate

Some tips:

Do not use the delay instruction
- HDL code can be ported between platforms, and propagation delays are hardware dependent. Introduce delay by introducing a physical register instead.
Use only one clock for the entire design
- Multiple clocks can be difficult to perform synthesis on, and may additionally require domain crossing circuitry.
- Avoid using outputs of combinational circuits as clocks due to glitches, e.g. Z = AND(X,-X) can result in transient spikes.
Use posedge/negedge only for clock/RST
- For edge detection, use a synchronous edge detection scheme, see below.

Synchronous edge detection circuit

Do not have combinational feedback paths

Missing flip-flop

Do not have multiple drivers for regs (output of always) and wires (output of assign)
- Wires should not be driven by multiple gate outputs
- Registers should not be assigned from multiple always blocks (for the same always block, values to registers can be scheduled).
- See this for blocking vs non-blocking.

Note that initialization is automatically handled by the FPGA, but needs to be explicitly stated for ASIC synthesis tools
- Do not assume the initial values are zero.
- reg is something that can hold a value in a block, but does not necessarily have to be a physical register. This means regs synthesized as combinational circuits cannot be meaningfully initialized.

Initialization circuit

Three main ways for writing always blocks:
1. Purely combinational, i.e. all outputs are sum-of-product or product-of-sum functions of the current inputs
  - No memory elements (latches are typically avoided in FPGA design)! All reg must be assigned a meaningful value for all possible input combinations.
  - Blocking assignments mainly used for internal logic, while non-blocking assignment can be used for the output.
2. Synchronous, i.e. output changes only with rising or falling edge of a clock
  - Only regs that change on the same edge should be in the same always block.
3. Synchronous with async set/rst, i.e. an asynchronous SET/RST trigger
  - Typically used for initialization, but this is typically already handled by the FPGA.
  - See the code below for template for such cases:

always @(posedge CLK or posedge RST) begin
    if (RST)
        // Asynchronous code here for SET/RST
        Z <= 1'b0;
    else begin
        // Synchronous code here
        Z <= X | Y;
    end
end

A small commercial break to explain the differences between blocking and non-blocking assignments:

	Blocking	Non-blocking
Statement	`=`	`<=`
Evaluation order	Ordered (source-code)	Unordered
Output circuit	Parallel	Series

A small note on inferred registers:

When non-blocking assignments are used.
When blocking assignments are used, but is read (RHS) before being written (LHS) - implies retention of information (but avoid this!)

always @(posedge CLK) begin
    B = A;
    C = B ^ Z;
    D <= C;  // D is a reg
    E = C;
    A = E;  // A is an implicit reg, should be written with '<='
end

Solution

(Single-cycle) Processor

We repeat the distinction between architecture and micro-architecture: the former is defined by the set of architecture instructions, while the latter is about implementing that architecture in hardware.

In the case of ARM (Advanced RISC Machines), both architecture and microarchitecture licenses are provided - though architecture licenses are much harder to obtain (where the licensee can design their own microarchitecture compliant with the ARM instruction set).
ARM's business model involves mainly the sale of their microarchitecture, as well as associated toolchains for the ARM instruction set.

ARM follows the design principle which includes:

Regularity: Constant length instructions/data and consistent instruction formats, with same number of operands (ease of encoding and handling)
Simplicity: Only simple, commonly used instructions are included in the architecture, with small number of registers

The architecture in a bit more detail:

Only 16 general purpose registers
- Volatile registers: R0 for return value, R1-R3
- Non-volatile registers: R4-R11
- Scratch register: R12
- Stack pointer: R13
- Link register: R14
- Program counter: R15
All instructions can be conditionally executed based on the status flags (using 4-bit condition codes)
- ADD R1, R2, R3 saves result to R1. ADDEQ R1, R2, R3 saves result to R1 only if Z flag is set.
- Suffix S used to affect flags after instruction
- In contrast, conditional branches typically need to be used to preface instructions, in other instruction sets.
Three main instruction formats: data-processing (e.g. ADD), memroy (e.g. LDR), branch (e.g. B).

Format:

4-bit (31:28) condition code
2-bit (27:26) operation code
6-bit (25:20) function
- 4-bit (25:22) command, e.g. ADD, SUB
- 1-bit (21) I-bit, indicates if Src2 is a register
- 1-bit (20) S-bit, indicates if instruction sets flags
4-bit (19:16) Rn
4-bit (15:12) Rd
12-bit (11:0) Src2
- EITHER immediate = 8-bit unsigned immediate, 4-bit right-rotation value
- OR ...

The rotation value is twice the rot value. To allow up to 32-bit rotations (so that we can ideally access the full range of 32-bit values), we need 5 bits rot. But only 4 bits are accessible in this instruction format, so we halve the resolution of the rotation (i.e. the rot value has an implicit 0 at the end). This, combined with the 8-bit immediate value, means the only values that can be encoded are those with 1s of up to 8-bit width and can align with two-bit boundaries.

0xFF0 == 0b0..0_00001111_11110000 is okay
0x7F8 == 0b0..0_00000111_11111000 is not okay
0x3F8 == 0b0..0_00000011_11111000 is okay

Example:

SUB R2, R3, #0xFF0

+-- cond: 1110 (14) unconditioned
+-- op: 00 (0) data processing
+-- cmd: 0010 (2) subtraction
+-- I-bit: 1 (1) src2 is immediate
+-- S-bit: 0 (0) no set flag
+-- Rn: 0011 (3)
+-- Rd: 0010 (2)
+-- rot: 1110 (14)
+-- imm8: 11111111 (255)

See that:

  0xFF0 == 0b0...0_00001111_11110000

which is (0xFF << 4) = (0xFF ROR 28) = (0xFF ROR 0b11100).
So rot == 0b1110 with implicit 0 at the end, and imm8 == 0xFF.
The instruction is thus,

  0x E0 A3 2E FF

Week 2

Tip

Important to note that the number of cycles taken per instruction is not necessarily due to the need to chain instructions, but also due to the settling time of the logical levels. This means a higher clock speed means instructions will potentially take more cycles to complete.

A common benchmark for computers is the set of SPEC benchmarks (SPEC CPU 2017) including computation time and power consumption. Power consumption across workloads, using ssj_ops (server-side Java operations) per Watt (averaged across workloads in 10% increments).

Some non-linearities in system metrics:

Note the term "Amdahl's law" which basically states that improving a single aspect of a computational workload will only have returns for the affected operations.
Idle power at low workload is usually about 50% of that at full workload.
- Can be partially mitigated via DVFS (tuning voltage/frequency on processor based on workload)
MIPS is not an accurate performance metric across different architectures

$$$$ \text{MIPS} = \frac{\text{number of instructions}}{\text{execution time} \times 10^6} = \frac{\text{clock rate}}{\text{cycles per instruction} \times 10^6} $$$$

Some trivia:

Note a general rule of thumb that lower return with time investment at lower abstraction levels (from algorithmic, RTL, gates, circuits, layout). Lower layers should be supported by EDA platforms.
HDLs is event-based as opposed to sequential operation, so maps easier to an actual circuit. Concurrency, timing and delays can be modelled as well.
Consider a couple of language candidates for HDL verification, e.g. SystemVerilog, SystemC and MyHDL.
Chip designers disabling thermal throttling to boost system performance benchmarks, e.g. Mediatek.
Usage of hardware increasingly does not require a hardware engineer, e.g. software libraries to interface with GPUs for parallelism (CUDA, OpenCL).
Can read up on UNIC-CASS.

Week 1

Instruction Set Architecture (ISA) is the abstraction between hardware and instructions, also known as microarchitecture. This combined with the operating system interface (that defines how data is piped, how functions should be called and return) is termed Application Binary Interface (ABI), which provides for binary compatibility between computers.

Examples of popular ISAs:

ARM family, e.g. ARMv7M, ARM Cortex-7M, ARMv7A
x86 family, e.g. AMD64/x86_64

Some fun facts

ASIC, ASSP, etc.

Processor/GPU: Intel, AMD, Nvidia, Broadcomm, Qualcomm

SoC: Snapdragon, A-Series, AMD APU, etc.

Processor information:

Note the rough power scaling model with capactiv

$$$$ P = CV^2 f $$$$

Methods to improve performance:

Increase throughput
- Multi-core architecture
- Instruction-level parallelism
- Heterogenous systems, e.g. high / low performance cores
- Application-specific instruction-set processors, e.g. for DSP
- Hardware + software co-design, e.g. Apple
- Reconfigurable computing, e.g. FPGAs
Reduce communication bottlenecks:
- SoCs / multi-chip modules, 3D ICs, optical interconnects
Reduce channel leakage currents:
- Multi-gate (3D) FETs, e.g. FinFETs, GAA FETs
- Channel strain engineering, SoI, high-k/metal gate materials