Processor Architectures

From AstroBaki
Jump to navigationJump to search

Short Topical Videos[edit]

Reference Material[edit]

1 What’s an FPGA?

A Field-Programmable Gate Array (FPGA) is chip for performing digital logic operations. At its most basic level, and FPGA is a bunch of simple logic elements and a huge networkof wires, and connections between logic elements and wires are activated by anarray of bits that is loaded onto the chip. This makes an FPGA something of a chameleon;a user can rewire the chip to do many different things, depending on the application.

The only kind of chip that competes with the FPGA for flexibility is the Central Processing Unit (CPU) that sits at the heart of every modern computer. CPUs are flexible for a completely different reason than FPGAs. CPUs are not configurable chips. Instead, they are hard-wired with a numberof registers for holding numbers, and a number of supported operations for how to modify or combinethe numbers in those registers. What makes a CPU flexible is its capability to execute theseoperations in arbitrary order, according to the instructions contained in a program in memory that was written by a user. Long ago, CPUs had only a single core and so every instruction in a program was executed sequentially, one at a time. Nowadays, CPUs have multiple cores, and severalinstructions can be executed in parallel. This development notwithstanding, CPUs are basicallyserial processors, with some parallel capability allowed by multicore technology. In contrast, FPGAs are massively parallel. Every logic element in an FPGA is like a parallel processor. The rub, of course, is that these “processors” do very simple things, and once you configure a logic element to perform an operation, that is the only operation it can perform until the FPGA is reprogrammed. The parallel-ality of an FPGA makes it incredibly powerful for processing high-speed streaming data, but it also makes FPGAs hard to program. The difference in programming paradigms between CPUs and FPGAs is probably the biggest road-block to the widespread adoption of FPGA co-processors in computers.

2 Clocks and Timing

FPGAs are spatially parallel processors. Employing spatially separated logic elements to process datain coordination with one another can greatly improve performance.

2.1 Timing: Setup and Hold

Registers cannot make decisions instantaneously. Flip-flops typically have a setup time during which the D input to the register must be stable before the rising edge of a clock, and a hold time after the rising edge of a clock during which the D input must remain fixed. Furthermore, there is a clock-to-Q time, which is the time it takes the value to propogate from D to Q following the rising edge of a clock. The details of these mechanisms are not important for this discussion, but understanding that signals have a limited amount of time to propogate from one register to another in order to satisfy these timing demands is essential to designing FPGA circuits.

A typical FPGA signal might begin as a 0/1 on the D input of a flip-flop when a rising clock edge occurs. A nanosecond later, that state will appear on the Q output of the flip-flop, whereupon it will travel down a wire (incurring a couple more nanoseconds of delay), through a few logic circuits (a few more nanoseconds), to arrive at the D input of another flip-flop in time to satisfy the setup time required by that register before the next rising clock edge.

For visualizing how signals are sent and recaptured through synchronous digital circuits, timing diagrams are a useful tool.

2.2 Pipelining: Latency vs. Throughput vs. Resource Utilization

Suppose you have just designed a circuit that takes two signals from the outside world arriving on FPGA pins, ORs them together, and one clock later, ANDs the result with a signal arriving on a third pin, then outputs the result to a fourth pin. Suppose that you want a clock period of 5 ns so that the AND operation involving the third pin happens at just the write time. After telling the compiler to target a 200 MHz clock rate, you start compiling your design, only to find that it returns with an error: your design has not met timing. What does this mean, and how can you fix it?

When a design fails to meet timing, this means that there is a signal path between two registers whose total delay through layers of logic and routing down wires exceeds the 5 ns clock period you were shooting for. Though the compiler may finish compiling you design, you will not be able to run it at 200 MHz. If you do, the behavior of your design will be indeterminant–it will output junk.

To solve this problem, you have two options. If you don’t mind running your design at a lower clock rate, you can lower your clock frequency. Unfortunately, this is not commonly an option. The usual option is to add registers to your design to break up the long signal paths (called pipelining). In our example, registers could be added at the inputs of the AND and OR. However, when pipelining, one must be careful to keep signals time-aligned. For example, if we place registers at both inputs of the OR, we must also place one between the third pin and the AND. Failure to do so would mean that the “OR” side of the AND block would arrive one clock later than expected, and that would mean the circuit would be checking for a signal on pin 3 that is true 10 ns after pin 1 or pin 2 were true, instead of 5 ns.

Pipelining a design can help you reach higher clock rates, but it comes at a cost. Registers are ubiquitous on FPGAs, but they do eventually run out. Pipelining a large design raises its resource utilization, which can result in a design demanding more physical resources than are available on the chip (and an associated compiler error). Pipelining also raises the latency of a design–the number of clocks it takes to flow from input to output. For applications that require fast response to a stimulus, too much latency can pose a problem. However, many applications only require throughput, which is total sustained rate at which data flows through a design. Throughput is determined by clock rate, and so can be improved by pipelining a design.

When designing for an FPGA, a programmer must balance resource utilization and throughput by carefully pipelining a design. This involves adding latency to long signal paths, but avoiding unnecessary pipelining that consumes FPGA resources. A bizarre paradox of FPGA design is that consuming too many physical resources forces a design to be spread out across the entire chip. The wide physical spacing between circuit components incurs routing delays as signals are forced to travel farther. Higher routing delays lower the clock rate at which a design can be compiled, and hence, lower throughput. Thus, it is actually possible for excessive pipelining to decrease throughput.