Virtus Cyber Academy

FPGA Cell to Silicon Bridge

1,395 words · ~6 min read
Markdown PDF

VCA-CSA-101 cross-chapter handout. Audience: students who have completed Ch 5 and are curious about what their synthesized design looks like at the physical layer.

Purpose: to connect the LUT4 cells, flip-flops, and block RAM primitives your Tang Primer 25K bitstream configures to the physical silicon structures that implement them. The central idea: your FPGA bitstream is reprogrammable silicon-equivalent. The configuration changes; the transistors do not.


The three cell types in your synthesized design

When you synthesize a Verilog design for the Tang Primer 25K, the synthesis + place-and-route tools map your logic into three kinds of silicon resources:

Verilog construct FPGA primitive Physical structure
Combinational logic (assign, always @(*)) LUT4 16-bit SRAM + 4:1 mux tree
Sequential logic (always @(posedge clk)) DFF (D flip-flop) CMOS transmission-gate latch chain
Large memory arrays (reg [N:0] mem [0:M]) BRAM (block RAM) Dedicated SRAM macro on die

Each of these is a fixed silicon structure on the GW5AT-138 die. You do not change the transistors. You change the SRAM configuration bits that determine how those transistors behave.


LUT4: the programmable gate

A 4-input lookup table (LUT4) implements any boolean function of 4 inputs. Internally it is:

When synthesis maps assign out = (a & b) | (c ^ d) to a LUT4, it:

  1. Evaluates the boolean expression for all 16 combinations of a, b, c, d
  2. Writes the 16 results into the LUT4's SRAM array via the bitstream

At runtime, the four input signals select one of the 16 SRAM cells; that cell's programmed value drives the output. The transistors in the LUT4 are always the same transistors. Only the 16 SRAM bits change between bitstreams.

Why this matters for your CPU: every AND gate, OR gate, and multiplexer in your ALU and control logic is implemented as one or more LUT4 cells. The tools may pack two related 3-input functions into a single LUT4 using a technique called function sharing, but the physical mechanism is the same.

Approximate transistor budget per LUT4: roughly 50-80 transistors at 55nm (16 SRAM cells + MUX tree + output buffer). The exact count depends on the GOWIN cell library implementation.


DFF: the programmable flip-flop

A D flip-flop in FPGA fabric is a hardwired CMOS flip-flop next to each LUT4. It captures the LUT4 output (or a direct signal) on the rising edge of the clock.

In CMOS, a D flip-flop is typically built from two cross-coupled transmission-gate latches (a master-slave pair) clocked in opposite phases:

The result: the flip-flop's output changes once per clock cycle, on the rising edge. The slave latch's output is stable during the entire high phase of the clock, giving downstream combinational logic a full half-cycle to settle.

On the Tang Primer 25K, each DFF cell also has:

When your Verilog says always @(posedge clk) if (rst) reg_x <= 0; else reg_x <= next_val;, synthesis maps this to a DFF cell with LSR driven by rst and D driven by next_val.

Compared to 1970s silicon: the Z80's register bits are NMOS static latch cells (cross-coupled inverters, always transparent to the appropriate control signal). Your FPGA DFFs are CMOS master-slave latches with an explicit clock edge. Both store one bit. The NMOS latch requires fewer transistors; the CMOS DFF is faster, lower-power, and more predictable in timing.


BRAM: the hard-macro memory block

Block RAM (BRAM) cells are dedicated SRAM macros on the FPGA die. Unlike LUT4 cells, which are general-purpose and can implement either logic or tiny memories, BRAM cells are specifically optimized for memory access patterns: synchronous read and write ports, optional output registers, configurable width/depth.

On the Tang Primer 25K (GW5AT-138):

When your Verilog infers a RAM larger than about 32 bits (the exact threshold depends on synthesis heuristics), the tools automatically place it in BRAM rather than in LUT4 cells. You can force or suppress this inference with synthesis attributes.

Your CPU's instruction memory is the most obvious BRAM candidate in the CSA-101 design: the rom or imem array that holds the 32-bit instruction words. The synthesis tool places this in one or more BRAM blocks; the bitstream initializes those blocks with your program image.

Compared to 1970s silicon: the 6502 and Z80 accessed external DRAM or SRAM chips for memory -- there was no on-chip RAM. Modern FPGAs have moved the memory on-die as a hard macro. The tradeoff is the same: on-chip is faster but limited in size; off-chip is slower but can be gigabytes.


Programmable silicon vs fixed-function ASIC

An FPGA is often described as "reconfigurable" or "programmable." What that means at the physical level:

Fixed-function ASIC (like the 6502 or Z80):

FPGA:

Your Tang Primer 25K bitstream is approximately 2-5 MB of configuration data that sets the state of the SRAM bits inside every LUT4 and every routing switch on the die. Loading a different bitstream is the FPGA equivalent of fabricating a different ASIC.

The pedagogical summary: when you synthesize your RV32I-Lite CPU and load the bitstream, you are not doing anything fundamentally different from what TSMC does when it fabricates a fixed-function CPU die. You are configuring a silicon structure to implement a specific logic function. The mechanism is different (SRAM programming vs photolithographic mask); the outcome is the same: silicon that executes instructions.


Routing fabric: the invisible third resource

Between the LUT4 cells and the BRAM blocks, the FPGA die contains a large routing fabric: a grid of programmable switches and wire segments that connect cells to each other. Every wire in your Verilog becomes a route through this fabric.

The routing fabric consumes a significant fraction of the die area. On a typical FPGA, 60-70% of the die area is routing, not logic. This is why FPGAs are larger (in die area) than equivalent ASICs for the same function: the ASIC's routing is custom-designed to be minimal; the FPGA's routing is a general-purpose grid that must support any possible connection.

When you run place-and-route (the step after synthesis), the tools are finding a path through this grid for every wire in your design. The routing delay (the time a signal takes to traverse the grid) is a major component of your design's critical path timing.


Cross-references

Backward-looking

Forward-looking