Repurposing von Neumann Architecture with SRAM-based Register File
By Louie De Luna, Agnisys Chief Product Evangelist
The conventional von Neumann architecture has been the workhorse of computing for several decades, but with the advent of AI applications and big data the entire industry has put a spotlight on its limitations. Since massive amounts of data need to travel back and forth between the CPU and memory, the resulting latency and power consumption became major issues. One of the powerful convolutional neural networks (CNN), Alexnet, requires 68M total weights (parameters) and 724M total MACs for a single inference process – a mere average requirement compared to other CNNs such as VGGNet which requires 138M total weights and 15.5G total MACs.
New chip architectures and technologies are now emerging to address these issues known as the “von Neumann bottleneck” or the “memory wall” problem. The Google TPU is based on systolic arrays that provides up to 420 Teraflops, the Graphcore IPU is based on Bulk Synchronous Parallel (BSP) technology that provides up to 125 Teraflops and IBM Zurich Lab is working on a new AI chip based on in-memory computing.
But as the world of computing and AI wait for the new chip architectures to mature, the memory wall problem is still a real pain. Startups without the backing of deep pockets will need to come up with other ingenious ways in order to be competitive.
A popular strategy used for datacenter acceleration is to repurpose existing von Neumann architectures by implementing some of the register files as on-chip SRAM as opposed to standard flip-flop cells. This will free up a considerable area on the die that can be used to implement more functional blocks for acceleration. There are two base categories of register files in today’s SoCs: dynamic and static registers.
- Dynamic registers are host-centric control, status and debug registers. They need constant HW/SW accessibility so they are best suited for flip-flop register implementation. They amount to a small section of registers (~5%) in relation to entire register bits available on the chip.
- Static registers are algorithm specific hyperparameters or configuration registers. These are always programmed by SW at boot or in controlled environments before HW can consume the data, avoiding simultaneous SW writes and HW reads. They amount to large portions of the total register bits available on the chip and hence it is best to realize them using SRAM rather than flip-flops to save considerable cell area.
Benefits of SRAM-based Register Implementation
Implementing register files as SRAM as opposed to standard flip-flops provide the following significant benefits:
- Reduce logic utilization area – a register map with 40K register bits infer 206K flip-flop cells taking up 400µm² standard cell area. If these registers are implemented as SRAM, the required area is only 10µm² – an area savings of 40x.
- Acceleration – freeing up area on the die can be used to implement more functional blocks that can be used for acceleration. More functional blocks for processors, MACs and deep pipeline structures can enable highly-parallelized computation for higher system throughput.
- Reduce power consumption per stored bit – register files implemented as SRAM consume less power than if they were implemented as standard flip-flop cells.
Sample Implementation in IDesignSpec™
Implementing registers using SRAM is not an easy task as the creation of the structure and interface requires a lot of automation. The registers implemented as SRAM need to also connect with the registers implemented as flip-flops. An example of how this can be done in IDesignSpec is described below with several components shown in Figure 1.
- Dynamic Registers are implemented as flip-flop for each register bit. Dynamic registers are implemented as INTERNAL in IDesignSpec which means the RTL is auto-generated by the tool. The user only needs to define the register widths, fields and default values and the tool will generate the RTL.
- Memory Access Ports – generated by IDesignSpec which includes the ports for data, address and control.
- SRAM Wrapper is created by the user which takes care of the connection of the access ports generated by the IDesignSpec, and it consists of the following:
- SRAM for static registers – the structure is defined in IDesignSpec including the required access ports. These registers are marked as EXTERNAL in IDesignSpec which means the RTL will not be auto-generated but the ports will be generated.
- Application logic to access the SRAM – this application logic is created by the user based on the desired functionality.
- Application specific ports to access the SRAM – HW needs visibility to only one section of register bits at any instance in time.
A more specific example of a register file structure is shown in Figure 2 which consists of:
- 32 Dynamic 32-bit Registers with identical fields and field sizes. These registers will be implemented as INTERNAL in IDesignSpec.
- 512 Sections where each section consists of 32 Static 32-bit Registers with varying number of fields and field sizes, all sections are identical. These registers will be implemented as EXTERNAL in IDesignSpec. They represent a static configuration of the chip at any instance in time. They are not time sensitive and can be available after defined predictable number of clock cycles for HW reads.
Figure 3 shows how the Dynamic and Static Registers can be defined in IDesignSpec using Excel Editor. Using properties {EXTERNAL=true; repeat=512} in the description column repeats the Static section 512 times and makes “Static_Section_0, Static_Section_1, … Static_Section_511” as address placeholders only where no physical registers are implemented inside the IDesignSpec generated block. Only ports are generated by IDesignSpec.
Once the specification is completed using IDesignSpec Excel Editor then the user simply needs to select the desired system bus (AMBA-APB, AHB, AXI, TileLink or Proprietary) and generate the required output code including RTL, UVM register model, C/C++ Headers, Python and documentation. All outputs are derived from the golden specification, a popular methodology employed by our customers to synchronize various SoC teams.
Startups in AI can easily implement SRAM-based register files. It does not completely solve the von Neumann bottleneck but it’s an inexpensive counter-measure. As the race for AI supremacy continues, new-generation chips will be deployed, new research will be conducted and new requirements will arise. As a result new bottlenecks will naturally show up so favoring the least expensive solution to a given problem is a practical choice. Time will tell who will dominate in this race, and I’m for sure excited to see!
If you’re in the Bay Area this month, see us at Hot Chips Symposium at Stanford Memorial Auditorium in Palo Alto, California on August 19-20, 2019. We can discuss more details about the SRAM-based register implementation.