A "Snitch" cluster is 8+1 single-stage small integer cpus +1 DMA core. Cores each have their own "decoupled heavily pipelined" FPU. The cluster shares a L1 I-cache, has 32 banks of 4kB scratchpads, and has a 64bit config / peripheral AXI bus, and a 512bit "wide" AXI bus the +1 DMA core can talk to, and it's own connection to all the scratchpads.
Clusters are grouped together. On this "Occamy" chip, there are 4 Snitch clusters in a group. These groups share the 64bit/512bit AXI group bus, which connects up to top bus. Groups also share a L2 consts cache. Then 6 groups.
The chiplet also has 8 HBM2e chips bringing ~400GBps, and some other goodies, like a 8GBps off-die serial link, a system DMA controller, 8GBps and 64GBps die-to-die serials to connect to other chiplets, and a CVA6 management core.
There's some interesting iterations done to make the FPU more decoupleable, such as having native loop instructions & some interesting memory access instructions with address generation built in- "streaming semantic registers", a neat risc-v isa extension. https://arxiv.org/abs/1911.08356
I love love love the non-cache coherent design. Even within a cluster, there's still not really cache coherence, is my feel; there's just some shared memory one can use. The early slides in this deck so show much promise. Notably, a huge weakness in the ecosystem in general around hard IP for talking to memory, ethernet, PCIe, but otherwise, so much glorious stuff & rapidly improving tools for doing this kind of work.
A "Snitch" cluster is 8+1 single-stage small integer cpus +1 DMA core. Cores each have their own "decoupled heavily pipelined" FPU. The cluster shares a L1 I-cache, has 32 banks of 4kB scratchpads, and has a 64bit config / peripheral AXI bus, and a 512bit "wide" AXI bus the +1 DMA core can talk to, and it's own connection to all the scratchpads.
Clusters are grouped together. On this "Occamy" chip, there are 4 Snitch clusters in a group. These groups share the 64bit/512bit AXI group bus, which connects up to top bus. Groups also share a L2 consts cache. Then 6 groups.
The chiplet also has 8 HBM2e chips bringing ~400GBps, and some other goodies, like a 8GBps off-die serial link, a system DMA controller, 8GBps and 64GBps die-to-die serials to connect to other chiplets, and a CVA6 management core.
There's some interesting iterations done to make the FPU more decoupleable, such as having native loop instructions & some interesting memory access instructions with address generation built in- "streaming semantic registers", a neat risc-v isa extension. https://arxiv.org/abs/1911.08356
I love love love the non-cache coherent design. Even within a cluster, there's still not really cache coherence, is my feel; there's just some shared memory one can use. The early slides in this deck so show much promise. Notably, a huge weakness in the ecosystem in general around hard IP for talking to memory, ethernet, PCIe, but otherwise, so much glorious stuff & rapidly improving tools for doing this kind of work.