The issue I was thinking of is if you needed multiple "custom instructions" with...

The issue I was thinking of is if you needed multiple "custom instructions" within a loop. Another related issue, if multiple tiles are considered, is the interconnect between the processor core and the tiles. Interconnect is expensive, so there's a trade off to be made. Profiling data showing a curve of unique accelerator functions for the targeted problem domains and their temporal relationship (e.g. there may be ten accelerator functions but they are executed several milliseconds apart or functions A and B tend to appear in an inner loop together) would help make the trade off.