You can think of OpenCL kernels (or any imperative sequence of low-level operations) as data flowing through math operations. Normally, we leverage a single set of math circuits to perform all of these operations in sequence, and orchestrate the data flow through a register file. You could imagine removing the register file and instantiating an actual circuit that represents the data flow of the program itself. This creates more opportunity for pipelining, which should be plentiful in a highly data parallel computation. The issue with FPGA is they are clocked lower and are not very dense, so the tradeoff is generally not worth it.
Yes, I don't think it's an issue with the compiler. The FPGA approach requires a flexible fabric that just has lot's of overhead to give it programmability compared to an ASIC. For an FPGA to have value, you _really_ need to leverage it's programmability. Emulating an ASIC design for verification and testing is a good use case.