Putting large data on the stack is a good idea as a standard Fortran optimization, at least (gfortran -fstack-arrays and ifort default, I think). You typically need to increase the stack limit greatly for HPC jobs. I'm not aware of that causing trouble for profiling, and would also be interested in how it does.