They're not the same type of mistake. I don't remember the deteails, but when I was reading papers on caption generation, the part of producing coherent sentences seemed more like a hack that happens to kinda work (usually) rather than a robust solution.