Unless you're replicating someone else's thing exactly, You can't really get by with one training process. You want to be trying different things, and running a few samples of each configuration to account for random variation. I'm not even talking about decadent hyper-parameter sweeps to fine-tune. I'm talking like, how wide do I need my layers to be, what optimizers are good, how deep should I make the network, etc.
I want to be training 5-10 models at a time minimum. 20-30 would be much more productive. If I can only train one model at once, it's not really worth the effort — it's better to work on one of the other tickets for the library.
^ That. For all that people say about spot instances, there's no infrastructure I know if to manage jobs and have them migrate to higher priced instances without losing state.
You can always snapshot and keep track of state as you go (a little bit tricky with Spark, though). We use spot instances for training we know is not vital (as in, has to be done, but rather run it twice and save money anyway that run it for sure). Also, once you know what availability specific instances have you can always choose better (i.e. maybe c3.xlarge is slightly more expensive as spot than large, you can do with large... but xlarge has almost no shutdowns)
https://aws.amazon.com/ec2/pricing/on-demand/