> I've applied for a grant from NVidia so I can fix that. A g2.xlarge is 65c/hou...

syllogism · on Oct 11, 2016

That's honestly no way to get anything done...

Unless you're replicating someone else's thing exactly, You can't really get by with one training process. You want to be trying different things, and running a few samples of each configuration to account for random variation. I'm not even talking about decadent hyper-parameter sweeps to fine-tune. I'm talking like, how wide do I need my layers to be, what optimizers are good, how deep should I make the network, etc.

I want to be training 5-10 models at a time minimum. 20-30 would be much more productive. If I can only train one model at once, it's not really worth the effort — it's better to work on one of the other tickets for the library.

AlexCoventry · on Oct 11, 2016

Sorry, I had the impression from the GP that you were trying to replicate a CPU-based task on a GPU.

samstave · on Oct 11, 2016

Yeah, for OD? thats 468/month...

I'd use a spot instance and stop it whenever possible.

syllogism · on Oct 11, 2016

Spot instances are pretty painful for training. It's annoying to have the machine randomly shut down.

viksit · on Oct 11, 2016

^ That. For all that people say about spot instances, there's no infrastructure I know if to manage jobs and have them migrate to higher priced instances without losing state.

RBerenguel · on Oct 11, 2016

You can always snapshot and keep track of state as you go (a little bit tricky with Spark, though). We use spot instances for training we know is not vital (as in, has to be done, but rather run it twice and save money anyway that run it for sure). Also, once you know what availability specific instances have you can always choose better (i.e. maybe c3.xlarge is slightly more expensive as spot than large, you can do with large... but xlarge has almost no shutdowns)

pc86 · on Oct 11, 2016

Presumably that's a decent chunk of what the grant is for?