Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

  > I've applied for a grant from NVidia so I can fix that.
A g2.xlarge is 65c/hour on AWS, FWIW.

https://aws.amazon.com/ec2/pricing/on-demand/



That's honestly no way to get anything done...

Unless you're replicating someone else's thing exactly, You can't really get by with one training process. You want to be trying different things, and running a few samples of each configuration to account for random variation. I'm not even talking about decadent hyper-parameter sweeps to fine-tune. I'm talking like, how wide do I need my layers to be, what optimizers are good, how deep should I make the network, etc.

I want to be training 5-10 models at a time minimum. 20-30 would be much more productive. If I can only train one model at once, it's not really worth the effort — it's better to work on one of the other tickets for the library.


Sorry, I had the impression from the GP that you were trying to replicate a CPU-based task on a GPU.


Yeah, for OD? thats 468/month...

I'd use a spot instance and stop it whenever possible.


Spot instances are pretty painful for training. It's annoying to have the machine randomly shut down.


^ That. For all that people say about spot instances, there's no infrastructure I know if to manage jobs and have them migrate to higher priced instances without losing state.


You can always snapshot and keep track of state as you go (a little bit tricky with Spark, though). We use spot instances for training we know is not vital (as in, has to be done, but rather run it twice and save money anyway that run it for sure). Also, once you know what availability specific instances have you can always choose better (i.e. maybe c3.xlarge is slightly more expensive as spot than large, you can do with large... but xlarge has almost no shutdowns)


Presumably that's a decent chunk of what the grant is for?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: