Thanks especially for the "Not All Sunshine and Rainbows" section. It is all too...

implicit · on March 26, 2014

We deal with Haskell resource leaks the same way you would in C++ or Java.

We have production monitors on every host that show basic metrics like memory, disk, and CPU utilization. Atop that, we added a tracker for the number of suspended Haskell threads. (that is, threads which are not blocked on I/O, but are also not running)

We found that the machines are usually able to handle requests as soon as they come in, so if the number of Haskell threads goes above 0 for any length of time, the machine is about an hour away from melting down.

We can restart the process without losing any connections, so this leaves us a very comfortable margin of error.

Once we know we have a problem, it's usually pretty simple to run the heap profiler on the process and look at recent commits. We continuously deploy, so there's only about a 10 minute delay before a particular commit is running in front of customers. This makes tracking regressions down really fast.

Even in cases where we can't figure out why a bit of code is leaking, we can almost always identify it and revert it until we understand what's going on.

pmahoney · on March 26, 2014

> We can restart the process without losing any connections

Would you mind expanding on this a bit? I'm not too familiar with Haskell, but I am familiar with various was of blocking new connections while allowing existing connections to complete, either at the load-balancer level or built-in each individual process.

What Haskell stack are you using, and how are graceful restarts accomplished?

Thanks.

implicit · on March 27, 2014

One of my coworkers wrote a really cool bit of software to do this. I want him to open source it.

Basically, you can share a single socket amongst many servers. The OS ensures that just one process accepts each connection.

You can therefore have a manager process that owns the socket and passes it on to application processes.

To update, start new processes, then politely tell the old ones to go away.

dllthomas · on March 27, 2014

One really cool thing in Linux is that you can actually pass file descriptors between processes over unix domain sockets.

enigmo · on March 27, 2014

Windows has supported this for ~14 years too.

dllthomas · on March 27, 2014

Good to know. Does it work for everything that's an fd in Linux? I know you've got to treat sockets and files differently in some cases (or at least did once)...

enigmo · on March 27, 2014

It works for most kernel handles, sockets might be a little more normal starting with Win7 but I stopped doing Windows development around then.

Here are the official docs: http://msdn.microsoft.com/en-us/library/windows/desktop/ms72...

dllthomas · on March 27, 2014

Looks like there's a separate function for sockets.

Still, cool stuff there too.

samstokes · on March 27, 2014

einhorn [1] implements this model and is pretty effective. Used in production at Stripe and other places. (It's written in Ruby, but can run application processes in any language.)

[1] https://github.com/stripe/einhorn

DonPellegrino · on March 27, 2014

Basically, catch SIGINT, then stop listening to a socket/port. Finish all current requests and exit. The "watcher" parent process will restart the process with the new executable. Repeat for all other processes listening to the socket/port.

shadytrees · on March 26, 2014

I can't answer for grandparent, but you should check out https://github.com/notogawa/graceful

joehillen · on March 26, 2014

Except in Haskell you can build ekg right into your server. http://ocharles.org.uk/blog/posts/2012-12-11-24-day-of-hacka...

benmos · on March 26, 2014

"we added a tracker for the number of suspended Haskell threads" - would you mind sharing how you did that? I couldn't see any obvious GHC APIs for it.

implicit · on March 26, 2014

It looks like you're right. I misspoke.

We track total threads, working or not. It works great as an indicator because it tends to stay below the number of CPU cores on the server.

benmos · on March 26, 2014

I take it you mean OS-level threads then?

implicit · on March 26, 2014

We track Haskell threads.

edit: Found the code. :)

We rolled our own implementation. Our WAI application action increments a counter and decrements it again whenever an HTTP request is received and completed.

It doesn't track threads created as a part of HTTP request handling, but we don't allow those actions to forkIO anyway. There hasn't been any demand for it.

benmos · on March 27, 2014

Ah, right, that makes sense - thanks.

thinkpad20 · on March 26, 2014

... so, how? Isn't that what he was asking?

samstokes · on March 27, 2014

I don't believe sandboxes fix the particular problem the OP was describing, which is that of reproducible builds across multiple environments (different team members / CI / prod).

A cabal sandbox means once you've got your dependencies to resolve and your app to build, it'll continue to build and use the same versions of its dependencies, when building from that sandbox (which pretty much means "when building in that working directory"). But it gives you no guarantee that if your dev build got version 0.1.2 of a transitive dependency, then your CI server will also get version 0.1.2, and not 0.1.3.

If it turns out that your app works with 0.1.2 but not with 0.1.3, then your dev machine will reproducibly produce working builds, while your CI server will reproducibly produce broken builds.

What's really needed is an analogue to the Gemfile.lock used by Ruby or npm-shrinkwrap.json in the Node world, which is checked into version control, and freezes the exact versions of all transitive dependencies until explicitly updated. I think there's a "cabal freeze" command in development, but I'm not sure what the status is.

nightski · on March 27, 2014

You can easily lock to a specific dependency version in the cabal file if that is your desire.

samstokes · on March 27, 2014

Sure - if you can reliably identify the exact required versions of all of your transitive dependencies. That's infeasible for nontrivial applications. (And even that's only if the set of exact versions you find manage to not have conflicting requirements with each other.)

The reason Gemfile.lock works is because it lets you achieve that the same way you create working code - figure out what works in dev, using a combination of skill and trial and error, then lock it down in version control and deploy exactly that to CI/prod/other devs.

People have written shell scripts to scan your sandbox for installed package versions and update your cabal file to require those versions, but it's an inherently approximate process - e.g. if you upgraded a transitive dependency but still have the previous version in the sandbox, the shell script has to guess which one you want, because there's no explicit relationship between your code and a particular version.

There's a more fundamental problem with that approach - it ignores the difference between "my app semantically requires package X at version y" and "I have tested my app with package X at version y". The cabal file expresses the former - which is why it doesn't include transitive dependencies, and why it's more idiomatic to specify broad version ranges than exact version constraints. "cabal freeze", if it existed, would express the latter. Reliable engineering requires both.

jfischoff · on March 27, 2014

We manage all deps in our cabal file, and have scripts to make sure that what is in package database matches the cabal file (this way we can upgrade versions without manually unregistering and reinstalling).

Like you said, there is nothing seamless that is part of cabal ... yet. I would like improve our workflow and integrate it into cabal.

whatgoodisaroad · on March 26, 2014

One Haskell resource leak I've encountered a couple of times has to do with opening large numbers of files combined with non-strict semantics. By default Haskell will open IO handles but not consume them until the contents are needed, and thus, not close them. To read the contents of many files in a directory, the result is opening thousands of concurrent file handles and exhausting the OS's IO handle pool. The solution is to add strictness annotations to force evaluation and relinquish the handles, which isn't fun and isn't pretty.

mightybyte · on March 26, 2014

This problem is being addressed by a number of packages like conduit, pipes, and (at a lower level) io-streams. These are second-generation solutions to the problem that was pioneered by the iteratee and enumerator packages.

shadytrees · on March 26, 2014

Seconding. I've used conduit before, and it was a delight to use something so carefully designed. The blog posts about conduit are in themselves an insight into how to think with Haskell.

http://www.yesodweb.com/blog/2013/10/core-flaw-pipes-conduit http://www.yesodweb.com/blog/2013/10/simpler-conduit-core

jfischoff · on March 26, 2014

The library in question was using unsafePerformIO to open a file handle. It was just a bug.

dllthomas · on March 26, 2014

There's apparently a particular instance you're referring to?

But the general issue can be encountered with lazy IO without any use of unsafePerformIO. There has been a lot of discussion about this around the various enumerator-like libraries - in particular, Snoyman has many posts about ensuring timely release of resources.

klrr · on March 26, 2014

As for resource leaks, the particular example in the post was a bit unfortunate. The problem in general have a solution though. You can either use functions that limit a resource inside a limited scope, and then have it clean up automatically when done (like 'withFile'). And if you want to do more complicated things there are the resourcet and pipes-safe libraries.