That's a great news ! We are exclusively using scala at work for back end and I wonder if it could be interesting to switch new projects to scala native.
Did you test scala native against well known and massive open source scala project ? Did the performance improved or regress ? Did you wrote a brand new scala compiler for native code ?
Never touch a running system ;) Scala on JVM is much more tested than the new shiny thing. Also don't expect improved performance... many people think that the JVM is bloated and makes programs slower (this is mostly not true). The downsides of the JVM are more memory consumption/footprint (when you have e.g. small servers or micro instances) and the cold startup time of the JVM itself (which is not relevant on a server in comparison to desktop Java apps).
Would be interested to hear if any backend Scala projects like e.g. Play work on Scala Native.
> the cold startup time of the JVM itself (which is not relevant on a server in comparison to desktop Java apps).
I disagree somewhat with this.
We found that when we started writing microservices in languages that are not java, the short startup time changed how we did some error handling.
For errors where we say lose connection to the database, or rabbitmq, we much rather have the nodejs-process die and restart, than try to construct reconnect logic.
The problem with reconnect-logic is that it is code that (may) be tested very rarely. This in turn means it's easy to get strange long term problems there like very slow memory leak due to some listener being added to a connection object once the connection is initiated.
We did a 180 on reconnect-logic in our nodejs-processes and let the exceptions just bubble unhandled and take the entire vm down. With automatic restart script, the process will be back in seconds anyway, and with docker having built in back-off timers for auto restart, we don't necessarily overload the shared resources.
> For errors where we say lose connection to the database, or rabbitmq, we much rather have the nodejs-process die and restart, than try to construct reconnect logic.
This sounds like a very Erlang-ish way to handle the problem. Another advantage is that if the server/process is in some weird state that's causing problems, killing and restarting it lets you clear out the broken state, and get back into the state that it's most likely been tested under.
Yes. We're almost now risking it going the other way, that some bad programming goes unchecked for a long time, because overall, the process sort of does what it should. Even if it restarts like 10 times a day.
I guess I don't see the difference here if your VM startup time is 3 seconds or 15-30 seconds. If that's the difference between the site remaining stable and the whole thing collapsing then it seems like you're setting yourself up for a big outage one day when the nodejs process isn't able to come back in three seconds for whatever reason.
I think it depends a bit on class of errors. Certainly not everything is suitable for this treatment.
Lost connectivity to RabbitMQ or Elasticsearch would mean our site is dead anyhow (you can't do anything). So either of those errors should arguably result in some static 500 pardon-our-appearance page.
But say someone messes up the network connection or we get a brief problem.
The most effective way to handle these kind of errors in Java unfortunately requires understanding class loading, thread contexts, wrapping connection primitives in the right kind of references, and then making sure that all resource deallocation/closing always use the same codepath. Even though you really only need to implemnt it once, it is both somewhat tricky and technically challenging.
It's a pity that so few Java projects have tried to use these mechanisms without building them as part of massive frameworks, sometimes apparently even without understanding what they have built.
Yes, I did spend large parts of my java developer looking at class loaders and class loading delegations in servlet containers etc.
I think it's a bit too hard to get it right.
Like, suddenly some third party library starts pulling in log4j and your whole logging setup goes wrong in subtle yet very bad ways.
Or you screwed up with that one reference to a ResultSet and even though it is closed, that reference keeps an entire class tree of Connection, PreparedStatement etc alive.
Isn't this also solved by just load balancing so that the customer ends up reconnected to a healthy node while the downed node is replaced?
We run our Scala apps on Aurora/Mesos behind a load balancer (hundreds of instances for just one app). If there's an issue that can't be handled within the app and error rates breach a given threshold, Aurora just kills the instance and creates a new one on another host.
At the moment, there isn't direct support for multithreading [1], so I'm guessing it would be very difficult to run any of the common web servers or computing frameworks natively. It may be possible for libraries that have pluggable concurrency, for example by creating an `ExecutionContext` that wraps OS threads, but that's waaay beyond my pay grade.
Long time Java lover here. I agree with all your points, but in the context of Java at least (does Scala support this?) there is no simple static binary that can be built and released, which includes the JVM. I think 1.9 will have this option, but this is something I didn't realize I missed until I started work with Rust and Go. It makes deployment so much simpler.
At work I'm running 17 different containers- many of which require their own JVM. (That's 3 different JRuby apps, zookeeper, kafka, and ElasticSearch.)
Those JVMs get heavy when you're shipping container images compared to small Go or Rust binaries.
To my mind the JVM is where containers make the least sense. If you build an executable jar you can run with "java -jar ..." then that seems just as simple as "docker run ..." and gets you the single-file deployment, and you can control memory allocation via flags if you need to. You don't get virtual networking but IME that doesn't add value in the first place.
There are still some licensing issues. For instance, Atlassian has an official docker container to evaluate Confluence, but they don't support it in production since it uses OpenJDK and Confluence is still somewhat broken on OpenJDK.
Rather than fix Confluence to work on OpenJDK (I don't want to imagine what type of reflection garbage they've got going on down there that breaks so bad on OpenJDK), their instructions tell you how to make your own Dockerfile using the official Oracle runtime.
Actually, in that situation, if it won't run on OracleJDK it's probably not going to work via a native compiler either.
No downvote from me. But a explanation why e.g. compiled binaries are better. I had a hard time to get a normal non fancy Scala Play project running on a 512MB DigitalOcean instance. Mostly because it needs a lot more ram for building. I solved it with using a bigger swap partition. With single binary precompiled programs this problem is more a developer machine problem than a infrastructure problem. So I think the deployment step itself (and not looking at anything else) is easier with a small single binaries.
You shouldn't build it on the deployment server. You build a jar and upload/download that to/from the place you want it to run, just as you'd do with an executable. A jar is "not binary" but what practical difference does that make?
Does it really need to be a binary? Build executable jars (use the maven shade plugin), run them with java -jar foo.jar, that's about as simple as it gets.
Yeah you can't do that, and I don't necessarily agree with that design decision. But for the sake of the comparison it's worth saying that these "simple" compile-to-binary languages simply don't let you set those parameters at all - it's ridiculous to argue that Go (say) is better than Java because something that's impossible in Go requires fiddling with parameters in Java.
I didn't actually say Go is better than Java. I said that binaries were something I realized I missed because of those languages. That is, it's something I appreciate about Go and Rust.
But what's the advantage of a binary over a (shaded) jar? "java -jar myapp.jar" is a little more typing than "myapp", but only a little (and you can avoid that by prepending a launch script if you want); having the JVM installed on all your servers is a one-time cost.
See my above comment about GC and other runtime options. I always need a script to specify all the options to the JVM. It is never just as easy as a single jar with no options. That makes it better, no doubt, but it still sucks.
I don't understand how you need those other options in Java and avoid needing them with a binary? What's the difference that means you can get by with not passing any options in go or what have you?
(I've used "java -jar myapp.jar" in production and it's been fine; the Java mainstream may favour using lots of -Dblah but it's entirely possible to replace that with code)
Maybe some apps do not have classpath. Here is what I see of running kafka instance on one of my server. And it does not looks like as simple as it gets.
If you use the maven shade plugin (or similar) you can replace the whole "-cp ..." stanza with a "-jar myfile.jar". The "-D" arguments can be set in code instead (though I'd ask why you are allowing remote management without authentication and without SSL?).
The rest of the arguments are about GC tuning and logging. How would you do those things in a language that gives you a "simple" static binary? Either you can't at all, or they'd require an equally complex series of arguments.
1. What should the default be? Java build systems default to building dynamically linked, though it's a few lines to change. IMO dynamic is a better default for large projects, as you usually have more library modules than executable modules. On the other hand a large project is likely to already involve a fair bit of build config, so maybe the defaults should be optimized for small projects.
2. Whether you allow dynamic at all. To my mind it's always worth having the option, and I think Go will come to regret not having it if and when it ever gets used for large projects.
There are plenty of large projects like kubernetes/docker/rkt/influxdb/tidb/cockroachdb and so on. Go is providing quite large memory efficiency and sub-millisec GC as compared to Java.
As of Go 1.8 it also provide plugin support though I am not sure if they are any where near Java in term of dynamic libraries loading support.
I think for Server-Application Scala on the JVM will probably beat Scala-Native. The benefits of Scala-Native over Scala JVM are:
- faster startup time
- (drastically) lower memory footprint
- fine hand-tuning of you application
All these things are not super important in server-applications. For example Java trades memory for throughput (higher memory footprint, but also higher throughput. These usually go hand in hand.).
You could always bundle the jre, and with Java 9 one can even make use of the newly introduced linker to create a customized image just with the relevant classes.
The large memory footprint of the JVM is memory for classes, profiles, things like that. Those are used to create optimised code and to recover when optimisations were too optimistic. When your program is optimised and running in steady state, this memory isn't actively used and so doesn't contend with your application memory and so has no impact on cache efficiency.
This sounds like a plausible explanation, but is this verified/verifiable? Are there memory profilers that can show me the relative sizes of the young/old/permanent generation segments of the GC?
I'm always blown away at the memory usage of JVM apps. Part of it is the fact that java has encouraged insanity-inducing inheritance hierarchies...but also it is incredibly hard to do dead code optimization on for such a static (type and compilation model) language (I blame dymanic classloading, but that's more of a guess than anything). Maybe what you're saying is the reason we don't see noticable GC pauses until you start seeing large amounts of data...but it is still a huge pain for low memory environments like phones, embedded devices, IoT, etc. And while memory usage is always gonna be higher on a GC'd language, the JVM still consumes vastly more memory than other languages like OCaml, D, Go, etc.
Yes, it's called "perm gen cache," or something like that, on any standard JVM profile. This roughly represents the memory used by the type system. It can get pretty high if you are doing something like auto-generating types (GUI, build systems, etc)
In general (not for server apps), two major benefits of Scala Native are:
- Predictable latency if desired (optional GC)
- Very low call overhead for C ABI
As to your point about memory use, Java trades memory for convenience, not performance. GC requires substantially more memory for similar performance. I read an IBM blog (which I can't find at the moment) within the last week which showed a Swift web service running slightly faster than Java, but using only half the memory.
The following comparison is also interesting, with a JSON serialization example in Swift outpacing Spring/Java by a factor of ten... This is also running on Linux instead of macOS.
To be fair, in that list, the spring entry didn't exactly run circles around the competition either. It's somewhere between the better PHP contenders and even behind grails (which can be pretty accurately described as spring with layer of slowness added on top). Really looks like there is something unfortunate going on with the idiomatic way to implement those examples on spring.
I used the JVM on the 512MB DO instances and containers and they run fine. I think for containers there are other issues (most likely you are going in a micro service-direction where latency is eventually going to be important, so picking another JVM GC-algorithm might be suitable). There may be applications for which the 512 instances and JVM are not suitable, but you can most likely just upgrade the instance.
Many things pointed out in this article apply to just about every managed language runtime. Implement a TreeSet in any language and you'll see the same overhead from object headers, memory alignment, etc. Java has some oddities that cause it to waste extra memory, but off the top of my head the only I can think of is 16-bit character Strings. Java 9 is supposed to help with that by allowing Strings to internally store utf8 characters.
I do like the slide though showing that people tend to assemble abstractions together and completely lose sight of the performance costs of what they are doing. There's also the fallacy commonly held by many that because someone took the time to write a framework or library, they must have also taken the time to ensure it's optimized well.
I'm not involved with this project in any way, but I would expect performance to be overall worse with Scala Native. The advantages of Scala native are likely:
1) Much faster startup times
2) smaller memory footprint for small programs.
3) Potential for easier installation since no dependency on the JDK (assuming binaries are statically linked)
So basically you could use scala native to cover some cases that are better covered by golang or rust right now. For large and long-running server-side processes, the JVM is still king.
Did you test scala native against well known and massive open source scala project ? Did the performance improved or regress ? Did you wrote a brand new scala compiler for native code ?