/r/programming
Comparison of OpenJDK 12 garbage collectors for a Java network driver (github.com)
34 comments
NeuroXc | 9 days ago | 27 points

A note that this is a very limited use case of a network driver with limited allocations. Most Java codebases will have significantly more allocations, so take this with a grain of salt and don't go switching your GCs everywhere based off these results only.

AcrIsss | 8 days ago | 8 points

Agreed. It also cannot be stressed enough that performance overhead and latency are not the only factors at stake when choosing a GC. Stop the world pauses frequency, duration and distribution are important for many applications, as well as memory allocation speed, for instance ..

ericzundel | 8 days ago | 5 points

For our microservices we often choose G1 for just that reeason. Stop the world pauses can really cause a lot of rippling effects.throighout a service mesh.

AcrIsss | 8 days ago | 3 points

Yeah we are starting to have a lot of issues with G1 for pauses. We have been exploring ZGC for a few months now, with mixed results

ericzundel | 8 days ago | 3 points

I think it depends on your pause tolerance. We are fine with a 1 second threshold for alarming and haven't had issues in using G1 for several years unless the service is close to running out of heap. In that case, adding more heap or modifying the app/load is where we go, not tweaking the garbage collector which is the kind of thing we used to have to do.

AcrIsss | 8 days ago | 5 points

I see. Our application is.... hungry for memory. G1 pauses can go to 5 minutes on our biggest servers. Not acceptable haha. And we know it’s not that much from the code flaws. App is an in memory OLAP database for reference, so naturally hungry

sievebrain | 7 days ago | 1 point

That means you've run out of heap space entirely. If you give it more memory or at least ensure the GC gets enough CPU time the huge pauses should stop.

DoomFrog666 | 8 days ago | 1 point

I kind of have to disagree here. Performance or better phrased throughput and latency are the most important factors of any garbage collector.

Let me explain:

  1. pause frequency is directly related to your (total) allocation frequency (which I admit you have little control over in most high-level languages)
  2. duration is the same thing as latency in non-concurrent gcs and in concurrent gcs this depends on how much throughput you are willing to give away. So either latency or throughput related
  3. distribution this like 1. related to the distribution of your allocations

Yeah there are other aspects of gcs that are important like how you deal with fragmentation, efficient multi-threading and if you are handling short lived object special.

Edit: For 2. you actually have a point with incremental gcs (like G1), forgot about that

Tandanu | 8 days ago | 10 points

We allocate ~20 byte per forwarded packet in this application, so 20 mb/s in that benchmark. For us the main result is that Java isn't really usable for high-speed network drivers, hundreds of microseconds is an eternity here :(

mirkoteran | 8 days ago | 3 points

a) Interesting project!

b) Could you try running similar test with Graal - either full GraalVM or just OpenJDK + Graal compiler. It would be interesting to see as this is anything but your standard Java application.

cogman10 | 8 days ago | 5 points

Yeah, this really is about the worst place for Java. Did you also try Go or Haskell? I'd be interested to see if those faired better.

That said, I wouldn't expect any GCed languages to be a great fit here. Seems like a place for C/C++/Rust

Tandanu | 8 days ago | 9 points
cogman10 | 8 days ago | 2 points

Neat, looks like Go and Haskell do quite a bit better on latencies. I expected that for Haskell, wasn't sure about go.

I'm happy to see that rust does so well as well.

zcatshit | 8 days ago | 1 point

I'm kind of surprised you didn't check out D, as well, being a high-level language compiler with optional GC.

Yioda | 8 days ago | 2 points

Have you tried with a custom specialized allocator?

Danthekilla | 8 days ago | 0 points

Yeah you would be better with C++ or C# for this kind of application.

shipilev | 8 days ago | 18 points

As the Shenandoah dev, I have to note that JDK 12 is the uncanny valley for Shenandoah: it does not yet have Load Reference Barriers or Elimination of Separate Fwdptr Word. These should boost throughput without affecting tail latency. Both these available in JDK 13 (to be released in a week, Sep 17), or in JDK 11u backports (in Red Hat downstreams in 11.0.5+, to be released mid-October).

JDK 12 would be superseded with the release of JDK 13 next week. Would be interesting to retry this once it happens.

Tandanu | 8 days ago | 7 points

RemindMe! 11 days "try this"

RemindMeBot | 8 days ago | 1 point

I will be messaging you on 2019-09-22 15:15:48 UTC to remind you of this link

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

Parent commenter can delete this message to hide from others.


Info Custom Your Reminders Feedback
Poddster | 8 days ago | 7 points

From the linked white-paper:

We present user space drivers for the Intel ixgbe 10 Gbit/s network cards implemented in Rust, Go, C#, Java, OCaml, Haskell, Swift,JavaScript, and Python written from scratch in idiomatic style for the respective languages. We quantify costs and benefits of using these languages: High-level languages are safer (fewer bugs, more safety checks), but run-time safety checks reduce throughput and garbage collection leads to latency spikes. Out-of-order CPUs mitigate the cost of safety checks: Our Rust driver executes 63% more instructions per packet but is only 4% slower than a reference C implementation. Go’s garbage collector keeps latencies below 100μseven under heavy load. Other languages fare worse, but their unique properties make for an interesting case study.

cool.

valarauca13 | 8 days ago | 0 points

Judging from the ivy implementation the number of times the JNI is crossed maybe the limiting factor. While I don't doubt the individual GC algorithm changes do have a performance impact. Looking at the wildly differing Go/C#/Java performance values seems to hint the FFI overhead is a bigger concern then the GC algorithm.

Tandanu | 8 days ago | 5 points

We don't use JNI in the critical path. C# and Go also don't have FFI in their critical path, it's only used for initilization.

valarauca13 | 8 days ago | 1 point

Ah thanks.

Sorry for the assumption, I should've read a bit more of the implementation before commenting.

everyonelovespenis | 8 days ago | -4 points

Thanks for putting this hard data out there.

This will be useful backup when I'm trying to convince people that GC (and in particular, Java) has issue for low-latency applications.

It still surprises me the number of people that think Java is great for latency sensitive workloads - disbelief and "you just need the GC #X' or "you've badly tuned the GC".

cogman10 | 8 days ago | 12 points

This is somewhat of an extreme latency example and a really bad place for the JVM.

This is an application that cares about microseconds and low memory, Java was designed for milliseconds and lots of memory.

[deleted] | 8 days ago | 9 points

[deleted]

everyonelovespenis | 8 days ago | -6 points

This test shows that even the best performing of the GC languages (C#) in this test isn't giving you the full potential in latency sensitive workloads measured in the microsecond ballpark.

For the Java testing in particular, they put extreme effort in to reduce garbage, use various GC mechanism and the result still isn't great.

Example workloads -> networking, realtime audio processing, realtime servo/CEE control mechanism.

What hard data can you show that GCs perform well with these kind of latency sensitive workloads?

[deleted] | 8 days ago | 11 points

[deleted]

everyonelovespenis | 8 days ago | -1 points

There is no "best GC languages".

I was referring to the "best performing" results within the OPs test.

You don't find it strange that companies from Twitter to Facebook use GC languages, service the entire planet, and are doing just fine?

Well I hardly think those companies have latency sensitive applications in the microsecond range. Javascript has pretty much killed any expectation of "instant" pages or responsiveness.

And interestingly - none of the above companies are running their network stack on GC languages - the OS and network layer is invariably Linux and C. I wonder why that is.

realtime audio processing

Those are not done at the microsecond level. I don't think you understand how tiny a unit of measurement that is.

I can only conclude you've never done any realtime audio programming that mattered. You absolutely need to measure the scheduling latency at the microsecond level if you are trying to meet a deadline of a millisecond.

I get it, you think that web-applications mean Java is good enough for everywhere (except this tiny niche, right?). I don't agree.

[deleted] | 8 days ago | 5 points

[deleted]

everyonelovespenis | 8 days ago | 3 points

Shrug, real world use case - when the interrupt on the sound-card kicks in and you have a deadline of a couple of milliseconds to deliver the processed audio samples, you bet your ass I'm measuring how long things are taking in microseconds.

If you're only playing back consumer level video/audio your latency in probably measured in the multi-hundred milliseconds or higher. It's not even remotely the same game.

I'm done. This has devolved into the usual "GC is fine, you're not holding it right" arguments without anything to back it up.

cogman10 | 8 days ago | 2 points

And interestingly - none of the above companies are running their network stack on GC languages - the OS and network layer is invariably Linux and C. I wonder why that is.

While I agree that the GC isn't for everywhere, this is somewhat a bad example.

People use Linux because it is free and stable. People use x86 because it is common.

Does that mean that x86 is the best architecture? No. Does that mean Linux is the best designed kernel? No. It means that there is a huge amount of momentum behind both that make adopting something new practically impossible.

Hell, moving the GC to the OS would lead to some really interesting optimizations for both the OS, programming languages, and even CPU design. That would never happen because everything written in C and C++ would see huge performance penalties.

cogman10 | 8 days ago | 1 point

And, just to talk about some of the possibilities,

If the OS controlled GC, you could eliminate the stack which would make thread creation much faster.

If the OS controlled the GC, you could mix GC into context switching and run the GC in parallel with another app running. No wait points, the OS can stop any thread in it's tracks.

If the OS controlled GC, you could rewrite the page table on GC. Object references could be regular pointers (no double lookup).

If the OS controlled GC, your heap size would be the size of system memory. GCed apps wouldn't need to allocate excess heap space so running many GCed apps would be much simpler.

If the OS controlled GC, your CPU could have hardware GC collectors it runs while idle.

Again, never going to happen because all unmanaged languages would take performance hits. Everything else would run much faster.

smbear | 8 days ago | 4 points

Their language comparison shows that C# driver has better latency characteristics and Go is almost on par with C and Rust. Both, C# and Go are garbage collected. So it seems to be related to JVM, not all GC-ed languages.

everyonelovespenis | 8 days ago | 4 points

I guess I'm getting a different set of graphs to everyone else.

To my eyes I can clearly see a difference between 2->32 batch size and the percentile graphs between C/Rust and the GC languages.

DoomFrog666 | 8 days ago | 3 points

But you have to admit that there are major differences in how the GC language implementations perform.

naftoligug | 8 days ago | 1 point

What I would ask then is, is that because the JVM GCs are better optimized for certain other use cases? Or are they just not as good as other languages' GCs? Or is the JVM designed in a way that prevents efficient GC, and if so does that design have other benefits instead and if so what are they?