JVM Performance

4 Jun 2020 | - views


Philosophy of traffic light

We can refer to traffic light when we talk about optimizations.


Use green/yellow/red strategy for application performance optimization. When you stay at start most useful is use "green" techniques. You can easily improve you metrics without hard work.

Metrics. We will discuss about performance advanced approaches in "Performance measurement" part but lets start with simple understanding measurable values. First one is throughput (=operations per time). Throughput is connected with response time (=time per each operation from start to begging) but not directly. We can use batching for evaluate multiple samples per time but response time of each one will be different. Here we have "low latency" vs "high throughput" problem: we can use batches for reach "high throughput" (like kafka) or process each sample independently (like rabbitmq). So in each case we shall select target metric: operation per second, response time, evaluation time, etc.

After that we have to verify metric correctness. Is this metrics reflects real production performance? And is this metric is reproducible? What difference between indecent starts without any code changes.

Application load. When we want to optimize perforce we shall use load like in production. The are two main approaches: simulate using load testing tools (like JMeter) or mirror trafic from production to test instance. The second one is more complicate but measures will be more correct.

First step

Better algorithms. You can cleanup your code with label "Technical dept". After that you can remember about algorithms complexity and (for example) replace your O(n) algorithm to O(log n). You can remove excess customization from your code: do you really need to setup this after startup time?

Use code analysers. You can use Intellij idea code analyser or use spotbugs for detect low quality piece of code. May be this is easy to find something like loop with string joins without StringBuilder.

Look to Database. When we use Hibernate not all queries look good. You can analyse this in both sides: in application (show sql queries & plans analysis) or in database (detect long running queries & go deep with plans analysis). You can try to use cache when it is possible (local or remote).

Find bottleneck

Performance measurement

Performance model. We can discuss about performance in terms of "Startup time" + "Running time". In first step application prepares to start: start Java VM, load classes, construct service objects (like beans), in second one - executing. You no need to optimize first step for long running applications. In runtime you have to minimize time for gc / io wait / kernel wait / thread locks.

Remember about WarmUp. JVM use JIT for compile hot methods to native code (and sometimes cleanup this). So you have to load application and wait before measure performance.

Benchmarks (micro / macro / meso).
  • Micro benchmark - we try to optimize small piece of code. For this needs we recommend to use JMH.
  • Macro benchmark - we try to optimize user scenario. For this needs we recommend to use JMeter.
  • Meso benchmark - we try to optimize multiple large scenarios. For this needs we recommend to use distributed tracing utils like Zipkin.

Profiling.

  • Sampling profiler. Watch memory every N seconds and detect what is going on in this time.
  • Instrumented profiler. Inject instructions into byte code and detect all actions. This one can decrease performance to x5 times or more.

Latency. We can use data locality principle: share less data between layers. All the time when we fetch data from db we waste our time to external logic (parse / fetch / resolve concurrency problems) and network. Or in the global level we can use Data center located with your user (for US users in US, for EU in EU, for Asia in Asia, etc), not only CDN can use this policy, you place your servers in target region.

Performance measurement pitfalls. When we try to increase performance 5 times it is easy for detect improvements. But when we got 5-10% improvements (or less) we found benchmark sensitivity. In perfect world we limited only by clock accuracy. But in real world we got multiple side effects like GC or OS background jobs. So we have to use A/B tests methodology: compare multiple runs with changes and without.

A_GROUP: [T1,..., Tn]
B_GROUP: [t1, ..., tn]
We can use AB Test calculator [?] for this needs: set visitors as samples count and conversions as evaluation time.

Performance Tip CPU Utilization.
  • Low utilization
    • Look to disk / network (may have high utilization)
    • Look to locks
    • Look to OS resources
  • High utilization (some cores)
    • Look to locks
    • Look to kernel calls
  • High utilization (all cores)
    • Look to architecture
    • Look to API usage
    • Look to frequently usage methods
    • Look to GC configuration

Use right Databases

Compare your needs and target load. Be careful when you try nosql solution. Scalability and flexibility is expensive. Most common problem is manage entities connections. All the time you should remember about denormalization technique.

Main aspect of any storage system is:

Optimize your service layers. In JPA you can use second level cache or prevent it using Spring Cache.

Denormalization. You can store your aggregate without every time evaluation. For example you can save total articles count instead use COUNT(*) all the time. Or you can prevent multiple joins using data duplication.

Garbage collection

JVM manage heap memory itself (compared by C++ where it is user managed). So main task objective of GC is find unused objects in memory and remove it with hole filling. There are four main GC algorithms: serial collector, throughput collector, concurrent collector and G1 collector. We have tradeoff between memory and cpu consumption in limitation of cpu core count.

GC removes objects from heap if no links from runtime stack. Main hypnosis of GC is weak hypnosis of generations: most objects a short living and old objects rarely links to new one. So we can split memory to generations: young & old generations. And we can collect objects in young separately of old. Trip from young to old can be defined by some age.

In the same time we can split young generation into some sub regions. Egen - new objects will be allocated here. And two survivor spaces: s0 & s1. When the are not enough space in egen we start small collection: copy all objects from eden to s0 and someone from s1 to old generation. So in s1 we can find objects who survived after at least once gc.

GC Performance metrics

Algorithms

Serial GC

This collector is default for single-processor machines. We stop all application threads (GC pause) during full GC.

Throughput collector

This collector is default for mutli-CPU machines with 64-bit JVM. We use multiple threads to collect the young generation. The throughput collector stops all application threads during minor and full GC.

CMS collector

This collector eliminate the long pauses in full GC cycles. We stop all application threads during a minor GC, it also performs with multiple threads.

G1 collector

This collector designed to process large heaps (4GB plus) with minimal pauses. It divides the heap into a regions. G1 clean up objects from th old generation by copying from one region into another.

Tip You can call full GC manually using System.gc() when you known that is right time to wait.

Tuning

Heap size. We can control heap size using -Xms{N} (initial) and -Xmx{N} (max).

Generation sizes. We can control ratio of young to old generation by -XX:NewRatio=N, young generation size by -XX:NewSize=N (initial) and -XX:MaxNewSize=N (max).

Metaspace. JVM loads classes and saves their meta data into metaspace (Java 8, in Java 7 - permanent generation) in separate heap space. This information contains only JVM specific meta data (not classes). We can prevent expensive resize of metaspace using large initial metaspace size.

JIT Compiler

Java distribute programs in java byte code compared C++ that distribute in binary distribute for specific platform, compared to Python or PHP that distribute there sources as is. But Hotpot (Oracle JVM Distribution) can dynamically compile byte code to native code when there methods is frequently used. In same time JVM can use optimization in flight.

Java Client/Server compiler. We can use java -client -XX:+TieredCompilation for specify server evaluation. This option effects startup time so when you have GUI application or short-running application it is matter.

Trick of compiling java byte code to native called as "Code Cache". So you can tune this option by -XX:ReservedCodeCacheSize= (max size of code cache). You can detect compilation process when you turn on -XX:+PrintCompilation flag. Another trick is "inlining" controlling by -XX:-Inline (enabled by default), -XX:MaxInlineSize=, -XX: +PrintInlining.

Native memory

We can allocate native memory via JNI calls or by NIO byte buffers (allocateDirect()). Most common case is buffers for filesystem and socket operation where write to nio buffer and sending data to the channel requires no copying between JVM and C libraries for transform data. Also you can allocate memory outside GC management. It is useful for large (16GB+) allocations where GC is not your best friend. Software for big data evaluation like Spark or Ignite like this trick.


This article was inspired by Aleksey Shipilëv's talks and book "Java Performance: The Definitive Guide".

Related articles:

Author @mrkandreev

Machine Learning Engineer