JVM Performance

4 Jun 2020 | - views

Philosophy of traffic light

We can refer to traffic light when we talk about optimizations.

Green zone - use profiler and rewrite ugly code (easy to find problems, measurement accuracy is secondary, performance improvements can be easily detected).
Yellow zone - use profiler and write benchmarks (harder to find problems, measurement accuracy is first).
Red zone - hack jvm.

First step

Better algorithms. You can cleanup your code with label "Technical dept". After that you can remember about algorithms complexity and (for example) replace your O(n) algorithm to O(log n). You can remove excess customization from your code: do you really need to setup this after startup time?

Use code analysers. You can use Intellij idea code analyser or use spotbugs for detect low quality piece of code. May be this is easy to find something like loop with string joins without StringBuilder.

Look to Database. When we use Hibernate not all queries look good. You can analyse this in both sides: in application (show sql queries & plans analysis) or in database (detect long running queries & go deep with plans analysis). You can try to use cache when it is possible (local or remote).

Find bottleneck

sys% - threading
irq% - offload interruptions
idle% - find were we wait
iowait% - io optimizations
usr% - use profiler...

Performance measurement

Performance model. We can discuss about performance in terms of "Startup time" + "Running time". In first step application prepares to start: start Java VM, load classes, construct service objects (like beans), in second one - executing. You no need to optimize first step for long running applications. In runtime you have to minimize time for gc / io wait / kernel wait / thread locks.

Remember about WarmUp. JVM use JIT for compile hot methods to native code (and sometimes cleanup this). So you have to load application and wait before measure performance.

Benchmarks (micro / macro / meso).

Micro benchmark - we try to optimize small piece of code. For this needs we recommend to use JMH.
Macro benchmark - we try to optimize user scenario. For this needs we recommend to use JMeter.
Meso benchmark - we try to optimize multiple large scenarios. For this needs we recommend to use distributed tracing utils like Zipkin.

Profiling.

Sampling profiler. Watch memory every N seconds and detect what is going on in this time.
Instrumented profiler. Inject instructions into byte code and detect all actions. This one can decrease performance to x5 times or more.

Latency. We can use data locality principle: share less data between layers. All the time when we fetch data from db we waste our time to external logic (parse / fetch / resolve concurrency problems) and network. Or in the global level we can use Data center located with your user (for US users in US, for EU in EU, for Asia in Asia, etc), not only CDN can use this policy, you place your servers in target region.

Performance measurement pitfalls. When we try to increase performance 5 times it is easy for detect improvements. But when we got 5-10% improvements (or less) we found benchmark sensitivity. In perfect world we limited only by clock accuracy. But in real world we got multiple side effects like GC or OS background jobs. So we have to use A/B tests methodology: compare multiple runs with changes and without.

A_GROUP: [T1,..., Tn]
B_GROUP: [t1, ..., tn]

We can use AB Test calculator [?] for this needs: set visitors as samples count and conversions as evaluation time.

Performance Tip CPU Utilization.

Low utilization
- Look to disk / network (may have high utilization)
- Look to locks
- Look to OS resources
High utilization (some cores)
- Look to locks
- Look to kernel calls
High utilization (all cores)
- Look to architecture
- Look to API usage
- Look to frequently usage methods
- Look to GC configuration

Use right Databases

Compare your needs and target load. Be careful when you try nosql solution. Scalability and flexibility is expensive. Most common problem is manage entities connections. All the time you should remember about denormalization technique.

Main aspect of any storage system is:

Read pattern: index at the file level, index at the record level, secondary index, reverse indexing, batch operation, random access.
Write pattern: single record write, batch write, mutation (?)
Partitioning: centralized, range, hash
Mutation: append only, file versus records, record size, mutation latency.
Availability vs consistency
Use case: large scans, random access to data, cubing, time series, high mutability.

Optimize your service layers. In JPA you can use second level cache or prevent it using Spring Cache.

Denormalization. You can store your aggregate without every time evaluation. For example you can save total articles count instead use COUNT(*) all the time. Or you can prevent multiple joins using data duplication.

Garbage collection

JVM manage heap memory itself (compared by C++ where it is user managed). So main task objective of GC is find unused objects in memory and remove it with hole filling. There are four main GC algorithms: serial collector, throughput collector, concurrent collector and G1 collector. We have tradeoff between memory and cpu consumption in limitation of cpu core count.

GC removes objects from heap if no links from runtime stack. Main hypnosis of GC is weak hypnosis of generations: most objects a short living and old objects rarely links to new one. So we can split memory to generations: young & old generations. And we can collect objects in young separately of old. Trip from young to old can be defined by some age.

In the same time we can split young generation into some sub regions. Egen - new objects will be allocated here. And two survivor spaces: s0 & s1. When the are not enough space in egen we start small collection: copy all objects from eden to s0 and someone from s1 to old generation. So in s1 we can find objects who survived after at least once gc.

GC Performance metrics

Throughput
Predictable (latency)
Footprint (memory usage)

Algorithms

Serial GC

This collector is default for single-processor machines. We stop all application threads (GC pause) during full GC.

Throughput collector

This collector is default for mutli-CPU machines with 64-bit JVM. We use multiple threads to collect the young generation. The throughput collector stops all application threads during minor and full GC.

CMS collector

This collector eliminate the long pauses in full GC cycles. We stop all application threads during a minor GC, it also performs with multiple threads.

G1 collector

This collector designed to process large heaps (4GB plus) with minimal pauses. It divides the heap into a regions. G1 clean up objects from th old generation by copying from one region into another.

Tip You can call full GC manually using System.gc() when you known that is right time to wait.

Tuning

Heap size. We can control heap size using -Xms{N} (initial) and -Xmx{N} (max).

Generation sizes. We can control ratio of young to old generation by -XX:NewRatio=N, young generation size by -XX:NewSize=N (initial) and -XX:MaxNewSize=N (max).

Metaspace. JVM loads classes and saves their meta data into metaspace (Java 8, in Java 7 - permanent generation) in separate heap space. This information contains only JVM specific meta data (not classes). We can prevent expensive resize of metaspace using large initial metaspace size.

JIT Compiler

Java distribute programs in java byte code compared C++ that distribute in binary distribute for specific platform, compared to Python or PHP that distribute there sources as is. But Hotpot (Oracle JVM Distribution) can dynamically compile byte code to native code when there methods is frequently used. In same time JVM can use optimization in flight.

Java Client/Server compiler. We can use java -client -XX:+TieredCompilation for specify server evaluation. This option effects startup time so when you have GUI application or short-running application it is matter.

Trick of compiling java byte code to native called as "Code Cache". So you can tune this option by -XX:ReservedCodeCacheSize= (max size of code cache). You can detect compilation process when you turn on -XX:+PrintCompilation flag. Another trick is "inlining" controlling by -XX:-Inline (enabled by default), -XX:MaxInlineSize=, -XX: +PrintInlining.

Native memory

We can allocate native memory via JNI calls or by NIO byte buffers (allocateDirect()). Most common case is buffers for filesystem and socket operation where write to nio buffer and sending data to the channel requires no copying between JVM and C libraries for transform data. Also you can allocate memory outside GC management. It is useful for large (16GB+) allocations where GC is not your best friend. Software for big data evaluation like Spark or Ignite like this trick.

This article was inspired by Aleksey Shipilëv's talks and book "Java Performance: The Definitive Guide".

Author @mrkandreev

Machine Learning Engineer