We can refer to traffic light when we talk about optimizations.
Use green/yellow/red strategy for application performance optimization. When you stay at start most useful is use "green" techniques. You can easily improve you metrics without hard work.
Metrics. We will discuss about performance advanced approaches in "Performance measurement" part but lets start with simple understanding measurable values. First one is throughput (=operations per time). Throughput is connected with response time (=time per each operation from start to begging) but not directly. We can use batching for evaluate multiple samples per time but response time of each one will be different. Here we have "low latency" vs "high throughput" problem: we can use batches for reach "high throughput" (like kafka) or process each sample independently (like rabbitmq). So in each case we shall select target metric: operation per second, response time, evaluation time, etc.
After that we have to verify metric correctness. Is this metrics reflects real production performance? And is this metric is reproducible? What difference between indecent starts without any code changes.
Application load. When we want to optimize perforce we shall use load like in production. The are two main approaches: simulate using load testing tools (like JMeter) or mirror trafic from production to test instance. The second one is more complicate but measures will be more correct.
Better algorithms. You can cleanup your code with label "Technical dept". After that you can remember
about algorithms complexity and (for example) replace your O(n)
algorithm to O(log
n)
. You can remove excess customization from your code: do you really need to setup this after
startup time?
Use code analysers. You can use Intellij idea code analyser or use spotbugs for detect low quality piece of code. May be this is easy to find something like loop with string joins without StringBuilder.
Look to Database. When we use Hibernate not all queries look good. You can analyse this in both sides: in application (show sql queries & plans analysis) or in database (detect long running queries & go deep with plans analysis). You can try to use cache when it is possible (local or remote).
Performance model. We can discuss about performance in terms of "Startup time" + "Running time". In first step application prepares to start: start Java VM, load classes, construct service objects (like beans), in second one - executing. You no need to optimize first step for long running applications. In runtime you have to minimize time for gc / io wait / kernel wait / thread locks.
Remember about WarmUp. JVM use JIT for compile hot methods to native code (and sometimes cleanup this). So you have to load application and wait before measure performance.
Profiling.
Performance measurement pitfalls. When we try to increase performance 5 times it is easy for detect improvements. But when we got 5-10% improvements (or less) we found benchmark sensitivity. In perfect world we limited only by clock accuracy. But in real world we got multiple side effects like GC or OS background jobs. So we have to use A/B tests methodology: compare multiple runs with changes and without.
A_GROUP: [T1,..., Tn]
B_GROUP: [t1, ..., tn]
We can use AB Test calculator [?] for this needs: set visitors as
samples count and conversions as evaluation time.
Compare your needs and target load. Be careful when you try nosql solution. Scalability and flexibility is expensive. Most common problem is manage entities connections. All the time you should remember about denormalization technique.
Main aspect of any storage system is:
Optimize your service layers. In JPA you can use second level cache or prevent it using Spring Cache.
Denormalization. You can store your aggregate without every time evaluation. For example you can save
total articles count instead use COUNT(*)
all the time. Or you can prevent multiple joins using
data duplication.
JVM manage heap memory itself (compared by C++ where it is user managed). So main task objective of GC is find unused objects in memory and remove it with hole filling. There are four main GC algorithms: serial collector, throughput collector, concurrent collector and G1 collector. We have tradeoff between memory and cpu consumption in limitation of cpu core count.
GC removes objects from heap if no links from runtime stack. Main hypnosis of GC is weak hypnosis of generations: most objects a short living and old objects rarely links to new one. So we can split memory to generations: young & old generations. And we can collect objects in young separately of old. Trip from young to old can be defined by some age.
In the same time we can split young generation into some sub regions. Egen - new objects will be allocated here. And two survivor spaces: s0 & s1. When the are not enough space in egen we start small collection: copy all objects from eden to s0 and someone from s1 to old generation. So in s1 we can find objects who survived after at least once gc.
GC Performance metricsThis collector is default for single-processor machines. We stop all application threads (GC pause) during full GC.
This collector is default for mutli-CPU machines with 64-bit JVM. We use multiple threads to collect the young generation. The throughput collector stops all application threads during minor and full GC.
This collector eliminate the long pauses in full GC cycles. We stop all application threads during a minor GC, it also performs with multiple threads.
This collector designed to process large heaps (4GB plus) with minimal pauses. It divides the heap into a regions. G1 clean up objects from th old generation by copying from one region into another.
Tip You can call full GC manually using System.gc()
when
you known that is right time to wait.
Heap size. We can control heap size using -Xms{N}
(initial) and -Xmx{N}
(max).
Generation sizes. We can control ratio of young to old generation by -XX:NewRatio=N
,
young generation size by -XX:NewSize=N
(initial) and -XX:MaxNewSize=N
(max).
Metaspace. JVM loads classes and saves their meta data into metaspace (Java 8, in Java 7 - permanent generation) in separate heap space. This information contains only JVM specific meta data (not classes). We can prevent expensive resize of metaspace using large initial metaspace size.
Java distribute programs in java byte code compared C++ that distribute in binary distribute for specific platform, compared to Python or PHP that distribute there sources as is. But Hotpot (Oracle JVM Distribution) can dynamically compile byte code to native code when there methods is frequently used. In same time JVM can use optimization in flight.
Java Client/Server compiler. We can use java -client -XX:+TieredCompilation
for specify
server evaluation. This option effects startup time so when you have GUI application or short-running
application it is matter.
Trick of compiling java byte code to native called as "Code Cache". So you can tune this option by
-XX:ReservedCodeCacheSize=
(max size of code cache). You can detect compilation process when you turn on
-XX:+PrintCompilation
flag. Another trick is "inlining" controlling by -XX:-Inline
(enabled by default), -XX:MaxInlineSize=
, -XX: +PrintInlining
.
We can allocate native memory via JNI calls or by NIO byte buffers (allocateDirect()
). Most
common case is buffers for filesystem and socket operation where write to nio buffer and sending data to the
channel requires no copying between JVM and C libraries for transform data. Also you can allocate memory
outside GC management. It is useful for large (16GB+) allocations where GC is not your best friend. Software
for big data evaluation like Spark or Ignite like this trick.
Author @mrkandreev
Machine Learning Engineer