AOT cache code generated on an AVX-512 CPU shatters an AVX2 CPU on load

TL;DR

We noticed sporadic crashes during internal testing of our pre-launch control plane. Kubernetes did what it does: restarted pods, retried, kept serving. But a restart count climbing while everything looks fine isn't something you take live. So we dug in.

We had added -XX:AOTCache=/deployments/app.aot to a Quarkus 3.35 image on JDK 25 LTS, expecting a free startup win from JEP 483 (Ahead-of-Time Class Loading & Linking). Instead we got a JVM crashing with SIGILL in an AdapterBlob on the JIT/AOT boundary. The hs_err file made it obvious once we'd read it properly: the AOT cache was trained on a GitHub-hosted Actions runner that supports AVX-512, then loaded on a runtime CPU that doesn't. The cache embedded instructions the runtime could not execute.

The cache was buying us about 4.3 s of cold-start (7.5 s with the cache, 11.9 s without it). Worth having back, but not in exchange for a JVM that occasionally hits an illegal opcode. The immediate fix is one line: drop -XX:AOTCache=... from production's JAVA_OPTS. The lesson is that "same architecture" isn't tight enough when your build host has AVX-512 and your runtime host doesn't.

This post walks through the diagnosis end-to-end and ends with where else the same trap waits. If your hardware is mixed, your VMs are CPU-masked to a baseline (Proxmox cpu: x86-64-v3, VMware EVC), or you run AMD EPYC Rome/Milan, this applies whether your hardware is "old" or not.

About this work

JEP 483 is part of Project Leyden, OpenJDK's effort to make JVM startup less painful. It was integrated as a stable feature in JDK 24, and JDK 25 builds on it with JEP 514 (Ahead-of-Time Command-Line Ergonomics) and JEP 515 (Ahead-of-Time Method Profiling). The work is iterating quickly, and the portability semantics around the cache file are still maturing in the community. This post is one constraint, worked end-to-end, written so it shows up when someone else hits the same SIGILL.

The setup

The crashing app is the admin/portal control plane for our self-service hosting platform: a Quarkus 3.35 app on JDK 25 LTS, running on Talos Linux, fronted by HAProxy, with CloudNativePG underneath. Two replicas, modest 1.5 Gi memory limit, G1, the usual containerised-JVM tuning:

# src/main/kubernetes/deployment.yaml
- name: JAVA_OPTS
  value: >-
    -XX:InitialRAMPercentage=60
    -XX:MaxRAMPercentage=60
    -XX:+UseG1GC
    -XX:MaxGCPauseMillis=100
    -XX:+ExitOnOutOfMemoryError
    -XX:+AlwaysPreTouch
    -XX:AOTCache=/deployments/app.aot       # <-- the trap
    ...

The AOT cache was being baked into the container image during docker build, using the two-step record / create dance from JEP 483. JEP 483 doesn't cache the whole app's bytecode blindly. It caches only the classes the JVM actually loaded and linked during a real run. So the first pass (AOTMode=record) runs the app and writes a configuration file (app.aotconf) describing what was loaded and how. The second pass (AOTMode=create) reads that configuration and materialises the cache file (app.aot) that production pods load at startup:

# src/main/docker/Dockerfile.jvm (before)
RUN java -XX:AOTMode=record -XX:AOTConfiguration=/deployments/app.aotconf \
    -Dquarkus.http.host=0.0.0.0 \
    -jar /deployments/quarkus-run.jar; \
    if [ -f /deployments/app.aotconf ]; then \
      java -XX:AOTMode=create -XX:AOTConfiguration=/deployments/app.aotconf \
        -XX:AOTCache=/deployments/app.aot \
        -jar /deployments/quarkus-run.jar && \
      rm -f /deployments/app.aotconf; \
    fi; \
    exit 0

The training run exits non-zero (the app has no DB at build time), but the Dockerfile's ; and exit 0 swallow that, and the if [ -f ... ] check only proceeds to the create step if the recording actually produced a config file. CI ran clean; the image pushed; ArgoCD rolled.

The symptom: a restart counter climbing while everything served fine

kubectl get pods told the story:

NAME                      READY   STATUS    RESTARTS       AGE
barista-9f94dbc84-qmfmr   1/1     Running   34 (19h ago)   40h
barista-9f94dbc84-x9fsb   1/1     Running   3 (28h ago)    40h

Some restarts were near-instant after pod start, some were hours into a serving session. No OOMKills, no liveness flap, no shared trigger. Kubernetes handled the symptom gracefully, which is what it's supposed to do. Which is exactly why we caught it pre-launch. Graceful restart loops hide root causes, and a control plane that randomly restarts isn't something you want underneath customer workloads.

--previous logs always ended the same way:

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGILL (0x4) at pc=0x00007fb12ed1860c, pid=1, tid=145
#
# JRE version: OpenJDK Runtime Environment (Red_Hat-25.0.3.0.9-1) (25.0.3+9)
# Java VM: OpenJDK 64-Bit Server VM (25.0.3+9-LTS, mixed mode, sharing,
#          tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# v  ~AdapterBlob 0x00007fb12ed1860c

Two clues already:

  1. mode, sharing. The AOT/CDS archive is mapped. The cache is live.
  2. The crash is in an AdapterBlob, not in interpreter code, and not in a C2-compiled method. AdapterBlobs are short JVM-generated code blocks that bridge calls between interpreted and JIT-compiled code (their argument-passing conventions differ). They aren't where well-formed Java code crashes during normal operation.

When the crash is in JVM-generated code, it usually means the CPU can't execute the instructions the JVM emitted.

Reading the hs_err

Every Hotspot crash drops hs_err_pidN.log somewhere writable. Our pods mount /tmp as an emptyDir, and emptyDir persists across container restarts inside the same pod, so the file from the previous container was still there:

$ kubectl exec barista-9f94dbc84-qmfmr -- ls -la /tmp/
-rw-r--r--. 1 default root 167721 May 19 10:54 hs_err_pid1.log
$ kubectl cp barista/barista-9f94dbc84-qmfmr:/tmp/hs_err_pid1.log ./hs_err_pid1.log

The header was the same as --previous logs. The interesting bit is further down: the crashing thread.

Current thread (0x00007fb141031250):
  JavaThread "JPA Startup Thread"  [_thread_in_Java, id=145, ...]

Stack:
v  ~AdapterBlob 0x00007fb12ed1860c
j  org.hibernate.boot.model.relational.QualifiedNameParser$NameParts.<init>(
      Lorg/hibernate/boot/model/naming/Identifier;
      Lorg/hibernate/boot/model/naming/Identifier;
      Lorg/hibernate/boot/model/naming/Identifier;)V+30
j  org.hibernate.boot.model.relational.QualifiedNameImpl.<init>(...)
j  org.hibernate.boot.model.relational.QualifiedTableName.<init>(...)
j  org.hibernate.mapping.Table.getQualifiedName(...)
j  org.hibernate.persister.entity.JoinedSubclassEntityPersister.<init>(...)
j  java.lang.invoke.LambdaForm$DMH+0x80000003f.newInvokeSpecial(...)
...
j  io.quarkus.hibernate.orm.runtime.FastBootHibernatePersistenceProvider
       .createEntityManagerFactory(...)
j  io.quarkus.hibernate.orm.runtime.JPAConfig$LazyPersistenceUnit.get(...)

We crash in the canonical constructor of a Java record. Hibernate's QualifiedNameParser.NameParts is a record (Identifier, Identifier, Identifier), being invoked through Java's MethodHandle machinery (the LambdaForm$DMH.newInvokeSpecial further up the stack). The AdapterBlob at the top of the stack is the JVM's calling-convention bridge for that MethodHandle invocation.

That's also the moment Quarkus' lazy JPA unit boots Hibernate's metamodel: the heaviest reflective burst of the whole startup. A high-throughput moment in the JIT.

The "it works for a while, then crashes" red herring

One pod crashed at startup, 7 s in. The other pod ran fine for 30 minutes, then crashed during a G1 Concurrent Mark Cycle:

[01:38:13.295][gc,marking] GC(20) Concurrent Mark From Roots
#
#  SIGILL (0x4) at pc=0x00007f73bad51690, pid=1, tid=141
# Problematic frame:
# v  ~AdapterBlob 0x00007f73bad51690

Same signature, different timing. That ruled out "Hibernate is buggy on JDK 25" and pointed back at "something about the generated code itself is unsound." Both crashes happen in adapters; the only thing they have in common with the AOT cache loaded is that the AOT cache is loaded.

So why did one pod sometimes survive for hours? Because the adapters only get hit while methods are still bridging interpreted and compiled code. Once C2 has compiled the hot paths, those adapters drop out of the path. Whether the crash hits at startup or hours in depends on which methods the JIT got to first.

The actual root cause: CPU feature mismatch

siginfo clinched it:

siginfo: si_signo: 4 (SIGILL), si_code: 2 (ILL_ILLOPN), ...

ILL_ILLOPN is "illegal opcode". Not a memory fault, not a permission fault. The CPU was being asked to execute a byte sequence it doesn't implement.

So we asked: what's the CPU?

$ kubectl exec barista-9f94dbc84-qmfmr -- \
    grep -m1 'model name' /proc/cpuinfo
model name : Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz

$ kubectl exec barista-9f94dbc84-qmfmr -- \
    grep -m1 flags /proc/cpuinfo | tr ' ' '\n' | grep -iE '^avx[0-9]*$'
avx
avx2

E5-2680 v4 is a Broadwell-EP chip from 2016. AVX2, FMA, BMI2. No AVX-512.

What does the JVM itself pick?

$ kubectl exec ... -- java -XX:+PrintFlagsFinal -version | grep UseAVX
 int UseAVX = 2 {ARCH product} {default}

That's UseAVX=2: Hotspot detected AVX2 and won't emit AVX-512.

Now, the AOT cache was trained during docker build, which on our pipeline runs on a GitHub-hosted ubuntu-latest runner. Those run on Azure VMs whose underlying CPUs (Intel Ice Lake or AMD EPYC Milan, depending on SKU) do support AVX-512. No idea why the build still takes 5x as long as on my laptop, but that's a different rant. The training-run JVM picks UseAVX=3 by default, and its C1/C2/stub generators may emit EVEX-encoded instructions.

That cache then ships in the image, gets loaded on a runtime CPU with a lower feature ceiling, and the first time a cached blob tries to execute an AVX-512 instruction the CPU faults with ILL_ILLOPN. The crashing PC sits inside an AdapterBlob's code range. Whether the offending AVX-512 instruction is in the adapter itself or in a stub the adapter branches into, the code in that blob was generated on an AVX-512-capable CPU.

The documented AOT cache portability requirements are: same JDK release, same operating system, same hardware architecture (x64 or aarch64). What's not pinned down at the same level of detail is what "same architecture" means within x86-64. Our crash shows that two x86-64 CPUs with different feature flags can't share an AOT cache, at least not when the cache was trained on the more capable one. No warning, no fallback, just a SIGILL on the first cached instruction the runtime CPU can't execute.

The fix

Two changes. First, drop -XX:AOTCache from prod's JAVA_OPTS:

 - name: JAVA_OPTS
   value: >-
     -XX:InitialRAMPercentage=60
     -XX:MaxRAMPercentage=60
     -XX:+UseG1GC
     -XX:MaxGCPauseMillis=100
     -XX:+ExitOnOutOfMemoryError
     -XX:+AlwaysPreTouch
-    -XX:AOTCache=/deployments/app.aot
     -Dio.netty.machineId=00:00:00:00:00:01
     ...

Second, stop baking the cache into the image. The RUN block goes away, and the JAVA_OPTS_APPEND ENV stops pointing at it:

-RUN java -XX:AOTMode=record ... ; \
-    if [ -f /deployments/app.aotconf ]; then \
-      java -XX:AOTMode=create ... ; \
-    fi; \
-    exit 0
-
 EXPOSE 8090
-ENV JAVA_OPTS_APPEND="-Dquarkus.http.host=0.0.0.0 ... -XX:AOTCache=/deployments/app.aot"
+ENV JAVA_OPTS_APPEND="-Dquarkus.http.host=0.0.0.0 ..."

A note on this second change. While diagnosing this, we found that JAVA_OPTS_APPEND on the Red Hat UBI Java image doesn't actually append the way its name suggests: it's a fallback that's silently discarded the moment you set JAVA_OPTS. That's a separate behaviour worth its own writeup. We documented it here. Relevant to anyone on ubi9/openjdk-25, AOT cache or not.

Did the fix actually fix it?

Yes, with the trade-off the cache was supposed to be hiding. Rolled the no-AOT image to both pods, watched the Quarkus startup lines, and caught the deltas:

# OLD image, -XX:AOTCache=/deployments/app.aot:
[io.quarkus] barista 1.0.0-SNAPSHOT on JVM (powered by Quarkus 3.35.3)
             started in 7.497s. Listening on: http://0.0.0.0:8090

# NEW image, no AOT cache, two consecutive pod starts:
[io.quarkus] ... started in 11.929s. Listening on: http://0.0.0.0:8090
[io.quarkus] ... started in 11.602s. Listening on: http://0.0.0.0:8090
Configuration Quarkus startup Restarts
With AOT cache 7.497 s 34 / 40 h on pod A, 3 on pod B
No AOT cache 11.6 to 11.9 s 0

So the cache really was buying us a meaningful ~4.3 s, about a 58% faster cold start. It just wasn't safe to take this way. The right next move is the in-cluster training pattern described below: keep the 7.5 s number, lose the SIGILLs.

For now: image is 70 MB lighter, restart churn is gone, on-call sleeps through the night. As the ops person, I'll make that trade.

Where else this bites

"Old hardware + new JVM" isn't the only place this fires. The condition is: whenever your build CPU's feature ceiling exceeds your runtime CPU's feature ceiling, the AOT cache trained on the former can crash on the latter. Hardware age is incidental. A non-exhaustive list of modern places the same trap waits:

  • AMD EPYC Rome (Zen 2, 2019) and Milan (Zen 3, 2021). No AVX-512. These are current-generation server CPUs running in plenty of production today. If your CI runner is an Azure-hosted Ice Lake (AVX-512) and your runtime is an EPYC Rome host, you have the same crash surface as we did, without any "old Xeon" excuse. Zen 4 Genoa (2022) and Zen 5 Turin (2024) added AVX-512, but Rome and Milan didn't.
  • Hypervisor CPU masking, especially at the x86-64-v3 baseline. Proxmox does this via cpu: x86-64-v3, VMware via EVC. Both deliberately set the VM's CPU ceiling below the host silicon so VMs can live-migrate across hardware generations. Sensible policy. It's also no coincidence that x86-64-v3 is the most common baseline: it's RHEL 10's ISA floor, and AlmaLinux 10, Rocky 10, the rest of the family inherited it. Predictable, supported, well-documented. Also: AVX-512 excluded. Train on a laptop with AVX-512, deploy onto a x86-64-v3 VM, same SIGILL.
  • Mixed Intel/AMD Kubernetes node pools. Kubernetes schedules pods across nodes based on resource requests, not CPU feature flags. A pod scheduled on a Skylake-SP node one day might land on a Zen 2 EPYC node the next. Your image carries one AOT cache. One of those landings will work; the other will crash.
  • Mixed-microarchitecture CI runner pools. If your build sometimes lands on a GitHub-hosted runner and sometimes on a self-hosted M-series Mac, even the arch family changes. JEP 483 doesn't pretend to handle cross-architecture portability, but the failure mode of "this build is fine, that build crashes in production" is genuinely confusing if you didn't know to look.

This is an infrastructure problem. Feature ceiling is a policy choice, not something the hardware gives you for free. Pick a level, document it, make every JIT/AOT runtime stick to it. We're rotating out our older hosts on the normal refresh cycle, but new hardware would have hidden this bug rather than fixed it.

Doing AOT properly later

The honest fix isn't "never use Leyden." It's train the cache where it's going to run. Two safer patterns:

  1. Build-time training on the same CPU class. Either pin the GHA runner to a self-hosted runner on the same hardware as production, or set -XX:UseAVX=2 (or whichever feature level matches prod) explicitly during the AOTMode=record and AOTMode=create steps. The latter is the smallest change: it caps Hotspot's instruction selection at the prod level even when the builder CPU is fancier. Catch: you must keep the cap in sync if you ever upgrade the production fleet.
  2. First-boot training in-cluster. Run the AOTMode=record / AOTMode=create pair as an initContainer on first launch, writing the .aot to a PVC. Subsequent pod starts mount the PVC and load with -XX:AOTCache=.... The cache is generated on a real production node, so feature-set mismatch is impossible. This is what Project Leyden's design actually anticipates; baking the cache into the image is the easy-mode shortcut that bites when your CI runner and your prod nodes diverge.

We'll likely take option 2 when we move onto more heterogeneous hardware. One image, per-cluster cache.

Prior art and further reading

The OpenJDK Leyden team and the wider community have been talking about portability constraints for a while. This post is what one such constraint looks like in production:

Takeaways

  • SIGILL in any compiled Java code on JDK 25 with -XX:AOTCache set = CPU feature mismatch until proven otherwise. Check UseAVX on the recording host vs the runtime host before chasing GC, library, or Quarkus bugs.
  • hs_err_pidN.log is gold. It survives container restarts when /tmp is an emptyDir. kubectl cp it out before the pod gets rescheduled.
  • emptyDir persists across container restarts within the same pod. Useful for forensic files like Hotspot crash dumps and JFR recordings; surprising if you assumed restartPolicy: Always meant a fresh /tmp.
  • JEP 483 is stable and useful. The cache file is more architecture-pinned than the docs make obvious. Treat it like a microarchitecture-pinned binary, not like bytecode.
  • Your feature ceiling is a policy you choose. Pick one, document it, ensure every JIT/AOT runtime respects it. Modern AMD EPYC, Proxmox CPU masking, VMware EVC, and the RHEL 10 v3 baseline all sit on the lower side of this line. CI runners almost always sit on the higher side.

Closing

We're going to keep using Leyden. 4.3 s of startup improvement is worth chasing properly, and the in-cluster training pattern is how JEP 483 was designed to be used. We took the easy-mode path and got reminded why it's marked that way. Writing this up is how we say thanks for it.

If you hit the same SIGILL in an AdapterBlob and Google brought you here: it's almost certainly your CPU feature ceiling. Check UseAVX on both sides.

Running Java on infrastructure where this matters? Talk to us about managed Java hosting on Dutch infrastructure that gives you control over the feature ceiling.