Crazy data…

My current research project is the implementation of a failure detection service based on the SNMP standard. One important aspect of this service is sending heartbeat messages over SNMP. The prototype works quite well and I’m currently doing some performance measures, but last week in stumbled on a small mystery. My current experiment is rather simple, set the service to transmit heartbeat messages at high rate and measure the CPU load. I did some preliminary tests from and to my laptop and the Sun station I use here, no problem. But when I run the same code on the Linux cluster, I get something strange. The code runs fine, but the CPU load varies at lot, reaching nearly 50% and then dropping to 2%. So while I have reasonable average CPU loads, I get wild distributions.

I plotted the CPU load as a function of time, and I get this really strange curve. The X axis is the time in seconds and the Y axis is the CPU load, in percentage. I get this very regular pattern, with a period of around 198.5 seconds. This is much to slow to be the garbage collector, and the GC logs show no activity that could even be related to this phenomenon. When CPU load is around 50%, the CPU spends around 25% of the time in the kernel, so I might hit some strange kernel problem. I’m now trying to figure out if the same phenomenon has the same periodicity on other cluster nodes, and if it is related to the frequency of heartbeats.

CPU load graph

I tried running some profiling tools, but no dice. My process is eating away 40% of the CPU with the following stack trace

Full thread dump Java HotSpot(TM) Client VM (1.5.0_02-b09 mixed mode, sharing):

“Notifier:[c1.jaist.ac.jp, 0, Test, 1000, 3, 1, 2%, 1.3.6.1.4.1.999.2.1.1.1.14.99.49.46.106.97.105.115.116.46.97.99.46.106.112.0, 1, 1, ]” daemon prio=1 tid=0x083938f0 nid=0x7025 in Object.wait() [0xb1e49000..0xb1e49fb0]
at java.lang.Object.wait(Native Method) – waiting on <0x89038e10> (a jp.ac.jaist.snmpd.basic_fd.BasicFailureDetectorTable$FailureDetectorRow$StateChangeNotifier)
at java.lang.Object.wait(Object.java:474)
at jp.ac.jaist.snmpd.basic_fd.BasicFailureDetectorTable$FailureDetectorRow$StateChangeNotifier.waitNext(BasicFailureDetectorTable.java:393) – locked <0x89038e10> (a jp.ac.jaist.snmpd.basic_fd.BasicFailureDetectorTable$FailureDetectorRow$StateChangeNotifier)
at jp.ac.jaist.snmpd.basic_fd.BasicFailureDetectorTable$FailureDetectorRow$StateChangeNotifier.run(BasicFailureDetectorTable.java:412)

“DefaultUDPTransportMapping_0.0.0.0/1616” daemon prio=1 tid=0x08310808 nid=0x7022 runnable [0xb1f4b000..0xb1f4c130]
at java.net.PlainDatagramSocketImpl.receive0(Native Method)
at java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:136) – locked <0x8901a5f0> (a java.net.PlainDatagramSocketImpl)
at java.net.DatagramSocket.receive(DatagramSocket.java:712) – locked <0x88b85490> (a java.net.DatagramPacket) – locked <0x8901ac28> (a java.net.DatagramSocket)
at org.snmp4j.transport.DefaultUdpTransportMapping$ListenThread.run(Unknown Source)

“Timer-1″ daemon prio=1 tid=0x0817fd88 nid=0x7021 in Object.wait() [0xb1fcd000..0xb1fcd1b0]
at java.lang.Object.wait(Native Method) – waiting on <0x89014cc0> (a java.util.TaskQueue)
at java.lang.Object.wait(Object.java:474)
at java.util.TimerThread.mainLoop(Timer.java:483) – locked <0x89014cc0> (a java.util.TaskQueue)
at java.util.TimerThread.run(Timer.java:462)”Timer-0″ prio=1 tid=0x0817a3e0 nid=0x7020 in Object.wait() [0xb204d000..0xb204de30]
at java.lang.Object.wait(Native Method) – waiting on <0x89014d38> (a java.util.TaskQueue)
at java.util.TimerThread.mainLoop(Timer.java:509) – locked <0x89014d38> (a java.util.TaskQueue)
at java.util.TimerThread.run(Timer.java:462)”Low Memory Detector” daemon prio=1 tid=0x080a00d8 nid=0x701e runnable

[0x00000000..0x00000000]”CompilerThread0″ daemon prio=1 tid=0x0809eb88 nid=0x701d waiting on condition

[0x00000000..0xb2409998]”Signal Dispatcher” daemon prio=1 tid=0x0809dcb0 nid=0x701c runnable

[0x00000000..0x00000000]”Finalizer” daemon prio=1 tid=0x08099348 nid=0x701b in Object.wait() [0xb270b000..0xb270c0b0]
at java.lang.Object.wait(Native Method) – waiting on <0x88fffe90> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116) – locked <0x88fffe90> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)”Reference Handler” daemon prio=1 tid=0x08098648 nid=0x701a in Object.wait() [0xb278d000..0xb278d130]
at java.lang.Object.wait(Native Method) – waiting on <0x88ffff10> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:474)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) – locked <0x88ffff10> (a java.lang.ref.Reference$Lock)”main” prio=1 tid=0x0805b720 nid=0x7018 waiting on condition [0xbfffd000..0xbfffd658]
at java.lang.Thread.sleep(Native Method)
at jp.ac.jaist.snmpd.FDAgent.main(FDAgent.java:245)”VM Thread” prio=1 tid=0x08095b50 nid=0x7019 runnable “VM Periodic Task Thread” prio=1 tid=0x080a1568 nid=0x701f waiting on condition

On Mac OS X I use Shark for doing the profiling, but I have not found such a nice tool for Linux. The cluster nodes run 2.6.11-1.14_FC3 i686 i386 GNU/Linux with the 1.5.0_02 Sun JVM.

2 thoughts on “Crazy data…

  1. Mouais.

    Je vois deux coupables potentiels : Java 1.5 et le kernel 2.6.11. Quelle version de Java utilise-tu ? J2SE ou J2EE ?

    Shark ça fait quoi comme effet ?

  2. Essaye une autre VM que celle de Sun, genre Jikes d’IBM ou Kaffe.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.