Get 3500 EURO in our Special Referral Programme.

Debugging concurrency issues on a JIRA plugin case

The goal of this blog post is to share experiences regarding analyzing and debugging concurrency issues that occured after integrating JIRA Plugin with the Atmosphere library. Perhaps a few tricks might be useful for a casual developer, perhaps some of them could be avoided or replaced with better techniques. Anyway, the audience of this blogpost are the developers so expect gory technical details.

Background

I'm a part of the team that develops a new plugin from scratch for the JIRA Agile application. The plugin main functionality is to allow people to place estimations on the issues - in the realtime! So whenever you're participating in the planning session, you could instantly see in your JIRA what are the estimations from your colleagues - all this without the need of bringing those passé poker cards to the session or without using external application for managing the voting. Anyway, one of the non trivial development tasks was to implement the 'real-time' communication between different web browsers - for example, when user A places a vote, we would like to transmit that information to user B and show him the vote (without refreshing the page). Technically, we could implement all the stuff (front-end, back-end) by ourselves, but being smart life-experienced engineers, we reached for existing open-source solutions: the Atmosphere library. After all, it should save us a lot of time: no wheel reinventing, no corner case analysis - somebody has already spend his/her time on that.

The Story

Unfortunately for us, after managing to integrate with the aforementioned library in our JIRA plugin, we started to experience very UNSTABLE webdriver test runs on our build server. Yes, I know, webdriver tests are destined to be unstable, but nearly all of the tests were failing/erroring out with unpredictable scheme. That's not acceptable! So after a bit of astonishment (our plugin seemed to be OK when manually tested), we needed to admit that there was something sinister happening. In the end, it turned out to be a concurrency issue: in a non-deterministic condition, two threads were writing over the same object while they shouldn't have done that. Quite a trivial issue but due to the complexity of involved components (Tomcat, Atmosphere, JIRA, our plugin), it took me a few weeks to nail down the problem. And so, I decided to share a few of my experiences with you.

Initial analysis

Okay, so as I said before, the initial symptom was that our webdriver tests were failing. After inspecting build logs and screenshots from the failed tests, the following fact became obvious:

1) the browser usually failed to load the page completely - in the screenshots we could see a white (empty) space, without any html loaded;
2) sometimes the browser failed with the following screenshot:

Yuck, not very much informative.


3) build logs revealed that something wrong was happening with the http communication, as following exceptions were observed in the logs:

12-sty-2015 19:13:40 [INFO] http-bio-2990-exec-1 ERROR admin 1153x85x1 8a9ocx 127.0.0.1 /secure/RapidBoard.jspa  [com.atlassian.velocity.DefaultVelocityManager] Exception getting message body from Velocity: 12-sty-2015 19:13:40 [INFO] java.lang.IllegalStateException: getOutputStream() has already been called for this response 12-sty-2015 19:13:40 [INFO] at org.apache.catalina.connector.Response.getWriter(Response.java:639) 

or these ones:

12-sty-2015 14:16:48 [INFO] SEVERE: Servlet.service() for servlet [default] in context with path [/jira] threw exception 12-sty-2015 14:16:48 [INFO] java.lang.IllegalStateException: Cannot forward after response has been committed 12-sty-2015 14:16:48 [INFO] at org.apache.catalina.core.ApplicationDispatcher.doForward(ApplicationDispatcher.java:349)

I'm sparing you the details of the full exception stacktrace - it was pointing out that the exception came somewhere from the org.apache.catalina.* internals. Not good.


Okay, so the initial observation was that something was corrupting Tomcat (org.apache.catalina.*) Response objects - but who and why?

Let's debug

To make things worse, the problem appeared to be very tentative - it manifested itself nearly on each build run (where all the test are executed). Okay, good, but locally you wouldn't have such luck - there were no steps to reproduce the issue in a deterministic way. So, what were the options? I concluded that it might be a good idea to attach IDEA debugger to a running application and run the tests in a loop. I hoped that at some run the exception would occur and I would be able to inspect things a bit with debugger. So the first task was to put breakpoints at the right code lines. The logs from the builds contained relevant code parts, so it should be as simple as doing CTRL+N CTRL+N and entering, for example, "org.apache.catalina.connector.Response" and go to line 639 (see the first log I've attached above). So simple.

Unfortunately, the Tomcat code (org.apache.catalina.*) is not used explicitly by our plugin - of course, eventually the plugin runs in the JIRA environment, which in turn uses Tomcat as a servlet container, but the IDEA debugger has no idea of it! So we need to add proper dependency in our project's pom.xml for making things obvious to IDEA: first, we deduct which version of Tomcat is used:


...then, we may add a special maven profile in our pom.xml and enable it on our devbox:

Quite a neat trick - you can even commit that to a shared repository without fearing that you will disturb your co-developers.

Breakpoint catching

With above directives, IDEA is now able to see Tomcat sources. We can place breakpoints in the exception triggering places and start looping webdriver tests. Whew. A few coffees later, the breakpoint triggers: the JVM threads are frozen and we can inspect the situation prior to throwing IllegalStateException. As one might expect, there is nothing interesting: after inspecting the Response object (I'll spare the details), we can clearly see that it is 'committed' - whatever that means, but there is no trace of who did that. Let's check how Response.setCommitted() looks like (this is the only place where 'commited' flag is changed):

org.apache.coyote.Response.java:

215 public void setCommitted(boolean v) {

216 this.commited = v;

217 }

A simple setter. If only we could see the trace of who was executing this setter last time. Fortunately, IDEA breakpoints have nice functionalities - let's exert them! We do right click on the breakpoint:

... and there we can:

1) turn off thread suspension - because we don't want the application to stop on each "setCommitted" operation - there are thousands of them being correct while usually only one is wrong. We don't know upfront which one.
2) log evaluated expression - where we can execute one-line java expression so we can start logging something like:

Arrays.toString(new Exception().getStackTrace())

or we can even use existing object structures and methods to keep the IDEA console clean:

setHeader("tony_halik", Arrays.toString(newException().getStackTrace()))

And later retrieve the stacktrace of the offending thread via the debugger when the IllegalStateException does occur again:

... and reformat it in whatever text editor.


This allows us to see what committed the Response object - it is actually some code of the Atmosphere library (org.atmosphere.cpr.*) which we call during processing http request in our plugin (com.spartez.pokerng.atmosphere.SpartezAtmoshpereFilter). However, analyzing the relevant code of those classes leads to no conclusion. We need more data.

Tomcat threads

At this moment, I didn't understand clearly which threads were involved. I knew which thread was throwing the IllegalStateException because I could see it clearly in the debugger:

But was it the same thread that set 'commit' flag some time before? I wasn't sure. To better understand what was going on, I decided to use an old-fashioned System.out.println() technique and collect information about relevant threads and their behavior. I rebuilt Tomcat from scratch with a few additionalSystem.out.println(Thread.currentThread().getName() + " " + System.identityHashCode(someObject)) statements. The Thread.currentThread().getName() is enough to give the answer for the "which thread" question, while System.identityHashCode(obj) is enough to give the answer for the "which object" question. Not going into details of how I chose where to put those statements, let's just say that after some trial & error I had quite decent activity logs, like the example below:

(...)
[INFO] http-bio-2990-exec-6 SocketProcessor.run() - obudzil sie
[INFO] http-bio-2990-exec-6 org.apache.coyote.Response@21839092 sendHeaders()
[INFO] http-bio-2990-exec-6 org.apache.coyote.Response@21839092 setCommitted()
[INFO] http-bio-2990-exec-6 org.apache.coyote.Response@21839092 recycle() null
[INFO] http-bio-2990-exec-6 zaraz oddam responsa do puli, patrz nastepny recycle()
[INFO] http-bio-2990-exec-6 org.apache.coyote.Response@21839092 recycle() null
[INFO] http-bio-2990-exec-6 SocketProcessor.run() - skonczyl sie
[INFO] http-bio-2990-exec-6 SocketProcessor.run() - obudzil sie
[INFO] http-bio-2990-exec-6 org.apache.coyote.Response@1413693b sendHeaders()
[INFO] http-bio-2990-exec-6 org.apache.coyote.Response@1413693b setCommitted()
[INFO] http-bio-2990-exec-6 org.apache.coyote.Response@1413693b recycle() null
[INFO] http-bio-2990-exec-6 zaraz oddam responsa do puli, patrz nastepny recycle()
[INFO] http-bio-2990-exec-6 org.apache.coyote.Response@1413693b recycle() null
[INFO] http-bio-2990-exec-6 SocketProcessor.run() - skonczyl sie
[INFO] http-bio-2990-exec-6 SocketProcessor.run() - obudzil sie
[INFO] http-bio-2990-exec-2 org.apache.coyote.Response@44dfe44e sendHeaders()
[INFO] http-bio-2990-exec-2 org.apache.coyote.Response@44dfe44e setCommitted()
[INFO] http-bio-2990-exec-2 org.apache.coyote.Response@44dfe44e recycle() null
[INFO] http-bio-2990-exec-2 zaraz oddam responsa do puli, patrz nastepny recycle()
[INFO] http-bio-2990-exec-2 org.apache.coyote.Response@44dfe44e recycle() null
[INFO] http-bio-2990-exec-2 SocketProcessor.run() - skonczyl sie (...)

This is normal behaviour of Tomcat threads that handle http requests from the web browsers - usually the communcation goes like this:

1) new request has come from the web browser (via TCP) and Tomcat should handle that;
2) a thread is waking up from the pool - in the example above these threads are named "http-bio-2990-exec-X", and I logged wake up with the "obudzil sie" message written to the System.out;
3) a thread is handling the request - it passes it to the underlying servlet (JIRA side) or whatever and it uses Tomcat's org.apache.coyote.Response to deliver http content back to the web browser;
3.1) in the example above, you can see that http-bio-* threads are serving the headers (sendHeaders()) which result in setting committed flag; I wasn't logging other activity as I wasn't interested in it;
4) after finishing processing the request - the relevant resources (in our example: Response object) are recycled and the thread goes to the pool ("skonczyl sie" message)

... rinse and repeat until there are http requests to serve.

Tomcat recycling

The above example shows how the causal activity looks like. However, I noticed that our IllegalStateException was usually preceded with the (more or less) following activity:

[INFO] http-bio-2990-exec-10 SocketProcessor.run() - obudzil sie
[INFO] http-bio-2990-exec-6 SocketProcessor.run() - obudzil sie
[INFO] http-bio-2990-exec-10 org.apache.coyote.Response@52e93037 sendHeaders()
[INFO] http-bio-2990-exec-10 org.apache.coyote.Response@52e93037 setCommitted()
[INFO] http-bio-2990-exec-10 WARN anonymous 805x1914x2 1p39yc7 127.0.0.1 /plugins/servlet/jpp-ng/atmosphere/board/1
[INFO] [org.atmosphere.cpr.DefaultBroadcaster] Duplicate resource bee1c112-4f54-46dd-ae60-7a9360a0aea3. Could be
[INFO] caused by a dead connection not detected by your server. Replacing the old one with the fresh one
[INFO] http-bio-2990-exec-3 SocketProcessor.run() - obudzil sie
[INFO] http-bio-2990-exec-3 org.apache.coyote.Response@59909aec recycle() null
[INFO] http-bio-2990-exec-3 zaraz oddam responsa do puli, patrz nastepny recycle()
[INFO] http-bio-2990-exec-3 org.apache.coyote.Response@59909aec recycle() null
[INFO] http-bio-2990-exec-10 org.apache.coyote.Response@59909aec sendHeaders()
[INFO] http-bio-2990-exec-10 org.apache.coyote.Response@59909aec setCommitted()
[INFO] http-bio-2990-exec-3 SocketProcessor.run() - skonczyl sie [INFO] http-bio-2990-exec-3 SocketProcessor.run() - obudzil sie [INFO] http-bio-2990-exec-3 zaraz oddam responsa do puli, patrz nastepny recycle()
[INFO] http-bio-2990-exec-3 org.apache.coyote.Response@6d6935c4 recycle() null
[INFO] http-bio-2990-exec-3 SocketProcessor.run() - skonczyl sie
[INFO] http-bio-2990-exec-10 SocketProcessor.run() - skonczyl sie
[INFO] http-bio-2990-exec-6 org.apache.coyote.Response@21839092 setCommitted()
[INFO] http-bio-2990-exec-9 SocketProcessor.run() - obudzil sie [INFO] http-bio-2990-exec-9 org.apache.coyote.Response@52e93037 recycle() null
[INFO] http-bio-2990-exec-2 org.apache.coyote.Response@52e93037 sendHeaders() [INFO] http-bio-2990-exec-2 org.apache.coyote.Response@52e93037 setCommitted()
[INFO] http-bio-2990-exec-9 zaraz oddam responsa do puli, patrz nastepny recycle()
[INFO] http-bio-2990-exec-9 org.apache.coyote.Response@52e93037 recycle() null
[INFO] http-bio-2990-exec-9 SocketProcessor.run() - skonczyl sie
[INFO] http-bio-2990-exec-9 SocketProcessor.run() - obudzil sie
[INFO] http-bio-2990-exec-2 org.apache.coyote.Response@1413693b sendHeaders()
[INFO] http-bio-2990-exec-2 org.apache.coyote.Response@1413693b setCommitted()
[INFO] http-bio-2990-exec-2 org.apache.coyote.Response@1413693b recycle() null
[INFO] http-bio-2990-exec-2 zaraz oddam responsa do puli, patrz nastepny recycle()
[INFO] http-bio-2990-exec-2 org.apache.coyote.Response@1413693b recycle() null
[INFO] http-bio-2990-exec-2 SocketProcessor.run() - skonczyl sie

with the interesting part being actually this:

[INFO] http-bio-2990-exec-10 SocketProcessor.run() - obudzil sie (...)
[INFO] http-bio-2990-exec-3 SocketProcessor.run() - obudzil sie
[INFO] http-bio-2990-exec-3 org.apache.coyote.Response@59909aec recycle() null
[INFO] http-bio-2990-exec-3 zaraz oddam responsa do puli, patrz nastepny recycle()
[INFO] http-bio-2990-exec-3 org.apache.coyote.Response@59909aec recycle() null
[INFO] http-bio-2990-exec-10 org.apache.coyote.Response@59909aec sendHeaders()
[INFO] http-bio-2990-exec-10 org.apache.coyote.Response@59909aec setCommitted()

There are two threads involved; one of them seems to wake up out of the order and recycle the Response@59909aec, and put it back to the pool ("zaraz oddam responsa do puli" message). The other thread, however, is still messing with the Response object after recycle. Okay, so now let's take a look at the recycling mechanism in Tomcat:

// org.apache.coyote.Response code from tomcat 7.0.40. Lines 511-527:

public void recycle() {

contentType = null;

contentLanguage = null;

locale = DEFAULT_LOCALE;

characterEncoding = Constants.DEFAULT_CHARACTER_ENCODING;

charsetSet = false;

contentLength = -1;

status = 200;

message = null;

commited = false;

errorException = null;

headers.clear();

// update counters

contentWritten=0;

}

This is quite interesting. Tomcat actually reuses various objects that are needed for the communication with the web browser. In this blog post, we will focus on the Response object, but Tomcat actually uses the same recycling pattern for other http handling objects too. Back to the example above: you can see that recycle() method resets most of the local variables of the Response object to some default values. So whenever a Tomcat thread finishes handling the request, it actually calls that method and puts the Response object to the pool of ready-to-be-reused Responses. However, there might still be an object in the memory that references that Response object and will manipulate it! And it seems that this was the cause - the Atmosphere library is actually wrapping and storing the Response object in its data structures. Otherwise, it wouldn't be able to conduct asynchronous operations on the http requests as it should. Could it be that in some dire circumstances the Atmosphere code writes to the Response that Tomcat already collected for re-using?

Completing asynchronous request

After some investigating, I managed to nail down what was causing Tomcat to wake up a thread and recycle a Response. It was completing asynchronous request, which is done by calling asyncContext.complete() from the Servlet 3.0 API. I just needed to know now which thread (and why) was completing asynchronous requests. To achieve that I've rebuilt the Atmosphere library with an additional hack:

//my addition:<code data-redactor-tag="code"><code>response.setHeader("asyncCompleted", Thread.currentThread().getName() + " asyncCompleted "<code data-redactor-tag="code"><code><code data-redactor-tag="code"><code data-redactor-tag="code"><code><code data-redactor-tag="code"><code data-redactor-tag="code"><code><code data-redactor-tag="code"><code data-redactor-tag="code"><code><code data-redactor-tag="code"><code>

<code data-redactor-tag="code"><code><code data-redactor-tag="code"><code data-redactor-tag="code"><code><code data-redactor-tag="code"><code data-redactor-tag="code"><code><code data-redactor-tag="code"> <code data-redactor-tag="code"><code> <code data-redactor-tag="code"><code>+ Arrays.toString(new Exception("asyncCompleted").getStackTrace()));

<code data-redactor-tag="code"><code data-redactor-tag="code"><code>//end of my addition<code data-redactor-tag="code"><code>asyncContext.complete();

and tweaking the Tomcat Response:

public void recycle() {

contentType = null;

contentLanguage = null;

locale = DEFAULT_LOCALE;

characterEncoding = Constants.DEFAULT_CHARACTER_ENCODING;

charsetSet = false;

contentLength = -1;

status = 200;

message = null;

commited = false;

errorException = null;

+ String asyncCompleted = headers.getHeader("asyncCompleted");

headers.clear();

+ if (asyncCompleted != null) { setHeader("asyncCompleted", asyncCompleted); }

// update counters

contentWritten=0;

}

This allowed me to check in the debugger who was completing the asyncContext last time. Using the same technique as I used to determine the code that set the committed flag, I just needed to trigger the IllegalStateException once again and dig down to the Response object in the debugger:

This allowed me to see two stacktraces of the same thread:
1) the stacktrace in the moment of calling asyncContext.complete();
2) the stacktrace in the moment of committing the Response.

By manual inspection of these two (and analyzing the difference), I finally arrived at this Atmosphere code:

// org.atmosphere.cpr.AtmosphereResourceImpl: ver 2.2.3 lines 737-782

public void cancel() throws IOException {

try {

if (!isCancelled.getAndSet(true)) {

(...)

asyncSupport.complete(this); // <-- this calls eventually

// asyncContext.complete() (...)

// We must close the underlying WebSocket as well.

if (AtmosphereResponse.class.isAssignableFrom(response.getClass())) {

AtmosphereResponse.class.cast(response).close();

AtmosphereResponse.class.cast(response).destroy();

}

(...)

}

} finally {

(...)

It seems that the Atmosphere 'cancel' implementation first signals that the asynchronous operation is completed (and thus Tomcat recycles the Response with another thread). Later, the Atmosphere calls .close() on the wrapped Response object. Further inspection revealed that .close() implementation leads to changing Response state to committed. Could it be that actually completing the asyncContext should mean that nobody can manipulate the Response object anymore? In order to test that hypothesis, let's use another feature of breakpoints:

Let's allow of the asyncContext.complete() operation and freeze the current thread right in the succeeding statement (line 757 in the above screenshot - note that I've checked Suspend: "Thread" option instead of default "All"). This will allow other threads to work in the background and possibly one of them will collect and recycle() the response object. And in fact - stopping at this breakpoint and resuming after a few seconds made the IllegalStateException problem nearly 100% replicating. Furthermore, commenting out the line

// AtmosphereResponse.class.cast(response).close();


made the webdriver tests to run stable. Whoa. Could it be that the Atmosphere authors didn't know that you can't manipulate Response objects after completing asynchronous operations? Perhaps the problem only manifests itself in the JIRA/Tomcat7 environment - I guess Atmosphere is used occasionally on such a combo. So perhaps nobody has had this problem yet because other containers have different (contention-proof) recycling mechanisms. Or maybe no recycling mechanisms at all. I dunno.

Conclusion

Fixing the problem should be quite easy now - for example, one might modify the Atmosphere code and get rid of these unnecessary .close() actions. Another one could raise an issue in the Atmosphere Github repository and wait till the authors fix the issue. Another one could wrap the Response objects with own wrapper that will be able to cut any action after the asyncContext.complete() is called. Another one might find yet another creative solution.

Anyway, that's a different story. In this blog post, I wanted to focus on analyzing concurrency issues, and as we finally got into the root cause of the problem, I think we can finish the story here. I hope you enjoyed it, thanks for reading.

Oh, and by the way: if you are an active JIRA Agile user and you do estimating of story points for your issues (or plan to add that step to your development process), you might be interested in checking out our new add-on available on the Atlassian Marketplace: Agile Estimates. It's a tool for collaborative estimating within JIRA Agile directly, and incidentally also the product that made me write this blog post in the first place. :)

This website, like many others, uses small files called cookies to help us provide social media features and to analyse our traffic.
You consent to our cookies if you continue to use this website.

OK