Synchronization issue in parallel multi-instance call activity

Description

When using a multi-instance call activity to iterate over a collection in parallel, it is possible for two or more threads to update the nrOfCompletedInstances and nrOfActiveInstances variables at the same time. This leaves the variables with wrong values.

The following lines are from Activiti's debug logs:

As can be seen, two threads try to update the variables at nearly the same time and the second one sets wrong values for them. Instead of setting 2 for nrOfCompletedInstances and 6 for nrOfActiveInstances, it sets 1 and 7, respectively.

This leads to some weird behavior. For example, if the collection that we're iterating over contains the numbers from 1 to 8, and each call activity prints one of these numbers, the output should be similar to:
1 2 3 4 5 6 7 8 (the order would probably be different due to the parallel execution)

However, because of the issue mentioned above, the result is:
2 1 3 4 5 6 7 8 1 4 4 (note the extra numbers)

The issue seems to be in the org.activiti.engine.impl.bpmn.behavior.ParallelMultiInstanceBehavior's leave() method. There, the values of nrOfCompletedInstances and nrOfActiveInstances are retrieved from the execution, incremented (or decremented) and then set back into the execution. These two operations are not in a synchronized block, however, nor are they in a separate DB transaction. This probably leads to the following situation:

1) Thread 1 retrieves the variables and they have values: nrOfCompletedInstances = 0, nrOfActiveInstances = 8.
2) Thread 2 retrieves the variables, before Thread 1 updates them and they have the same values: nrOfCompletedInstances = 0, nrOfActiveInstances = 8.
3) Both of them modify the variables independently, resulting in the values: nrOfCompletedInstances = 1, nrOfActiveInstances = 7.
4) Both of them attempt to set them back into the execution with these wrong values.

For more information, see this forum thread:
https://community.alfresco.com/thread/226908-synchronization-issue-in-parallel-multi-instance-call-activity

Environment

None

Activity

Show:
Johnathan James
February 16, 2017, 6:44 PM
Edited

TL;DR: This issue happens because two threads are updating the database at the same time. An exception occurs which causes the thread to leave the Job table in a dirty state. These are the "extra numbers" you see running the unit test. However, my proposed solution, allowing only a single thread to flush at a time, causes unit tests to break. Help needed.

To see why this issue occurs, use the test project associated with this issue (parallel-activiti-test.zip). In org.actviti..CommandContext.java, change the flushSessions() method so that it reads like this:

And update your log4j.properties in the test project:

Then run MyUnitTest.java.

I have attached an image of the log, demonstrating the issue. You can see that Thread 2 begins flushing, and then before it finishes, Thread 1 begins flushing. This soon results in an OptimisticLockingException; Thread 2 has its transaction rolled back. This causes its JobEntity not to be deleted from the database, even though it has completed. This is an "extra number" which appears later on.

The solution to this, I hope you will agree, would be to ensure that only one thread is flushing at a time.

Update your CommandContext.java with the following flushSessions() method:

When we do this, and re-run MyUnitTest, we still unfortunately see an occassional OptimisticLockingException (another problem somewhere else?), but the error which this issue describes, is no longer reproducible; the extra numbers do not appear.

Only one thread is flushing at a time, and the issue is no longer reproducible. This feels "more correct" to me.

However, when you run the unit tests, at least three tests fail

Are they depending on a design flaw, which now corrected causes them to fail? Or is my "more correct" solution actually invalid? Help!

Assignee

Unassigned

Reporter

Alexander Tsvetkov

Labels

None

Components

Priority

High
Configure