We stress tested our application with several jobs in parallel and additionally tested the restart behavior. Two out of 760 restarts, we saw the following issue:
As far as we could analyze it, the issue come from jobs resumed from timers. The jobs were in a timer (10s) and were suspended. After restart the timer was run out and the jobs were resumed in parallel.
The issues seems to come from ExecutionEntity.ensureProcessDefinitionInitialized()
This method needs to b synchronized otherwise several deployments run in parallel due to shared state in XML parser engine.
The Class TaskListenerParser extends ActivitiListenerParser which has a field
which has a state. The class TaskListenerParser is put in class BaseBpmnXMLConverter as a static reference into field
which is access in parallel and is therefore not thread-safe.
A synchronization in ExecutionEntity.ensureProcessDefinitionInitialized() or the removal of shared state in ActivitiListenerParser should fix the issue. There are maybe some more issue with shared state in we did not observe, yet.
We cannot provide a test for it. The issue is hard to reproduce.
Theoretically, it can be reproduce in the following way (what we did not get managed):
1) Create a workflow with an endless loop over a timer. The timer should suspend the job for several seconds.
2) Create multiple jobs and let them suspend in the timer.
3) Shutdown Activiti (Suspend all jobs with suspendProcessInstanceById and ProcessEngines.destroy).
4) Restart Activiti and observe the debug output (JVM restart, ProcessEngines.init, resume all jobs with activateProcessInstanceById).
5) Let jobs run into timers again.
6) Repeat starting with 3).
Do this a lot of times and the issue should occur again. We currently have no idea how to increase the probability of running into the issue.