For some currently unknown reasons, we are experiencing a very high CPU load on our PostgreSQL DB server when processing workflows with Activiti. Of course, we're in the process of investigating that problem, but on the way, we discovered the following problem leading to a quite high number of aborted jobs in the processing of the workflows:
In the performOperation method of the org.activiti.engine.impl.interceptor.CommandContext class, the passed in execution instance is accessed without checking it for null. However, that execution instance is loaded from the DB further down the stack trace in the org.activiti.engine.impl.persistence.entity.JobEntity class' execute method via the org.activiti.engine.impl.persistence.entity.ExecutionEntityManager's findExecutionById method. In case there is no entry found in the DB, that method returns null. In the end, this leads to a NullPointerException in the above mentioned CommandContext class without further logging (just the NullPointerException itself is logged).
Investigating the source code of the Activiti Engine, one can find that in other locations where a similar process is follows, there are checks for the execution instance being null. For example in the org.activiti.engine.impl.cmd.SignalEventReceivedCmd class and others.
We currently only know that our DB does not find the expected execution instance and returns null whenever the CPU load of the DB server increases above a critical level (more than 100% CPU usage). Of course, this should not be the case, but we also think that no entity loaded from the DB should be accessed directly without checking it for being null.
Below, you can find a stack trace of the NullPointerException:
Activiti version 5.21.0 in an OSGi (Eclipse Equinox) context on CentOS 6.8 64-Bit.
PostgreSQL 9.4 DB server.
I can confirm the potential NPE, but I disagree that the solution is to add null pointer checks. The stated example of the SignalEventReceivedCmd class did not demonstrate null pointer checks on passed in variables. At least at press time.
The problem is that the JobEntity has an executionId, which no longer refers to an actual ExecutionEntity object.
JobEntity uses its executionId member variable to call the ExecutionEntityManager to retrieve the associated entity. The ExecutionEntityManager reports that no such entity with the JobEntity's executionId can be found, and returns null.
To be 100% sure, I need to to know, if in the method CommandContext.performOperation(AtomicOperation, InterpretableExecution), when the InterpretableExecution is null,
1) does that indicate an error condition, because a NULL InterpretableExecution param should never happen? (In which case JobEntity needs better handling of its no-longer-valid executionId)
or 2) should the net effect of the call to .performOperation() be a NO-OP, because we should expect the occasional NULL InterpretableExecution object, and should just continue gracefully.
From my cursory perusal of the code, it seems the intent is #1, that the NPE should be thrown because NULL InterpretableExecution objects are not to be expected.
So, ExecutionEntityManager, then.. we assume it is making its SELECT calls correctly, and the ExecutionEntity truly does not exist in the table. Returning a null seems the correct behavior to indicate that no entity exists.
The question is what to do when the executionId currently set on the JobEntity object does not reference a valid object, and then JobEntity.execute() is called. What should it then pass to jobHandler.execute() ?