Wick Technology Blog

Distributed Spring State Machines

June 12, 2019

Race conditions are annoyingly difficult to deal with. When it works sometimes but not all the time, and timing of events needs to be correct, it can be difficult to fix. Usually it requires more checks and guards and it becomes even more difficult when distributed across multiple processes.

Spring State Machine is excellent for representing Sagas (as referred to in Domain Driven Design) or long lived business process managers, but can sometimes fall into the trap of race conditions where a sequence of events only work when the timings of incoming events is orderly.

I found when testing a new Saga that when specifying a transition action which sent a message to a Kafka topic, that sometimes the response message would come back before the transition to the new state had even completed. This meant that when my StreamListener received the response message, and tried to send a new event to the state machine, the state machine was still in the initial state and therefore didn’t handle with it.

Deferred Events

Deferred events allow you to easily side step this problem. Spring State Machine will hold the event from being played until a state is reached that doesn’t list it as a deferred event. This gives the ability for an event to be played on the next state change or many state changes down the line. It completely simplifies trying to coordinate the sequence of events if they arrive out of order.

This can even help deal with with race conditions across microservice applications as long as only one process picks up the message (e.g. when using exclusive consumers in ActiveMQ).

Distributed running

Deferred events work well for single node state machines but can still fall foul of inconsistent state when run across multiple nodes.

The problem is that if the first node sends a command the second node can receive the event saying it’s complete. The event could be received before the first node has committed the state transition from sending the command.

The safest way to use state machines in a distributed way is to do things sequentially (no forks or joins) and only send messages when the state transition has been committed. To achieve this, use state actions not actions on transitions e.g.
builder.configureStates() .withStates() .state(COMMAND_SENT, Actions.errorCallingAction(action(this::sendCommand), action(this::moveToFailedState)))

will work fine, however, don’t use:

builder.configureTransitions()
        .withExternal()
        .source(INITIAL).target(COMMAND_SENT)
        .event(ACTION_REQUESTED)
        .action(action(this::sendCommand))

since it could cause issues with ordering. Even if the message sending fails, you can transition to a “failed” state in an error action. Shown in this state machine configurer:

stateMachineBuilder.configureStates()
    .withStates()
    .initial(INITIAL)
    .state(COMMAND_SENT, Actions.errorCallingAction(
            action(this::sendCommand), action(this::moveToFailedState)))
    .state(COMMAND_SENDING_FAILED)
    .end(ACTION_COMPLETED)
    
stateMachineBuilder.configureTransitions()
    .withExternal()
    .source(INITIAL).target(COMMAND_SENT)
    //This event is sent to the state machine in the handler for a http request
    .event(ACTION_REQUESTED)

    .and().withExternal()
    .source(COMMAND_SENT)
    .target(ACTION_COMPLETED)
    //This event is sent to the state machine in a message handler (could be ActiveMQ, RabbitMQ)
    .event(RECEIVED_EVENT)
    .and().withExternal()
    .source(COMMAND_SENT)
    .target(COMMAND_SENDING_FAILED)
    .event(FAILED_TO_SEND_COMMAND)

Now the command will only be sent once the state machine has moved to the COMMAND_SENT state. This means the second node, even if it receives a very quick reply, will always be in the correct state to deal with the event.

The combination of this technique with deferred events, may seem to be a good idea but can still get stuck. This is because one node could send the command, the second receives the related event and defers it, the first transitions to the next state and then there’s nothing to tell the second node that a transition has occurred in the first node.

The simpler the process, the better

Running state machines in replicated processes is difficult and requires consideration for many different scenarios. If you can make your process flow simpler, you will decrease the chance of problems.

Written by Phil Hardwick