Wick Technology Blog

Agreeing on state across microservices

April 24, 2020

At Mettle, our system is event driven. Services communicate and collaborate by sending events. But when information is sent asynchronously, how does our system come to an agreed state?

One key decision is how to manage business processes which affect a state change in multiple services. For this, we chose to use orchestration rather than choreography.

What is orchestration?

Orchestration is where a single process tells each component directly to do something using a command. Usually this is implemented as a saga, as in DDD. A command is simply a message to another service which tells it to do something, as opposed to an event which contains the state of the thing which changed. The business process is coded completely in one service and this one service knows about all the services it is sending commands to.

What is choreography

Choreography tries to attain a more fluid, and more decoupled system by sending events which ripple through the system to affect change. When a change happens, an event is sent and any services which want to know about that change subscribe and update their state, and they may in turn send events which other services listen to. This way each service only knows about the events it is listening to. The business process is implicit in the way the events are listened to and handled by each service.

Why didn’t we choose choreography?

Choreography sounds pretty good - it’s decoupled and simple. But what happens when failures occur?

Events are immutable - they’re records of things which have definitely happened and which cannot be changed.

When failures occur in choreography the only way to undo unchangeable events is to send more events.

For example, in a credit card application when a customer changes their address, the customer service might send a AddressChanged event, which then the card service listens to change the address it sends new cards to for the customer. But if the card service fails to update the card postal address then the card service needs to send a CardPostalAddressUpdateFailed event, which the customer service then needs to listen to and revert the address change it just made.

If an event fails, the original service has to listen to any failure event and then undo the original event with another event. This means the originating service now has to listen to many more events than before and knows about far more of the system than intended. It’s no longer decoupled. This gets trickier when multiple services listen to an event and do something and then one fails and the other doesn’t. During failure scenarios choreography requires many more messages for the system to come to an agreed state.

Choreography should only be used when the event being propagated cannot fail. Which is basically impossible to guarantee in a distributed system. When sending to an internal service, failure may be less likely and choreography could be used, but it definitely shouldn’t be used to propagate state to an external system or third party. Sending data to a third party is very likely to fail.

Using choreography it’s also more difficult to sequence dependent actions reliably. For example if creating a customer also then creates an applicant in a third party system and then issuing a credit card causes a credit check to be done on that applicant in the third party system, how does the whole system know the applicant has been created successfully and is ready to be credit checked? To know it needs to listen to events from the service that integrates with the third party and the whole system becomes less decoupled.

It’s also difficult to flex processes and only perform part of the process. In the above example if you wanted to issue a credit card without doing a credit check, for example because you’ve done a check within the last 24 hours, the logic for that needs to be in the credit check service. The credit check service is listening to events and now deciding whether it should be doing something. This results in the business process logic to be scattered across services.

Why choose orchestration then?

Orchestration still allows decoupling, but the orchestrating service knows about all the services it’s instructing and listening to. Better to have that knowledge in one place than scattered across services.

We are able to look at our orchestration services and completely understand the way the system implements the business process. All of the ordering, dependencies, data and steps taken are in one place, so it’s easy to comprehend and also easy to change.

The one downside is that it does take longer to implement and more messages need to be sent, since the orchestrator is sending commands as well as listening to events. When implementing something like the deletion of data orchestration is more work because every time you add a new service with user data you also need to go to the data seletion orchestrator and add a new step to send a delete command to the new service. In this example it would be easier to use choreography to just listen to the deletion requested event in every service that needs it.

Conclusion

Use choreography when the state changes in each service can’t fail and when the business process is very simple, with no dependencies between steps. Use orchestration when the business process is even a little complex, involves third parties, the possibilities of failures and you want the process to bee understandable and flexible without increasing complexity in other services.

Written by Phil Hardwick