Quantcast
Channel: Udi Dahan - The Software Simplist » DDD
Viewing all articles
Browse latest Browse all 25

Life without distributed transactions

$
0
0

transactionsOccasionally I get questions about the issue of transactional messaging – why is it so important, why does NServiceBus default to this behavior, and if we didn’t use it, what bad things could happen. I’m talking specifically about the ability to enlist a queue in a distributed transaction here.

I think the reason for this interest is the rise in popularity of cloud platforms and queuing systems like RabbitMQ (which don’t support distributed transactions) and the difficulty of setting up distributed transactions even in on-premise.

Of course, there’s also the regular scalability hand-wringing going on even though most people wouldn’t bump up against those limits anyway.

In this post, I’ll talk about the nature of the problem, explain the pitfalls in some of the common solutions, but I’ll put off the description of how to provide consistency without distributed transactions to a future post as this one is already going to be quite long.

I’ll start with the basic fault-tolerance issues and then explain how things spiral out from there.

Starting with the basics

OK, so we have a queuing system in place that dispatches messages to our business logic which does some transactional work against a database.

Let’s say that we completed the transaction against our database but before we could acknowledge to the queue that the message was processed successfully, our machine crashed. What our machine comes up again, the queue will once again dispatch us the same message. Unless we have some logic to detect that we’ve already processed it (called “idempotence” in the REST community), we will end up processing it again.

In short, the problem is duplicates.

Attempted solutions to the duplicate problem

Most queuing systems don’t do anything about duplicates, actually giving it a proper architectural name: At-least-once message delivery, as opposed to the Once-and-only-once model that a queue that supports distributed transaction provides.

The solution often suggested is to have your logic check to see if it has already processed a message with that ID before – in essence storing the ID of each message processed for some period of time. Of course, there is some performance overhead with that, but it might be a small price to pay compared to dealing with it in the logic of every use case.

On the other hand, you’ll often have some messages (like Update commands) for which it looks like you can safely process them multiple times, in which case you might want not to pay the performance overhead there. The thing is, if your logic publishes an event in addition to the regular database work (something that is quite common) and you process the same message twice, you will probably end up publishing the event twice as well.

These duplicates are different in that here we have two distinct messages with different IDs that contain the same business data. This means that recipients of these messages will not be able to filter them out at an infrastructure level anymore.

NOTE: Deduplication abilities in queues

Although the Azure Service Bus doesn’t support distributed transactions meaning you still have the issue mentioned above, Microsoft added the ability to detect and filter out duplicates based on message contents rather than just the ID. This helps quite a bit but it’s important to understand that that doesn’t cover everything for you. Let me explain:

More complex logic

In some of your most important use cases, you may have both entity updates as well as entity creation happening together in your domain model. You might be using some kind of event model (like I wrote about here) to percolate out the information that an entity was created in order to keep your service layer decoupled from the internals of the domain model.

In the callback code from these domain events, you will likely publish out an event on the queuing system containing information like the ID of the entities created as well as other business data. And there’s the rub.

You see, without distributed transactions, you can run into some problematic scenarios:

For example, if you don’t make sure that your event publishing calls to the queuing system include the same transaction object as the one you used when retrieving the original message from the queue, then those calls could “escape” before you know if the database transaction is going to succeed. Deadlocks always happen at the lousiest times. Anyway, if you’re using database generated IDs for your entities, then those IDs will get published out in events despite the database rolling back and your subscribers will now be making decisions on wrong data – not just eventually consistent data.

In this case, processing the message again doesn’t really solve the problem – it just means that you’ll be publishing events with different IDs, so an infrastructure like Azure Service Bus couldn’t really de-duplicate them.

On the other hand, if you do use the same transaction and combine in the infrastructural message ID based de-duplication described above (as identifying duplicate calls for complex business logic is damn hard), you’ll run into another problem.

Consider what would happen if your server crashes right after finishing its database work but before it completes the transaction against the queuing system. When going to retry the message, the infrastructure filtering thing would know not to call your business logic again and that message would be quietly swallowed. Unfortunately, the event publishing calls to the queuing system from the first time the message was processed were rolled back and since your business logic isn’t called again, the event publishing won’t happen again.

Oops.

In closing

I hope I’ve been able to clarify what kind of scenarios distributed transactions solve for you and some of the difficulties in solving them yourself.

Now, to be clear, you could solve these problems by going in-depth on each of your use cases, analyzing the consistency needs and structuring the code differently to address those needs. But give this another thought, if our consistency is dependent on calling otherwise independent APIs in exactly the right order, and that a change in this order would not cause any visible functional effects, what would happen when developers with less expertise maintain this code?

The folks in the event sourcing community have their solution to this which is based on writing their business logic differently. As the adoption of this pattern is still pretty limited (probably still in the Innovator section of the Technology Adoption Curve), it’ll be interesting to see how it holds up with larger teams in the mainstream.

Oh, and in case it wasn’t clear from before, the guys in the REST community haven’t even begun addressing this problem when it comes to server-to-server integration.

We’re working on a solution for this with NServiceBus that won’t require you to change how you write business logic. We’ve got one big release to do before we can roll this in, and that’s coming soon (with all sorts of cool things like support for ActiveMQ and queues in the database). The solution we’ve found is architecturally sound but you’ll have to wait for my next post to hear about it.

Stay tuned.


Viewing all articles
Browse latest Browse all 25

Latest Images

Trending Articles





Latest Images