Shiny, new feature: Nimbus now allows command/event/request handlers the option to run for extended time periods.

From day zero Nimbus has supported competing command handlers, allowing us to spin up an arbitrary number of handlers to increase throughput. One issue that we’ve run into is to do with the way that we have to deal with reliable handling of messages and how and when retries are attempted.

You’d think (naively) that a normal workflow would looks something like this:

  1. Pop a command from the queue.
  2. Handle that command.

But what happens when the command handler goes bang? We need some way of putting that command back onto the queue for someone else to attempt. Again, a naive approach would be something like this:

  1. Pop a command from the queue.
  2. Handle that command.
  3. If that command goes bang, put it back onto the queue.

So… where does the command live during Step #2? The only place for it to live is on the node that’s actually doing the work - and this is a problem. If that node simply throws an exception then we could catch it and put the message back onto the queue. But what if the power goes out? Or a disk goes crunch? (Or crackle, given that we’re in SSD-land now?) What if that node never comes back?

If that node never comes back, the message never gets re-enqueued, which means we’ve violated our delivery guarantee. Oops.

Thankfully, that’s not how it works. Under the covers, the Azure Message Bus does some clever stuff for us. The actual workflow looks something like this:

  1. Tentatively pop a message from the head of the queue.
  2. Attempt to handle that message.
  3. If we succeed, call BrokeredMessage.Complete()
  4. If we fail, call BrokeredMessage.Abandon()

The missing piece in this puzzle is still what happens if the power goes out. In this case, the Azure Message Bus will automatically re-queue the message after a certain time period (called the peek-lock timeout) and won’t allow the original (now-timed-out) handler to call either .Complete() or .Abandon() on the message any more. In essence, it’s saying “You get XX seconds to handle the message and if I don’t hear back from you one way or the other before that time elapses then I’ll assume you’ve vanished and will give someone else a chance to handle it.”

So what’s the problem, then?

The problem arises when we have a command handler that legitimately takes longer than the peek-lock timeout to do its thing. We’ve seen this scenario in the wild with people doing things like Selenium-based screen-scraping of legacy web sites, really long-running aggregate queries or ETL operations on databases and a bunch of other scenarios.

Let’s have a look at our PizzaMaker as an example. Here’s our IncomingOrderHandler class:

public class IncomingOrderHandler : IHandleCommand<OrderPizzaCommand>
{
    private readonly IPizzaMaker _pizzaMaker;

    public IncomingOrderHandler(IPizzaMaker pizzaMaker)
    {
        _pizzaMaker = pizzaMaker;
    }

    public async Task Handle(OrderPizzaCommand busCommand)
    {
        await _pizzaMaker.MakePizzaForCustomer(busCommand.CustomerName);
    }
}

and our PizzaMaker looks something like this:

public class PizzaMaker : IPizzaMaker
{
    private readonly IBus _bus;

    public PizzaMaker(IBus bus)
    {
        _bus = bus;
    }

    public async Task MakePizzaForCustomer(string customerName)
    {
        await _bus.Publish(new NewOrderRecieved {CustomerName = customerName});
        Console.WriteLine("Hi {0}! I'm making your pizza now!", customerName);

        await Task.Delay(TimeSpan.FromSeconds(45));

        await _bus.Publish(new PizzaIsReady {CustomerName = customerName});
        Console.WriteLine("Hey, {0}! Your pizza's ready!", customerName);
    }
}

Let’s say that the peek-lock timeout is set to 30 seconds and making a pizza takes 45 seconds. What will happen in this case is that the first handler will be spun up and given the command instance to handle. It will start to do its thing and all is well and good. Thirty seconds later, the bus decides that that handler has died so it revokes its lock, puts the message back at the head of the queue and promptly gives it to someone else.

After another 15 seconds, the first handler will finish (presumably successfully) and will attempt to call .Complete() on its message, which will make it throw an exception as it no longer holds a lock. What’s worse is that this will repeat until the maximum number of delivery attemps has been exceeded.

We’ve just made five pizzas for the one order. And none of them has been recorded as successful. Oops.

What do I have to do to make it all Just Work™?

All you need to do is implement the ILongRunningHandler interface on your handler class. Let’s update our IncomingOrderHandler example from earlier:

public class IncomingOrderHandler : IHandleCommand<OrderPizzaCommand>, ILongRunningHandler    // Note the additional interface
{
    private readonly IPizzaMaker _pizzaMaker;

    public IncomingOrderHandler(IPizzaMaker pizzaMaker)
    {
        _pizzaMaker = pizzaMaker;
    }

    public async Task Handle(OrderPizzaCommand busCommand)
    {
        await _pizzaMaker.MakePizzaForCustomer(busCommand.CustomerName);
    }

    // Note the new method
    public bool IsAlive
    {
        get { return true; }
    }
}

The ILongRunningHandler interface has a single, read-only property on it: IsAlive. All you need to do is return true if your handler is still happily executing or false if it’s not. In this case, we’ve taken the very naive approach of just returning true but it might make more sense, for instance, to ask our PizzaMaker instance if they still have an order for the customer in the works.

Under the covers, Nimbus will automatically renew the lock it’s taken out on the message for you so that you can take as long as you like to handle it.