That doesn't make any sense. the timeout is how long to block for and retry, now...

8organicbits · on July 29, 2023

Instance A grabs the lock and makes an API call that takes 120 seconds. Instance B sees the lock but considers it expired after the lock times out at 100 seconds. Instance B falsely concludes A died, overwrites A's lock so the system doesn't dead lock waiting for A, and makes its own request. Unfortunately, A's request was still processing and B's accidentally concurrent request cause corruption.

Racing0461 · on July 30, 2023

> falsely concludes

i disagree here. instance b did the right thing given the information it has. instance a should realize it no longer owns the lock and stop proceeding. but in reality it also signifies concurrency based limitations in the api itself (no ability to perform a do and then commit call). https://microservices.io/patterns/data/saga.html

8organicbits · on July 30, 2023

I think we both agree that A did something wrong and that B followed the locking algorithm correctly. "Falsely" refers to a matter of fact: A is not dead.

You're right that A could try to stop, but I think it's more complicated than that. A is calling a third party API, which may not have a way to cancel an in-flight request. If A can't cancel, then A should refresh its claim on the lock. A must have done neither in the example.

ngc248 · on July 30, 2023

I have implemented a Distributed lock using DynamooDB and the timeout for the lock release needs to be slightly greater than the time taken to process the thing under lock. Else things like what you mention will happen.