It goes without saying that a 5000ms latency is… unacceptable in a real-time environment. Honestly, we first blamed our home-grown DynamoDB-Mapper and, indeed found, and fixed, a nasty design. Here is the specific commit for those liking juicy details
Ok, so “case closed” you might think. Sadly not, it did not change anything but since this behavior was random and the application still under very low load (development environment) it took some time to spot it again.
Case re-opened.
Diving deeper in CloudWatch, I saw these pick latency in the stat. Interestingly enough, this was always like 5000 + n milliseconds. Where ‘n’ is small and pretty close to the normal average latency observed on DynamoDB. After mailing directly Amazon’s support about this specific issue, it appeared that this intuition was right. 5 sec it the failure timeout on their side.
We already know that data is spread over partitions. But this “partitions” might simply be instances running as part of a cluster. This cluster would then be exposed by an ELB with failure timeout set to 5sec. From my early tests, I noticed that there is 2 exposed partitions on a nearly empty table at a throughput of 1000. It now appears that both partition contains the whole dataset. Good!
Mystery solved ? Dunno. I have no clue how DynamoDB is actually built and all this is jealously kept as an ‘IP’ secret, which I can understand.
So, most of the time DynamoDB is indeed a great choice. But, sometimes, you may experience unusual latencies. In this case, feel free to tell the support so that they can drop an eye.
Last advice: always keep profiling informations. I was asked for the ‘TransactionID’ and suddenly felt stupid as we have none. Bad luck. If you use the great Boto library, do not forget to configure ‘boto.perflog’, I contributed it for this very purpose
Oh, If you read this down to that point, you may be interested in some common DynamoDB recipies too.