Blog > Engineering

Cheap near real time analytics using DynamoDB and Redis

Posted by John Barton | March 13, 2023

In the last post I outlined the architecture of how we use DynamoDB to store data for PureClarity. Before I mention how we use Redis to reduce our write costs and allow for near real time analytics, let me explain how we store our time based data.

Time based records in DynamoDB

Our time based data is stored in DynamoDB, and is “binned” into days, months, years and the total. We do this for speed when we retrieve data for each visitor when they visit the e-commerce store. We can quickly pull the right records needed to determine their past activity and which customer segments they belong to.

This format means each update to DynamoDB (for example the number of page impressions or orders) is written in multiple rows.

For example, we would hold daily and monthly records for a given store for each period:

May 3rd 2023 (day 123)

Attribute Type Example
StoreKey PartitionKey STORE12345
TimePeriod SortKey D2023-123
OrderCount Attribute 52
ViewCount Attribute 1023

May 4th 2023 (day 124)

Attribute Type Example
StoreKey PartitionKey STORE12345
TimePeriod SortKey D2023-124
OrderCount Attribute 48
ViewCount Attribute 1078

All of May 2023

Attribute Type Example
StoreKey PartitionKey STORE12345
TimePeriod SortKey M2023-05
OrderCount Attribute 765
ViewCount Attribute 32023

We can then easily query DynamoDBto get data for a given range. Want the last 30 days for StoreA? Pull records where the PartitionKey is “StoreA” and the SortKey is between “D2023-130” and “D2023-100”. Last 3 months? Pull records where the SortKey is between “M2023-05” and “M2023-02”.

Over the years we’ve used Redis more and more to help reduce the costs of reading and writing data in this format. Data for each current user session is stored in Redis for example, and updates for DynamoDB generated at the end. Our API endpoints can talk to redis and DynamoDB, and we use Redis to cache historic data about a user for each session. If a segment condition is, “number of views in the last 3 months of product X”, then this data is fetched from DynamoDB and cached for this user in redis.

One of the most significant changes we have made in the last few years is to store analytical data in Redis before writing to dynamo.
The “reduce” microservice I mentioned in the last post is very good at reducing these updates - but even so, we wanted to do better. It would pull a large number of records, try to find duplicates, and then write back. Not very efficient.

The solution was to store these “binned” aggregates in Redis, in small time based buckets. Each 15 minute window is a different bucket, and all data for this period is stored within it. Data is constantly written into new buckets as the day progresses.

A separate job then takes the older buckets which contain the totals for previous periods, and generates all the updates for Dynamo. It’s obvious, but it made a huge difference to our costs.

The other benefit is we can show real time stats in our dashboard! The data for a particular period is loaded from DynamoDB, and if it includes the current active period (i..e “today”) the relevant data from redis is merged in. The records we write to DynamoDB are now far fewer, and are almost instantly written to Dynamo. No more need for the “reduce” micro-service, and our write costs are as low as they can be. We still add the records to our queues, and our write service writes as before. This means we are protected if part of the system goes down, as the data will still just back up in the queues and will eventually be written.

Storing less rows = cheaper write costs

The next logical step is to take advantage of the fact that each “write unit” in DynamoDB is per row, and can be up to 4k in size. By storing all the time based data as a row per period, we are not taking full advantage of the write costs. The solution is to store the data as attributes on the same row. For some data we now write data not into 3 rows (day, month and total), but into 2:

Monthly and daily data

Attribute Type Example
StoreKey PartitionKey STORE12345
TimePeriod SortKey M2023-05
D123.OrderCount Attribute 48
D123.ViewCount Attribute 1078
D124.OrderCount Attribute 52
D124.ViewCount Attribute 1234
M03.OrderCount Attribute 600
M03.ViewCount Attribute 34478

Total (All time)

Attribute Type Example
StoreKey PartitionKey STORE12345
TimePeriod SortKey TOTAL
OrderCount Attribute 37658
ViewCount Attribute 4320238

Retrieving the data is the same - you just work out the months the data you want falls into and then pull them back and get the attributes you need. This has made a big impact on our write costs and storage sizes as it’s far more economical to store.

Conclusion

Using a NOSQL datastore such as DynamoDB can have huge advantages for you in terms of cost, reliability and speed. Be mindful of how you are using it - and design accordingly. Reduce the number of writes where you can with caching and design choices, and you will see significant cost savings.

AWS

Join our newsletter

Receive e-commerce tips, tricks, insights and PureClarity updates