Indexing the notification table in DynamoDB

Question

Indexing the notification table in DynamoDB

I am about to implement a notification system and I am trying to find a good way to store notifications in a database. I have a web application that uses a PostgreSQL database, but a relational database doesn't seem ideal for this use case; I want to support different types of notifications, each containing different data, although a subset of the data is common to all types of notifications. So I thought that a NoSQL database is probably better than trying to normalize the schema in a relational database as that would be quite difficult.

My application is hosted on Amazon Web Services (AWS) and I have been looking around for a bit of DynamoDB to store notifications. This is because it is driven, so I don't need to deal with its actions. Ideally, I would like to use MongoDB, but I would rather not deal with database operations. I tried to think of a way to do what I want in DynamoDB, but I was struggling, so I have a few questions.

Suppose I want to store the following data for each notification:

Identifier
User ID of the recipient of the notification
Notification type
Mark
Whether it was read / viewed
Notification / event metadata (no request required)

Now I would like to be prompted for the most recent X notifications for a given user. Also, in another request, I would like to get the number of unread notifications for a specific user. I am trying to figure out a way that I can index a table in order to be able to do this efficiently.

I can rule out just having a hash primary key since I would not be doing a simple hash key lookup. I don't know if "hash and range of primary key" would help me here as I don't know which attribute should be used as the range key. Can I have a unique notification ID as a hash key and a user ID as a range key? Will I allow me to search using the range key only, i.e. Without providing a hash key? Then maybe a secondary index can help me sort by timestamp if possible.

I also looked at global secondary indexes, but the problem is that when querying an index, DynamoDB can only return attributes that are projected into the index - and since I want all attributes to be returned, I would effectively have to duplicate all of my data, which is seems pretty funny.

How can I index my notification table to support my use case? Is this possible or do you have other recommendations?

+3

amazon-web-services amazon-dynamodb notifications

Andy0708 Apr 29. 15 at 18:07

source to share

2 answers

I am an active DynamoDB user and here's what I will do ... First, I am assuming that you need to access the notifications individually (e.g. mark them as read / viewed), in addition to getting the latest notifications for user_id.

Table design:

NotificationsTable
id - Hash key
user_id
timestamp
...

UserNotificationsIndex (Global Secondary Index)
user_id - Hash key
timestamp - Range key
id

When you query UserNotificationsIndex

, you set the user_id

user whose notifications you want, and ScanIndexForward

- false

, and DynamoDB will return the notification IDs for that user in reverse chronological order. You can optionally set the limit

number of results returned or get a maximum of 1 MB.

As for projecting attributes, you will either have to project the attributes you want into the index, or just project id

and then write the "hydrate" functionality in your code that looks at each ID and returns the fields you want .

If you really don't like it, here is an alternative solution for you ... Install id

as yours timestamp

. For example, I would use # milliseconds from a custom epoch (e.g. January 1, 2015). Here's an alternative table design:

NotificationsTable
user_id - Hash key
id/timestamp - Range key

You can now request the NotificationsTable function directly, configuring it appropriately user_id

and setting it ScanIndexForward

to false

the Range key type. Of course, this assumes that you will not have a collision when the user receives 2 notifications in the same millisecond. This is unlikely, but I don't know the scale of your system.

+1

readyornot May 01 '15 at 12:23

source to share

bsd · Accepted Answer · 2015-05-01T12:59:04+0000

Note on motivation:When using cloud storage like DynamoDB, we need to know the storage model because this will directly affect your efficiency, scalability and financial costs. This is different than working with a local database because you pay not only for the data you store, but also the operations you perform against the data. For example, deleting a record is a WRITE operation, so if you don't have an effective cleanup plan (and your Time Series Data case specifically needs one), you'll pay the price. Your Data Model won't show problems when dealing with small amounts of data, but it can definitely mess up your plans when you need to scale. It said that solutions like creating (or not) an index, determining the correct attributes for your keys, creating table segmentation, etc.make all the difference along the way. Choosing DynamoDB (or more generally, Key-Value storage), like any other architectural decision comes with a trade-off, you need to clearly understand some concepts about the storage model in order to be able to use the tool effectively choosing the right keys is really important, but only the tip of the iceberg ... For example, if you ignore the fact that you are dealing with time series data, no matter what primary keys or index you define, your provided bandwidth will not be optimized as it propagates across your entire table (and its partitions) and NOT ONLY DATA THAT IS FREQUENTLY ASKED, which means that unused data is directly affecting your bandwidth just because it is part of the same table. This leads to cases wherewhenwhenwhenwhenChoosing DynamoDB (or more generally, Key-Value storage), like any other architectural decision comes with a trade-off, you need to clearly understand some concepts about the storage model in order to be able to use the tool effectively choosing the right keys is really important, but only the tip of the iceberg ... For example, if you ignore the fact that you are dealing with time series data, no matter what primary keys or index you define, your provided throughput will not be optimized as it propagates across your entire table (and its partitions) and NOT ONLY DATA THAT IS FREQUENTLY ASKED, which means that unused data is directly affecting your bandwidth just because it is part of the same table. This leads to cases whereChoosing DynamoDB (or more generally, Key-Value storage), like any other architectural decision comes with a trade-off, you need to clearly understand some concepts about the storage model in order to be able to use the tool effectively choosing the right keys is really important, but only the tip of the iceberg ... For example, if you ignore the fact that you are dealing with time series data, no matter what primary keys or index you define, your provided throughput will not be optimized as it propagates across your entire table (and its partitions) and NOT ONLY DATA THAT IS FREQUENTLY ASKED, which means that unused data is directly affecting your bandwidth just because it is part of the same table. This leads to cases whereKey-Value storage), like any other architectural solution comes with a compromise, you need to clearly understand some concepts about the storage model in order to be able to use the tool efficiently choosing the correct keys is really important, but only the tip of the iceberg. For example, if you ignore the fact that you are dealing with time series data, no matter what primary keys or index you define, your provided throughput will not be optimized as it propagates across your entire table (and its partitions) and NOT ONLY DATA THAT IS FREQUENTLY ASKED, which means that unused data is directly affecting your bandwidth just because it is part of the same table. This leads to cases whereKey-Value storage), like any other architectural solution comes with a compromise, you need to clearly understand some concepts about the storage model in order to be able to use the tool efficiently choosing the correct keys is really important, but only the tip of the iceberg. For example, if you ignore the fact that you are dealing with time series data, no matter what primary keys or index you define, your provided bandwidth will not be optimized as it propagates across your entire table (and its partitions) and NOT ONLY DATA THAT IS FREQUENTLY ASKED, which means that unused data is directly affecting your bandwidth just because it is part of the same table. This leads to cases wherelike any other architectural solution comes with a trade-off, you need to clearly understand some concepts about the storage model in order to be able to use the tool efficiently choosing the right keys is really important, but only the tip of the iceberg. For example, if you ignore the fact that you are dealing with time series data, no matter what primary keys or index you define, your provided throughput will not be optimized as it propagates across your entire table (and its partitions) and NOT ONLY DATA THAT IS FREQUENTLY ASKED, which means that unused data is directly affecting your bandwidth just because it is part of the same table. This leads to cases wherelike any other architectural solution comes with a trade-off, you need to clearly understand some concepts about the storage model in order to be able to use the tool effectively choosing the right keys is really important, but only the tip of the iceberg. For example, if you ignore the fact that you are dealing with time series data, no matter what primary keys or index you define, your provided throughput will not be optimized as it propagates across your entire table (and its partitions) and NOT ONLY DATA THAT IS FREQUENTLY ASKED, which means that unused data is directly affecting your bandwidth just because it is part of the same table. This leads to cases whereto be able to use the tool effectively, choosing the correct keys is really important, but only the tip of the iceberg. For example, if you ignore the fact that you are dealing with time series data, no matter what primary keys or index you define, your provided throughput will not be optimized as it propagates across your entire table (and its partitions) and NOT ONLY DATA THAT IS FREQUENTLY ASKED, which means that unused data is directly affecting your bandwidth just because it is part of the same table. This leads to cases whereto be able to use the tool effectively, choosing the right keys is really important, but only the tip of the iceberg. For example, if you ignore the fact that you are dealing with time series data, no matter what primary keys or index you define, your provided throughput will not be optimized as it propagates across your entire table (and its partitions) and NOT ONLY DATA THAT IS FREQUENTLY ASKED, which means that unused data is directly affecting your bandwidth just because it is part of the same table. This leads to cases wherewhichever primary keys or index you define, your provided bandwidth will not be optimized as it propagates across your entire table (and its partitions) and NOT ONLY DATA WHICH IS FREQUENTLY ASKED, which means that unused data directly impacts your bandwidth only because it is part of the same table. This leads to cases wherewhichever primary keys or index you define, your provided bandwidth will not be optimized as it propagates across your entire table (and its partitions) and NOT ONLY DATA WHICH IS FREQUENTLY ASKED, which means that unused data directly impacts your bandwidth only because it is part of the same table. This leads to cases whereProvisionedThroughputExceededException

is thrown "unexpectedly" when you know for sure that your provided bandwidth should be sufficient for your demand, however, the TABLE PARTITION that is unevenly available has reached its limits (more details here ).

The post below has more details, but I would like to give you some motivation to read it and understand that while you can certainly find an easier solution at the moment, it could mean starting from scratch in the near future when you hit a wall (The wall can have high financial costs, performance and scalability constraints, or a combination of all).

Q: Can I have a unique notification ID as a hash key and a user ID as a range key? Will I allow me to search using the range key only, i.e. Without providing a hash key?

A: DynamoDB is a key store meaning that the most efficient queries use the entire key (hash or hash range). Using an operation Scan

to actually execute a query just because you don't have your key is definitely a sign of a flaw in your data model with respect to your requirements. There are several things to consider and many options to avoid this problem (see below for details).

Now, before moving on, I suggest you read this quick post to clearly understand the difference between Hash Key and Hash + Range Key:

DynamoDB: When to Use Which PK Type?

Your case is a typical time series scenario where your records become stale over time. There are two main factors to be observed:

Make sure your tables have access patterns.

If you put all of your notifications in one table, and the most recent ones are available more frequently, your allocated bandwidth will not be efficiently used. You must group the most accessible items in one table so that the provided bandwidth can be adjusted correctly for the required access. Also, make sure you define the hash key correctly , which will distribute your data evenly across multiple partitions .

Obsolete data is removed in the most efficient way (effort, performance and cost)

The documentation suggests segmenting data in different tables so that you can delete or back up the entire table after the records have become stale (see more below).

Here is a section from the documentation that describes best practices related to Time Series data:

Understanding Access Patterns for Time Series Data

For each table you create, you specify the bandwidth requirement. DynamoDB allocates and reserves resources for sustained low latency bandwidth demand. When you design your application and tables, you should consider access to make the most of your table resources.

Let's say you've created a spreadsheet to track customer behavior on your site, such as the URLs they click. You can create hash table and range primary key with customer id as hash attribute and date / time as range attribute. In this application, customer data grows over time; however, applications may show an uneven access pattern for all items in a table, where the latest customer data is more relevant, and your application may access the latest items more often and less accessible over time, ultimately, older items are rarely available. If this is a known access pattern, you can take it into account when designing your table schema. Instead of storing all items in a single table, you can use multiple tables to store those items. For example,you can create tables to store monthly or weekly data. For table storage of data from the last month or week, where data access speed is high, request higher bandwidth and store tables of older data, you can gain bandwidth and save resources.

You can conserve resources by keeping hot items in one table with higher bandwidths and cold items in another table with lower bandwidth settings. You can remove old items by simply deleting tables. You can copy these tables to other storage as needed, such as Amazon Simple Storage Service (Amazon S3). Dropping an entire table is significantly more efficient than deleting items one by one, which substantially doubles the write throughput, as you do as many delete operations as put operations.

Source:

http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.TimeSeriesDataAccessPatterns

For example, you could segment your tables by month:

Notifications_April, Notifications_May, etc

Q: I would like to be able to request the most recent X notifications for a given user.

A: I would suggest using an operation Query

and a query using only Hash Key

( UserId

) having Range Key

to sort notifications using Timestamp

(Date and Time).

Hash Key: UserId
Range Key: Timestamp

Note. The best solution would be to Hash Key

not only have UserId

, but also some more concatenated information that you could calculate before querying to make sure yours Hash Key

even provides you with access patterns for your data. For example, you might start to have hot sections if notifications from specific users are more readily available than others ... having more information to Hash Key

reduce that risk.

Q: I would like to get the number of unread notifications for a specific user.

A: Create Global Secondary Index

as sparse index having UserId

how Hash Key

and Unread

how Range Key

.

Example:

Index Name: Notifications_April_Unread
Hash Key: UserId
Range Key : Unuread

When you query this index using the Hash Key (UserId), you will automatically have all unread notifications without unnecessary checks through notifications that are not relevant to this case. Keep in mind that the original key key from the table is automatically projected into the index, so if you need more information about the notification, you can always resort to these attributes to execute GetItem

or BatchGetItem

in the original table.

Note. ... You can explore the idea of using different attributes other than the unread flag, it is important to remember that a sparse index can help you with this use case (more details below).

Detailed explanation:

I would have a sparse index to make sure you can query a reduced dataset to do the count. In your case, you can specify the attribute "unread" if the notification was read or not, and use this attribute to create a sparse index. When a user reads a notification, you simply remove that attribute from the notification so that it no longer appears in the index. Here are some guidelines from the documentation that clearly apply to your scenario:

Take advantage of rare indexes

For any element in a table, DynamoDB will only write the corresponding index if the index range key is the attribute value is present in the element. If the range key attribute does not appear on every table element, the index is said to be sparse. [...]

To keep track of open orders, you can create an index on CustomerId (hash) and IsOpen (range). Only those orders in the table with the IsOpen definition will appear in the index. Then your application can quickly and efficiently find orders that are still open by querying the index. If you had thousands of orders, for example, but only a small number are open, the application can query the index and return the OrderId of each open order. Your application will perform significantly fewer reads than it would take to scan the entire CustomerOrders table. [...]

Instead of writing an arbitrary value to the IsOpen attribute, you can use another attribute that will result in a useful sort order in the index. To do this, you can create an OrderOpenDate attribute and set it to the date the order was placed (and still remove the attribute after order fulfillment) and create an OpenOrders index with the CustomerId schema (hash) and OrderOpenDate (range). This way, when you ask for your index, the items will be returned in a more useful sort order. [...]

Such a query can be very efficient because the number of elements in the index will be significantly less than the number of elements in the table. Also, the fewer table attributes you create in index, the less read power you will consume from the index.

Source: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForGSI.html#GuidelinesForGSI.SparseIndexes

Find below some links to the operations you need to programmatically create and drop tables:

Create table http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_CreateTable.html

Delete table http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_DeleteTable.html

Indexing the notification table in DynamoDB

More articles: