Surrogate keys at Datawarehouse

I want to understand how surrogate keys are used in real time DWH environments. I get that they add the benefit of not relying on the original data to store each dimensional key, and also avoiding having a composite key built from natural keys from dimensions in fact: for example (prod id + cust id + id)

But doesn't that add complexity to the need to support the lookup (natural key, surrogate key) while we load data into facts. I have been working in the BI / DW teams for the past 3 years and we do not support any surrogate keys on our systems. We use natural keys to create our databases. One example of a usecase is revenue information that is stored in a transactional system that is loaded into the warehouse by customer, product, time period granularity, using the same natural keys from the source. We use the same to connect with the appropriate parameters to build the STAR circuit.

The main reason I think it makes sense in our case is that the business is using EDW data to micro-analyze data at the account level, not just trend analysis. In this case, we need to maintain the integrity of the data that we achieve with natural keys. I want to understand how other DW environments work. How do you use surrogate keys or natural keys in your systems.

Thank!

+3


source to share


4 answers


One of the reasons is to maintain and compare historical changes.

For example, if one of your product attributes changed, and you wanted to view and compare revenue before and after the attribute change, how would you do it without using surrogate product keys? Using the natural key will just overwrite the old value when you ETL.

The search doesn't have to be very difficult to maintain. Most ETL tools support this feature and usually have a built-in caching mechanism for caching lookup values.



Also, what do you mean when you say "real-time data warehouse"? Are you using ROLAP, DirectQuery or something similar? If so, you can create your storefronts directly on your OLTP system and de-normalize in some semantic model. Then you can use your natural keys because there is no traditional ETL / data store to find and store your surrogate keys.

Finally, granularity is not related to what type of key you are using.

+3


source


If your business is stable and runs on top of one application for everything, natural keys will work very well, as your experience suggests.

Most businesses have not been in this state or not for very long. Mergers happen, new applications are introduced, obsolete things refuse to die. New lines of business start or shut down and require wholesale renaming of existing natural key schemes.

Surrogate keys provide great advantages in providing stable and convenient reporting in a business when you have a bunch of separate new and legacy applications, each with its own versions of its customers and products and are regularly migrated or replaced for similar systems with new ones. definitions of natural keys. The main work has to do with the various natural keys of the customer / product / regardless of the fact that assigning a surrogate key is just a simple and very useful step in this.

Even in your scenario, I would use surrogate keys as they prepare you for future changes and are very useful with historical data (as NITHIN B also answered) in Type 2 Dimensions.

It is entirely possible to do versioning with natural keys by adding a version field to your size and fact tables, but this makes it difficult to write messages for reporting, and your entire system is still confused if changes in the business or application cause the natural keys to change.



To illustrate:

Select bla from Fact F inner join Dim_Customer DC on F.Surrogate_key = DC.Surrogate_Key

almost flawless. If you screw it up, it will be immediately apparent in your report.

Select bla from Fact F inner join Dim_Customer DC on F.Natural_key = DC.Natural_Key and F.Version = DC.Version

does the same job, but if you forget that last line everything will look fine, but your numbers will be overstated depending on how many versions are on average. Soreness when a 25% increase in sales turns out to be a mistake.

+2


source


Another reason that hasn't been mentioned yet is performance. Sometimes (very often in my experience) strings are natural keys, sometimes long strings.

It doesn't seem to be worth using a 10, 20, or 30 byte string instead of a 4 byte integer, but when you have 10 dimensions and hundreds of millions of strings, it adds up quickly.

+2


source


Could you please post a sample design.

I would be interested to see how you can load the fact table using dimension keys, which are natural keys. Kimball's design never recommends.

My stand on surrogate keys at DWH.

  • Surrogate keys give you a lot of flexibility when using sizes 2, i.e. if you have size type 2. For example: you can track the client's changes if he or she changes their middle name. You can have strings with old values ​​and new values.
  • Fact tables usually contain keys that are surrogate keys. This makes your star circuit neat, tidy and reliable.

However, I am not skipping queues, I will wait for your design before going against your booth.

Cheers to Nithin

+1


source







All Articles