Modeling relational data for subtypes

I am studying relational model and data modeling.
And I have some confusion in my mind regarding subtypes.

I know that data modeling is an iterative process and there are many different ways to model things.
But I don't know how to choose between the different options.

Example

Suppose we want to model particles (molecule, atom, proton, neutron, electron, ...).
Let's ignore Quark and other particles for simplicity.

Since all particles of the same type behave the same way, we are not going to model individual particles. Put it another way, we won't store every hydrogen atom.
Instead, we will store hydrogen, oxygen, and other types of atoms.
What we're going to model is actually the types of particles and the relationship between them.

I use the word " type " casually.
The hydrogen atom is an example. Hydrogen is a type. Hydrogen is also a type of atom.
Yes, there is a type hierarchy. And we ignore the lowest level (individual particles).

Approaches

I can think of several approaches to model them.

1. One table (relation, entity) for each type of thing (types of particles).

1.1 The first approach that comes to my mind.

Proton (Proton)
Neutron (neutron)
Electron (Electron)

Atom (Atom)
Atom_Proton (Atom, Proton, Quantity)
Atom_Neutron (Atom, Neutron, Quantity)
Atom_Electron (Atom, Electron, Quantity)

Molecule (molecule)
Molecule_Atom (Molecule, atom, number)

1.2 Since there is only one kind of proton / neutron / electron, we can simplify it.

Atom (Atom, ProtonQuantity, NeutronQuantity, ElectronQuantity)
Molecule (molecule)
Molecule_Atom (Molecule, atom, number)

In this simplified model, the facts about the Proton are lost.

2. All things in one table, with associative tables representing the relationship between them.

2.1 one associative table for each relationship

Particle

Atom_Proton (Particle, Particle, ProtonQuantity)
Atom_Neutron (Particle, Particle, NeutronQuantity)
Atom_Electron (Particle, Particle, ElectronQuantity)
Molecule_Atom (Particle, Particle, AtomQuantity)

2.2 single associative table

Particle (Particle)
ParticleComposition (Particle, Particle, Quantity)

This simplification loses nothing. I think this is better.
But if there are facts specific to Atom_Proton / Atom_Neutron / Atom_Electron, 2.1 might be better.

2.3 combine 2.1 and 2.2

Particle

Atom_Proton (Particle, Particle, other attributes)
Atom_Neutron (Particle, Particle, other attributes)
Atom_Electron (Particle, Particle, other attributes) Molecule_Atom (Particle, Particle, other attributes)

ParticleComposition (Particle, Particle, Quantity, other attributes)

In this approach, general attributes about particle composition are given in ParticleComposition,
while special attributes about particle composition go to special tables.

3. Use subtype tables.

3.1 Table for base type Particle and additional tables for subtypes (Atom, Molecule, ...).

Particle

Proton (Particle, other attributes)
Neutron (Particle, other attributes)
Electron (Particle, other attributes)
Atom (Particle, other attributes)
Molecule (Particle, other attributes)

Atom_Proton (Particle, Particle, ProtonQuantity)
Atom_Neutron (Particle, Particle, NeutronQuantity)
Atom_Electron (Particle, Particle, ElectronQuantity)
Molecule_Atom (Particle, Particle, AtomQuantity)

3.2 We can also combine Atom_XXXQuantity tables in Atom and drop Pronton / Neutron / Electron.

Particle

Atom (Particle, ProtonQuantity, NeutronQuantity, ElectronQuantity)
Molecule (Particle, other attributes)

Molecule_Atom (Particle, Particle, AtomQuantity)

It's easier, but the information about Proton / Neutron / Electron is lost as in 1.2.

3.3 We can change the name of Molecule_Atom to make it more generic.

Particle

Atom (Particle, ProtonQuantity, NeutronQuantity, ElectronQuantity)
Molecule (Particle, other attributes)

ParticleComposition (Particle, Particle, Quantity)

It looks like 2.2, with additional tables for subtypes (Atom, Molecule).
It seems that 2.2 is a special case of 3.3.

3.4 We can combine all of the above approaches and get a general model.

Particle

Proton (Particle, other attributes)
Neutron (Particle, other attributes)
Electron (Particle, other attributes)
Atom (Particle, other attributes)
Molecule (Particle, other attributes)

ParticleComposition (Particle, Particle, Quantity, other attributes)

Atom_Proton (Particle, Particle, other attributes)
Atom_Neutron (Particle, Particle, other attributes)
Atom_Electron (Particle, Particle, other attributes)
Molecule_Atom (Particle, Particle, other attributes)

It seems that Atom_Proton, Atom_Neutron, Atom_Electron and Molecule_Atom can be thought of as subtypes for ParticleComposition.

This approach is the most complex, it contains many tables, but each table has a different role.

Questions

  • Does any of the above constructs call the rules of the relational model?
  • Which approach is better? Does it depend on how we think about the data? Does it depend on the requirements?
  • If it depends on the requirements, will we choose the simplest design first and then make it more versatile to meet the new requirements?
    While the resulting data models have a lot in common, the original design may affect the naming of tables / columns, and the domains for the keys are different.

    • If we decided to use one table for each type of thing, we could choose incompatible keys for Atom and Molecule, such as atomic weight for Atom and molecule name for Molecule.
    • If we decide to use a generic approach, we can choose a common key for all particles.
      Changing keys can have a greater impact on the system, so it can be difficult to transition from a simple design to a generic one. What do you think?

PS: This may not be an appropriate example and solutions may be problematic and there may be more options for approaches, but you can say with confidence. If you have any better projects, please share with me.


Update 1

What is the data for the simulation?

At first I was trying to simulate the particles because

  • I think there are subtypes in between, which is exactly what I'm looking for.
  • They are well understood (?) By people.
  • This is a good example of how people understand the world.

Here is a picture in my opinion. Particle Hierarchy

I didn't say this clearly because I was not very clear about what I was trying to model.
First, I thought Atom is the parent of Proton / Neutron / Electron and Molecule is the parent of Atom.
Then I realized that it was about composition, not subtyping, and not about type hierarchy .

Types

I've been thinking about types for a while, and grouping and classification.

Here's a quote from "SQL and Relational Theory":

So what is a type, exactly? In essence, it is a named finite set of values ​​─ all possible values ​​of some particular kind: for example, all possible integers or all possible character strings, or all possible vendor numbers, or all possible XML documents, or all possible relationships with a specific header (etc. .).

People came up with the name "Integer" to represent a collection of integer values.
In fact, people have come up with concepts and names to identify things by grouping things so that we can understand / model the world.

A proton is a set of real protons, hydrogen is a set of hydrogen atoms, etc.
In this sense, real particles remain at the lowest level of the type hierarchy.
At first I tried to simulate all the particles, but then I got stuck because

  • I couldn't think of a suitable key to identify each real particle;
  • there are too many of them to store in the database.

So I decided to ignore real particles and simulate types.

When we say "a molecule is made up of atoms" it means that "a real H2O molecule is made up of two real hydrogen atoms and one oxygen atom", it also means that "any (type) molecule is made up of (some types).

Instead of specifying every fact about real particles, we can simply point out facts about the types of particles.
This is the advantage we get by grouping things and coming up with names (types).

Particle type hierarchy as sets

The hierarchy can be translated into specific definitions.

The second level is types over real particles:

S_proton = { p | p satisfied the definition of a proton }  
S_neutron = { n | n satisfied the definition of a neutron }  
S_electron = { e | e satisfied the definition of an electron }  
S_hydrogen = { h | h satisfied the definition of a hydrogen }  
S_oxygen = { o | o satisfied the definition of an oxygen }  
S_h2o = { w | w satisfied the definition of a h2o }  
S_o2 = { o | o satisfied the definition of a o2 }

      

Higher levels

Using set theory terminology, a type A is a subtype of B if A is a subset of B.

At first I thought we could define an Atom type as:

S_atom = S_hydrogen union S_oxygen union ...

      

However, sets are relationships and items are tuples, so joining does not work if the tuples in the relationship are not compatible.

Approaches using subtype tables solve the problem and model the subset relationship.

But in subtyping, Atom's approach is still at the second level .

enter image description here

Higher-level types are defined as sets of sets.

S_atom = { S_hydrogen, S_oxygen, ... }
S_molecule = { S_h2o, S_o2, ... }
S_particle = { S_proton, S_neutron, S_electron, S_atom, S_molecule }

      

which means Particle is an Atom type and Atom is a hydrogen type.

Thus, the relationship between particles can be represented at a high level.

New data model

4. Treat types as a type hierarchy

ParticleType (ParticleType, Name)
ParticleTypeHierarchy (ParticleType, ParentType)
ParticleComposition (PartileType, SubParticleType, Quantity)

Sample data:

ParticleType

| ParticleType | Name |
| -------------- + ---------- |
| Particle | Particle |
| Proton | Proton |
| Neutron | Neutron |
| Electron | Electron |
| Atom | Atom |
| Molecule | Molecule |
| H | Hydrogen |
| O | Oxygen |
| H2O | Water |
| O2 | Oxygen |

ParticleTypeHierarchy

| ParticleType | ParentType |
| -------------- + ------------ |
| Proton | Particle |
| Neutron | Particle |
| Electron | Particle |
| Atom | Particle |
| Molecule | Particle |
| Hydrogen | Atom |
| Oxygen | Atom |
| H2O | Molecule |
| O2 | Molecule |

ParticleComposition

| PartileType | SubParticleType | Quantity |
| ------------- + ----------------- + ---------- |
| H | Proton | 1 |
| H | Electron | 1 |
| He | Proton | 2 |
| He | Neutron | 2 |
| He | Electron | 2 |
| H2O | H | 2 |
| H2O | H | 2 |
| H2O | O | 1 |
| CO2 | C | 1 |
| CO2 | O | 2 |

For comparison, this is sample data for a table subtype.

Particle

| ParticleId | ParticleName |
| ------------ + ---------------- |
| H | Hydrogen |
| He | Helium |
| Li | Lithium |
| Be | Beryllium |
| H2O | Water |
| O2 | Oxygen |
| CO2 | Carbon Dioxide |

Molecule

| MoleculeId | some_attribute |
| ------------ + ---------------- |
| H2O | ... |
| O2 | ... |
| CO2 | ... |

Atom

| AtomId | ProtonQuantity | NeutronQuantity | ElectronQuantity |
| -------- + ---------------- + ----------------- + ----- ------------- |
| H | 1 | 0 | 1 |
| He | 2 | 2 | 2 |
| Li | 3 | 4 | 3 |
| Be | 4 | 5 | 4 |

ParticleComposition

| ParticleId | ComponentId | Quantity |
| ------------ + ------------- + ---------- |
| H2O | H | 2 |
| H2O | O | 1 |
| CO2 | C | 1 |
| CO2 | O | 2 |
| O2 | O | 2 |

sub-atom

These types of particles are defined by humans, and humans continue to define new concepts to model new aspects of reality.
We can define a "subatom" and the hierarchy looks like this:

Particle Type Hierarchy with Sub-Atoms

Approach 4 can facilitate changing this type of hierarchy.


Update 2

Facts to record

  • There are different types of particles in the world: protons, neutrons, electrons, atoms, molecules.
  • Atoms are made up of protons, neutrons, and electrons.
  • Molecules are made up of atoms.
  • There are many different types of atoms: hydrogen, oxygen, ...
  • There are many different types of molecules: H2O, O2, ....
  • And the hydrogen atom consists of one proton and one electron; ...
  • The H2O molecule consists of two hydrogen atoms and one oxygen atom; ...
  • Different types of particles can have special properties, for example. An atom has an atomic weight, etc.
  • ...
+3


source to share


3 answers


Preliminary

Good question, very thoughtful for a student. I think you are actually after this - discussion to get clarity, and this is a data modeling exercise.

  • I understand your progression up to 3.3. What, how do you get 3.4 (after a phased progression to 3.3)? For me, the combination of all of the above does not equal Generic.

  • Instead of following your progression and setting a model for each step, let me TRD answer the appropriate steps in your discussion.

  • TRD Only tables that are identified by keys and relationships are relevant at this stage, I think you are well aware of the Attributes, if any, and what keys they will be expanded from. After you reach a stable TRD, you can expand it to full DM.

  • After the model is set up from the previous one, and after evaluation, if it is clear that it is losing information, it can be safely discarded. There is value for such models, so this step is not wrong. But the ongoing discussion of this is a waste. I believe I demonstrated this in the previous question.

Consider this set of Table Link Diagrams .

1.x

From my point of view, A First will be the first sensible TRD worthy of contemplation.

  • I don't see how or why Proton / Neutron / Electron are independent tables. They do not exist by themselves, their weights; etc. fixed. They only exist in the Atom context.

  • Since each Atom contains at least one proton / neutron / electron, the Proton / Neutron / Electron columns can be expanded into an Atom. Not drawn. Later.

2.x

Your progression is fine except for one glaring error.

general attributes about particle composition are listed in ParticleComposition, while special attributes about particle composition go to special tables.

Not. General attributes about particles go to Particle. Relationship-specific attributes (i.e., not generic) refer to ParticleComposition. And then there are no "special attributes about the composition of particles", there are no "special tables".

3.x

Consider the B Subtype . Your [3.1] is mostly correct, except:

  • I don't see Particle having kids like Proton / Neutron / Electron. Only Atom has it.

  • I don't see how Particles are related to other Particles (i.e. what is it?). For the data discussed, a molecule is composed of atoms; An atom consists of a proton / neutron / electron; and Particle is either Molecule xor Atom (Exclusive subtype).

  • Please correct me if this is not correct.

  • See document subtype for details .

It can be C Decreased as you stated. This states that proton / neutron / electronics information is captured for the Atom: there is one entry for each entry. For example. each sheath / energy level is not differentiated; zero is acceptable for neutrons (instead of Null).

  • I have already discussed the great value of Predicates. The main thing here is that the model identifies the predicates. and Predicates validate the model; this is a great feedback loop. I have given Predicates so you can evaluate them for yourself and check if the model is correct.

3.3

If it was completely D Normalized : Atom always has at least a Proton record; the Neutron record is optional; and each Shell / Energy level is differentiated.

  • Note the difference in predicates.

  • Note that while Reduction is a valid method, it does not conform to normalization.

3.4

This appears to be the sum of everything, laid out flat or flattened out (derived relations, perspective, result set). How good it is for understanding. But if you proposed it as a set of tables, then it is terribly incorrect due to various normalization errors. Which, if fixed, would go into [3.3] and mine [D Normalized].

Question

Does any of the above constructs break the rules of the relational model?

All of them, except [3.3], violate a number of rules. They mostly fall into the category of normalization errors. There would be associated identification errors if you provided the complete model or CREATE TABLE statements.

But it doesn't matter if the context is a data modeling exercise, to understand. If this exercise was serious, then the paragraph above.

This section is presented in accordance with SO guidelines, specifically: correct misinformation when you see it . I have commented on this thread, but they keep disappearing. So I posted it here.

Erwin Smout:
  When shortened to its core, the relational data model has at most one "rule": all information in the database must be represented as attribute values ​​in tuples in relationships.

This is one of the rules, yes, but the supplied statement is clearly false.

First, there are many first or first order rules in the relational model. From memory I would say about forty.

Second, there are many second-order rules that are logically implied by first-order rules.

  • People with technical qualifications and experience, who can understand RM, and who follow spirit and intent, follow all of them.

  • Others may not recognize some first-order rules or most of the implied rules.

  • And there is, as evidenced by books that claim to be about RM and others are people who are actively destroying and shrinking RM. They ignore second-order rules, and worse, they use pharisaic "logic" to undermine first-order rules.

  • Here Erwin, well known for his RM efforts on comp.databases.theory and TTM, reduces RM to one correct rule and thus undermines the full ruleset, and RM himself. In particular, in response to your question, which, if not for my answer, would lead readers to believe that RM is what it does: just one rule that everything, attitude and non-attitude, "satisfies" ...

  • The relational model is freely available, you can read it yourself. Let me know if you would like a copy. The caveat is that the terminology is out of date and needs to be explained.

Secondly, even if it was necessary to weld it down to one rule (impossible, too reductionist) or the most important rule (perhaps, but humiliating), this rule will not exist. It's one of forty or so first-order rules, but it certainly wasn't close to the top.

  • However, I suggest that other people might have different rankings according to their own goals.

  • What people who understand RM are discussing, as the main difference (not a rule) between RM and its predecessors is this:

    • He was the first to receive a complete mathematical definition (which forms its basis, and everything in it flows from this).

    • Whereas predecessors map records using physical record identifiers, RM requires (a) logical keys made up of data, and (b) linking strings (not records) with those logical keys.



It should be mentioned that the basis for systems that are characterized by record identifiers in each file, declared as "primary key" are completely non-relational, a regression to pre-1970 ISAM record systems that RM is obsolete. Also notice how you can make these primitive systems appear "relational" because, according to schizophrenic "logic," they "satisfy" one quoted rule. Honest logic destroys such nonsense.

Such identity-based systems have become a burrow at the bottom of the industry precisely because of disinformation. Hence my willingness to fix it.

Complete the section on correcting misinformation.

What's the best approach?

Modeling formal data, including relational normalization. Method, science, principle, not fragments of NF definitions.

I don’t see sentences as different approaches, and this puts all of your thoughts into one modeling exercise. And the point at which the model starts to take on a serious, acceptable form is [3.3].

Does it depend on how we think about the data?

Sure. Your marriage will be successful or unsuccessful based on your wife's perception, because that perception is the seat of all your actions. The model will succeed or fail based on your perception of the data.

One of the great things about the relational model is that it teaches us to view (perceive, think, model) data as data and nothing but data. First, it forms the concept of a logical key.

Does it depend on the requirements?

The first answer is no, it shouldn't depend on requirements. It should take into account data that is limited by the enterprise (a requirement, yes, but not a functional requirement) and only data.

And, of course, for reasons that I have detailed elsewhere, the data model must match the real world, it must not be limited by the functional need for data analysis.

A massive error, a common reason for failure in the OO / ORM model, is that it takes data from the tiny lens of the OO / ORM model. It cannot decouple Data vs Process and treats data as simple "persistent" submission of objects. There are many other errors in this model that I will not list here, the point is that they start from the position of the requirement and ignore the data.

Second answer: the project is not commissioned until the requirement is established, reality if the funds are requirements based. In this way, a mature project manager ensures that the requirement contains sufficient justification for analyzing and modeling data as data separate from functions.

If it depends on the requirements, should we choose the simplest design first and then make it more versatile to meet the new requirements?

You could, but it will cost a lot. The mature consistency is to get the data model as soon as possible.

If the data model matches the real world, when changes and additions appear, it is easy to extend it. Conversely, if the data model was minimal for a functional requirement, or if it does not match the real world, then changes will be complex and costly.

While the resulting data models have a lot in common, the original design can influence the table / column naming and the key domains differ.

Sure.

If we chose to use one table for each type of thing, we could choose incompatible keys for Atom and Molecule, such as atomic atom for Atom and molecule name for Molecule.

It would be a terrible mistake. Never put anything in a container that doesn't match the label. Never put two different things in the same container (which has the same label). The correct method would be to use the common identifier name (which is Atom- or Molecule- or Particle-name) and use subtypes.

If we decide to use a generic approach, we can choose a common key for all particles.

Only if there is. And if this does not happen, it means that the entities do not match, that the universal model cannot be used.

Changing keys can have a greater impact on the system, so it can be difficult to transition from a simple design to a generic one.

Well, the idea is to select data items that are stable (not static) to form a key. And yes, key design is an important aspect of modeling. If you follow the Relational Model, Keys form the logical structure of the database. The domain is very important (I think you understand that). And yes, it's expensive.

This brings us back to the main question. That is why Keys must be modeled and selected correctly, for each table, as well as for all of its children.

Update 1 and 2

I noticed your two updates just now. This is not a complete answer, it is very short now.

  • So far, I have understood Particle as a set of Atoms plus a set of Molecules. This is what I modeled in D Normalized . Both have a name, a common key. He is a subtype.

But now, given your hierarchical charts and data samples (thanks), I understand that what I thought you meant and what you meant are two different things. Consider the Updated TRD and hierarchy :

  • Your part is a collection of Molecules plus many Atoms plus many subatomic particles.

    • This is not true

    • There is a hierarchy, yes, but it still exists in a sequence of tables, not as a hierarchy within a single table.

    • Otherwise, the two sets (atoms, molecules) are discrete, each with its own set of components that are different. There is no set that includes everything (except the theoretical universal set).

    Updated table mapping model E Normalized β€’ Update 2 . Subtypes have been removed along with Particle. It provides all of the requirements in Update 2. Note the updated predicates.

  • Your hierarchical diagram is wrong.

    • Your mistake is because you have combined the classifier hierarchy (structure, container) with data (classifier instances, content). You cannot do this. You need two separate charts, one for the container and one for the content.

    • This is a common mistake in OO / ORM thinking. Failure to adhere to scientific principle separates data and process . The exact error described in my answer to Hidders in the previous question. Results are complex objects that never work.

    • So your hierarchy diagram is illegal, they are two completely different diagrams combined into one.

    The hierarchy (classification) reflects this and only this.

    Hierarchy G (sample data) illustrates this and that alone.

    There is a difference in style between the way you depict hierarchies (organization chart) and the way you display them (explorer). One ends up very wide, the other more compact. I think you can figure it out.

  • You had some clarity at the end of the previous question. The new concept of Type in this poisonous book has completely puzzled you. This problem, these problems, has nothing to do with the type.

More words are required, I will answer more fully as time permits.

+5


source


Abbreviated to its core entity, the relational data model has at most one "rule": all information in the database must be represented as attribute values ​​in tuples in relationships.

All your "alternatives" potentially conform to this rule if: - that every attribute has an associated data type,
- and that every tuple in every relation in the database will always have a value for each of its attributes,
- and that value is a value that is a member of the data type associated with this attribute.

EDIT: You also haven't provided any details on what the exact nature of the facts you want to record in your system is.

EDIT 2: The first comment by Walter M. still applies. Your facts seem to point to different levels (which will differ markedly in typical use cases):

"6. A hydrogen atom consists of one proton and one electron

After a little rewriting to remove the "AND" in it:

"6. A <atom_id> atom contains <qty β†’ β†’ β†’ β†’ β†’ β†’ β†’ β†’ β†’ β†’ β†’ <

This one looks like something that will end up in your database (assuming your use case is typical / mundane as one might assume):

A 'H' atom contains 1 proton A 'H' atom contains 1 electron A 'H' atom contains 1 neutron



(Note that the elimination of "AND" involves splitting the conjunction into "atomic" parts (pun intended).)

From this, we can start to wonder what to do with <subatomicparticletype>. If your use case is such that the existence of a proton / neutron / electron is only given and it will never change, then you can just use a datatype for it, and for modeling this will not involve not only type identification, your model readers will know the set value. If, however, your use case is such that, say, you are experimenting to try and find a completely new model of chemistry, which may also have foobarons along with protons [or their existence may be deleted again for the sake of experiment) then you will need to include the table where "<subatomicparticletype> is part of my atom model".

In addition, you also need to include a rule in your model that any <subatomicparticletype> that claims to be part of an atom should indeed be one that is part of your atom model. In SQLspeak: you will need the FK from the ATOM_CONTAINS_PARTICLE table to this EXISTING_PARTICLES table.

In a sense, the declaration of this rule is similar to yours

"2. Atoms are made up of protons, neutrons and electrons.

But note that you won't have a table in your own database that talks about this. Instead, by declaring FK to the system, that particular statement will be made in the directory.

You need to make the right distinction between the type of operators that directly point to things that are in the domain of interest (these are ultimately entities / classes / ... in your conceptual models and most likely tables in your database) and the type operators that define about objects of the region of interest (for example, your FK rule).

(In cases where the domain itself is very abstract, the line between them can be extremely thin or even non-expansive.)

+3


source


I love the way Fowler treated class inheritance and single table inheritance. You touched on both of these projects here. Each of them has its own capabilities and disadvantages. You can use them as search terms and you will find it useful to read a lot. Some of them are noteworthy. SO also has a few tags with these names.

I'm not sure today, but subtypes were often overlooked from Database 101 courses back about 20 years ago, and this was something every database creator faced once they got into the "real world".

+1


source







All Articles