MySQL design question - which is better, long tables or multiple databases?

So, I have an interesting problem that was the fruit of many good discussions in my group at work.

We have scientific software generating SQLlite files and this software is basically a black box. We have no control over its table design, formats, etc. It is understood that this black box outlet may change and our design should be able to handle this.

SQLlite files represent entire databases that the user would like to query. We will implement two ways to create a single database and a backend in Python that adds tables from each database to the main database, and two are queries in different database tables and combining the results in Python.

Both methods run into difficulties when the black box makes changes to its table structures, such as renaming a column, splitting a table, etc. We have to keep this in mind, and we've discussed translation tables, which translate column queries from one table format to another.

We are interested in the ease of implementation, how well the design handles changes in database / table layout and speed. Also, the final dimension is how well it will work with existing Python webframes (Django doesn't support cross-database queries, and SQLAlchemy doesn't work, so we know we program a lot.)

+2


source to share


4 answers


If you are looking at databases, you should look into consolidation. Cross-base queries are evil.



If your queries are essentially separate databases, you might want to stick with multiple databases, as it is clear that separating them is necessary.

+3


source


You cannot make arbitrary changes to the database schema without classifying and waiting for the change in some way. At best, with non-trivial changes, you can sometimes just ignore new data or tables, at worst, your interpretation of the data will be completely broken.

I ran into similar problems when users need data inferred from a normalized schema. The circuit does NOT change. However, their required output format requires a fixed number of hierarchical levels. So while the design of the database takes into account all the changes they want to make, their chosen representation of that data cannot be maintained in the face of their changes. Thus, it is not possible to save the output schema before changing the data (not even changing the schema). This does not mean that this is not a valid output or input circuit, but there are limits that go beyond their chosen circuit. At this point they must revise the output contract, the pivot program (which can anticipate this and create new columns) can then have room to place the data in the output schema.

My point is that the semantics and interpretation of new columns and new tables (or dropping columns and tables that existing logic might depend on) is non-trivial unless new columns or tables are expected. However, in these cases, there are usually good database designs that eliminate these strategies in the first place:

For example, a particular database schema can contain any number of tables, all with the same structure (although the theoretical reason cannot be combined into a single table). In a certain way, a table can have a set of columns with the same name (although this "array" violates the principles of normalization and can be normalized to a general key / code / value scheme).



Even in a data warehouse ETL situation, the new column must be determined whether it is a fact or a size attribute, and then, if it is a size attribute, which dimension table is best assigned to. This can be somewhat automated for facts (scalars such as decimal / numeric would be obvious candidates) by checking the metadata for unconnected columns, modifying the DW (yikes) table, and then loading accordingly. But for measurements, I would be very careful about automating things like this.

So, overall, I would say that schema changes in a good normalized database design are the least likely to be accommodated because: 1) the database design already anticipates and accommodates significant changes and flexibility, and 2) schema changes to such database designs can hardly be expected very easily. Conversely, schema changes in a poorly normalized database structure are actually more predictable, as flaws in database design are more visible.

So my question to you is, how well designed is the database you are working from?

+1


source


You say you know you have a lot of programming ...

I'm not sure about this. I would go for the quick and dirty solution over the "generic" solution, because generic solutions like the entity attribute value model often have poor performance. Don't do client side join (merge results) inside your Python code because it is very slow. Use SQL for the connection, it is for this purpose. Users can also create their own reports with all kinds of reporting tools that generate SQL statements. You don't have to do everything in your application, just start by solving 80% of the problems, not 100%.

If something breaks because something inside the black box changes, you can define the backward compatibility views that your application supports.

Maybe scientific software will add many new features and maybe change its datamodel because of these new features ..? It is possible, but then you still have to change your application to profit from these new features.

+1


source


It seems to me that your problem is not with MySQL or SQLlite. It's about data sharing and a contract that must exist between the data provider and the user of the same data.

To the extent that databases exist so that data can be shared, this contract is fundamental to everything about databases. When databases were first built and database theory first solidified, in the 1960s and 1970s, data sharing was the central goal of database design. Databases are often used today where files will be served equally well. Your situation can be an example.

In your situation, you have a beggar contract with your data providers. They can change the format of the data and maybe even the semantics, and all you can do is suck it in and resolve it. This situation is by no means uncommon.

I do not know the specifics of your situation, so what follows may be far from the goal.

If it was up to me, I would like to create a database that is as general, flexible and stable as possible, without losing the core functionality of structured and managed data. Maybe some design like a star schematic would make sense, but I could have adopted a completely different design if I were in your shoes.

This leaves the problem of retrieving data from the databases you provide, converting the data to a stable format supported by the central database, and loading it into the central database. You are correct in assuming that this involves a lot of programming. This process, known as "ETL" in data warehouse texts, is not the simplest of programming problems.

At least ETL brings all the hard problems together in one place. Once you've loaded the data into a database built for your needs, not your suppliers' needs, turning the data into valuable information should be relatively easy, at least at the programming or SQL level. There are even OLAP tools out there that use data as easily as video games. There are problems at this level, but they are not the same problems that I am talking about here.

Read the data about data warehouses and especially data marts. The description may seem daunting to you at first, but it can be shrunk to suit your needs.

0


source







All Articles