Best way to create nested array from tables: multiple queries / loops VS single query / loop style
Let's say I have 2 tables that I can "concatenate" and represent in one nested array.
I am wandering what would be the best way to do this, given:
- efficiency
- best practics
- DB / server side use combination
- what should you do in real life
- the same case for 3, 4 or more tables that can be "merged" this way
The question is about ANY server side / relational-db.
2 easy ways I was thinking (if you have others please suggest! Notice I am asking for a simple SERVER-SIDE and RELATIONAL-DB so please don't waste your time explaining why I shouldn't use this kind of DB. use an MVC project, etc., etc.):
- 2 loops, 5 simple "SELECT" queries
- 1 cycle, "JOIN" request
I tried to give a simple and detailed example to explain myself and understand your answers better (although how to write code and / or search for possible errors is not a problem here, so try not to focus on that ...)
SQL SCRIPTS FOR CREATING AND INSERTING DATA INTO TABLES
CREATE TABLE persons
(
id int NOT NULL AUTO_INCREMENT,
fullName varchar(255),
PRIMARY KEY (id)
);
INSERT INTO persons (fullName) VALUES ('Alice'), ('Bob'), ('Carl'), ('Dan');
CREATE TABLE phoneNumbers
(
id int NOT NULL AUTO_INCREMENT,
personId int,
phoneNumber varchar(255),
PRIMARY KEY (id)
);
INSERT INTO phoneNumbers (personId, phoneNumber) VALUES ( 1, '123-456'), ( 1, '234-567'), (1, '345-678'), (2, '456-789'), (2, '567-890'), (3, '678-901'), (4, '789-012');
A JSON REPRESENTATION OF TABLES AFTER I "RANDOM" THEM:
[
{
"id": 1,
"fullName": "Alice",
"phoneNumbers": [
"123-456",
"234-567",
"345-678"
]
},
{
"id": 2,
"fullName": "Bob",
"phoneNumbers": [
"456-789",
"567-890"
]
},
{
"id": 3,
"fullName": "Carl",
"phoneNumbers": [
"678-901"
]
},
{
"id": 4,
"fullName": "Dan",
"phoneNumbers": [
"789-012"
]
}
]
PSEUDO CODE FOR 2 WAYS:
1.
query: "SELECT id, fullName FROM persons"
personList = new List<Person>()
foreach row x in query result:
current = new Person(x.fullName)
"SELECT phoneNumber FROM phoneNumbers WHERE personId = x.id"
foreach row y in query result:
current.phoneNumbers.Push(y.phoneNumber)
personList.Push(current)
print personList
2.
query: "SELECT persons.id, fullName, phoneNumber FROM persons
LEFT JOIN phoneNumbers ON persons.id = phoneNumbers.personId"
personList = new List<Person>()
current = null
previouseId = null
foreach row x in query result:
if ( x.id != previouseId )
if ( current != null )
personList.Push(current)
current = null
current = new Person(x.fullName)
current.phoneNumbers.Push(x.phoneNumber)
print personList
IMPLEMENTING THE CODE IN PHP / MYSQL:
1.
/* get all persons */
$result = mysql_query("SELECT id, fullName FROM persons");
$personsArray = array(); //Create an array
//loop all persons
while ($row = mysql_fetch_assoc($result))
{
//add new person
$current = array();
$current['id'] = $row['id'];
$current['fullName'] = $row['fullName'];
/* add all person phone-numbers */
$id = $current['id'];
$sub_result = mysql_query("SELECT phoneNumber FROM phoneNumbers WHERE personId = {$id}");
$phoneNumbers = array();
while ($sub_row = mysql_fetch_assoc($sub_result))
{
$phoneNumbers[] = $sub_row['phoneNumber']);
}
//add phoneNumbers array to person
$current['phoneNumbers'] = $phoneNumbers;
//add person to final result array
$personsArray[] = $current;
}
echo json_encode($personsArray);
2.
/* get all persons and their phone-numbers in a single query */
$sql = "SELECT persons.id, fullName, phoneNumber FROM persons
LEFT JOIN phoneNumbers ON persons.id = phoneNumbers.personId";
$result = mysql_query($sql);
$personsArray = array();
/* init temp vars to save current person data */
$current = null;
$previouseId = null;
$phoneNumbers = array();
while ($row = mysql_fetch_assoc($result))
{
/*
if the current id is different from the previous id:
you've got to a new person.
save the previous person (if such exists),
and create a new one
*/
if ($row['id'] != $previouseId )
{
// in the first iteration,
// current (previous person) is null,
// don't add it
if ( !is_null($current) )
{
$current['phoneNumbers'] = $phoneNumbers;
$personsArray[] = $current;
$current = null;
$previouseId = null;
$phoneNumbers = array();
}
// create a new person
$current = array();
$current['id'] = $row['id'];
$current['fullName'] = $row['fullName'];
// set current as previous id
$previouseId = $current['id'];
}
// you always add the phone-number
// to the current phone-number list
$phoneNumbers[] = $row['phoneNumber'];
}
}
// don't forget to add the last person (saved in "current")
if (!is_null($current))
$personsArray[] = $current);
echo json_encode($personsArray);
PS this link is an example of another question here where I tried to suggest a second way: tables for single json
source to share
Preliminary
First, thanks for putting in a lot of effort in explaining the problem and formatting. It's great to see someone who is clear about what they are doing and what they are asking.
But it should be noted that this in itself creates a limitation: you are assuming that this is the right decision, and that with some small correction or guidance it will work. This is not true. Therefore, I must ask you to give this concept, take a big step back and consider (a) the whole problem and (b) my answer without this concept.
The context of this answer:
-
all the explicit considerations you have given that are very important, which I will not repeat
-
the two most important are: what is the best practice and what will I do in real life
This answer is based on standards, higher order or frame of reference for best practice. This is what I have been doing in real life since 1990, which means that since 1990 I have never had to write code like yours. This is what the commercial Client / Server world does or should do.
This problem, all this problem space, is becoming a common problem. I will go into this question in detail here and thus answer another question. Therefore, it may contain a little more detail that you require. If so, please forgive this.
Consideration
-
A database is a server resource available to many users. In the online system, the database is constantly changing. It contains one version of the truth (as opposed to one fact in one place, which is a separate normalization problem) of each fact.
- the fact that mickey mouse NONsqls does not have a server architecture and therefore the concept of a server in such software is false and misleading are separate but marked points.
-
As I understand it, JSON and JSON structures are needed for "performance reasons", precisely because "server" does not execute as server cannot. The concept is to cache data for each (each) client, so you don't collect it from the "server" all the time.
-
This opens up a smelly can of worms. If you don't design and implement it correctly, worms will flood your application and the stench will kill you.
-
This implementation is a flagrant violation of the Client / Server architecture, which provides simple code on both sides and appropriate deployment of software and data components, so implementation time is short and efficiency is high.
-
In addition, such an implementation requires significant implementation effort and is complex with many parts. Each of these parts must be properly designed.
-
The web and the many books written in this subject area provide cesspool tiers of techniques marketed based on perceived simplicity; ease; someone can do something; free software, can-do anything; etc. There is no scientific basis for any of these proposals.
-
Nonarchitecture and Substandard
As you can see, you learned that this selling mythology is scam. You are faced with one problem: one case that this advice is incorrect. Once you solve this problem, the next problem will be uncovered, which is not obvious to you right now. These concepts are an endless set of problems.
I will not list all the misconceptions that these are pretending experts (in reality, circus freaks who do not know technology). I believe that as you progress through my answer, you will notice that one market concept after another is false.
The bottom two lines:
-
The concepts violate architecture and design standards, namely: Client / Server architecture; Open architecture ; Engineering principles; and to a lesser extent in this particular issue, Database Design Principles.
-
This leads to people like you trying to do an honest job, cheating, cheating, seducing, introducing simple concepts that turn into mass realizations. Implementations that will never work, so they require substantial ongoing maintenance and will eventually be replaced by the wholesale.
Architecture
The central principle violated never duplicates anything. The moment you have a place where data is duplicated (due to caching or replication or two separate monolithic applications, etc.), you create a duplicate that will sync in the internet situation. So the principle is not to.
- Of course, for serious third party software like a crude report tool, by design, they can cache server data on the client. But keep in mind that they have put hundreds of man-years in the right place with the above in mind. Yours is not such a software product.
Rather than giving a lecture about the principles to be understood, or the flaws and costs of each mistake, the rest of this answer asks what you would do in real life using the correct architectural method (step above is best practice) ...
Architecture 1
Do not confuse
- the data
to be normalized
from
- result set
which, by definition, is a smoothed ("unnormalized" not entirely correct) kind of data.
The data, if normalized, will not contain duplicate values; repeating groups. The result set will contain duplicate values; repeating groups. This is a pedestrian.
-
Note that the concept of nested sets (or nested relationships), which is largely marketed by schizophrenics, is based precisely on this confusion.
-
In the forty-five years since the advent of RM, they have been unable to differentiate base relations (for which normalization applies ) from derived relations (for which normalization does not apply ).
-
Currently, two of the freaks are attacking the definition of the first normal form. This is an attack on the intellect. This would (if accepted) normalize the madness. 1NF is the backbone of other NFs, if this madness is accepted, all NFs will be damaged, humiliated, irrelevant. The result is that normalization itself (loosely defined in mathematical terms, but understood as a science by professionals) will be severely damaged if not destroyed.
Architecture 2
There is a centuries-old scientific and technical principle that content (data) should be separated from control (software elements). This is because analysis; design; and the implementation of these two is completely different. This principle is equally important in the software sciences, where it has a certain articulation.
To keep this short (ha ha), instead of discourse, I will assume that you understand:
-
That there is a scientifically required boundary between data and program elements. Mixing results leads to complex, error-prone objects and difficult to maintain.
-
The confusion of this principle has reached epidemic proportions in the OO / ORM world, with consequences worldwide.
-
Only educated professionals avoid this madness. Otherwise, the vast majority, they accept this madness as "normal" and they spend their lives fixing problems that we simply do not have.
-
-
Architectural excellence, great value, data stored and presented in Tabular form on the Dr EF Codd relational model. There are certain rules for data normalization.
-
And importantly, you can tell when people in the madhouse who write and sell books are advising non-relational or anti-relational methods.
Architecture 3
If you are caching data on the client:
-
Load the absolute minimum.
This means that the cache is only data that does not change in the network environment. This means only Reference and Lookup tables, tables that populate higher level classifiers, drop-down lists, etc.
-
Currency
For each cached table, you must have a method to (a) determine if the cached data is outdated compared to one version of the truth that exists on the server, and (b) update it from the server, (c) table by table.
Typically this involves a background process that runs every (e) five minutes, which queries the updated MAX DateTime for each cached table on the client versus the DateTime on the server, and when changed, updates the table and all its child tables that depend from the modified table.
This of course requires a column
UpdatedDateTime
for each table. It's not a burden, because you need it for OLTP ACID transactions anyway (assuming you have a real database and not a bunch of substandard files).
Which really means never replicated, the encoding limitation is prohibitive.
Architecture 4
In a sub-commercial, non-server world, I understand that freaks advise the opposite (insane people always contradict common sense), caching "everything."
-
This is the only way that programs like PusGresQl, created by their cohorts in the same hideout, can be used in a multiplayer system, the only way they can spread their cancer.
-
You always get what you pay for: you pay for peanuts, you get monkeys; you pay zero, you get zero.
Architecture Corollary 3 - If you are caching data on the client, do not cache tables that change frequently. These are transaction and history tables. The notion of caching such tables or all tables on the client is completely bankrupt.
In a true Client / Server deployment, due to the use of applicable standards for each data window, the application should only ask for the rows that are required for that specific need at that specific time, based on context or filter values, etc. The application should never load the entire table.
If the same user, using the same window, checked its contents, 15 minutes after the first check, the data will be out of date by 15 minutes.
-
For freeware / shareware / vapourware platforms that define themselves due to the lack of a server architecture and hence, by the end result, this performance does not exist, of course you should cache more than the minimum tables per client.
-
If you do, you must consider all of the above and implement it correctly, otherwise your application will break and the stench will make users look for your termination. If there is more than one user, they will have the same reason and will soon form an army.
Architecture 5
We now move on to how you cache these carefully selected tables on the client.
Note that databases are growing, they are expanding.
-
If the system is broken, malfunctioning, it will grow in small increments and require more effort.
-
If the system has little success, it will grow exponentially.
-
If the system (each of the database and application separately) is well designed and implemented, changes will be easy, there will be few bugs.
Therefore, all components of the application must be properly designed, comply with applicable standards, and the database must be completely normalized. This, in turn, minimizes the impact of changes to the database, application, and vice versa.
-
The application will consist of simple, uncomplicated objects that are easy to maintain and modify.
-
For data that you use on the client, you will use arrays of some form: multiple instances of a class on the OO platform; DataWindows (TM, google for it) or similar in 4GL; simple arrays in PHP.
(Also, note that people in situations like yours produce in one year, professional vendors like me produce in one week using a commercial SQL platform, a commercial version of 4GL, and is architecture and standards compliant.)
Architecture 6
So, let me assume that you understand all of this and appreciate its value, especially Architecture 1 and 2.
- If you don't, stop here and ask questions, don't go to the one below.
Now that we have established the full context, we can solve the root of your problem.
-
In those application arrays, why on Earth would you store flattened representations of data?
- and therefore mess and agonize problems
-
instead of keeping copies of the normalized tables?
Answer
-
Never duplicate everything you can get. This is an architectural principle, not limited to database normalization.
-
Never drain anything.
If you do this, you will be creating :
-
duplication of data and their mass on the client. The client will not only be fat and slow, he will be tied to the floor with a ballast of duplicate data.
-
additional code that is completely unnecessary
-
complexity in this code
-
code that is fragile that will constantly change.
This is the exact problem you are suffering from, a consequence of a method that you know intuitively wrongly that there must be a better way. You know that this is a common and common problem.
Note that this method (the poison that is being sold), this code, is a mental anchor for you. See how you formatted it and present it so nicely: this is important to you. I am reluctant to inform you of all this.
- What a reluctance to overcome easily because of your sincere and frank attitude and the knowledge that you did not invent this method, that you followed "teachers" who, apparently, are completely unaware of the relevant science; non-science; nonsense like "science".
-
-
In each segment of code, during presentation, as needed:
and. In a Client / Server commercial context
Execute a query that joins simple, normalized, unencrypted tables and retrieves only qualifying strings. Thus, getting the current data values. The user never sees stale data. Representations (smoothed views of normalized data) are often used here.b. In the sub-context of the commercial non-server
Create a temporary array result array and connect simple unallocated arrays (up tables, which are cached), and fill it only qualifying rows from the original array. The currency of which is maintained by the background process.-
Use keys to form joins between arrays in the same way keys are used to form joins in relational tables in a database.
-
Destroy these components when the user closes the window.
-
The smart version will eliminate the result set array and concatenate the original arrays with keys and limit the result to qualifying strings.
-
Separated from the architectural madness, nested arrays or nested sets or JSON or JSON-like structures are simply not required. This is due to the confusion of Architecture Principle 1.
- If you choose to use such structures, use them only for temporary result arrays.
Finally, I'm sure this discourse demonstrates that n tables is not a problem. More importantly, there are m levels deep in the data hierarchy, "nesting" is not a problem.
Answer 2
Now that I've given the full context (and not before) that removes the implications in your question and makes it general, core.
The question is about ANY server side / relational-db. [What's better]:
2 loops, 5 simple "SELECT" queries
1, 1 JOIN request
The detailed examples you provided are not exactly described above. Exact descriptions:
-
Your option 1 2 loops, each loop to load each array 1 one-segment SELECT query for each loop (execute nxm times ... only outer loop is a one-time execution)
-
Your option 2 1 Execute the SELECT query once and then 2 loops, each loop to load each array
For commercial SQL platforms, none of these apply.
- Commercial SQL Server is a job processing engine. Use a single query with any joins that return a result set. Never step through lines using a loop that shrinks the pre-1970 ISAM job engine. Use View on the server because it provides the best performance and your code is in one place.
However, for non-commercial, non-server platforms, where:
-
your "server" is not a job processing engine; it returns single lines, so you need to get each line and fill the array manually, or
-
your "server" does not provide Client / Server binding, i.e. it does not provide a means for the client to bind the incoming result set to the receiving array, and so you have to go through the returned result set line by line and populate the array manually
as per your example , then the answer will be with a big difference to your option 2.
Please study carefully and comment or ask questions.
Reply to comment
Let's say I need to print this json (or some other html page) for some STOUT (example: http: GET / allUsersPhoneNumbers response. This is just an example to clarify what I expect to get), should return this json, I have a php function, which got these 2 sets of results (1). now it should print this json - how should i do that ?. this report can represent the employee's monthly salary for an entire year, and so on. one way or duster, I need to collect this information and present it in the "JOIN" view
Perhaps I was not clear enough.
-
Basically, don't use JSON. If you absolutely do not need it. Which means sending to some system that requires it, which means that the receiving system and that demand is very, very, stupid.
-
Make sure that your system does not place such demands on others.
-
Save normalized data. And in the database, and in any program elements that you write. This means (in this example) using one SELECT per table or array. That is, for download purposes, so that you can link and check them at any point in the program.
-
When you need a connection, understand that it is:
- set of results; derived relation; view
- therefore, temporarily, it exists for the duration of this element, only
and. For tables, join them in the usual way using the keys. One query joining two (or more) tables.
b. For arrays, concatenate arrays in the program, just as you would concatenate tables in a database using keys.
-
In the given example, which is the answer to some query, first understand that this is category [4], and then execute it.
Why even consider JSON ??? What does JSON have to do with this?
JSON is misunderstood and people are interested in the wow factor. This is the solution looking for the problem. If you don't have this problem, it doesn't matter. Check out these two links:
Copter - What is JSON
fooobar.com/questions/15579 / ...Now, if you get it, this is mostly for incoming channels. Never outgoing. Also, it requires parsing, deconstructing, etc. before it can be used.
Recall:
I need to collect this information and present it in the "JOIN" view
Yes. This is a pedestrian. Joins doesn't mean JSONed.
In your example, the receiver expects a smoothed view (like a spreadsheet), with all cells filled, and yes, for users with more than one phone number, their user data will be repeated in the second and subsequent set row results. For any type print,
eg. for debugging I want a flattened look. It's simple:
SELECT ... FROM Person JOIN PhoneNumber
And let's get it back. Or, if you are querying from arrays, join the Person and PhoneNumber array, which might require a temporary array for the result set, and return it.
please don't tell me you should only be getting 1 user at a time, etc. etc.
Right. If someone tells you to go back to procedural processing (i.e. line by line, in a WHILE loop) where the engine or your program has the processing set up (i.e. processes the entire set in one command), that marks them as an idiot ... Stop listening to them completely.
I already said that your option 2 is correct, Option 1 is wrong. This goes for GET or SELECT.
On the other hand, for programming languages that have no job handling capability (ie cannot print / set / check an array in a single command) or "servers" that don't provide client-side array binding, you need to write loops. one loop deep into the data hierarchy (in your example, two loops, one for Person and one for PhoneNumber for user).
- You must do this in order to parse the incoming JSON object.
- You must do this in order to load each array from the result set that is returned in your Option 2.
- You must do this to print each array from the result set that is returned in your Option 2.
Reply to comment 2
iv'e ment I need to return the result presented in a nested version (let's say I am printing a report on a page), json was just an example for such a view.
I don't think you understand the reasoning and conclusions I have given in this answer.
- For printing and display never nest. Print flattened view , rows returned from SELECT for option 2. This is what we have been doing when printing or displaying data Relationally for 31 years. Easier to read, debug, search, find, fold, stitch, maim. You can't do anything with a nested array other than look at it and say it's interesting.
I gave you directions so you can write the required code, but it doesn't work. It looks like I have to write some code for you.
code
Caveat
I would rather take your code and change it, but actually looking at your code, it is poorly written or structured, it cannot be intelligently changed. Secondly, if I use this it will be a poor learning tool. Therefore, I will have to give you fresh clean code, otherwise you will not learn the correct methods.
These code examples follow my advice, so I won't repeat it. And that goes beyond the original question.
-
Your query using your option 2. One SELECT is executed once. This is followed by one cycle. What you can "come up" if you want.
source to share
In general, your best bet is to grab the data you want in multiple trips to the database and then map the data to the relevant entities. (Option 2)
But to answer your question, I would like to ask myself what is the use case for your data. If you know for sure that you will need your person and your phone number details, I would say that the second method is your best option.
However, option one can also have a precedent when the combined data is optional. One example of this might be that in the user interface you have a table of all your people, and if the user wants to see the phone number for a specific person they should click on that person. Then it would be acceptable to "lazy-load" all phone numbers.
source to share
This is a common problem especially if you are building WebAPI, converting these sets of tables to nested arrays is a big problem.
I always suggest the second option for you (albeit a little differently) because the first is the worst possible way to do it ... One thing I've learned from my experience is never a query within a loop, i.e. a waste of DB calls, well , you know what I am trying to say.
While I don't accept everything that PerformanceDBA has said, there are two main things I need, 1. Don't use duplicate data 2. Extract only the data you want.
The only problem I see with joining a table is that we duplicate a lot of data from them, take data, for example, join personal numbers an phoneNumber, and we duplicate each person for each of their phone numbers, for two tables with a few hundred rows, that's fine, imagine we need to concatenate 5 tables with thousands of rows which are huge ... So here's my solution:
Query:
SELECT id, fullName From Person;
SELECT personId, phoneNumber FROM phoneNumbers
WHERE personId IN (SELECT id From Person);
So I get the tables in my result set, now I assign Table [0] to the Person list, and use 2 loops to put the correct phone number on the right person ...
code:
personList = ConvertToEntity<List<Person>>(dataset.Table[0]);
pnoList = ConvertToEntity<List<PhoneNumber>>(dataset.Table[1]);
foreach (person in personList) {
foreach (pno in pnoList) {
if(pno.PersonId = person.Id)
person.PhoneNumer.Add(pno)
}
}
I think the method above reduces the amount of duplication and only gets me what I wanted, if there is any downside to the above method please let me know ... and thanks for asking questions like this ...
source to share