Better data structure / db / binary format for storing hierarchical items
I have the following hierarchical objects:
Country, City, Street.
Every country has cities, every city has streets.
I wrote pseudocode for what I want to run:
handle_country_before(country)
for city in country:
handle_city_before(city)
for street in city:
handle_street_before(street)
handle_street_after(street)
handle_city_after(city)
handle_country_after(country)
I've tried the following approaches:
Approach document
(noSql):
I have saved all my data tightly:
{
country : {
# Country "x1" info corresponding with the street
},
city : {
# City "y1" info corresponding with the street
},
street : {
# street info...
}
}
{
country : {
# Country "x1" info corresponding with the street
},
city : {
# City "y1" info corresponding with the street
},
street : {
# street info...
}
}
{
country : {
# Country "x1" info corresponding with the street
},
city : {
# City "y1" info corresponding with the street
},
street : {
# street info...
}
}
{
country : {
# Country "x1" info corresponding with the street
},
city : {
# City "y2" info corresponding with the street
},
street : {
# street info...
}
}
{
country : {
# Country "x1" info corresponding with the street
},
city : {
# City "y2" info corresponding with the street
},
street : {
# street info...
}
}
Using this method, I had to use the following pseudocode:
last_country = 0
last_city = 0
last_street = 0
for element in elements:
if element.country.id != last_country_id:
if (0 != last_country) :
handle_country_after(last_country)
handle_country_before(element.country)
if element.city.id != city:
if (0 != last_city) :
handle_country_after(last_city)
handle_country_before(element.city)
if element.street.id != street:
if (0 != last_street) :
handle_country_after(last_street)
handle_country_before(element.street)
Disadvantage: I feel like this approach is a little overwhelmed, and using a flat structure is not suitable for my case, and it was also very slow and inefficient.
SQL approach:
I saved each entity in a table: Country, City, Street and repeated it with the following code:
country_cursor = query('select * from countries')
for country in country_cursor:
handle_country_before(country)
city_cursor = query('select * from cities where parent_country_ref=%s' % (country.id))
for city in city_cursor:
street_cursor = query('select * from streets where parent_city_ref=%s' % (city.id))
...
...
...
handle_country_after(country)
in the beginning, it looked like the best approach. But since I added more metadata tables and had to use JOIN statements, it got slower and I tried to use materialized view to speed things up a bit, but got the same result as when using docs.
Custom format method:
I tried to store the information in my own binary serialization format:
<number of countries>[1st-country-data]<number of citieis>[1nd-city-data]<number of streets>[1st-street-data][2nd-street-data][3rd-street...]...
Disadvantage: it couldn't scale, I couldn't update information, I couldn't get a specific city / street, each search was O (n).
I'm looking for a serialization / DB format that would look like this:
-
Ability to add / update fields for existing elements
-
Efficiency in speed, space and memory
-
C compatible (no CPP)
source to share
The fastest approach in your case is to use unnormalized data, create a flat file / table with the following information:
country city street current mayor additional information
of course the data must be sorted, of course you won't be using heavy parsing formats like json / xml, only plain text
then you will be able to use a single loop to iterate over that array
to speed up the iteration a bit, you can try to split one file into multiple fixed-width lines.
source to share