Does it make sense to "spoof" bitmap indexes?

I am planning the software to have an OLAP application at its core (it helps to analyze the measurement data) and it will have some kind of star schema for its database because the stored values โ€‹โ€‹will look from different angles (time, source, type and etc.), and queries will query aggregated data for these dimensions. Queries tend to deliver many rows (up to 100,000).

My research on this topic (see also my question here ) seems to indicate that bitmap indexes are a good way to search data the way I plan, However, I want to support multiple db engines, some of which do not offer bitmap indexes in their tables (specifically MySQL).

Now I can create and maintain my own bitmap index and use it to find row ids that point to the fact table. However, I suspect this will trump the whole purpose of the index, because the database will still look for row ids in the B-Tree. Can anyone with a deeper theoretical background or more experience tell me if I am still getting anything, such as not having to do slow JOINs on size tables?

I would also appreciate what I have to rate if the answer is not straightforward.

+1


source to share


2 answers


Some database engines that do not directly support bitmap indexes still have star optimizations that can execute this type of query without hitting the fact table. SQL Server, for example, has a feature called Index Intersection that does something similar by generating bitmaps on the fly to perform resolution. Microsoft claims the performance of this is comparable to bitmap indexes. See this post for a little talk on this topic.

I'm not sure if MySQL does this, but Postgresql certainly does. IIRC some of the variants (Greenplum I think) also support bitmap indexes directly, and there has been some talk of including it in the main DB engine. I don't remember if this has been done yet.



I think you'll find that most modern DBMS platforms offer some type of search optimization, so you probably don't need to reinvent the wheel. You may find one or two that cannot do this, but you always have the option of simply not supporting them.

+1


source


I have had good luck with raster indexes when manipulating large amounts of in-memory data using custom data structures, but they are rather awkward to implement through a third party database that does not have a good (postgresql-like) API to extend its index structures.

In general, since you will be looking in the B-Tree index, you will get nothing if my experience is some kind of reference.

So no.

If your application is OLAP by nature and you have a small number of dimensions that naturally group into ordered ranges and you really need to change the asymptotic behavior of your problem, you might consider creating a table like structure, then you can query it for any hierarchical answer with 2 ^ d operations, and you can amortize this if you are running a series of related queries.

An example in 2d with x and y coordinates, where you are interested in a sum ranging from (x1, y1) to (x2, y2).

Stored separately, you will need to sum the number of records proportional to the area.



Using summation, for each position (x, y) do not store the value of that position, but instead store the sum from (0,0) to (x, y).

You can then answer any range query by asking:

sum (x2, y2) - sum (x1, y2) - sum (x2, y1) + sum (x1, y1)

constant amount of overhead (well, logarithmic in dataset size if you have an index on x and y and are stored in SQL)

This, of course, breaks down if you have complex attributes that do not split into ranges, but can handle simple lexicographic indices, dates, etc.

+2


source







All Articles