What is the difference in tools between MongoDB / NoSQL which allows faster aggregation (MapReduce) compared to MySQL


I have the following problem. I have a table with a huge number of rows that I need to search for and then group the search results by many parameters. Let's say a table

id, big_text, price, country, field1, field2, ..., fieldX


And we run a query like this

[use FULLTEXT index to MATCH() big_text] AND 
[use some random clauses that anyway render indexes useless, 
like: country IN (1,2,65,69) and price<100]


We will display this as search results, and then we need to take these search results and group them into several fields to create search filters.

(results) GROUP BY field1
(results) GROUP BY field2
(results) GROUP BY field3
(results) GROUP BY field4


This is a simplified example of what I need, the actual problem is even more problematic, for example sometimes the first query for results also has its own GROUP BY. And an example of such functionality would be this site http://www.indeed.com/q-sales-jobs.html (search results plus filters on the left)

I have done and continue to do in-depth research on how MySQL works and at the moment I am completely missing this in MySQL. Roughly speaking, a MySQL table is just a bunch of rows lying on the hard disk, and indexes are tiny versions of those tables, sorted by index fields and pointing to the actual rows. Super simplification, of course, but the bottom line is that I don't see how this can be fixed at all, i.e. How to use more than one index, be able to do GROUP BY-s quickly (by the time the query reaches the GROUP BY is absolutely useless due to range lookups and other things). I know MySQL (or similar databases) have various useful things like index merges, index scans, etc., but that just doesn't fit - the queries that run above will run forever.

I was told that the problem can be solved by NoSQL, which uses some fundamentally new ways of storing and processing data, including aggregation tasks. What I want to know is a short schematic explanation of how this is done. I mean, I just want to take a quick look at it so that I can really see that it is, because at the moment I can't figure out how it can be done at all. I mean, data is still data and must be placed in memory, and indexes are still indexes with all their constraints. If this is really possible, I will start exploring NoSQL in detail.

PS. Please don't tell me to go and read a great book on NoSQL. I've already done this for MySQL just to find out that it doesn't apply in my case :) So, I wanted to have some preliminary understanding of the technology before getting the big book.



source to share

1 answer

Basically there are 4 types of "NoSQL", but three of the four are actually similar enough to write SQL syntax on top of it (including MongoDB and its crazy query syntax [and I say that although Javascript is one of my favorites) languages]).

Key-value store

These are simple NoSQL systems like Redis, which are basically a really fancy hash table. You have a value that you want to retrieve later, so you assign a key to it and put it in the database, you can only query one object at a time and only one key at a time.

You definitely don't want this.

Storage of documents

This is one step above the key and value store that most people talk about when they talk about NoSQL (MongoDB for example).

They are essentially hierarchical objects (for example, XML files, JSON files, and any other kind of tree structure in computer science), but the values ​​of various nodes in the tree can be indexed. They are "faster" than traditional SQL databases when searching because they sacrifice performance when attached.

If you are looking for data in your MySQL database from a single table with tons of columns (assuming it is not a view / virtual table) and if you indexed it correctly for your query (this could be your real problem, here) Document Databases such as MongoDB won't give you Big-O advantages over MySQL, so you probably don't want to migrate for that reason.

Column storage

These are the most similar SQL databases. In fact, some (like Sybase) implement SQL syntax while others (Cassandra) do not. They store data in columns, not rows, so adding and updating is expensive, but most queries are cheap because each column is inherently indexed implicitly.

But if your query cannot use the index, you are not in better shape with the Columnar Store than a regular SQL database.

Storing charts

Graph databases extend beyond SQL. Anything that can be represented by graph theory, including Key-Value, document database, and SQL database, can be represented by a graph database such as neo4j.

To do this, graph databases make joins as cheap as possible (as opposed to document databases), but they must do so because even a simple "string" query would require a lot of joins.

A table scan type query is likely to be slower than a standard SQL database due to all the extra joins to retrieve the data (which are stored in an unrelated way).

So what's the solution?

You may have noticed that I have not exactly answered your question. I'm not saying you're done, but the real problem is how the request is being made.

  1. Are you absolutely sure you can't index your data better? There are things like multiple column keys that can improve the performance of your particular query. Microsoft SQL Server has a full-text key type that will apply to your example, and PostgreSQL can emulate it .
  2. The real advantage of most NoSQL databases over SQL databases is Map-Reduce - specifically, the integration of the full full Turing language, which runs at high speed into which query constraints can be written. The query function can be written to quickly "fail" to exclude inappropriate queries, or to quickly return successfully to records that meet "priority" requirements, while doing the same in SQL is a little more difficult.

Finally, however, the specific problem you are trying to solve: text search with optional filtering parameters is more commonly known as search engine

, and there are very specialized mechanisms to address this specific problem. I would recommend Apache Solr to do these requests.

Basically, dump the text box, "filter" fields, and the primary key of the table into Solr, let it index the text box, run queries through it, and if you need a full record after that, query the SQL database. for the specific index you got from Solr. It uses more memory and requires a second process, but will probably work best for you, here.

Why should all this text get to this answer?

Since the title of your question has nothing to do with the content of your question, so I answered both questions. :)



All Articles