Database selection and design for alternative Google Sheets

We are creating a Spreadsheet web application for our clients. They can load any csv (20MB +) and then perform operations (listed below) on the data. The data is unstructured .

Over the past few months, we have experimented with several architectures:

  • First, we stored the entire grid in 2d array format, eg [{a: 'b', x: 'y'}, {a: 'e'}] inside the PostGreSQL JSON datatype. But then the updated cell required the entire CSV to be stored in the database. This made the application extremely slow.

  • Then we moved to MongoDB. This has improved performance, but we still run into performance and scalability issues. Below is our structure.

Our current database design:
PostgreSql structure:
Table - datasets
id, name, description, etc.

Mongo structure:
Line 1
_id, column1: value1, column2: value2, _data_set_id = datasets.id
Line 2
_id, column1: value1, column2: value2, _data_set_id = datasets.id
and so on ...
Also we have an index mongo for the _data_set_id key to support faster queries of the following types.

( db.coll.find({_data_set_id: xyz}) )

      

We also use mongo hosting from a third party that takes care of sharding, backups, uptime, etc. (we have no defons)

Data operations are of 2 types:

  • Row operations such as adding or deleting a row
  • Column operations, such as adding or removing a column Most data operations are column-level operations . ie only updates the column in each of the rows.

We've optimized the point where mongo works reasonably well with datasets that have less than 10k rows. But also, we cannot scale. We currently have ~ 25GB of data in Mongo and we will reach 50GB over the next few weeks. Our current product is a prototype and we are now revising our database architecture to scale better.

The most important requirements for our database are:

  • Fast read-write.
  • Column query and updates.
  • Updating the value of one cell (e.g. row x, column y).

So,

  • Is Mongo the right database for this use case?
  • If so, what else (besides indexing, sharding) can we do to scale Mongo?

PS
We understand that we can only achieve 2 CAPs and also pass Cassandra vs Mondomba vs Kushdba vs Redis We also evaluate Couchdb (Master-master replication, MVCC, etc., but not qynamic querying), Cassandra ( unstructured data query not possible ) and HBase (ColumnStore) as alternatives.

+3


source to share


1 answer


I strongly suspect that your database is not actually penalized. If you pay for shards, you probably aren't getting the benefit.

Then you can characterize the index, which should save you time, as the data will be stored on one or two shard servers, which can then respond faster according to your _data_set_id _.

Try to enter:

sh.status()

      



This should distribute your database well. It will probably only be on one shard.

Read these bits well before tweaking the shard. It is very difficult to repeat the outline without rebuilding your entire collection!

http://docs.mongodb.org/manual/tutorial/choose-a-shard-key/

0


source







All Articles