Efficient Datomic Query for Filtering on Partitioned Sets
Given that Datomic does not support pagination I am wondering how to efficiently support a query, e.g . :
Take the first 30 objects on
:history/body
, find the objects whose:history/body
matches some regexp.
This is how I would make a regex:
{:find [?e] :where [[?e :history/body ?body] [(re-find #"foo.*bar$" ?body)]]}
remarks:
- Then I could
(take ...)
, but this is not the same as matching with the first 30 objects. - I could get all the entities,
take 30
and then manually filter withre-find
, but if I have 30M entities, getting all of them just beforetake 30
seems wildly inefficient. Also: what if I wanted to take 20M from my 30M objects and filter them throughre-find
?
The Datomic docs talks about how queries are done locally, but I tried doing in-memory conversions on a set of 52913 objects (provided, they are completely touch
ed) and it takes ~ 5 seconds. Imagine how bad they will be in the millions or 10 million million.
source to share
(Just brainstorming, here)
First of all, if you ever use regexp, you might want to consider the full text index: history / body so you can:
[(fulltext $ :history/body "foo*bar") [[?e]]]
(Note: you cannot change :db/fulltext true/false
to an existing entity schema)
Sorting is what you need to do outside of the query. But depending on your data, you might want to restrict your query to one "page" and then apply your predicate to only those objects.
For example, if we were only paging :history
entities using auto-incrementing :history/id
, then we know in advance that "Page 3" is :history/id
between 61 and 90.
[:find ?e
:in $ ?min-id ?max-id
:where
[?e :history/id ?id]
(<= ?min-id ?id ?max-id)
(fulltext $ :history/body "foo*bar") [[?e]]]
Perhaps something like this:
(defn get-filtered-history-page [page-n match]
(let [per-page 30
min-id (inc (* (dec page-n) per-page))
max-id (+ min-id per-page)]
(d/q '[:find ?e
:in $ ?min-id ?max-id ?match
:where
[?e :history/id ?id]
[(<= ?min-id ?id ?max-id)]
[(fulltext $ :history/body ?match) [[?e]]]]
(get-db) min-id max-id match)))
But of course the problem is that containment of a grouped set is usually based on an order you don't know about in advance, so it's not very useful.
source to share