Efficient import of Cypher statements
I am trying to export the database to a file and import it again without copying the database files or stopping the database. I realize there are many great (and effective) neo4j-shell-tools out there , however the Neo4j database is removed export-*
and the import-*
commands require the files to be on the remote client, whereas for my circumstances they are located locally.
The following post explains alternative methods for exporting / importing data, however the import is not very efficient.
The following examples use a subset of our datastore of 10,000 nodes with different labels / properties for testing purposes. First, the database was exported via
> time cypher-shell 'CALL apoc.export.cypher.all("/tmp/graph.db.cypher", {batchSize: 1000, format: "cypher-shell", separateFiles: true})'
real 0m1.703s
and then wipe it off,
neo4j stop
rm -rf /var/log/neo4j/data/databases/graph.db
neo4j start
before re-import,
time cypher-shell < /tmp/graph.db.nodes.cypher
real 0m39.105s
which doesn't look too revealing. I have also tried the Python route, exporting Cypher in regular format:
CALL apoc.export.cypher.all("/tmp/graph.db.cypher", {format: "plain", separateFiles: true})
The next snippet runs after ~ 30 seconds (using batch size 1000),
from itertools import izip_longest
from neo4j.v1 import GraphDatabase
with GraphDatabase.driver('bolt://localhost:7687') as driver:
with driver.session() as session:
with open('/tmp/graph.db.nodes.cypher') as file:
for chunk in izip_longest(*[file] * 1000):
with session.begin_transaction() as tx:
for line in chunk:
if line:
tx.run(line)
I understand that parameterized Cypher queries are more optimal. I used a few kludgy logic (note that string replacement is not sufficient in all cases) to try and extract labels and properties from Cypher code (which runs in ~ 8s)
from itertools import izip_longest
import json
from neo4j.v1 import GraphDatabase
import re
def decode(statement):
m = re.match('CREATE \((.*?)\s(.*?)\);', statement)
labels = m.group(1).replace('`', '').split(':')[1:]
properties = json.loads(m.group(2).replace('`', '"')) # kludgy
return labels, properties
with GraphDatabase.driver('bolt://localhost:7687') as driver:
with driver.session() as session:
with open('/tmp/graph.db.nodes.cypher') as file:
for chunk in izip_longest(*[file] * 1000):
with session.begin_transaction() as tx:
for line in chunk:
if line:
labels, properties = decode(line)
tx.run(
'CALL apoc.create.node({labels}, {properties})',
labels=labels,
properties=properties,
)
Using UNWIND
instead of transactions further improves performance up to ~ 5s:
with GraphDatabase.driver('bolt://localhost:7687') as driver:
with driver.session() as session:
with open('/tmp/graph.db.nodes.cypher') as file:
for chunk in izip_longest(*[file] * 1000):
rows = []
for line in chunk:
if line:
labels, properties = decode(line)
rows.append({'labels': labels, 'properties': properties})
session.run(
"""
UNWIND {rows} AS row
WITH row.labels AS labels, row.properties AS properties
CALL apoc.create.node(labels, properties) YIELD node
RETURN true
""",
rows=rows,
)
Is this the right approach to speed up Cypher imports? Ideally, I would not want this level of manipulation to be in Python because it is possibly error prone and I would need to do something like this for the relationship.
Also does anyone know the correct approach for Cypher decoding to extract properties? This method fails if the property has a backstamp (`). Note. I don't want to go the GraphML route as I also need a schema that is exported via the Cypher format. Although it is really strange to decompress Cypher this way.
Finally, for reference, import-binary
shell commands take ~ 3s to do the same imports:
> neo4j-shell -c "import-binary -b 1000 -i /tmp/graph.db.bin"
...
finish after 10000 row(s) 10. 100%: nodes = 10000 rels = 0 properties = 106289 time 3 ms total 3221 ms
source to share
No one has answered this question yet
Check out similar questions: