How to return maximum points for another node properties

Question

How to return maximum points for another node properties

I need to calculate how many times a piece of composer's music has been performed in a decade and then only return the one piece with the most performances in a decade.

This cypher does everything but the filter but the highest scores in a decade.

match (c:Composer)-[:CREATED_BY]-(w:Work)<-[*..2]-(prog:Program) 
WHERE c.lastname =~ '(?i).*stravinsky.*' 
WITH w.title AS Title, prog.title AS Program, LEFT(prog.date, 3)+"0" AS Decade
RETURN Decade, Title, COUNT(Program) AS Total
ORDER BY Decade, Total DESC, Title

I've been banging my head for hours about variations but can't seem to find a solution.

+3

neo4j cypher

roymond 08 jan. 15 at 22:22

source to share

1 answer

subvertallchris · Accepted Answer · 2015-01-09T00:52:27+0000

This seems to return what you are looking for, but it can probably be improved.

MATCH (c:Composer)-[r:CREATED_BY]-(w:Work)<-[*..2]-(prog:Program)
WHERE c.lastname =~ '(?i).*stravinsky.*'
WITH LEFT(prog.date, 3)+"0" AS Decade, w.title AS Title, COUNT(prog.title) AS Total
ORDER BY Decade, Total DESC, Title
RETURN Decade, HEAD(COLLECT(Total)) AS Total, HEAD(COLLECT(Title)) AS Title
ORDER BY Decade

It only returns one result per decade, but it doesn't account for connections, so it's a bit incomplete. I'll think about how to do it and edit it if I come up with something good.

I used this line from http://graphgen.neoxygen.io to generate sampled data locally.

(c:Composer {firstname: firstName, lastname: lastName} *10)<-[:CREATED_BY *n..1]-(w:Work {title: progLanguage} *75)<-[:PERFORMED *n..1]-(prog:Program {title: catchPhrase, date: date} *400)

USER PICTURE

This is a raw version of the above query and will show multiple Jobs when there are links.

MATCH (c:Composer)-[r:CREATED_BY]-(w:Work)<-[*..2]-(prog:Program)
WHERE c.lastname =~ '(?i).*stravinsky.*'
WITH LEFT(prog.date, 3)+"0" AS Decade, w.title AS Title, COUNT(prog.title) AS Total
ORDER BY Decade, Total DESC, Title
WITH Decade, Title, Total, HEAD(COLLECT(Total)) AS PerformedTotal
WITH Decade, [title in COLLECT(Title) WHERE Total = PerformedTotal] as Title, Total, PerformedTotal
ORDER BY PerformedTotal DESC
return Decade, HEAD(COLLECT(PerformedTotal)) as Totals, HEAD(COLLECT(Title)) as Titles
ORDER BY Decade

It seems to me that it should be refactored, but I cannot simplify it.

I have a ton of notes on the process of writing this answer. Even if that's not exactly what you're looking for, TL; DR is of interest here.

Get rid of this fuzzy search if you can, find a way to index this property, or use an external index like Elasticsearch. You get massive performance gains with this regex.
There's a bug in Neo4j 2.2.M02 that makes the request fail if <-[*..2]-

changed to pretty much anything else. If you set Cypher Query Scheduler to Cypher 2.1

, performance will be better if the first line is MATCH (c:Composer)-[r:CREATED_BY]-(w)<-[r2:REL_TYPE]-(prog)

. Use only the first node label to help WHERE

get your job done. Always always use node and rel identifiers.
Cypher has amazing behavior. This all [title in COLLECT(Title) WHERE Total = PerformedTotal]

uses variables later on the same line. If I pull them out, it works.

The more surprising behavior was that it failed to refactor what I expected. I would expect to do this, but cannot:

MATCH (c:Composer)-[r:CREATED_BY]-(w:Work)<-[*..2]-(prog:Program)
WHERE c.lastname =~ '(?i).*stravinsky.*'
WITH LEFT(prog.date, 3)+"0" AS Decade, w.title AS Title, COUNT(prog.title) AS Total
ORDER BY Decade, Total DESC, Title
WITH Decade, [title in COLLECT(Title) WHERE Total = HEAD(COLLECT(Total))] as Title, Total, HEAD(COLLECT(Total)) AS PerformedTotal
ORDER BY PerformedTotal DESC
return Decade, HEAD(COLLECT(PerformedTotal)) as Totals, HEAD(COLLECT(Title)) as Titles
ORDER BY Decade

ANOTHER EDIT: AS POSSIBLE SPEED THIS UP

If you have multiple potential paths that your request might need but want to avoid [*..2]

, you can speed up the process a bit by specifying the paths that should be taken when trying to find a match. Whether or not it is actually faster depends on how many branches it can take, it will be a dead end. If you can only give him two or three paths so that he can completely ignore half a dozen other relationships, that will probably make up for the filtering and whatever happens later. Of course, if the paths are difficult enough, it could be more trouble than it is worth.

You have to put this in the neo4j wrapper and add PROFILE

, add a semicolon to the end and look at the number of database hits to determine what works best for your dataset.

MATCH (c:Composer)-[r:CREATED_BY]-(w)
WHERE c.lastname =~ '(?i).*Denesik.*'
OPTIONAL MATCH (w)-[r2:CONNECTED_TO]-(this_node)<-[r3:ONE_MORE]-(prog1)
OPTIONAL MATCH (w)<-[r4:PERFORMED]-(prog2)
OPTIONAL MATCH (w)-[r5:THIS_REL]->(this_node)-[r6:AGAIN_WITH_THE_RELS]->(prog3)
WITH FILTER(program in [prog1, prog2, prog3] WHERE program IS NOT NULL) AS progarray, w.title AS Title
UNWIND(progarray) as prog
WITH LEFT(prog.date, 3)+"0" AS Decade, COUNT(prog.title) AS Total, Title
ORDER BY Decade, Total DESC, Title
WITH Decade, Title, Total, HEAD(COLLECT(Total)) AS PerformedTotal
WITH Decade, [title in COLLECT(Title) WHERE Total = PerformedTotal] as Title, Total, PerformedTotal
ORDER BY PerformedTotal DESC
return Decade, HEAD(COLLECT(PerformedTotal)) as Totals, HEAD(COLLECT(Title)) as Titles
ORDER BY Decade;

The hardest part of this is that if we reuse a variable prog

it will drag the results from each OPTIONAL MATCH to the next, essentially trying to filter, and we won't end up with completely separate paths. (Why can we reuse w

now outside of me ...) It's okay. We take the results, put them in an array, filter out empty results, and then expand back to a single variable containing all valid results. After that, we continue as usual.

In my tests, it looks like it could be significantly faster with the correct dataset. YMMV.

How to return maximum points for another node properties

More articles: