Date
1 - 2 of 2
Improve performance on Properties Step
Claire <bobo...@...>
Hello, I am trying to optimize a simple recommendation query, and am somewhat stuck. *Our Setup* - Janusgraph 0.5.1 - Storage Backend: Scylla DB 3.2.4 *The Graph* Our Graph contains millions of vertices and edges. In the relevant part, we have the following Vertex: user (with several properties) Vertex: query (with several properties, one being "title") user is linked to query by an edge "searched". Each user can have multiple searches, and it is possible that a user has different searches with the same title (but then other properties would differ) *The Scenario* I know that a user searched for something, let's say "Snowboard", and I want to present him with related search terms by doing an "other users searching for Snowboard also searched for" query. Originally I started with the following query: g.V().has('query', 'title', 'snowboard').in('searched').out('searched').has('query', 'title', neq('snowboard')).has('title').dedup().as("related").select("related").by('title').groupCount().order(Scope.local).by(Column.values, Order.desc).profile() But the query time was beyond acceptable, thus I decided to do do the grouping and counting rather in the code (Java) then via the gremlin query. (after a timeLimit didn't bring the hoped for improvement, respetively really bad results) The simplified gremlin query now looks as follows g.V().has('search', 'title', within('snowboard', 'Snowboard')).in('searched').dedup().out('searched').values('title') Doing a profile on that query, I see that the "values" step, costs a lot of time. I already tried with the query-fast option, but that didn't help any. The profile step returns me the following ==>Traversal Metrics Step Count Traversers Time (ms) % Dur ============================================================================================================= JanusGraphStep([],[~label.eq(search), sear... 30246 30246 290.388 3.72 \_condition=(~label = search AND (title = snowboard OR title = Snowboard)) \_orders=[] \_isFitted=true \_isOrdered=true \_query=multiKSQ[2]@4000 \_index=bySearchTitle optimization 0.030 optimization 14.736 backend-query 0.000 \_query=bySearchTitle:multiKSQ[2]@4000 \_limit=4000 JanusGraphVertexStep(IN,[searched],... 30245 30245 1811.447 23.22 \_condition=type[searched] \_orders=[] \_isFitted=true \_isOrdered=true \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@a4661abd \_multi=true \_vertices=30246 optimization 4.640 backend-query 30245 1592.684 \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@a4661abd DedupGlobalStep 10296 10296 11.328 0.15 JanusGraphVertexStep(OUT,[searched]... 79241 79241 1293.578 16.58 \_condition=type[searched] \_orders=[] \_isFitted=true \_isOrdered=true \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@a46616dd \_multi=true \_vertices=10296 optimization 0.174 backend-query 79241 557.447 \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@a46616dd NoOpBarrierStep(2500) 79241 79241 52.760 0.68 JanusGraphPropertiesStep([title],value) 47709 47709 4322.742 55.42 \_condition=type[title] \_orders=[] \_isFitted=true \_isOrdered=true \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@8121f1dd \_multi=true \_vertices=79241 optimization 2.133 backend-query 47709 3969.057 \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@8121f1dd NoOpBarrierStep(2500) 47709 2293 18.347 0.24 >TOTAL - - 7800.592 - What am I missing? Where is some room for improvement? Gladly looking forward to any hint. Regards Claire |
|
Stephen Mallette <spmal...@...>
I'm not sure I have a complete answer for you but your original traversal could have been improved/simplified a bit: in('searched'). out('searched').has('query', 'title', neq('snowboard')). dedup().
groupCount().
by('title')
order(Scope.local). by(Column.values, Order.desc) limit(10)
I've left it there but I'm not sure I understand the use of dedup() in this case. Won't that make all your counts go to one? I got rid of as().select() which should turn off path tracking and reduce the resources required to run the traversal. I also tacked on a limit(10) which was arbitrary but if you return less of those results you will have far less serialization costs which can make a big difference in performance. You might also try to do a full barrier() to take greater advantage of bulking assuming dedup() should have went after in('searched') as shown in your second traversal: g.V().has('query', 'title', 'snowboard'). in('searched'). barrier(). out('searched').has('query', 'title', neq('snowboard')). groupCount().
by('title')
order(Scope.local). by(Column.values, Order.desc) limit(10) You don't show your whole profile() but it seems you are just gathering a lot of data. You may need to find ways to better limit the paths you have to traverse in order to get a reasonable answer. For example, perhaps your recommendation can be based on the most recent data rather than all of it. Could you store a timestamp on the "searched" edges and then do: g.V().has('query', 'title', 'snowboard'). inE('searched').has('timestamp', gt(oneWeekAgo)).outV(). barrier(). outE('searched').has('timestamp', gt(oneMonthAgo)).outV().has('query', 'title', neq('snowboard')).inV(). groupCount().
by('title')
order(Scope.local). by(Column.values, Order.desc) limit(10) With a vertex-centric index on that timestamp you could probably get some fast results that way. Perhaps you could even write a more complex limit that is timestamp and limit() oriented somehow depending on how your data is structured. On Fri, May 8, 2020 at 9:19 PM Claire <bobo...@...> wrote:
|
|