Date
1 - 8 of 8
JanusGraph meetup topic discussion - graph OLAP & algorithms
Ted Wilmes
Hello,
I'm working on planning another JanusGraph community meetup and wanted to gauge community interest in doing an in-depth focus on tackling OLAP/graph algorithmic work with JanusGraph. This has been covered briefly in previously meetups but I think is worthy of more focus due to the challenges folks face getting JanusGraph/Spark up and running and working performantly. I'm particularly interested in hearing if others have had success with this route in production, and if not, if they've employed other techniques to serve their analytics needs (shortest path, clustering, centrality, data science workflows, etc.). In one case on our side, we had good success deploying a separate, custom C++ in-memory graph alongside JG that serves shortest path requests with a much lower latency than JG and Spark could. Please reach out on this thread or directly to me if you're interested in presenting on this topic or taking part in a panel discussion. I'm currently targeting the March timeframe for the meetup.
Thanks,
Ted
hadoopmarc@...
Hi Ted,
Most probably you recognize my nickname from the answers I provided on this user forum on OLAP attempts with JanusGraph. I also co-authored:
https://tinkerpop.apache.org/docs/current/recipes/#connected-components
showing the need to test the scalability of graph algorithms.
I am interested to participate in the meeting and I am open to suggestions where contributions are most needed (no new material, so part of panel or presenting old material).
Best wishes, Marc
Most probably you recognize my nickname from the answers I provided on this user forum on OLAP attempts with JanusGraph. I also co-authored:
https://tinkerpop.apache.org/docs/current/recipes/#connected-components
showing the need to test the scalability of graph algorithms.
I am interested to participate in the meeting and I am open to suggestions where contributions are most needed (no new material, so part of panel or presenting old material).
Best wishes, Marc
Dylan Bethune-Waddell
Hi Ted,
Great idea Ted. Wanted to mention KatanaGraph (website, github). It's basically a port of this codebase called Galois (website, github). Appears to be a group of UT Austin researchers taking their impressive results (paper) solving various OLAP graph computing problems into open source (3-Clause BSD License). From what I've gathered poking around the new codebase vs. old, and the demo server you can launch a notebook on, they aim to commercialize the distributed GPU aspect of Galois after getting it production ready as katana "enterprise". The guts of it exist in the Galois codebase and they do refer to it - could be a good conversation to have in the JanusGraph community.
Seems like KatanaGraph and cool stuff like rapids.ai spark-rapids are all using the Apache Arrow format, might be an integration to consider. Another interesting project is the GraphBLAS, which is a spec but now has concrete implementations including this one which is from a "competitor" to KatanaGraph, gunrock. IIRC the gunrock direction-optimized BFS code is faster on power-law graphs than the implementation of BFS in katana/galois, which might be Interesting in terms of how Gremlin expects to do it's OLAP traversals.
Best,
Dylan
On Thu, Feb 11, 2021 at 11:51 AM <hadoopmarc@...> wrote:
Hi Ted,
Most probably you recognize my nickname from the answers I provided on this user forum on OLAP attempts with JanusGraph. I also co-authored:
https://tinkerpop.apache.org/docs/current/recipes/#connected-components
showing the need to test the scalability of graph algorithms.
I am interested to participate in the meeting and I am open to suggestions where contributions are most needed (no new material, so part of panel or presenting old material).
Best wishes, Marc
Ted Wilmes
Hi Marc,
Yes, I most definitely recognize your nickname and have been a beneficiary of many of your answers, blog posts, etc. Glad to hear you're interested in participating. You've been prolific on the lists and I'm wondering if you have a top 5 olap items that you see people have trouble with over and over? A brief presentation of your responses and pointers to what you've already written would probably be very helpful for folks who are attempting the Spark path.
Thanks,
Ted
On Thu, Feb 11, 2021 at 10:51 AM <hadoopmarc@...> wrote:
Hi Ted,
Most probably you recognize my nickname from the answers I provided on this user forum on OLAP attempts with JanusGraph. I also co-authored:
https://tinkerpop.apache.org/docs/current/recipes/#connected-components
showing the need to test the scalability of graph algorithms.
I am interested to participate in the meeting and I am open to suggestions where contributions are most needed (no new material, so part of panel or presenting old material).
Best wishes, Marc
Ted Wilmes
Hey Dylan,
Thanks for the links. That's a promising set of projects. I think a brief survey of OLAP graph engines that may be applicable to JG users would be very interesting. In addition to looking at alternative OLAP engines, I think the question of integration is an interesting one. For example, TP Spark pulls data directly out of JG. I find this attractive from the standpoint of not having to maintain a mirror image of the OLTP graph, but we pay a large performance penalty. Alternatively, a mirror image OLAP graph can be maintained, likely using the same change feed that JG ingests. A third, alternative, that may be feasible using the in-memory storage backend and the darker corners of the JG code base, the FulgoraGraphComputer, could possibly be made to work in a zero-copy fashion. Anyway, not as exciting as the selection/development of the OLAP engine itself, but I think the integration will play a big part in ease of use and adoption.
--Ted
On Fri, Feb 12, 2021 at 4:49 PM Dylan Bethune-Waddell <dylan.bethune.waddell@...> wrote:
Hi Ted,Great idea Ted. Wanted to mention KatanaGraph (website, github). It's basically a port of this codebase called Galois (website, github). Appears to be a group of UT Austin researchers taking their impressive results (paper) solving various OLAP graph computing problems into open source (3-Clause BSD License). From what I've gathered poking around the new codebase vs. old, and the demo server you can launch a notebook on, they aim to commercialize the distributed GPU aspect of Galois after getting it production ready as katana "enterprise". The guts of it exist in the Galois codebase and they do refer to it - could be a good conversation to have in the JanusGraph community.Seems like KatanaGraph and cool stuff like rapids.ai spark-rapids are all using the Apache Arrow format, might be an integration to consider. Another interesting project is the GraphBLAS, which is a spec but now has concrete implementations including this one which is from a "competitor" to KatanaGraph, gunrock. IIRC the gunrock direction-optimized BFS code is faster on power-law graphs than the implementation of BFS in katana/galois, which might be Interesting in terms of how Gremlin expects to do it's OLAP traversals.Best,DylanOn Thu, Feb 11, 2021 at 11:51 AM <hadoopmarc@...> wrote:Hi Ted,
Most probably you recognize my nickname from the answers I provided on this user forum on OLAP attempts with JanusGraph. I also co-authored:
https://tinkerpop.apache.org/docs/current/recipes/#connected-components
showing the need to test the scalability of graph algorithms.
I am interested to participate in the meeting and I am open to suggestions where contributions are most needed (no new material, so part of panel or presenting old material).
Best wishes, Marc
hadoopmarc@...
Hi Ted,
Yes, a short overview of OLAP questions from the user list sounds like a good idea and is easy to prepare. It need not be long; 10 minutes including a few questions for clarifications would do. If you want to discuss these issues in more depth, more time is needed, of course.
Best wishes, Marc
Yes, a short overview of OLAP questions from the user list sounds like a good idea and is easy to prepare. It need not be long; 10 minutes including a few questions for clarifications would do. If you want to discuss these issues in more depth, more time is needed, of course.
Best wishes, Marc
Ted Wilmes
Great! We've done 20 minute slots in the past, that may work well for this if we do around 10-15 minutes presentation, 5-10 for discussion/Q&A? In reality, that'll just scratch the surface but will give folks some jumping off points.
For others, what graph algorithms have you operationalized or would like to? What worked, what didn't? Real world use cases (successes or failures!) are always of keen interest to the group.
--Ted
On Tue, Feb 16, 2021 at 1:35 AM <hadoopmarc@...> wrote:
Hi Ted,
Yes, a short overview of OLAP questions from the user list sounds like a good idea and is easy to prepare. It need not be long; 10 minutes including a few questions for clarifications would do. If you want to discuss these issues in more depth, more time is needed, of course.
Best wishes, Marc
hadoopmarc@...
Hi Ted,
Saw these two interesting threads on the dev list the other day:
https://lists.lfaidata.foundation/g/janusgraph-dev/topic/performance_optimization/80653320
https://lists.lfaidata.foundation/g/janusgraph-dev/topic/performance_issue_large/80821002
Apparently, the people at Zeotab do analytics on janusgraph at a massive scale by having many spark executors individually connect to janusgraph (skipping SparkGraphComputer/HadoopGraph). It would be interesting to have them at the meeting and hear what kind of analytic queries they do, in particular:
Saw these two interesting threads on the dev list the other day:
https://lists.lfaidata.foundation/g/janusgraph-dev/topic/performance_optimization/80653320
https://lists.lfaidata.foundation/g/janusgraph-dev/topic/performance_issue_large/80821002
Apparently, the people at Zeotab do analytics on janusgraph at a massive scale by having many spark executors individually connect to janusgraph (skipping SparkGraphComputer/HadoopGraph). It would be interesting to have them at the meeting and hear what kind of analytic queries they do, in particular:
- how do they access the table with janusgraph id's?
- how do they aggregate the results of individual spark partitions into the end result of the gremlin query?
- how do they retrieve vertex data for step 2,3,.... of the traversal (spark shuffle vs each executor retrieving additional vertex data from janusgraph)?