So I've been toying with this idea for a while, but I want to make it concrete here so we can shoot it down, mutate, or build it up as needed:
I think that we should on-board an additional backend for distributed, asynchronous OLAP/OLTP "hybrid" processing, which Marko Rodriguez is implementing a reference implementation for right now in TinkerPop (link to branch
). Here's a link to a figure
he attached to the JIRA issue. We need to drop or change the name of FulgoraGraphComputer
anyways, so my suggestion is "repeal and replace" - let's take a stab at providing an implementation of Marko's new work, which he is doing with Akka, whereby we will use Apache Ignite as a backend instead with the goal of addressing a concern raised/discussed in the JIRA issue:
- If transactions are worked out, then distributed OLTP Gremlin provides mutation capabilities (something currently not implemented for GraphComputer). That is addV, addE, drop, etc. just works. *Caveate, transactions in this environment across GremlinServer seems difficult.*
Indeed, which is why I propose we try out Apache Ignite as a way to achieve this functionality while replacing our in-memory, single-machine graph computer (Fulgora) with a distributed in-memory graph computer that will suit most OLAP-style analytics needs with the benefit of full OLTP TinkerPop API functionality and integration points with Spark and tools that work with data stored in HDFS. Quoting from this article
, which admittedly reads as a long-form advertisement for Apache Ignite:
- Ignite can provide shared storage, so state can be passed from one Spark application or job to another
- Ignite can provide SQL with indexing so Spark SQL can be accelerated over 1,000x
- When working with files instead of RDDs, the Apache Ignite In-Memory File System (IGFS) can also share state between Spark jobs and applications
My understanding is that Apache Ignite essentially provides a distributed In-Memory cache that is compatible with and a "drop-in" value add to both HDFS and Apache Spark
with an integration point for Cassandra
as well. To me the RDD-like data structure of Ignite which maintains state complete with ACID transactions via an SQL interface would therefore address Marko's concern about distributed transactions. Here are a few jumping-off points we could learn from:
- MIT Licensed by Pieter Martin, who is quite active on the TinkerPop mailing lists and JIRA. We could ask nicely if he would like to lend a hand, as I am looking at the SQL interface as the most sensible target here.
3. Ted Wilmes' SQL-gremlin
- while this goes "the other way" and maps a TinkerPop-enabled database to a tabular representation so that you can run SQL queries over the data, I'm guessing we'll see plenty of overlapping gotchas that Ted already ran into.
4. SparkGraphComputer - or "the thing that already works". Apache Ignite shadowing the Spark/Hadoop APIs might put a drop-in IgniteGraphComputer within reach which would give us an idea of how performant and useful we could expect the system to be overall before we invest in the "big change" of IgniteGraphActors or whatever middle-ground between the GraphActors, GraphComputer, and GraphProvider frameworks we'll need to find to realise an Apache Ignite backend within JanusGraph.
I also wanted to mention my thoughts on IGFS (Ignite File System) which runs either on top of HDFS, between HDFS and Spark, or standalone (I believe). My thinking is that we can store side-effect data structures in IGFS and it will enable the same ACID-ness on distributed side-effect data structures we would be getting for elements in the Graph/RDD data structure via IgniteRDD or what have you.
From there, persistence to HDFS via a BulkDumperVertexProgram
would be possible, as would running spark job chains with addE()/V()
on the IgniteRDD
or transformations thereof, opening up a path to ETL type workflows involving other "Big Data" creatures/tools. Further, we could then persist back into JanusGraph with a BulkLoaderVertexProgram implementation. So again, this is somewhat of a GraphComputer/GraphActors hybrid, but I'm not sure I mind.
Jonathan Ellithorpe mentioned his implementation of TinkerPop over RAMcloud
a while back on the TinkerPop mailing list as part of a benchmarking effort - we could ask him about how performant that was as it sounds similar to what this would be. Benchmarks would be nice, too :)
I'm interested in what people think of on-boarding this kind of processing engine in principle, even if all the optimistic assumptions of the feasibility of Ignite I have made here turn out to be unfounded. Are there other options we should consider besides Ignite, or should we stick closer to home and simply implement the GraphActors/Partitioner/Partitions standard Marko is working on directly with Cassandra/HBase as a giant refactor over time? Clearly, this is a change we can move up or down our development schedule and spend a while getting right, but if performant I see a lot of value here.