[PROPOSAL] Replace FulgoraGraphComputer by IgniteGraphActors/IgniteGraphComputer


Dylan Bethune-Waddell <dylan.bet...@...>
 

So I've been toying with this idea for a while, but I want to make it concrete here so we can shoot it down, mutate, or build it up as needed:

I think that we should on-board an additional backend for distributed, asynchronous OLAP/OLTP "hybrid" processing, which Marko Rodriguez is implementing a reference implementation for right now in TinkerPop (link to branch and JIRA). Here's a link to a figure he attached to the JIRA issue. We need to drop or change the name of FulgoraGraphComputer anyways, so my suggestion is "repeal and replace" - let's take a stab at providing an implementation of Marko's new work, which he is doing with Akka, whereby we will use Apache Ignite as a backend instead with the goal of addressing a concern raised/discussed in the JIRA issue:
  • If transactions are worked out, then distributed OLTP Gremlin provides mutation capabilities (something currently not implemented for GraphComputer). That is addVaddEdrop, etc. just works. *Caveate, transactions in this environment across GremlinServer seems difficult.*

Indeed, which is why I propose we try out Apache Ignite as a way to achieve this functionality while replacing our in-memory, single-machine graph computer (Fulgora) with a distributed in-memory graph computer that will suit most OLAP-style analytics needs with the benefit of full OLTP TinkerPop API functionality and integration points with Spark and tools that work with data stored in HDFS. Quoting from this article, which admittedly reads as a long-form advertisement for Apache Ignite:
  • Ignite can provide shared storage, so state can be passed from one Spark application or job to another
  • Ignite can provide SQL with indexing so Spark SQL can be accelerated over 1,000x
  • When working with files instead of RDDs, the Apache Ignite In-Memory File System (IGFS) can also share state between Spark jobs and applications
My understanding is that Apache Ignite essentially provides a distributed In-Memory cache that is compatible with and a "drop-in" value add to both HDFS and Apache Spark with an integration point for Cassandra as well. To me the RDD-like data structure of Ignite which maintains state complete with ACID transactions via an SQL interface would therefore address Marko's concern about distributed transactions. Here are a few jumping-off points we could learn from:

1. SQLG - MIT Licensed by Pieter Martin, who is quite active on the TinkerPop mailing lists and JIRA. We could ask nicely if he would like to lend a hand, as I am looking at the SQL interface as the most sensible target here.
2. Marko Rodriguez's reference implementation of GraphActors with Akka - as mentioned above this is an active body of work that has not yet been released or merged into TinkerPop master, but is what I suggest we target here.
3. Ted Wilmes' SQL-gremlin - while this goes "the other way" and maps a TinkerPop-enabled database to a tabular representation so that you can run SQL queries over the data, I'm guessing we'll see plenty of overlapping gotchas that Ted already ran into.
4. SparkGraphComputer - or "the thing that already works". Apache Ignite shadowing the Spark/Hadoop APIs might put a drop-in IgniteGraphComputer within reach which would give us an idea of how performant and useful we could expect the system to be overall before we invest in the "big change" of IgniteGraphActors or whatever middle-ground between the GraphActors, GraphComputer, and GraphProvider frameworks we'll need to find to realise an Apache Ignite backend within JanusGraph.

I also wanted to mention my thoughts on IGFS (Ignite File System) which runs either on top of HDFS, between HDFS and Spark, or standalone (I believe). My thinking is that we can store side-effect data structures in IGFS and it will enable the same ACID-ness on distributed side-effect data structures we would be getting for elements in the Graph/RDD data structure via IgniteRDD or what have you. From there, persistence to HDFS via a BulkDumperVertexProgram or BulkExportVertexProgram would be possible, as would running spark job chains with addE()/V() and drop() on the IgniteRDD or transformations thereof, opening up a path to ETL type workflows involving other "Big Data" creatures/tools. Further, we could then persist back into JanusGraph with a BulkLoaderVertexProgram implementation. So again, this is somewhat of a GraphComputer/GraphActors hybrid, but I'm not sure I mind. Jonathan Ellithorpe mentioned his implementation of TinkerPop over RAMcloud a while back on the TinkerPop mailing list as part of a benchmarking effort - we could ask him about how performant that was as it sounds similar to what this would be. Benchmarks would be nice, too :)

I'm interested in what people think of on-boarding this kind of processing engine in principle, even if all the optimistic assumptions of the feasibility of Ignite I have made here turn out to be unfounded. Are there other options we should consider besides Ignite, or should we stick closer to home and simply implement the GraphActors/Partitioner/Partitions standard Marko is working on directly with Cassandra/HBase as a giant refactor over time? Clearly, this is a change we can move up or down our development schedule and spend a while getting right, but if performant I see a lot of value here.


Dylan Bethune-Waddell <dylan.bet...@...>
 

Should also have linked to the thread on the gremlin-users list - here you go: https://groups.google.com/forum/#!topic/gremlin-users/GNFgkaKjnFc


On Thursday, January 26, 2017 at 4:34:48 PM UTC-5, Dylan Bethune-Waddell wrote:
So I've been toying with this idea for a while, but I want to make it concrete here so we can shoot it down, mutate, or build it up as needed:

I think that we should on-board an additional backend for distributed, asynchronous OLAP/OLTP "hybrid" processing, which Marko Rodriguez is implementing a reference implementation for right now in TinkerPop (link to branch and JIRA). Here's a link to a figure he attached to the JIRA issue. We need to drop or change the name of FulgoraGraphComputer anyways, so my suggestion is "repeal and replace" - let's take a stab at providing an implementation of Marko's new work, which he is doing with Akka, whereby we will use Apache Ignite as a backend instead with the goal of addressing a concern raised/discussed in the JIRA issue:
  • If transactions are worked out, then distributed OLTP Gremlin provides mutation capabilities (something currently not implemented for GraphComputer). That is addVaddEdrop, etc. just works. *Caveate, transactions in this environment across GremlinServer seems difficult.*

Indeed, which is why I propose we try out Apache Ignite as a way to achieve this functionality while replacing our in-memory, single-machine graph computer (Fulgora) with a distributed in-memory graph computer that will suit most OLAP-style analytics needs with the benefit of full OLTP TinkerPop API functionality and integration points with Spark and tools that work with data stored in HDFS. Quoting from this article, which admittedly reads as a long-form advertisement for Apache Ignite:
  • Ignite can provide shared storage, so state can be passed from one Spark application or job to another
  • Ignite can provide SQL with indexing so Spark SQL can be accelerated over 1,000x
  • When working with files instead of RDDs, the Apache Ignite In-Memory File System (IGFS) can also share state between Spark jobs and applications
My understanding is that Apache Ignite essentially provides a distributed In-Memory cache that is compatible with and a "drop-in" value add to both HDFS and Apache Spark with an integration point for Cassandra as well. To me the RDD-like data structure of Ignite which maintains state complete with ACID transactions via an SQL interface would therefore address Marko's concern about distributed transactions. Here are a few jumping-off points we could learn from:

1. SQLG - MIT Licensed by Pieter Martin, who is quite active on the TinkerPop mailing lists and JIRA. We could ask nicely if he would like to lend a hand, as I am looking at the SQL interface as the most sensible target here.
2. Marko Rodriguez's reference implementation of GraphActors with Akka - as mentioned above this is an active body of work that has not yet been released or merged into TinkerPop master, but is what I suggest we target here.
3. Ted Wilmes' SQL-gremlin - while this goes "the other way" and maps a TinkerPop-enabled database to a tabular representation so that you can run SQL queries over the data, I'm guessing we'll see plenty of overlapping gotchas that Ted already ran into.
4. SparkGraphComputer - or "the thing that already works". Apache Ignite shadowing the Spark/Hadoop APIs might put a drop-in IgniteGraphComputer within reach which would give us an idea of how performant and useful we could expect the system to be overall before we invest in the "big change" of IgniteGraphActors or whatever middle-ground between the GraphActors, GraphComputer, and GraphProvider frameworks we'll need to find to realise an Apache Ignite backend within JanusGraph.

I also wanted to mention my thoughts on IGFS (Ignite File System) which runs either on top of HDFS, between HDFS and Spark, or standalone (I believe). My thinking is that we can store side-effect data structures in IGFS and it will enable the same ACID-ness on distributed side-effect data structures we would be getting for elements in the Graph/RDD data structure via IgniteRDD or what have you. From there, persistence to HDFS via a BulkDumperVertexProgram or BulkExportVertexProgram would be possible, as would running spark job chains with addE()/V() and drop() on the IgniteRDD or transformations thereof, opening up a path to ETL type workflows involving other "Big Data" creatures/tools. Further, we could then persist back into JanusGraph with a BulkLoaderVertexProgram implementation. So again, this is somewhat of a GraphComputer/GraphActors hybrid, but I'm not sure I mind. Jonathan Ellithorpe mentioned his implementation of TinkerPop over RAMcloud a while back on the TinkerPop mailing list as part of a benchmarking effort - we could ask him about how performant that was as it sounds similar to what this would be. Benchmarks would be nice, too :)

I'm interested in what people think of on-boarding this kind of processing engine in principle, even if all the optimistic assumptions of the feasibility of Ignite I have made here turn out to be unfounded. Are there other options we should consider besides Ignite, or should we stick closer to home and simply implement the GraphActors/Partitioner/Partitions standard Marko is working on directly with Cassandra/HBase as a giant refactor over time? Clearly, this is a change we can move up or down our development schedule and spend a while getting right, but if performant I see a lot of value here.


Ted Wilmes <twi...@...>
 

Great write-up Dylan.  I'm not familiar with Apache Ignite beyond name recognition so I'll be doing some more reading.
Here's a few thoughts that popped into my head.

* Transactions - what concurrency control methods does Ignite use and will it be feasible to run queries that have large numbers of
mutations (millions to billions) in the context of a single transaction (single meaning it would be distributed here but one "commit")
* What portion of existing JanusGraph code will be shared across the current execution engine and the new distributed OLTP? 
* Is it worth taking a crack at a distributed query mode against the existing backends?
* I think your point on ETL is critical, I frequently run into the desire to persist OLAP results back into the source graph so the more painless
of a path we can provide here, the better.  Along these lines, if we're talking about a separate ETL step back into the source graph,
having ACID support right out of the gate may not be critical because the ETL will be occurring in its own eventually
consistent fashion for the majority of our backends.  In addition to that, the copy of the source graph running Ignite will most
likely be out of date almost immediately as the the source will continue to diverge which will introduce its own challenges.

Thanks,
Ted


On Friday, January 27, 2017 at 9:42:57 AM UTC-6, Dylan Bethune-Waddell wrote:
Should also have linked to the thread on the gremlin-users list - here you go: https://groups.google.com/forum/#!topic/gremlin-users/GNFgkaKjnFc

On Thursday, January 26, 2017 at 4:34:48 PM UTC-5, Dylan Bethune-Waddell wrote:
So I've been toying with this idea for a while, but I want to make it concrete here so we can shoot it down, mutate, or build it up as needed:

I think that we should on-board an additional backend for distributed, asynchronous OLAP/OLTP "hybrid" processing, which Marko Rodriguez is implementing a reference implementation for right now in TinkerPop (link to branch and JIRA). Here's a link to a figure he attached to the JIRA issue. We need to drop or change the name of FulgoraGraphComputer anyways, so my suggestion is "repeal and replace" - let's take a stab at providing an implementation of Marko's new work, which he is doing with Akka, whereby we will use Apache Ignite as a backend instead with the goal of addressing a concern raised/discussed in the JIRA issue:
  • If transactions are worked out, then distributed OLTP Gremlin provides mutation capabilities (something currently not implemented for GraphComputer). That is addVaddEdrop, etc. just works. *Caveate, transactions in this environment across GremlinServer seems difficult.*

Indeed, which is why I propose we try out Apache Ignite as a way to achieve this functionality while replacing our in-memory, single-machine graph computer (Fulgora) with a distributed in-memory graph computer that will suit most OLAP-style analytics needs with the benefit of full OLTP TinkerPop API functionality and integration points with Spark and tools that work with data stored in HDFS. Quoting from this article, which admittedly reads as a long-form advertisement for Apache Ignite:
  • Ignite can provide shared storage, so state can be passed from one Spark application or job to another
  • Ignite can provide SQL with indexing so Spark SQL can be accelerated over 1,000x
  • When working with files instead of RDDs, the Apache Ignite In-Memory File System (IGFS) can also share state between Spark jobs and applications
My understanding is that Apache Ignite essentially provides a distributed In-Memory cache that is compatible with and a "drop-in" value add to both HDFS and Apache Spark with an integration point for Cassandra as well. To me the RDD-like data structure of Ignite which maintains state complete with ACID transactions via an SQL interface would therefore address Marko's concern about distributed transactions. Here are a few jumping-off points we could learn from:

1. SQLG - MIT Licensed by Pieter Martin, who is quite active on the TinkerPop mailing lists and JIRA. We could ask nicely if he would like to lend a hand, as I am looking at the SQL interface as the most sensible target here.
2. Marko Rodriguez's reference implementation of GraphActors with Akka - as mentioned above this is an active body of work that has not yet been released or merged into TinkerPop master, but is what I suggest we target here.
3. Ted Wilmes' SQL-gremlin - while this goes "the other way" and maps a TinkerPop-enabled database to a tabular representation so that you can run SQL queries over the data, I'm guessing we'll see plenty of overlapping gotchas that Ted already ran into.
4. SparkGraphComputer - or "the thing that already works". Apache Ignite shadowing the Spark/Hadoop APIs might put a drop-in IgniteGraphComputer within reach which would give us an idea of how performant and useful we could expect the system to be overall before we invest in the "big change" of IgniteGraphActors or whatever middle-ground between the GraphActors, GraphComputer, and GraphProvider frameworks we'll need to find to realise an Apache Ignite backend within JanusGraph.

I also wanted to mention my thoughts on IGFS (Ignite File System) which runs either on top of HDFS, between HDFS and Spark, or standalone (I believe). My thinking is that we can store side-effect data structures in IGFS and it will enable the same ACID-ness on distributed side-effect data structures we would be getting for elements in the Graph/RDD data structure via IgniteRDD or what have you. From there, persistence to HDFS via a BulkDumperVertexProgram or BulkExportVertexProgram would be possible, as would running spark job chains with addE()/V() and drop() on the IgniteRDD or transformations thereof, opening up a path to ETL type workflows involving other "Big Data" creatures/tools. Further, we could then persist back into JanusGraph with a BulkLoaderVertexProgram implementation. So again, this is somewhat of a GraphComputer/GraphActors hybrid, but I'm not sure I mind. Jonathan Ellithorpe mentioned his implementation of TinkerPop over RAMcloud a while back on the TinkerPop mailing list as part of a benchmarking effort - we could ask him about how performant that was as it sounds similar to what this would be. Benchmarks would be nice, too :)

I'm interested in what people think of on-boarding this kind of processing engine in principle, even if all the optimistic assumptions of the feasibility of Ignite I have made here turn out to be unfounded. Are there other options we should consider besides Ignite, or should we stick closer to home and simply implement the GraphActors/Partitioner/Partitions standard Marko is working on directly with Cassandra/HBase as a giant refactor over time? Clearly, this is a change we can move up or down our development schedule and spend a while getting right, but if performant I see a lot of value here.


okram...@...
 

Hi,

I have never heard of Apache Ignite and Kuppitz just introduced me to Apache Geode too. Crazy cool technologies.

Personally, I think it would be great to try and build "distributed/transactional/persisted TinkerGraph" using:

  Apache Geode + TinkerPop

The idea would be that each "region" (partition) maintains in their key/value structure vertex-id/vertex(properties/edges). Then, if you have GraphActors overlaid across the cluster, you get data local traversals. Finally, because Apache Geode supports transactions and disk persistence (and cache spillover), you get "database" features. From reviewing their docs, it seems it would be pretty easy to implement. 

The benefit of building this would be to learn how TinkerPop should define the Partitioner and Partition interfaces that are being developed for TinkerPop 3.3.0. I have been biased by Akka so it would be good to be biased by Geode/Ignite as well.

As a side: I didn't see anything about "distributed cache aware transactions" in Apache Geode like I see in Apache Ignite. 

Anywho, given that TinkerPop already provides the Gremlin language and machine where the machine can execute in an iterative "Hadoop-based" fashion (GraphComputer) and a peer-to-peer message passing fashion (GraphActors), there really isn't much work to do besides back the TinkerPop structure API by Apache Geode/Ignite. Once you do that, you can start to get fancy with not requiring "star vertices" as the values of the keys, but instead, support keys that are ids for vertices, "edge-blocks", and "vertex property blocks". In this way, you get not only vertex-, but also edge-cuts. The Partitioner/Partition API of TinkerPop will be able to support the ability to have the graph data represented across the cluster as you please and thus, unlike GraphComputer, you are not tied to "star vertex blocks."

Thanks for the thoughts. I enjoyed reading your email and spending my afternoon learning about Ignite and Geode.

Take care,
Marko.

http://markorodriguez.com


On Thursday, January 26, 2017 at 2:34:48 PM UTC-7, Dylan Bethune-Waddell wrote:
So I've been toying with this idea for a while, but I want to make it concrete here so we can shoot it down, mutate, or build it up as needed:

I think that we should on-board an additional backend for distributed, asynchronous OLAP/OLTP "hybrid" processing, which Marko Rodriguez is implementing a reference implementation for right now in TinkerPop (link to branch and JIRA). Here's a link to a figure he attached to the JIRA issue. We need to drop or change the name of FulgoraGraphComputer anyways, so my suggestion is "repeal and replace" - let's take a stab at providing an implementation of Marko's new work, which he is doing with Akka, whereby we will use Apache Ignite as a backend instead with the goal of addressing a concern raised/discussed in the JIRA issue:
  • If transactions are worked out, then distributed OLTP Gremlin provides mutation capabilities (something currently not implemented for GraphComputer). That is addVaddEdrop, etc. just works. *Caveate, transactions in this environment across GremlinServer seems difficult.*

Indeed, which is why I propose we try out Apache Ignite as a way to achieve this functionality while replacing our in-memory, single-machine graph computer (Fulgora) with a distributed in-memory graph computer that will suit most OLAP-style analytics needs with the benefit of full OLTP TinkerPop API functionality and integration points with Spark and tools that work with data stored in HDFS. Quoting from this article, which admittedly reads as a long-form advertisement for Apache Ignite:
  • Ignite can provide shared storage, so state can be passed from one Spark application or job to another
  • Ignite can provide SQL with indexing so Spark SQL can be accelerated over 1,000x
  • When working with files instead of RDDs, the Apache Ignite In-Memory File System (IGFS) can also share state between Spark jobs and applications
My understanding is that Apache Ignite essentially provides a distributed In-Memory cache that is compatible with and a "drop-in" value add to both HDFS and Apache Spark with an integration point for Cassandra as well. To me the RDD-like data structure of Ignite which maintains state complete with ACID transactions via an SQL interface would therefore address Marko's concern about distributed transactions. Here are a few jumping-off points we could learn from:

1. SQLG - MIT Licensed by Pieter Martin, who is quite active on the TinkerPop mailing lists and JIRA. We could ask nicely if he would like to lend a hand, as I am looking at the SQL interface as the most sensible target here.
2. Marko Rodriguez's reference implementation of GraphActors with Akka - as mentioned above this is an active body of work that has not yet been released or merged into TinkerPop master, but is what I suggest we target here.
3. Ted Wilmes' SQL-gremlin - while this goes "the other way" and maps a TinkerPop-enabled database to a tabular representation so that you can run SQL queries over the data, I'm guessing we'll see plenty of overlapping gotchas that Ted already ran into.
4. SparkGraphComputer - or "the thing that already works". Apache Ignite shadowing the Spark/Hadoop APIs might put a drop-in IgniteGraphComputer within reach which would give us an idea of how performant and useful we could expect the system to be overall before we invest in the "big change" of IgniteGraphActors or whatever middle-ground between the GraphActors, GraphComputer, and GraphProvider frameworks we'll need to find to realise an Apache Ignite backend within JanusGraph.

I also wanted to mention my thoughts on IGFS (Ignite File System) which runs either on top of HDFS, between HDFS and Spark, or standalone (I believe). My thinking is that we can store side-effect data structures in IGFS and it will enable the same ACID-ness on distributed side-effect data structures we would be getting for elements in the Graph/RDD data structure via IgniteRDD or what have you. From there, persistence to HDFS via a BulkDumperVertexProgram or BulkExportVertexProgram would be possible, as would running spark job chains with addE()/V() and drop() on the IgniteRDD or transformations thereof, opening up a path to ETL type workflows involving other "Big Data" creatures/tools. Further, we could then persist back into JanusGraph with a BulkLoaderVertexProgram implementation. So again, this is somewhat of a GraphComputer/GraphActors hybrid, but I'm not sure I mind. Jonathan Ellithorpe mentioned his implementation of TinkerPop over RAMcloud a while back on the TinkerPop mailing list as part of a benchmarking effort - we could ask him about how performant that was as it sounds similar to what this would be. Benchmarks would be nice, too :)

I'm interested in what people think of on-boarding this kind of processing engine in principle, even if all the optimistic assumptions of the feasibility of Ignite I have made here turn out to be unfounded. Are there other options we should consider besides Ignite, or should we stick closer to home and simply implement the GraphActors/Partitioner/Partitions standard Marko is working on directly with Cassandra/HBase as a giant refactor over time? Clearly, this is a change we can move up or down our development schedule and spend a while getting right, but if performant I see a lot of value here.


Jason Plurad <plu...@...>
 

David Robinson presented The Many Faces of Apache Ignite (PDF) at Apache: Big Data NA 2016. He has since moved on from IBM, so his work on Genesis Graph was unfinished and unreleased. It predated the cool stuff coming with GraphActors, but it seems in line with Dylan's proposal.

-- Jason


On Friday, January 27, 2017 at 4:01:36 PM UTC-5, okram...@... wrote:
Hi,

I have never heard of Apache Ignite and Kuppitz just introduced me to Apache Geode too. Crazy cool technologies.

Personally, I think it would be great to try and build "distributed/transactional/persisted TinkerGraph" using:

  Apache Geode + TinkerPop

The idea would be that each "region" (partition) maintains in their key/value structure vertex-id/vertex(properties/edges). Then, if you have GraphActors overlaid across the cluster, you get data local traversals. Finally, because Apache Geode supports transactions and disk persistence (and cache spillover), you get "database" features. From reviewing their docs, it seems it would be pretty easy to implement. 

The benefit of building this would be to learn how TinkerPop should define the Partitioner and Partition interfaces that are being developed for TinkerPop 3.3.0. I have been biased by Akka so it would be good to be biased by Geode/Ignite as well.

As a side: I didn't see anything about "distributed cache aware transactions" in Apache Geode like I see in Apache Ignite. 

Anywho, given that TinkerPop already provides the Gremlin language and machine where the machine can execute in an iterative "Hadoop-based" fashion (GraphComputer) and a peer-to-peer message passing fashion (GraphActors), there really isn't much work to do besides back the TinkerPop structure API by Apache Geode/Ignite. Once you do that, you can start to get fancy with not requiring "star vertices" as the values of the keys, but instead, support keys that are ids for vertices, "edge-blocks", and "vertex property blocks". In this way, you get not only vertex-, but also edge-cuts. The Partitioner/Partition API of TinkerPop will be able to support the ability to have the graph data represented across the cluster as you please and thus, unlike GraphComputer, you are not tied to "star vertex blocks."

Thanks for the thoughts. I enjoyed reading your email and spending my afternoon learning about Ignite and Geode.

Take care,
Marko.



On Thursday, January 26, 2017 at 2:34:48 PM UTC-7, Dylan Bethune-Waddell wrote:
So I've been toying with this idea for a while, but I want to make it concrete here so we can shoot it down, mutate, or build it up as needed:

I think that we should on-board an additional backend for distributed, asynchronous OLAP/OLTP "hybrid" processing, which Marko Rodriguez is implementing a reference implementation for right now in TinkerPop (link to branch and JIRA). Here's a link to a figure he attached to the JIRA issue. We need to drop or change the name of FulgoraGraphComputer anyways, so my suggestion is "repeal and replace" - let's take a stab at providing an implementation of Marko's new work, which he is doing with Akka, whereby we will use Apache Ignite as a backend instead with the goal of addressing a concern raised/discussed in the JIRA issue:
  • If transactions are worked out, then distributed OLTP Gremlin provides mutation capabilities (something currently not implemented for GraphComputer). That is addVaddEdrop, etc. just works. *Caveate, transactions in this environment across GremlinServer seems difficult.*

Indeed, which is why I propose we try out Apache Ignite as a way to achieve this functionality while replacing our in-memory, single-machine graph computer (Fulgora) with a distributed in-memory graph computer that will suit most OLAP-style analytics needs with the benefit of full OLTP TinkerPop API functionality and integration points with Spark and tools that work with data stored in HDFS. Quoting from this article, which admittedly reads as a long-form advertisement for Apache Ignite:
  • Ignite can provide shared storage, so state can be passed from one Spark application or job to another
  • Ignite can provide SQL with indexing so Spark SQL can be accelerated over 1,000x
  • When working with files instead of RDDs, the Apache Ignite In-Memory File System (IGFS) can also share state between Spark jobs and applications
My understanding is that Apache Ignite essentially provides a distributed In-Memory cache that is compatible with and a "drop-in" value add to both HDFS and Apache Spark with an integration point for Cassandra as well. To me the RDD-like data structure of Ignite which maintains state complete with ACID transactions via an SQL interface would therefore address Marko's concern about distributed transactions. Here are a few jumping-off points we could learn from:

1. SQLG - MIT Licensed by Pieter Martin, who is quite active on the TinkerPop mailing lists and JIRA. We could ask nicely if he would like to lend a hand, as I am looking at the SQL interface as the most sensible target here.
2. Marko Rodriguez's reference implementation of GraphActors with Akka - as mentioned above this is an active body of work that has not yet been released or merged into TinkerPop master, but is what I suggest we target here.
3. Ted Wilmes' SQL-gremlin - while this goes "the other way" and maps a TinkerPop-enabled database to a tabular representation so that you can run SQL queries over the data, I'm guessing we'll see plenty of overlapping gotchas that Ted already ran into.
4. SparkGraphComputer - or "the thing that already works". Apache Ignite shadowing the Spark/Hadoop APIs might put a drop-in IgniteGraphComputer within reach which would give us an idea of how performant and useful we could expect the system to be overall before we invest in the "big change" of IgniteGraphActors or whatever middle-ground between the GraphActors, GraphComputer, and GraphProvider frameworks we'll need to find to realise an Apache Ignite backend within JanusGraph.

I also wanted to mention my thoughts on IGFS (Ignite File System) which runs either on top of HDFS, between HDFS and Spark, or standalone (I believe). My thinking is that we can store side-effect data structures in IGFS and it will enable the same ACID-ness on distributed side-effect data structures we would be getting for elements in the Graph/RDD data structure via IgniteRDD or what have you. From there, persistence to HDFS via a BulkDumperVertexProgram or BulkExportVertexProgram would be possible, as would running spark job chains with addE()/V() and drop() on the IgniteRDD or transformations thereof, opening up a path to ETL type workflows involving other "Big Data" creatures/tools. Further, we could then persist back into JanusGraph with a BulkLoaderVertexProgram implementation. So again, this is somewhat of a GraphComputer/GraphActors hybrid, but I'm not sure I mind. Jonathan Ellithorpe mentioned his implementation of TinkerPop over RAMcloud a while back on the TinkerPop mailing list as part of a benchmarking effort - we could ask him about how performant that was as it sounds similar to what this would be. Benchmarks would be nice, too :)

I'm interested in what people think of on-boarding this kind of processing engine in principle, even if all the optimistic assumptions of the feasibility of Ignite I have made here turn out to be unfounded. Are there other options we should consider besides Ignite, or should we stick closer to home and simply implement the GraphActors/Partitioner/Partitions standard Marko is working on directly with Cassandra/HBase as a giant refactor over time? Clearly, this is a change we can move up or down our development schedule and spend a while getting right, but if performant I see a lot of value here.


mathias...@...
 

I will be testing JanusGraph on ScyllaDB.


On Friday, 27 January 2017 21:55:13 UTC, Jason Plurad wrote:
David Robinson presented The Many Faces of Apache Ignite (PDF) at Apache: Big Data NA 2016. He has since moved on from IBM, so his work on Genesis Graph was unfinished and unreleased. It predated the cool stuff coming with GraphActors, but it seems in line with Dylan's proposal.

-- Jason

On Friday, January 27, 2017 at 4:01:36 PM UTC-5, ok...@... wrote:
Hi,

I have never heard of Apache Ignite and Kuppitz just introduced me to Apache Geode too. Crazy cool technologies.

Personally, I think it would be great to try and build "distributed/transactional/persisted TinkerGraph" using:

  Apache Geode + TinkerPop

The idea would be that each "region" (partition) maintains in their key/value structure vertex-id/vertex(properties/edges). Then, if you have GraphActors overlaid across the cluster, you get data local traversals. Finally, because Apache Geode supports transactions and disk persistence (and cache spillover), you get "database" features. From reviewing their docs, it seems it would be pretty easy to implement. 

The benefit of building this would be to learn how TinkerPop should define the Partitioner and Partition interfaces that are being developed for TinkerPop 3.3.0. I have been biased by Akka so it would be good to be biased by Geode/Ignite as well.

As a side: I didn't see anything about "distributed cache aware transactions" in Apache Geode like I see in Apache Ignite. 

Anywho, given that TinkerPop already provides the Gremlin language and machine where the machine can execute in an iterative "Hadoop-based" fashion (GraphComputer) and a peer-to-peer message passing fashion (GraphActors), there really isn't much work to do besides back the TinkerPop structure API by Apache Geode/Ignite. Once you do that, you can start to get fancy with not requiring "star vertices" as the values of the keys, but instead, support keys that are ids for vertices, "edge-blocks", and "vertex property blocks". In this way, you get not only vertex-, but also edge-cuts. The Partitioner/Partition API of TinkerPop will be able to support the ability to have the graph data represented across the cluster as you please and thus, unlike GraphComputer, you are not tied to "star vertex blocks."

Thanks for the thoughts. I enjoyed reading your email and spending my afternoon learning about Ignite and Geode.

Take care,
Marko.



On Thursday, January 26, 2017 at 2:34:48 PM UTC-7, Dylan Bethune-Waddell wrote:
So I've been toying with this idea for a while, but I want to make it concrete here so we can shoot it down, mutate, or build it up as needed:

I think that we should on-board an additional backend for distributed, asynchronous OLAP/OLTP "hybrid" processing, which Marko Rodriguez is implementing a reference implementation for right now in TinkerPop (link to branch and JIRA). Here's a link to a figure he attached to the JIRA issue. We need to drop or change the name of FulgoraGraphComputer anyways, so my suggestion is "repeal and replace" - let's take a stab at providing an implementation of Marko's new work, which he is doing with Akka, whereby we will use Apache Ignite as a backend instead with the goal of addressing a concern raised/discussed in the JIRA issue:
  • If transactions are worked out, then distributed OLTP Gremlin provides mutation capabilities (something currently not implemented for GraphComputer). That is addVaddEdrop, etc. just works. *Caveate, transactions in this environment across GremlinServer seems difficult.*

Indeed, which is why I propose we try out Apache Ignite as a way to achieve this functionality while replacing our in-memory, single-machine graph computer (Fulgora) with a distributed in-memory graph computer that will suit most OLAP-style analytics needs with the benefit of full OLTP TinkerPop API functionality and integration points with Spark and tools that work with data stored in HDFS. Quoting from this article, which admittedly reads as a long-form advertisement for Apache Ignite:
  • Ignite can provide shared storage, so state can be passed from one Spark application or job to another
  • Ignite can provide SQL with indexing so Spark SQL can be accelerated over 1,000x
  • When working with files instead of RDDs, the Apache Ignite In-Memory File System (IGFS) can also share state between Spark jobs and applications
My understanding is that Apache Ignite essentially provides a distributed In-Memory cache that is compatible with and a "drop-in" value add to both HDFS and Apache Spark with an integration point for Cassandra as well. To me the RDD-like data structure of Ignite which maintains state complete with ACID transactions via an SQL interface would therefore address Marko's concern about distributed transactions. Here are a few jumping-off points we could learn from:

1. SQLG - MIT Licensed by Pieter Martin, who is quite active on the TinkerPop mailing lists and JIRA. We could ask nicely if he would like to lend a hand, as I am looking at the SQL interface as the most sensible target here.
2. Marko Rodriguez's reference implementation of GraphActors with Akka - as mentioned above this is an active body of work that has not yet been released or merged into TinkerPop master, but is what I suggest we target here.
3. Ted Wilmes' SQL-gremlin - while this goes "the other way" and maps a TinkerPop-enabled database to a tabular representation so that you can run SQL queries over the data, I'm guessing we'll see plenty of overlapping gotchas that Ted already ran into.
4. SparkGraphComputer - or "the thing that already works". Apache Ignite shadowing the Spark/Hadoop APIs might put a drop-in IgniteGraphComputer within reach which would give us an idea of how performant and useful we could expect the system to be overall before we invest in the "big change" of IgniteGraphActors or whatever middle-ground between the GraphActors, GraphComputer, and GraphProvider frameworks we'll need to find to realise an Apache Ignite backend within JanusGraph.

I also wanted to mention my thoughts on IGFS (Ignite File System) which runs either on top of HDFS, between HDFS and Spark, or standalone (I believe). My thinking is that we can store side-effect data structures in IGFS and it will enable the same ACID-ness on distributed side-effect data structures we would be getting for elements in the Graph/RDD data structure via IgniteRDD or what have you. From there, persistence to HDFS via a BulkDumperVertexProgram or BulkExportVertexProgram would be possible, as would running spark job chains with addE()/V() and drop() on the IgniteRDD or transformations thereof, opening up a path to ETL type workflows involving other "Big Data" creatures/tools. Further, we could then persist back into JanusGraph with a BulkLoaderVertexProgram implementation. So again, this is somewhat of a GraphComputer/GraphActors hybrid, but I'm not sure I mind. Jonathan Ellithorpe mentioned his implementation of TinkerPop over RAMcloud a while back on the TinkerPop mailing list as part of a benchmarking effort - we could ask him about how performant that was as it sounds similar to what this would be. Benchmarks would be nice, too :)

I'm interested in what people think of on-boarding this kind of processing engine in principle, even if all the optimistic assumptions of the feasibility of Ignite I have made here turn out to be unfounded. Are there other options we should consider besides Ignite, or should we stick closer to home and simply implement the GraphActors/Partitioner/Partitions standard Marko is working on directly with Cassandra/HBase as a giant refactor over time? Clearly, this is a change we can move up or down our development schedule and spend a while getting right, but if performant I see a lot of value here.


Pieter Martin <pieter...@...>
 

Hi,

I'd be happy to help where I can.
I'll read up a bit on Apache Ignite.

Cheers
Pieter


On Thursday, 26 January 2017 23:34:48 UTC+2, Dylan Bethune-Waddell wrote:
So I've been toying with this idea for a while, but I want to make it concrete here so we can shoot it down, mutate, or build it up as needed:

I think that we should on-board an additional backend for distributed, asynchronous OLAP/OLTP "hybrid" processing, which Marko Rodriguez is implementing a reference implementation for right now in TinkerPop (link to branch and JIRA). Here's a link to a figure he attached to the JIRA issue. We need to drop or change the name of FulgoraGraphComputer anyways, so my suggestion is "repeal and replace" - let's take a stab at providing an implementation of Marko's new work, which he is doing with Akka, whereby we will use Apache Ignite as a backend instead with the goal of addressing a concern raised/discussed in the JIRA issue:
  • If transactions are worked out, then distributed OLTP Gremlin provides mutation capabilities (something currently not implemented for GraphComputer). That is addVaddEdrop, etc. just works. *Caveate, transactions in this environment across GremlinServer seems difficult.*

Indeed, which is why I propose we try out Apache Ignite as a way to achieve this functionality while replacing our in-memory, single-machine graph computer (Fulgora) with a distributed in-memory graph computer that will suit most OLAP-style analytics needs with the benefit of full OLTP TinkerPop API functionality and integration points with Spark and tools that work with data stored in HDFS. Quoting from this article, which admittedly reads as a long-form advertisement for Apache Ignite:
  • Ignite can provide shared storage, so state can be passed from one Spark application or job to another
  • Ignite can provide SQL with indexing so Spark SQL can be accelerated over 1,000x
  • When working with files instead of RDDs, the Apache Ignite In-Memory File System (IGFS) can also share state between Spark jobs and applications
My understanding is that Apache Ignite essentially provides a distributed In-Memory cache that is compatible with and a "drop-in" value add to both HDFS and Apache Spark with an integration point for Cassandra as well. To me the RDD-like data structure of Ignite which maintains state complete with ACID transactions via an SQL interface would therefore address Marko's concern about distributed transactions. Here are a few jumping-off points we could learn from:

1. SQLG - MIT Licensed by Pieter Martin, who is quite active on the TinkerPop mailing lists and JIRA. We could ask nicely if he would like to lend a hand, as I am looking at the SQL interface as the most sensible target here.
2. Marko Rodriguez's reference implementation of GraphActors with Akka - as mentioned above this is an active body of work that has not yet been released or merged into TinkerPop master, but is what I suggest we target here.
3. Ted Wilmes' SQL-gremlin - while this goes "the other way" and maps a TinkerPop-enabled database to a tabular representation so that you can run SQL queries over the data, I'm guessing we'll see plenty of overlapping gotchas that Ted already ran into.
4. SparkGraphComputer - or "the thing that already works". Apache Ignite shadowing the Spark/Hadoop APIs might put a drop-in IgniteGraphComputer within reach which would give us an idea of how performant and useful we could expect the system to be overall before we invest in the "big change" of IgniteGraphActors or whatever middle-ground between the GraphActors, GraphComputer, and GraphProvider frameworks we'll need to find to realise an Apache Ignite backend within JanusGraph.

I also wanted to mention my thoughts on IGFS (Ignite File System) which runs either on top of HDFS, between HDFS and Spark, or standalone (I believe). My thinking is that we can store side-effect data structures in IGFS and it will enable the same ACID-ness on distributed side-effect data structures we would be getting for elements in the Graph/RDD data structure via IgniteRDD or what have you. From there, persistence to HDFS via a BulkDumperVertexProgram or BulkExportVertexProgram would be possible, as would running spark job chains with addE()/V() and drop() on the IgniteRDD or transformations thereof, opening up a path to ETL type workflows involving other "Big Data" creatures/tools. Further, we could then persist back into JanusGraph with a BulkLoaderVertexProgram implementation. So again, this is somewhat of a GraphComputer/GraphActors hybrid, but I'm not sure I mind. Jonathan Ellithorpe mentioned his implementation of TinkerPop over RAMcloud a while back on the TinkerPop mailing list as part of a benchmarking effort - we could ask him about how performant that was as it sounds similar to what this would be. Benchmarks would be nice, too :)

I'm interested in what people think of on-boarding this kind of processing engine in principle, even if all the optimistic assumptions of the feasibility of Ignite I have made here turn out to be unfounded. Are there other options we should consider besides Ignite, or should we stick closer to home and simply implement the GraphActors/Partitioner/Partitions standard Marko is working on directly with Cassandra/HBase as a giant refactor over time? Clearly, this is a change we can move up or down our development schedule and spend a while getting right, but if performant I see a lot of value here.


Henry Saputra <henry....@...>
 

I just want to clarify the proposal, so we will just need the IGFS portion of Apache Ignite to add in-memory cache on top of HDFS ?


On Thursday, January 26, 2017 at 1:34:48 PM UTC-8, Dylan Bethune-Waddell wrote:
So I've been toying with this idea for a while, but I want to make it concrete here so we can shoot it down, mutate, or build it up as needed:

I think that we should on-board an additional backend for distributed, asynchronous OLAP/OLTP "hybrid" processing, which Marko Rodriguez is implementing a reference implementation for right now in TinkerPop (link to branch and JIRA). Here's a link to a figure he attached to the JIRA issue. We need to drop or change the name of FulgoraGraphComputer anyways, so my suggestion is "repeal and replace" - let's take a stab at providing an implementation of Marko's new work, which he is doing with Akka, whereby we will use Apache Ignite as a backend instead with the goal of addressing a concern raised/discussed in the JIRA issue:
  • If transactions are worked out, then distributed OLTP Gremlin provides mutation capabilities (something currently not implemented for GraphComputer). That is addVaddEdrop, etc. just works. *Caveate, transactions in this environment across GremlinServer seems difficult.*

Indeed, which is why I propose we try out Apache Ignite as a way to achieve this functionality while replacing our in-memory, single-machine graph computer (Fulgora) with a distributed in-memory graph computer that will suit most OLAP-style analytics needs with the benefit of full OLTP TinkerPop API functionality and integration points with Spark and tools that work with data stored in HDFS. Quoting from this article, which admittedly reads as a long-form advertisement for Apache Ignite:
  • Ignite can provide shared storage, so state can be passed from one Spark application or job to another
  • Ignite can provide SQL with indexing so Spark SQL can be accelerated over 1,000x
  • When working with files instead of RDDs, the Apache Ignite In-Memory File System (IGFS) can also share state between Spark jobs and applications
My understanding is that Apache Ignite essentially provides a distributed In-Memory cache that is compatible with and a "drop-in" value add to both HDFS and Apache Spark with an integration point for Cassandra as well. To me the RDD-like data structure of Ignite which maintains state complete with ACID transactions via an SQL interface would therefore address Marko's concern about distributed transactions. Here are a few jumping-off points we could learn from:

1. SQLG - MIT Licensed by Pieter Martin, who is quite active on the TinkerPop mailing lists and JIRA. We could ask nicely if he would like to lend a hand, as I am looking at the SQL interface as the most sensible target here.
2. Marko Rodriguez's reference implementation of GraphActors with Akka - as mentioned above this is an active body of work that has not yet been released or merged into TinkerPop master, but is what I suggest we target here.
3. Ted Wilmes' SQL-gremlin - while this goes "the other way" and maps a TinkerPop-enabled database to a tabular representation so that you can run SQL queries over the data, I'm guessing we'll see plenty of overlapping gotchas that Ted already ran into.
4. SparkGraphComputer - or "the thing that already works". Apache Ignite shadowing the Spark/Hadoop APIs might put a drop-in IgniteGraphComputer within reach which would give us an idea of how performant and useful we could expect the system to be overall before we invest in the "big change" of IgniteGraphActors or whatever middle-ground between the GraphActors, GraphComputer, and GraphProvider frameworks we'll need to find to realise an Apache Ignite backend within JanusGraph.

I also wanted to mention my thoughts on IGFS (Ignite File System) which runs either on top of HDFS, between HDFS and Spark, or standalone (I believe). My thinking is that we can store side-effect data structures in IGFS and it will enable the same ACID-ness on distributed side-effect data structures we would be getting for elements in the Graph/RDD data structure via IgniteRDD or what have you. From there, persistence to HDFS via a BulkDumperVertexProgram or BulkExportVertexProgram would be possible, as would running spark job chains with addE()/V() and drop() on the IgniteRDD or transformations thereof, opening up a path to ETL type workflows involving other "Big Data" creatures/tools. Further, we could then persist back into JanusGraph with a BulkLoaderVertexProgram implementation. So again, this is somewhat of a GraphComputer/GraphActors hybrid, but I'm not sure I mind. Jonathan Ellithorpe mentioned his implementation of TinkerPop over RAMcloud a while back on the TinkerPop mailing list as part of a benchmarking effort - we could ask him about how performant that was as it sounds similar to what this would be. Benchmarks would be nice, too :)

I'm interested in what people think of on-boarding this kind of processing engine in principle, even if all the optimistic assumptions of the feasibility of Ignite I have made here turn out to be unfounded. Are there other options we should consider besides Ignite, or should we stick closer to home and simply implement the GraphActors/Partitioner/Partitions standard Marko is working on directly with Cassandra/HBase as a giant refactor over time? Clearly, this is a change we can move up or down our development schedule and spend a while getting right, but if performant I see a lot of value here.


dzso...@...
 

Hey, we already have a tinkerpop-ignite implementation as part of an in-house project. We'd be happy to merge it with Janus, if we can work together.

You are right about the convergence of OLTP and OLAP via Ignite. That's why we chose it in the first place. We avoided using IgniteRDD and based everything on IgniteCache, for this reason as well. When you have TP and AP tasks running on the same data, you need a series of new designs and mechanisms to manage consistency.

For the distributed TP implementation or actually, asynchronous parallelism, we may not follow the actor model. Consider this: http://vertx.io/. So it will involve changes to Janus and TinkerPop.

Hence, this merger won't be straightforward. Merging software never is...

I'm sure we can work it out the details. But that requires we all know and agree on where we are heading. This approach leads to an alternative data system, with a unique set of features and tradeoffs. We think of it as a data system designed for data science, enabling data-centric computing, end-to-end.

And that's why we didn't fork Titan, because it contained design decisions that don't quite align with the goal of OLTP-OLAP convergence.

Anyway, let's talk :)

song


On Friday, January 27, 2017 at 5:34:48 AM UTC+8, Dylan Bethune-Waddell wrote:
So I've been toying with this idea for a while, but I want to make it concrete here so we can shoot it down, mutate, or build it up as needed:

I think that we should on-board an additional backend for distributed, asynchronous OLAP/OLTP "hybrid" processing, which Marko Rodriguez is implementing a reference implementation for right now in TinkerPop (link to branch and JIRA). Here's a link to a figure he attached to the JIRA issue. We need to drop or change the name of FulgoraGraphComputer anyways, so my suggestion is "repeal and replace" - let's take a stab at providing an implementation of Marko's new work, which he is doing with Akka, whereby we will use Apache Ignite as a backend instead with the goal of addressing a concern raised/discussed in the JIRA issue:
  • If transactions are worked out, then distributed OLTP Gremlin provides mutation capabilities (something currently not implemented for GraphComputer). That is addVaddEdrop, etc. just works. *Caveate, transactions in this environment across GremlinServer seems difficult.*

Indeed, which is why I propose we try out Apache Ignite as a way to achieve this functionality while replacing our in-memory, single-machine graph computer (Fulgora) with a distributed in-memory graph computer that will suit most OLAP-style analytics needs with the benefit of full OLTP TinkerPop API functionality and integration points with Spark and tools that work with data stored in HDFS. Quoting from this article, which admittedly reads as a long-form advertisement for Apache Ignite:
  • Ignite can provide shared storage, so state can be passed from one Spark application or job to another
  • Ignite can provide SQL with indexing so Spark SQL can be accelerated over 1,000x
  • When working with files instead of RDDs, the Apache Ignite In-Memory File System (IGFS) can also share state between Spark jobs and applications
My understanding is that Apache Ignite essentially provides a distributed In-Memory cache that is compatible with and a "drop-in" value add to both HDFS and Apache Spark with an integration point for Cassandra as well. To me the RDD-like data structure of Ignite which maintains state complete with ACID transactions via an SQL interface would therefore address Marko's concern about distributed transactions. Here are a few jumping-off points we could learn from:

1. SQLG - MIT Licensed by Pieter Martin, who is quite active on the TinkerPop mailing lists and JIRA. We could ask nicely if he would like to lend a hand, as I am looking at the SQL interface as the most sensible target here.
2. Marko Rodriguez's reference implementation of GraphActors with Akka - as mentioned above this is an active body of work that has not yet been released or merged into TinkerPop master, but is what I suggest we target here.
3. Ted Wilmes' SQL-gremlin - while this goes "the other way" and maps a TinkerPop-enabled database to a tabular representation so that you can run SQL queries over the data, I'm guessing we'll see plenty of overlapping gotchas that Ted already ran into.
4. SparkGraphComputer - or "the thing that already works". Apache Ignite shadowing the Spark/Hadoop APIs might put a drop-in IgniteGraphComputer within reach which would give us an idea of how performant and useful we could expect the system to be overall before we invest in the "big change" of IgniteGraphActors or whatever middle-ground between the GraphActors, GraphComputer, and GraphProvider frameworks we'll need to find to realise an Apache Ignite backend within JanusGraph.

I also wanted to mention my thoughts on IGFS (Ignite File System) which runs either on top of HDFS, between HDFS and Spark, or standalone (I believe). My thinking is that we can store side-effect data structures in IGFS and it will enable the same ACID-ness on distributed side-effect data structures we would be getting for elements in the Graph/RDD data structure via IgniteRDD or what have you. From there, persistence to HDFS via a BulkDumperVertexProgram or BulkExportVertexProgram would be possible, as would running spark job chains with addE()/V() and drop() on the IgniteRDD or transformations thereof, opening up a path to ETL type workflows involving other "Big Data" creatures/tools. Further, we could then persist back into JanusGraph with a BulkLoaderVertexProgram implementation. So again, this is somewhat of a GraphComputer/GraphActors hybrid, but I'm not sure I mind. Jonathan Ellithorpe mentioned his implementation of TinkerPop over RAMcloud a while back on the TinkerPop mailing list as part of a benchmarking effort - we could ask him about how performant that was as it sounds similar to what this would be. Benchmarks would be nice, too :)

I'm interested in what people think of on-boarding this kind of processing engine in principle, even if all the optimistic assumptions of the feasibility of Ignite I have made here turn out to be unfounded. Are there other options we should consider besides Ignite, or should we stick closer to home and simply implement the GraphActors/Partitioner/Partitions standard Marko is working on directly with Cassandra/HBase as a giant refactor over time? Clearly, this is a change we can move up or down our development schedule and spend a while getting right, but if performant I see a lot of value here.


Demai Ni <nid...@...>
 

Song,

great that you guys already have the implement with ignite. Would you please share some performance numbers, for example comparing to Titan-on-HBase, how much Ignite improve?  My team consider Janus+HBase at this moment, but open to better solutions. Thanks

Demai

 

On Wed, Feb 15, 2017 at 1:55 AM, <dzso...@...> wrote:
Hey, we already have a tinkerpop-ignite implementation as part of an in-house project. We'd be happy to merge it with Janus, if we can work together.

You are right about the convergence of OLTP and OLAP via Ignite. That's why we chose it in the first place. We avoided using IgniteRDD and based everything on IgniteCache, for this reason as well. When you have TP and AP tasks running on the same data, you need a series of new designs and mechanisms to manage consistency.

For the distributed TP implementation or actually, asynchronous parallelism, we may not follow the actor model. Consider this: http://vertx.io/. So it will involve changes to Janus and TinkerPop.

Hence, this merger won't be straightforward. Merging software never is...

I'm sure we can work it out the details. But that requires we all know and agree on where we are heading. This approach leads to an alternative data system, with a unique set of features and tradeoffs. We think of it as a data system designed for data science, enabling data-centric computing, end-to-end.

And that's why we didn't fork Titan, because it contained design decisions that don't quite align with the goal of OLTP-OLAP convergence.

Anyway, let's talk :)

song

On Friday, January 27, 2017 at 5:34:48 AM UTC+8, Dylan Bethune-Waddell wrote:
So I've been toying with this idea for a while, but I want to make it concrete here so we can shoot it down, mutate, or build it up as needed:

I think that we should on-board an additional backend for distributed, asynchronous OLAP/OLTP "hybrid" processing, which Marko Rodriguez is implementing a reference implementation for right now in TinkerPop (link to branch and JIRA). Here's a link to a figure he attached to the JIRA issue. We need to drop or change the name of FulgoraGraphComputer anyways, so my suggestion is "repeal and replace" - let's take a stab at providing an implementation of Marko's new work, which he is doing with Akka, whereby we will use Apache Ignite as a backend instead with the goal of addressing a concern raised/discussed in the JIRA issue:
  • If transactions are worked out, then distributed OLTP Gremlin provides mutation capabilities (something currently not implemented for GraphComputer). That is addVaddEdrop, etc. just works. *Caveate, transactions in this environment across GremlinServer seems difficult.*

Indeed, which is why I propose we try out Apache Ignite as a way to achieve this functionality while replacing our in-memory, single-machine graph computer (Fulgora) with a distributed in-memory graph computer that will suit most OLAP-style analytics needs with the benefit of full OLTP TinkerPop API functionality and integration points with Spark and tools that work with data stored in HDFS. Quoting from this article, which admittedly reads as a long-form advertisement for Apache Ignite:
  • Ignite can provide shared storage, so state can be passed from one Spark application or job to another
  • Ignite can provide SQL with indexing so Spark SQL can be accelerated over 1,000x
  • When working with files instead of RDDs, the Apache Ignite In-Memory File System (IGFS) can also share state between Spark jobs and applications
My understanding is that Apache Ignite essentially provides a distributed In-Memory cache that is compatible with and a "drop-in" value add to both HDFS and Apache Spark with an integration point for Cassandra as well. To me the RDD-like data structure of Ignite which maintains state complete with ACID transactions via an SQL interface would therefore address Marko's concern about distributed transactions. Here are a few jumping-off points we could learn from:

1. SQLG - MIT Licensed by Pieter Martin, who is quite active on the TinkerPop mailing lists and JIRA. We could ask nicely if he would like to lend a hand, as I am looking at the SQL interface as the most sensible target here.
2. Marko Rodriguez's reference implementation of GraphActors with Akka - as mentioned above this is an active body of work that has not yet been released or merged into TinkerPop master, but is what I suggest we target here.
3. Ted Wilmes' SQL-gremlin - while this goes "the other way" and maps a TinkerPop-enabled database to a tabular representation so that you can run SQL queries over the data, I'm guessing we'll see plenty of overlapping gotchas that Ted already ran into.
4. SparkGraphComputer - or "the thing that already works". Apache Ignite shadowing the Spark/Hadoop APIs might put a drop-in IgniteGraphComputer within reach which would give us an idea of how performant and useful we could expect the system to be overall before we invest in the "big change" of IgniteGraphActors or whatever middle-ground between the GraphActors, GraphComputer, and GraphProvider frameworks we'll need to find to realise an Apache Ignite backend within JanusGraph.

I also wanted to mention my thoughts on IGFS (Ignite File System) which runs either on top of HDFS, between HDFS and Spark, or standalone (I believe). My thinking is that we can store side-effect data structures in IGFS and it will enable the same ACID-ness on distributed side-effect data structures we would be getting for elements in the Graph/RDD data structure via IgniteRDD or what have you. From there, persistence to HDFS via a BulkDumperVertexProgram or BulkExportVertexProgram would be possible, as would running spark job chains with addE()/V() and drop() on the IgniteRDD or transformations thereof, opening up a path to ETL type workflows involving other "Big Data" creatures/tools. Further, we could then persist back into JanusGraph with a BulkLoaderVertexProgram implementation. So again, this is somewhat of a GraphComputer/GraphActors hybrid, but I'm not sure I mind. Jonathan Ellithorpe mentioned his implementation of TinkerPop over RAMcloud a while back on the TinkerPop mailing list as part of a benchmarking effort - we could ask him about how performant that was as it sounds similar to what this would be. Benchmarks would be nice, too :)

I'm interested in what people think of on-boarding this kind of processing engine in principle, even if all the optimistic assumptions of the feasibility of Ignite I have made here turn out to be unfounded. Are there other options we should consider besides Ignite, or should we stick closer to home and simply implement the GraphActors/Partitioner/Partitions standard Marko is working on directly with Cassandra/HBase as a giant refactor over time? Clearly, this is a change we can move up or down our development schedule and spend a while getting right, but if performant I see a lot of value here.

--
You received this message because you are subscribed to the Google Groups "JanusGraph developer list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


song <dzso...@...>
 

Hi Demai,

Preliminary tests showed slower performance than tinkerpop-spark in OLAP (largely due inherent constraints of Ignite) and faster OLTP than titan-cassandra in most cases (largely because we are using cache). We will prepare benchmark numbers as we draw closer to GA release (early March).

Frankly speaking, this won't be the fastest OLTP or OLAP engine in either category. We can't beat application-specific constructs and algorithms on top of Ignite or Spark, because we are adding a general-purpose abstraction. With a graph system that truly integrates OLTP and OLAP, we should expect to lose some performance, but gain a lot of flexibility.

Our first version was actually based on Titan. And we killed it after a few months and a lot of debates. Its constructs became constraints as we were integrating OLTP and OLAP. It made more sense to start with blank slate.

Song


On Thursday, February 16, 2017 at 2:40:49 AM UTC+8, Demai wrote:
Song,

great that you guys already have the implement with ignite. Would you please share some performance numbers, for example comparing to Titan-on-HBase, how much Ignite improve?  My team consider Janus+HBase at this moment, but open to better solutions. Thanks

Demai

 

On Wed, Feb 15, 2017 at 1:55 AM, <dz...@...> wrote:
Hey, we already have a tinkerpop-ignite implementation as part of an in-house project. We'd be happy to merge it with Janus, if we can work together.

You are right about the convergence of OLTP and OLAP via Ignite. That's why we chose it in the first place. We avoided using IgniteRDD and based everything on IgniteCache, for this reason as well. When you have TP and AP tasks running on the same data, you need a series of new designs and mechanisms to manage consistency.

For the distributed TP implementation or actually, asynchronous parallelism, we may not follow the actor model. Consider this: http://vertx.io/. So it will involve changes to Janus and TinkerPop.

Hence, this merger won't be straightforward. Merging software never is...

I'm sure we can work it out the details. But that requires we all know and agree on where we are heading. This approach leads to an alternative data system, with a unique set of features and tradeoffs. We think of it as a data system designed for data science, enabling data-centric computing, end-to-end.

And that's why we didn't fork Titan, because it contained design decisions that don't quite align with the goal of OLTP-OLAP convergence.

Anyway, let's talk :)

song

On Friday, January 27, 2017 at 5:34:48 AM UTC+8, Dylan Bethune-Waddell wrote:
So I've been toying with this idea for a while, but I want to make it concrete here so we can shoot it down, mutate, or build it up as needed:

I think that we should on-board an additional backend for distributed, asynchronous OLAP/OLTP "hybrid" processing, which Marko Rodriguez is implementing a reference implementation for right now in TinkerPop (link to branch and JIRA). Here's a link to a figure he attached to the JIRA issue. We need to drop or change the name of FulgoraGraphComputer anyways, so my suggestion is "repeal and replace" - let's take a stab at providing an implementation of Marko's new work, which he is doing with Akka, whereby we will use Apache Ignite as a backend instead with the goal of addressing a concern raised/discussed in the JIRA issue:
  • If transactions are worked out, then distributed OLTP Gremlin provides mutation capabilities (something currently not implemented for GraphComputer). That is addVaddEdrop, etc. just works. *Caveate, transactions in this environment across GremlinServer seems difficult.*

Indeed, which is why I propose we try out Apache Ignite as a way to achieve this functionality while replacing our in-memory, single-machine graph computer (Fulgora) with a distributed in-memory graph computer that will suit most OLAP-style analytics needs with the benefit of full OLTP TinkerPop API functionality and integration points with Spark and tools that work with data stored in HDFS. Quoting from this article, which admittedly reads as a long-form advertisement for Apache Ignite:
  • Ignite can provide shared storage, so state can be passed from one Spark application or job to another
  • Ignite can provide SQL with indexing so Spark SQL can be accelerated over 1,000x
  • When working with files instead of RDDs, the Apache Ignite In-Memory File System (IGFS) can also share state between Spark jobs and applications
My understanding is that Apache Ignite essentially provides a distributed In-Memory cache that is compatible with and a "drop-in" value add to both HDFS and Apache Spark with an integration point for Cassandra as well. To me the RDD-like data structure of Ignite which maintains state complete with ACID transactions via an SQL interface would therefore address Marko's concern about distributed transactions. Here are a few jumping-off points we could learn from:

1. SQLG - MIT Licensed by Pieter Martin, who is quite active on the TinkerPop mailing lists and JIRA. We could ask nicely if he would like to lend a hand, as I am looking at the SQL interface as the most sensible target here.
2. Marko Rodriguez's reference implementation of GraphActors with Akka - as mentioned above this is an active body of work that has not yet been released or merged into TinkerPop master, but is what I suggest we target here.
3. Ted Wilmes' SQL-gremlin - while this goes "the other way" and maps a TinkerPop-enabled database to a tabular representation so that you can run SQL queries over the data, I'm guessing we'll see plenty of overlapping gotchas that Ted already ran into.
4. SparkGraphComputer - or "the thing that already works". Apache Ignite shadowing the Spark/Hadoop APIs might put a drop-in IgniteGraphComputer within reach which would give us an idea of how performant and useful we could expect the system to be overall before we invest in the "big change" of IgniteGraphActors or whatever middle-ground between the GraphActors, GraphComputer, and GraphProvider frameworks we'll need to find to realise an Apache Ignite backend within JanusGraph.

I also wanted to mention my thoughts on IGFS (Ignite File System) which runs either on top of HDFS, between HDFS and Spark, or standalone (I believe). My thinking is that we can store side-effect data structures in IGFS and it will enable the same ACID-ness on distributed side-effect data structures we would be getting for elements in the Graph/RDD data structure via IgniteRDD or what have you. From there, persistence to HDFS via a BulkDumperVertexProgram or BulkExportVertexProgram would be possible, as would running spark job chains with addE()/V() and drop() on the IgniteRDD or transformations thereof, opening up a path to ETL type workflows involving other "Big Data" creatures/tools. Further, we could then persist back into JanusGraph with a BulkLoaderVertexProgram implementation. So again, this is somewhat of a GraphComputer/GraphActors hybrid, but I'm not sure I mind. Jonathan Ellithorpe mentioned his implementation of TinkerPop over RAMcloud a while back on the TinkerPop mailing list as part of a benchmarking effort - we could ask him about how performant that was as it sounds similar to what this would be. Benchmarks would be nice, too :)

I'm interested in what people think of on-boarding this kind of processing engine in principle, even if all the optimistic assumptions of the feasibility of Ignite I have made here turn out to be unfounded. Are there other options we should consider besides Ignite, or should we stick closer to home and simply implement the GraphActors/Partitioner/Partitions standard Marko is working on directly with Cassandra/HBase as a giant refactor over time? Clearly, this is a change we can move up or down our development schedule and spend a while getting right, but if performant I see a lot of value here.

--
You received this message because you are subscribed to the Google Groups "JanusGraph developer list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-de...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.