Date   

Re: [PROPOSAL] Replace FulgoraGraphComputer by IgniteGraphActors/IgniteGraphComputer

okram...@...
 

Hi,

I have never heard of Apache Ignite and Kuppitz just introduced me to Apache Geode too. Crazy cool technologies.

Personally, I think it would be great to try and build "distributed/transactional/persisted TinkerGraph" using:

  Apache Geode + TinkerPop

The idea would be that each "region" (partition) maintains in their key/value structure vertex-id/vertex(properties/edges). Then, if you have GraphActors overlaid across the cluster, you get data local traversals. Finally, because Apache Geode supports transactions and disk persistence (and cache spillover), you get "database" features. From reviewing their docs, it seems it would be pretty easy to implement. 

The benefit of building this would be to learn how TinkerPop should define the Partitioner and Partition interfaces that are being developed for TinkerPop 3.3.0. I have been biased by Akka so it would be good to be biased by Geode/Ignite as well.

As a side: I didn't see anything about "distributed cache aware transactions" in Apache Geode like I see in Apache Ignite. 

Anywho, given that TinkerPop already provides the Gremlin language and machine where the machine can execute in an iterative "Hadoop-based" fashion (GraphComputer) and a peer-to-peer message passing fashion (GraphActors), there really isn't much work to do besides back the TinkerPop structure API by Apache Geode/Ignite. Once you do that, you can start to get fancy with not requiring "star vertices" as the values of the keys, but instead, support keys that are ids for vertices, "edge-blocks", and "vertex property blocks". In this way, you get not only vertex-, but also edge-cuts. The Partitioner/Partition API of TinkerPop will be able to support the ability to have the graph data represented across the cluster as you please and thus, unlike GraphComputer, you are not tied to "star vertex blocks."

Thanks for the thoughts. I enjoyed reading your email and spending my afternoon learning about Ignite and Geode.

Take care,
Marko.

http://markorodriguez.com


On Thursday, January 26, 2017 at 2:34:48 PM UTC-7, Dylan Bethune-Waddell wrote:
So I've been toying with this idea for a while, but I want to make it concrete here so we can shoot it down, mutate, or build it up as needed:

I think that we should on-board an additional backend for distributed, asynchronous OLAP/OLTP "hybrid" processing, which Marko Rodriguez is implementing a reference implementation for right now in TinkerPop (link to branch and JIRA). Here's a link to a figure he attached to the JIRA issue. We need to drop or change the name of FulgoraGraphComputer anyways, so my suggestion is "repeal and replace" - let's take a stab at providing an implementation of Marko's new work, which he is doing with Akka, whereby we will use Apache Ignite as a backend instead with the goal of addressing a concern raised/discussed in the JIRA issue:
  • If transactions are worked out, then distributed OLTP Gremlin provides mutation capabilities (something currently not implemented for GraphComputer). That is addVaddEdrop, etc. just works. *Caveate, transactions in this environment across GremlinServer seems difficult.*

Indeed, which is why I propose we try out Apache Ignite as a way to achieve this functionality while replacing our in-memory, single-machine graph computer (Fulgora) with a distributed in-memory graph computer that will suit most OLAP-style analytics needs with the benefit of full OLTP TinkerPop API functionality and integration points with Spark and tools that work with data stored in HDFS. Quoting from this article, which admittedly reads as a long-form advertisement for Apache Ignite:
  • Ignite can provide shared storage, so state can be passed from one Spark application or job to another
  • Ignite can provide SQL with indexing so Spark SQL can be accelerated over 1,000x
  • When working with files instead of RDDs, the Apache Ignite In-Memory File System (IGFS) can also share state between Spark jobs and applications
My understanding is that Apache Ignite essentially provides a distributed In-Memory cache that is compatible with and a "drop-in" value add to both HDFS and Apache Spark with an integration point for Cassandra as well. To me the RDD-like data structure of Ignite which maintains state complete with ACID transactions via an SQL interface would therefore address Marko's concern about distributed transactions. Here are a few jumping-off points we could learn from:

1. SQLG - MIT Licensed by Pieter Martin, who is quite active on the TinkerPop mailing lists and JIRA. We could ask nicely if he would like to lend a hand, as I am looking at the SQL interface as the most sensible target here.
2. Marko Rodriguez's reference implementation of GraphActors with Akka - as mentioned above this is an active body of work that has not yet been released or merged into TinkerPop master, but is what I suggest we target here.
3. Ted Wilmes' SQL-gremlin - while this goes "the other way" and maps a TinkerPop-enabled database to a tabular representation so that you can run SQL queries over the data, I'm guessing we'll see plenty of overlapping gotchas that Ted already ran into.
4. SparkGraphComputer - or "the thing that already works". Apache Ignite shadowing the Spark/Hadoop APIs might put a drop-in IgniteGraphComputer within reach which would give us an idea of how performant and useful we could expect the system to be overall before we invest in the "big change" of IgniteGraphActors or whatever middle-ground between the GraphActors, GraphComputer, and GraphProvider frameworks we'll need to find to realise an Apache Ignite backend within JanusGraph.

I also wanted to mention my thoughts on IGFS (Ignite File System) which runs either on top of HDFS, between HDFS and Spark, or standalone (I believe). My thinking is that we can store side-effect data structures in IGFS and it will enable the same ACID-ness on distributed side-effect data structures we would be getting for elements in the Graph/RDD data structure via IgniteRDD or what have you. From there, persistence to HDFS via a BulkDumperVertexProgram or BulkExportVertexProgram would be possible, as would running spark job chains with addE()/V() and drop() on the IgniteRDD or transformations thereof, opening up a path to ETL type workflows involving other "Big Data" creatures/tools. Further, we could then persist back into JanusGraph with a BulkLoaderVertexProgram implementation. So again, this is somewhat of a GraphComputer/GraphActors hybrid, but I'm not sure I mind. Jonathan Ellithorpe mentioned his implementation of TinkerPop over RAMcloud a while back on the TinkerPop mailing list as part of a benchmarking effort - we could ask him about how performant that was as it sounds similar to what this would be. Benchmarks would be nice, too :)

I'm interested in what people think of on-boarding this kind of processing engine in principle, even if all the optimistic assumptions of the feasibility of Ignite I have made here turn out to be unfounded. Are there other options we should consider besides Ignite, or should we stick closer to home and simply implement the GraphActors/Partitioner/Partitions standard Marko is working on directly with Cassandra/HBase as a giant refactor over time? Clearly, this is a change we can move up or down our development schedule and spend a while getting right, but if performant I see a lot of value here.


[DISCUSS] Moving toward an initial release

"P. Taylor Goetz" <ptg...@...>
 

I’d like to start a discussion to flesh out a list of tasks necessary to make an initial release. Having a downloadable binary release and associated maven artifacts, IMO would help in terms of expanding the community by allowing potential users “kick the tires” of JanusGraph and get and idea of what’s required to migrate off of TitanDB.

From a process perspective, we should figure out how releases are planned and executed. In Apache projects, that typically falls to the Release Manager. That role typically rotates among commiters, and any committer is free to propose a release at any time. The process typically starts with a DISCUSS thread where the community discusses what features, bug fixes, etc. should to into the release, and deciding who will act as release manager for that release. Once the release is ready to be cut, the release manager builds the source release, binary release, and stages the Maven artifacts in Nexus. The next step is to start a VOTE thread to approve the release (the VOTE message contains links to all release artifacts: source/binary archives, associated checksums and signatures, and links to the Nexus staging repository. For the vote to pass, it requires a minimum of 3 positive votes, and more positive than negative votes. If the vote passes, the release artifacts are made available for download, and the Maven artifacts released from Nexus.

The Apache release policy can be found here: http://www.apache.org/legal/release-policy.html

I that a process we would like to adopt?


From an operational perspective, I can think of a few things:

Any potential release managers would need a Nexus account. I already have one, so I can help facilitate the creation of the JanusGraph organization in Nexus (presumably using the “org.janusgraph” maven group ID). I would just need JIRA usernames for everyone. (You can create a JIRA account here: https://issues.sonatype.org/secure/Dashboard.jspa) I would want to do this in one fell swoop because the process requires human review and can take up to 48 hours.

Where would we host the downloads? The janusgraph.org website is in git, but I’m not sure we would want to store potentially big binaries there.

In terms of technical issues that should be resolved for the first release, do we want to list them out in this thread, use a GitHub issue label, or something else?

-Taylor



Re: [PROPOSAL] Replace FulgoraGraphComputer by IgniteGraphActors/IgniteGraphComputer

Jason Plurad <plu...@...>
 

David Robinson presented The Many Faces of Apache Ignite (PDF) at Apache: Big Data NA 2016. He has since moved on from IBM, so his work on Genesis Graph was unfinished and unreleased. It predated the cool stuff coming with GraphActors, but it seems in line with Dylan's proposal.

-- Jason


On Friday, January 27, 2017 at 4:01:36 PM UTC-5, okram...@... wrote:
Hi,

I have never heard of Apache Ignite and Kuppitz just introduced me to Apache Geode too. Crazy cool technologies.

Personally, I think it would be great to try and build "distributed/transactional/persisted TinkerGraph" using:

  Apache Geode + TinkerPop

The idea would be that each "region" (partition) maintains in their key/value structure vertex-id/vertex(properties/edges). Then, if you have GraphActors overlaid across the cluster, you get data local traversals. Finally, because Apache Geode supports transactions and disk persistence (and cache spillover), you get "database" features. From reviewing their docs, it seems it would be pretty easy to implement. 

The benefit of building this would be to learn how TinkerPop should define the Partitioner and Partition interfaces that are being developed for TinkerPop 3.3.0. I have been biased by Akka so it would be good to be biased by Geode/Ignite as well.

As a side: I didn't see anything about "distributed cache aware transactions" in Apache Geode like I see in Apache Ignite. 

Anywho, given that TinkerPop already provides the Gremlin language and machine where the machine can execute in an iterative "Hadoop-based" fashion (GraphComputer) and a peer-to-peer message passing fashion (GraphActors), there really isn't much work to do besides back the TinkerPop structure API by Apache Geode/Ignite. Once you do that, you can start to get fancy with not requiring "star vertices" as the values of the keys, but instead, support keys that are ids for vertices, "edge-blocks", and "vertex property blocks". In this way, you get not only vertex-, but also edge-cuts. The Partitioner/Partition API of TinkerPop will be able to support the ability to have the graph data represented across the cluster as you please and thus, unlike GraphComputer, you are not tied to "star vertex blocks."

Thanks for the thoughts. I enjoyed reading your email and spending my afternoon learning about Ignite and Geode.

Take care,
Marko.



On Thursday, January 26, 2017 at 2:34:48 PM UTC-7, Dylan Bethune-Waddell wrote:
So I've been toying with this idea for a while, but I want to make it concrete here so we can shoot it down, mutate, or build it up as needed:

I think that we should on-board an additional backend for distributed, asynchronous OLAP/OLTP "hybrid" processing, which Marko Rodriguez is implementing a reference implementation for right now in TinkerPop (link to branch and JIRA). Here's a link to a figure he attached to the JIRA issue. We need to drop or change the name of FulgoraGraphComputer anyways, so my suggestion is "repeal and replace" - let's take a stab at providing an implementation of Marko's new work, which he is doing with Akka, whereby we will use Apache Ignite as a backend instead with the goal of addressing a concern raised/discussed in the JIRA issue:
  • If transactions are worked out, then distributed OLTP Gremlin provides mutation capabilities (something currently not implemented for GraphComputer). That is addVaddEdrop, etc. just works. *Caveate, transactions in this environment across GremlinServer seems difficult.*

Indeed, which is why I propose we try out Apache Ignite as a way to achieve this functionality while replacing our in-memory, single-machine graph computer (Fulgora) with a distributed in-memory graph computer that will suit most OLAP-style analytics needs with the benefit of full OLTP TinkerPop API functionality and integration points with Spark and tools that work with data stored in HDFS. Quoting from this article, which admittedly reads as a long-form advertisement for Apache Ignite:
  • Ignite can provide shared storage, so state can be passed from one Spark application or job to another
  • Ignite can provide SQL with indexing so Spark SQL can be accelerated over 1,000x
  • When working with files instead of RDDs, the Apache Ignite In-Memory File System (IGFS) can also share state between Spark jobs and applications
My understanding is that Apache Ignite essentially provides a distributed In-Memory cache that is compatible with and a "drop-in" value add to both HDFS and Apache Spark with an integration point for Cassandra as well. To me the RDD-like data structure of Ignite which maintains state complete with ACID transactions via an SQL interface would therefore address Marko's concern about distributed transactions. Here are a few jumping-off points we could learn from:

1. SQLG - MIT Licensed by Pieter Martin, who is quite active on the TinkerPop mailing lists and JIRA. We could ask nicely if he would like to lend a hand, as I am looking at the SQL interface as the most sensible target here.
2. Marko Rodriguez's reference implementation of GraphActors with Akka - as mentioned above this is an active body of work that has not yet been released or merged into TinkerPop master, but is what I suggest we target here.
3. Ted Wilmes' SQL-gremlin - while this goes "the other way" and maps a TinkerPop-enabled database to a tabular representation so that you can run SQL queries over the data, I'm guessing we'll see plenty of overlapping gotchas that Ted already ran into.
4. SparkGraphComputer - or "the thing that already works". Apache Ignite shadowing the Spark/Hadoop APIs might put a drop-in IgniteGraphComputer within reach which would give us an idea of how performant and useful we could expect the system to be overall before we invest in the "big change" of IgniteGraphActors or whatever middle-ground between the GraphActors, GraphComputer, and GraphProvider frameworks we'll need to find to realise an Apache Ignite backend within JanusGraph.

I also wanted to mention my thoughts on IGFS (Ignite File System) which runs either on top of HDFS, between HDFS and Spark, or standalone (I believe). My thinking is that we can store side-effect data structures in IGFS and it will enable the same ACID-ness on distributed side-effect data structures we would be getting for elements in the Graph/RDD data structure via IgniteRDD or what have you. From there, persistence to HDFS via a BulkDumperVertexProgram or BulkExportVertexProgram would be possible, as would running spark job chains with addE()/V() and drop() on the IgniteRDD or transformations thereof, opening up a path to ETL type workflows involving other "Big Data" creatures/tools. Further, we could then persist back into JanusGraph with a BulkLoaderVertexProgram implementation. So again, this is somewhat of a GraphComputer/GraphActors hybrid, but I'm not sure I mind. Jonathan Ellithorpe mentioned his implementation of TinkerPop over RAMcloud a while back on the TinkerPop mailing list as part of a benchmarking effort - we could ask him about how performant that was as it sounds similar to what this would be. Benchmarks would be nice, too :)

I'm interested in what people think of on-boarding this kind of processing engine in principle, even if all the optimistic assumptions of the feasibility of Ignite I have made here turn out to be unfounded. Are there other options we should consider besides Ignite, or should we stick closer to home and simply implement the GraphActors/Partitioner/Partitions standard Marko is working on directly with Cassandra/HBase as a giant refactor over time? Clearly, this is a change we can move up or down our development schedule and spend a while getting right, but if performant I see a lot of value here.


Re: [DISCUSS] Moving toward an initial release

Misha Brukman <mbru...@...>
 

Hi Taylor,

On Fri, Jan 27, 2017 at 4:38 PM, P. Taylor Goetz <ptg...@...> wrote:
I’d like to start a discussion to flesh out a list of tasks necessary to make an initial release. Having a downloadable binary release and associated maven artifacts, IMO would help in terms of expanding the community by allowing potential users “kick the tires” of JanusGraph and get and idea of what’s required to migrate off of TitanDB.

Definitely agree on this! And we've had a number of users ask for this explicitly as well.
 
From a process perspective, we should figure out how releases are planned and executed. In Apache projects, that typically falls to the Release Manager. That role typically rotates among commiters, and any committer is free to propose a release at any time. The process typically starts with a DISCUSS thread where the community discusses what features, bug fixes, etc. should to into the release, and deciding who will act as release manager for that release. Once the release is ready to be cut, the release manager builds the source release, binary release, and stages the Maven artifacts in Nexus. The next step is to start a VOTE thread to approve the release (the VOTE message contains links to all release artifacts: source/binary archives, associated checksums and signatures, and links to the Nexus staging repository. For the vote to pass, it requires a minimum of 3 positive votes, and more positive than negative votes. If the vote passes, the release artifacts are made available for download, and the Maven artifacts released from Nexus.

The Apache release policy can be found here: http://www.apache.org/legal/release-policy.html

I that a process we would like to adopt?

No strong feelings on this matter, the Apache release process sounds good to me.
 
From an operational perspective, I can think of a few things:

Any potential release managers would need a Nexus account. I already have one, so I can help facilitate the creation of the JanusGraph organization in Nexus (presumably using the “org.janusgraph” maven group ID). I would just need JIRA usernames for everyone. (You can create a JIRA account here: https://issues.sonatype.org/secure/Dashboard.jspa) I would want to do this in one fell swoop because the process requires human review and can take up to 48 hours.

We can start a GitHub issue and request that everyone posts their desired usernames there, collect them all after some period of time, and submit a single request. Do you want to own this?

Where would we host the downloads? The janusgraph.org website is in git, but I’m not sure we would want to store potentially big binaries there.

We don't need to host binaries in version control, GitHub provides binary hosting outside of version control, so it won't affect the source repo: https://github.com/blog/1547-release-your-software
 
In terms of technical issues that should be resolved for the first release, do we want to list them out in this thread, use a GitHub issue label, or something else?

I'm fine with either, but for discussion purposes, email might be easier (?).

Misha


Re: [PROPOSAL] Replace FulgoraGraphComputer by IgniteGraphActors/IgniteGraphComputer

mathias...@...
 

I will be testing JanusGraph on ScyllaDB.


On Friday, 27 January 2017 21:55:13 UTC, Jason Plurad wrote:
David Robinson presented The Many Faces of Apache Ignite (PDF) at Apache: Big Data NA 2016. He has since moved on from IBM, so his work on Genesis Graph was unfinished and unreleased. It predated the cool stuff coming with GraphActors, but it seems in line with Dylan's proposal.

-- Jason

On Friday, January 27, 2017 at 4:01:36 PM UTC-5, ok...@... wrote:
Hi,

I have never heard of Apache Ignite and Kuppitz just introduced me to Apache Geode too. Crazy cool technologies.

Personally, I think it would be great to try and build "distributed/transactional/persisted TinkerGraph" using:

  Apache Geode + TinkerPop

The idea would be that each "region" (partition) maintains in their key/value structure vertex-id/vertex(properties/edges). Then, if you have GraphActors overlaid across the cluster, you get data local traversals. Finally, because Apache Geode supports transactions and disk persistence (and cache spillover), you get "database" features. From reviewing their docs, it seems it would be pretty easy to implement. 

The benefit of building this would be to learn how TinkerPop should define the Partitioner and Partition interfaces that are being developed for TinkerPop 3.3.0. I have been biased by Akka so it would be good to be biased by Geode/Ignite as well.

As a side: I didn't see anything about "distributed cache aware transactions" in Apache Geode like I see in Apache Ignite. 

Anywho, given that TinkerPop already provides the Gremlin language and machine where the machine can execute in an iterative "Hadoop-based" fashion (GraphComputer) and a peer-to-peer message passing fashion (GraphActors), there really isn't much work to do besides back the TinkerPop structure API by Apache Geode/Ignite. Once you do that, you can start to get fancy with not requiring "star vertices" as the values of the keys, but instead, support keys that are ids for vertices, "edge-blocks", and "vertex property blocks". In this way, you get not only vertex-, but also edge-cuts. The Partitioner/Partition API of TinkerPop will be able to support the ability to have the graph data represented across the cluster as you please and thus, unlike GraphComputer, you are not tied to "star vertex blocks."

Thanks for the thoughts. I enjoyed reading your email and spending my afternoon learning about Ignite and Geode.

Take care,
Marko.



On Thursday, January 26, 2017 at 2:34:48 PM UTC-7, Dylan Bethune-Waddell wrote:
So I've been toying with this idea for a while, but I want to make it concrete here so we can shoot it down, mutate, or build it up as needed:

I think that we should on-board an additional backend for distributed, asynchronous OLAP/OLTP "hybrid" processing, which Marko Rodriguez is implementing a reference implementation for right now in TinkerPop (link to branch and JIRA). Here's a link to a figure he attached to the JIRA issue. We need to drop or change the name of FulgoraGraphComputer anyways, so my suggestion is "repeal and replace" - let's take a stab at providing an implementation of Marko's new work, which he is doing with Akka, whereby we will use Apache Ignite as a backend instead with the goal of addressing a concern raised/discussed in the JIRA issue:
  • If transactions are worked out, then distributed OLTP Gremlin provides mutation capabilities (something currently not implemented for GraphComputer). That is addVaddEdrop, etc. just works. *Caveate, transactions in this environment across GremlinServer seems difficult.*

Indeed, which is why I propose we try out Apache Ignite as a way to achieve this functionality while replacing our in-memory, single-machine graph computer (Fulgora) with a distributed in-memory graph computer that will suit most OLAP-style analytics needs with the benefit of full OLTP TinkerPop API functionality and integration points with Spark and tools that work with data stored in HDFS. Quoting from this article, which admittedly reads as a long-form advertisement for Apache Ignite:
  • Ignite can provide shared storage, so state can be passed from one Spark application or job to another
  • Ignite can provide SQL with indexing so Spark SQL can be accelerated over 1,000x
  • When working with files instead of RDDs, the Apache Ignite In-Memory File System (IGFS) can also share state between Spark jobs and applications
My understanding is that Apache Ignite essentially provides a distributed In-Memory cache that is compatible with and a "drop-in" value add to both HDFS and Apache Spark with an integration point for Cassandra as well. To me the RDD-like data structure of Ignite which maintains state complete with ACID transactions via an SQL interface would therefore address Marko's concern about distributed transactions. Here are a few jumping-off points we could learn from:

1. SQLG - MIT Licensed by Pieter Martin, who is quite active on the TinkerPop mailing lists and JIRA. We could ask nicely if he would like to lend a hand, as I am looking at the SQL interface as the most sensible target here.
2. Marko Rodriguez's reference implementation of GraphActors with Akka - as mentioned above this is an active body of work that has not yet been released or merged into TinkerPop master, but is what I suggest we target here.
3. Ted Wilmes' SQL-gremlin - while this goes "the other way" and maps a TinkerPop-enabled database to a tabular representation so that you can run SQL queries over the data, I'm guessing we'll see plenty of overlapping gotchas that Ted already ran into.
4. SparkGraphComputer - or "the thing that already works". Apache Ignite shadowing the Spark/Hadoop APIs might put a drop-in IgniteGraphComputer within reach which would give us an idea of how performant and useful we could expect the system to be overall before we invest in the "big change" of IgniteGraphActors or whatever middle-ground between the GraphActors, GraphComputer, and GraphProvider frameworks we'll need to find to realise an Apache Ignite backend within JanusGraph.

I also wanted to mention my thoughts on IGFS (Ignite File System) which runs either on top of HDFS, between HDFS and Spark, or standalone (I believe). My thinking is that we can store side-effect data structures in IGFS and it will enable the same ACID-ness on distributed side-effect data structures we would be getting for elements in the Graph/RDD data structure via IgniteRDD or what have you. From there, persistence to HDFS via a BulkDumperVertexProgram or BulkExportVertexProgram would be possible, as would running spark job chains with addE()/V() and drop() on the IgniteRDD or transformations thereof, opening up a path to ETL type workflows involving other "Big Data" creatures/tools. Further, we could then persist back into JanusGraph with a BulkLoaderVertexProgram implementation. So again, this is somewhat of a GraphComputer/GraphActors hybrid, but I'm not sure I mind. Jonathan Ellithorpe mentioned his implementation of TinkerPop over RAMcloud a while back on the TinkerPop mailing list as part of a benchmarking effort - we could ask him about how performant that was as it sounds similar to what this would be. Benchmarks would be nice, too :)

I'm interested in what people think of on-boarding this kind of processing engine in principle, even if all the optimistic assumptions of the feasibility of Ignite I have made here turn out to be unfounded. Are there other options we should consider besides Ignite, or should we stick closer to home and simply implement the GraphActors/Partitioner/Partitions standard Marko is working on directly with Cassandra/HBase as a giant refactor over time? Clearly, this is a change we can move up or down our development schedule and spend a while getting right, but if performant I see a lot of value here.


Re: [DISCUSS] Moving toward an initial release

Ted Wilmes <twi...@...>
 

Apache release process sounds good to me.  Perhaps we can discuss what to include on this list and then mark the relevant issues in github with the 0.1 milestone.

As far as what to include, I think it would be good to at a minimum get the build & tests in a repeatedly passing state and then possibly update things to TinkerPop 3.2.4 which will be released shortly.
I just entered this other issue which is serious, but I'm torn if it's serious enough require its inclusion in the first release.  Essentially if batch queries to the backends are flipped on, certain traversals will
fail.  At this point, getting a release that is at least at parity with the Titan 1.1 branch + a TinkerPop bump is probably a good first step.  What are you all's thoughts? 

--Ted


On Friday, January 27, 2017 at 5:22:34 PM UTC-6, Misha Brukman wrote:
Hi Taylor,

On Fri, Jan 27, 2017 at 4:38 PM, P. Taylor Goetz <ptg...@...> wrote:
I’d like to start a discussion to flesh out a list of tasks necessary to make an initial release. Having a downloadable binary release and associated maven artifacts, IMO would help in terms of expanding the community by allowing potential users “kick the tires” of JanusGraph and get and idea of what’s required to migrate off of TitanDB.

Definitely agree on this! And we've had a number of users ask for this explicitly as well.
 
From a process perspective, we should figure out how releases are planned and executed. In Apache projects, that typically falls to the Release Manager. That role typically rotates among commiters, and any committer is free to propose a release at any time. The process typically starts with a DISCUSS thread where the community discusses what features, bug fixes, etc. should to into the release, and deciding who will act as release manager for that release. Once the release is ready to be cut, the release manager builds the source release, binary release, and stages the Maven artifacts in Nexus. The next step is to start a VOTE thread to approve the release (the VOTE message contains links to all release artifacts: source/binary archives, associated checksums and signatures, and links to the Nexus staging repository. For the vote to pass, it requires a minimum of 3 positive votes, and more positive than negative votes. If the vote passes, the release artifacts are made available for download, and the Maven artifacts released from Nexus.

The Apache release policy can be found here: http://www.apache.org/legal/release-policy.html

I that a process we would like to adopt?

No strong feelings on this matter, the Apache release process sounds good to me.
 
From an operational perspective, I can think of a few things:

Any potential release managers would need a Nexus account. I already have one, so I can help facilitate the creation of the JanusGraph organization in Nexus (presumably using the “org.janusgraph” maven group ID). I would just need JIRA usernames for everyone. (You can create a JIRA account here: https://issues.sonatype.org/secure/Dashboard.jspa) I would want to do this in one fell swoop because the process requires human review and can take up to 48 hours.

We can start a GitHub issue and request that everyone posts their desired usernames there, collect them all after some period of time, and submit a single request. Do you want to own this?

Where would we host the downloads? The janusgraph.org website is in git, but I’m not sure we would want to store potentially big binaries there.

We don't need to host binaries in version control, GitHub provides binary hosting outside of version control, so it won't affect the source repo: https://github.com/blog/1547-release-your-software
 
In terms of technical issues that should be resolved for the first release, do we want to list them out in this thread, use a GitHub issue label, or something else?

I'm fine with either, but for discussion purposes, email might be easier (?).

Misha


Re: [DISCUSS] Moving toward an initial release

Jerry He <jerr...@...>
 

+1
We are on the same page.
The first release will be an introduction release. New face, updated dependencies as appropriate, on par with Titan 1.0/1.1 in term of testing and quality, with consideration/ documentation for migration from Titan.

Jerry  


Re: [DISCUSS] Moving toward an initial release

Misha Brukman <mbru...@...>
 

FYI, I created a milestone for the 0.1.0 release (since we're on 0.1.0-SNAPSHOT now): https://github.com/JanusGraph/janusgraph/milestone/1

Let's start adding all blockers to that milestone so that it's easy to track where we stand. For example, I just added the following issue:
I think some of these should also be blockers:
though what it's really trying to get to is to get a stable set of tests that we can rely on, that Travis also runs for every PR. Right now, we're just verifying builds, but since the full test suite times out, we are not running any tests continuously at all, which is dangerous.

On Sat, Jan 28, 2017 at 11:46 PM, Jerry He <jerr...@...> wrote:
+1
We are on the same page.
The first release will be an introduction release. New face, updated dependencies as appropriate, on par with Titan 1.0/1.1 in term of testing and quality, with consideration/ documentation for migration from Titan.

Jerry  

--
You received this message because you are subscribed to the Google Groups "JanusGraph developer list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [PROPOSAL] Replace FulgoraGraphComputer by IgniteGraphActors/IgniteGraphComputer

Pieter Martin <pieter...@...>
 

Hi,

I'd be happy to help where I can.
I'll read up a bit on Apache Ignite.

Cheers
Pieter


On Thursday, 26 January 2017 23:34:48 UTC+2, Dylan Bethune-Waddell wrote:
So I've been toying with this idea for a while, but I want to make it concrete here so we can shoot it down, mutate, or build it up as needed:

I think that we should on-board an additional backend for distributed, asynchronous OLAP/OLTP "hybrid" processing, which Marko Rodriguez is implementing a reference implementation for right now in TinkerPop (link to branch and JIRA). Here's a link to a figure he attached to the JIRA issue. We need to drop or change the name of FulgoraGraphComputer anyways, so my suggestion is "repeal and replace" - let's take a stab at providing an implementation of Marko's new work, which he is doing with Akka, whereby we will use Apache Ignite as a backend instead with the goal of addressing a concern raised/discussed in the JIRA issue:
  • If transactions are worked out, then distributed OLTP Gremlin provides mutation capabilities (something currently not implemented for GraphComputer). That is addVaddEdrop, etc. just works. *Caveate, transactions in this environment across GremlinServer seems difficult.*

Indeed, which is why I propose we try out Apache Ignite as a way to achieve this functionality while replacing our in-memory, single-machine graph computer (Fulgora) with a distributed in-memory graph computer that will suit most OLAP-style analytics needs with the benefit of full OLTP TinkerPop API functionality and integration points with Spark and tools that work with data stored in HDFS. Quoting from this article, which admittedly reads as a long-form advertisement for Apache Ignite:
  • Ignite can provide shared storage, so state can be passed from one Spark application or job to another
  • Ignite can provide SQL with indexing so Spark SQL can be accelerated over 1,000x
  • When working with files instead of RDDs, the Apache Ignite In-Memory File System (IGFS) can also share state between Spark jobs and applications
My understanding is that Apache Ignite essentially provides a distributed In-Memory cache that is compatible with and a "drop-in" value add to both HDFS and Apache Spark with an integration point for Cassandra as well. To me the RDD-like data structure of Ignite which maintains state complete with ACID transactions via an SQL interface would therefore address Marko's concern about distributed transactions. Here are a few jumping-off points we could learn from:

1. SQLG - MIT Licensed by Pieter Martin, who is quite active on the TinkerPop mailing lists and JIRA. We could ask nicely if he would like to lend a hand, as I am looking at the SQL interface as the most sensible target here.
2. Marko Rodriguez's reference implementation of GraphActors with Akka - as mentioned above this is an active body of work that has not yet been released or merged into TinkerPop master, but is what I suggest we target here.
3. Ted Wilmes' SQL-gremlin - while this goes "the other way" and maps a TinkerPop-enabled database to a tabular representation so that you can run SQL queries over the data, I'm guessing we'll see plenty of overlapping gotchas that Ted already ran into.
4. SparkGraphComputer - or "the thing that already works". Apache Ignite shadowing the Spark/Hadoop APIs might put a drop-in IgniteGraphComputer within reach which would give us an idea of how performant and useful we could expect the system to be overall before we invest in the "big change" of IgniteGraphActors or whatever middle-ground between the GraphActors, GraphComputer, and GraphProvider frameworks we'll need to find to realise an Apache Ignite backend within JanusGraph.

I also wanted to mention my thoughts on IGFS (Ignite File System) which runs either on top of HDFS, between HDFS and Spark, or standalone (I believe). My thinking is that we can store side-effect data structures in IGFS and it will enable the same ACID-ness on distributed side-effect data structures we would be getting for elements in the Graph/RDD data structure via IgniteRDD or what have you. From there, persistence to HDFS via a BulkDumperVertexProgram or BulkExportVertexProgram would be possible, as would running spark job chains with addE()/V() and drop() on the IgniteRDD or transformations thereof, opening up a path to ETL type workflows involving other "Big Data" creatures/tools. Further, we could then persist back into JanusGraph with a BulkLoaderVertexProgram implementation. So again, this is somewhat of a GraphComputer/GraphActors hybrid, but I'm not sure I mind. Jonathan Ellithorpe mentioned his implementation of TinkerPop over RAMcloud a while back on the TinkerPop mailing list as part of a benchmarking effort - we could ask him about how performant that was as it sounds similar to what this would be. Benchmarks would be nice, too :)

I'm interested in what people think of on-boarding this kind of processing engine in principle, even if all the optimistic assumptions of the feasibility of Ignite I have made here turn out to be unfounded. Are there other options we should consider besides Ignite, or should we stick closer to home and simply implement the GraphActors/Partitioner/Partitions standard Marko is working on directly with Cassandra/HBase as a giant refactor over time? Clearly, this is a change we can move up or down our development schedule and spend a while getting right, but if performant I see a lot of value here.


Re: [PROPOSAL] Replace FulgoraGraphComputer by IgniteGraphActors/IgniteGraphComputer

Henry Saputra <henry....@...>
 

I just want to clarify the proposal, so we will just need the IGFS portion of Apache Ignite to add in-memory cache on top of HDFS ?


On Thursday, January 26, 2017 at 1:34:48 PM UTC-8, Dylan Bethune-Waddell wrote:
So I've been toying with this idea for a while, but I want to make it concrete here so we can shoot it down, mutate, or build it up as needed:

I think that we should on-board an additional backend for distributed, asynchronous OLAP/OLTP "hybrid" processing, which Marko Rodriguez is implementing a reference implementation for right now in TinkerPop (link to branch and JIRA). Here's a link to a figure he attached to the JIRA issue. We need to drop or change the name of FulgoraGraphComputer anyways, so my suggestion is "repeal and replace" - let's take a stab at providing an implementation of Marko's new work, which he is doing with Akka, whereby we will use Apache Ignite as a backend instead with the goal of addressing a concern raised/discussed in the JIRA issue:
  • If transactions are worked out, then distributed OLTP Gremlin provides mutation capabilities (something currently not implemented for GraphComputer). That is addVaddEdrop, etc. just works. *Caveate, transactions in this environment across GremlinServer seems difficult.*

Indeed, which is why I propose we try out Apache Ignite as a way to achieve this functionality while replacing our in-memory, single-machine graph computer (Fulgora) with a distributed in-memory graph computer that will suit most OLAP-style analytics needs with the benefit of full OLTP TinkerPop API functionality and integration points with Spark and tools that work with data stored in HDFS. Quoting from this article, which admittedly reads as a long-form advertisement for Apache Ignite:
  • Ignite can provide shared storage, so state can be passed from one Spark application or job to another
  • Ignite can provide SQL with indexing so Spark SQL can be accelerated over 1,000x
  • When working with files instead of RDDs, the Apache Ignite In-Memory File System (IGFS) can also share state between Spark jobs and applications
My understanding is that Apache Ignite essentially provides a distributed In-Memory cache that is compatible with and a "drop-in" value add to both HDFS and Apache Spark with an integration point for Cassandra as well. To me the RDD-like data structure of Ignite which maintains state complete with ACID transactions via an SQL interface would therefore address Marko's concern about distributed transactions. Here are a few jumping-off points we could learn from:

1. SQLG - MIT Licensed by Pieter Martin, who is quite active on the TinkerPop mailing lists and JIRA. We could ask nicely if he would like to lend a hand, as I am looking at the SQL interface as the most sensible target here.
2. Marko Rodriguez's reference implementation of GraphActors with Akka - as mentioned above this is an active body of work that has not yet been released or merged into TinkerPop master, but is what I suggest we target here.
3. Ted Wilmes' SQL-gremlin - while this goes "the other way" and maps a TinkerPop-enabled database to a tabular representation so that you can run SQL queries over the data, I'm guessing we'll see plenty of overlapping gotchas that Ted already ran into.
4. SparkGraphComputer - or "the thing that already works". Apache Ignite shadowing the Spark/Hadoop APIs might put a drop-in IgniteGraphComputer within reach which would give us an idea of how performant and useful we could expect the system to be overall before we invest in the "big change" of IgniteGraphActors or whatever middle-ground between the GraphActors, GraphComputer, and GraphProvider frameworks we'll need to find to realise an Apache Ignite backend within JanusGraph.

I also wanted to mention my thoughts on IGFS (Ignite File System) which runs either on top of HDFS, between HDFS and Spark, or standalone (I believe). My thinking is that we can store side-effect data structures in IGFS and it will enable the same ACID-ness on distributed side-effect data structures we would be getting for elements in the Graph/RDD data structure via IgniteRDD or what have you. From there, persistence to HDFS via a BulkDumperVertexProgram or BulkExportVertexProgram would be possible, as would running spark job chains with addE()/V() and drop() on the IgniteRDD or transformations thereof, opening up a path to ETL type workflows involving other "Big Data" creatures/tools. Further, we could then persist back into JanusGraph with a BulkLoaderVertexProgram implementation. So again, this is somewhat of a GraphComputer/GraphActors hybrid, but I'm not sure I mind. Jonathan Ellithorpe mentioned his implementation of TinkerPop over RAMcloud a while back on the TinkerPop mailing list as part of a benchmarking effort - we could ask him about how performant that was as it sounds similar to what this would be. Benchmarks would be nice, too :)

I'm interested in what people think of on-boarding this kind of processing engine in principle, even if all the optimistic assumptions of the feasibility of Ignite I have made here turn out to be unfounded. Are there other options we should consider besides Ignite, or should we stick closer to home and simply implement the GraphActors/Partitioner/Partitions standard Marko is working on directly with Cassandra/HBase as a giant refactor over time? Clearly, this is a change we can move up or down our development schedule and spend a while getting right, but if performant I see a lot of value here.


Development process - testing

sjudeng <sju...@...>
 

Hello JanusGraph developers,

I'd like to request that the PR evaluation/approval process be updated to require successful local testing (mvn clean install) of all relevant test suites before merging into master. Currently Travis CI can only reliably test one or two modules and it's unlikely to ever be able to handle larger modules or the TinkerPop test suite (mvn clean install -Dtest.skip.tp=false).

I think there should be a test sign off as part of the review process of any non-trivial PR. You can see this done in TinkerPop PR comments. The scope of the PR can determine what modules need tests run and whether for example the TinkerPop suite needs to be run. If running tests is too much of a burden on reviewers an option would be to require submitters to indicate which test suites were run and provide the "Build Successful" snippet showing the time.

Currently all tests with the exception of the janusgraph-solr module should pass and there's an open PR (#76) to resolve this. The TinkerPop test suite should be passing as well but otherwise it's passing in the branch in open PR #78.


Re: Development process - testing

Jerry He <jerr...@...>
 

+1, given the current limitation with Travis CI.

Thanks,

Jerry


Re: [DISCUSS] Moving toward an initial release

"P. Taylor Goetz" <ptg...@...>
 

I created the following issue to get Sonatype Nexus setup so we can publish artifacts to maven central.


Any committers interested in having the ability to perform a release should follow those instructions. Once we get a list of usernames, I can go ahead and setup the Nexus account for the “org.janusgraph” group ID.

Thanks,

-Taylor*


On Jan 27, 2017, at 6:21 PM, Misha Brukman <mbru...@...> wrote:

Hi Taylor,

On Fri, Jan 27, 2017 at 4:38 PM, P. Taylor Goetz <ptg...@...> wrote:
I’d like to start a discussion to flesh out a list of tasks necessary to make an initial release. Having a downloadable binary release and associated maven artifacts, IMO would help in terms of expanding the community by allowing potential users “kick the tires” of JanusGraph and get and idea of what’s required to migrate off of TitanDB.

Definitely agree on this! And we've had a number of users ask for this explicitly as well.
 
From a process perspective, we should figure out how releases are planned and executed. In Apache projects, that typically falls to the Release Manager. That role typically rotates among commiters, and any committer is free to propose a release at any time. The process typically starts with a DISCUSS thread where the community discusses what features, bug fixes, etc. should to into the release, and deciding who will act as release manager for that release. Once the release is ready to be cut, the release manager builds the source release, binary release, and stages the Maven artifacts in Nexus. The next step is to start a VOTE thread to approve the release (the VOTE message contains links to all release artifacts: source/binary archives, associated checksums and signatures, and links to the Nexus staging repository. For the vote to pass, it requires a minimum of 3 positive votes, and more positive than negative votes. If the vote passes, the release artifacts are made available for download, and the Maven artifacts released from Nexus.

The Apache release policy can be found here: http://www.apache.org/legal/release-policy.html

I that a process we would like to adopt?

No strong feelings on this matter, the Apache release process sounds good to me.
 
From an operational perspective, I can think of a few things:

Any potential release managers would need a Nexus account. I already have one, so I can help facilitate the creation of the JanusGraph organization in Nexus (presumably using the “org.janusgraph” maven group ID). I would just need JIRA usernames for everyone. (You can create a JIRA account here: https://issues.sonatype.org/secure/Dashboard.jspa) I would want to do this in one fell swoop because the process requires human review and can take up to 48 hours.

We can start a GitHub issue and request that everyone posts their desired usernames there, collect them all after some period of time, and submit a single request. Do you want to own this?

Where would we host the downloads? The janusgraph.org website is in git, but I’m not sure we would want to store potentially big binaries there.

We don't need to host binaries in version control, GitHub provides binary hosting outside of version control, so it won't affect the source repo: https://github.com/blog/1547-release-your-software
 
In terms of technical issues that should be resolved for the first release, do we want to list them out in this thread, use a GitHub issue label, or something else?

I'm fine with either, but for discussion purposes, email might be easier (?).

Misha


HBase table definition and how flexible to change it?

Demai <nid...@...>
 

hi, Guys

new to this form and looking for a few pointers.

I am fairly familiar with HBase, hence plan to use it as the backend. I went through the 'getting-started' tutorial, and have the example up and run. I am looking for a few pointers about the design(of familycolumn, key), and how/why it was designed in such way. And then like to lead to my next questions, is it flexible(and reasonable, beneficial) to store vertex, edge and properties separately?  To do so, which code should I pay attention to? 

Many thanks

Demai


Re: HBase table definition and how flexible to change it?

Irving Duran <irvin...@...>
 

This is a good video that I would recommend -> https://youtu.be/tLR-I53Gl9g

I would keep your vertex, edges, and properties together.


Thank You,

Irving Duran

On Thu, Feb 9, 2017 at 12:49 PM, Demai <nid...@...> wrote:
hi, Guys

new to this form and looking for a few pointers.

I am fairly familiar with HBase, hence plan to use it as the backend. I went through the 'getting-started' tutorial, and have the example up and run. I am looking for a few pointers about the design(of familycolumn, key), and how/why it was designed in such way. And then like to lead to my next questions, is it flexible(and reasonable, beneficial) to store vertex, edge and properties separately?  To do so, which code should I pay attention to? 

Many thanks

Demai

--
You received this message because you are subscribed to the Google Groups "JanusGraph developer list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: HBase table definition and how flexible to change it?

Demai Ni <nid...@...>
 

Irving, 

thanks for the pointers. 

Neo4j is still running on single server, though there are efforts with data distribution/partition to support true cluster. maybe Neo4j doesn't worry about scalability yet. well, I am not really know much about it either. so should leave to experts to comment on it. 

But scalability is something I care. that why I am looking at JanusGraph. 

Demai 

On Thu, Feb 9, 2017 at 2:06 PM, Irving Duran <irvin...@...> wrote:
Hi Demai,
I think it goes all of the way back on when graph theory started.

Maybe one of these links will give you the answer that you are seeking.

I am not sure about Noe4j.  I played with it couple of years ago (when Hadoop was being looked at to being supported).  The problem that I ran into was scalability.  That's why I made the switch to GraphX (Apache Spark) and looking back into JanusGraph.

I hope this help.



Thank You,

Irving Duran

On Thu, Feb 9, 2017 at 3:54 PM, Demai Ni <nid...@...> wrote:
Irving, 


thanks. I just watched the whole presentation. It is pretty helpful to understand tinkerPop. However it doesn't mentioned the storage design about why vertex, edges, and properties should be put together. On another note, does Neo4j keep the three in separated file?

Anyway, appreciate the pointer

On Thu, Feb 9, 2017 at 12:58 PM, Irving Duran <irvin...@...> wrote:
This is a good video that I would recommend -> https://youtu.be/tLR-I53Gl9g

I would keep your vertex, edges, and properties together.


Thank You,

Irving Duran

On Thu, Feb 9, 2017 at 12:49 PM, Demai <nid...@...> wrote:
hi, Guys

new to this form and looking for a few pointers.

I am fairly familiar with HBase, hence plan to use it as the backend. I went through the 'getting-started' tutorial, and have the example up and run. I am looking for a few pointers about the design(of familycolumn, key), and how/why it was designed in such way. And then like to lead to my next questions, is it flexible(and reasonable, beneficial) to store vertex, edge and properties separately?  To do so, which code should I pay attention to? 

Many thanks

Demai

--
You received this message because you are subscribed to the Google Groups "JanusGraph developer list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.





Re: HBase table definition and how flexible to change it?

Jerry He <jerr...@...>
 

The edges, vertices, properties and indexes are stored in a fixed table and a fixed set of column families within the table in HBase. i.e. edges are in one CF, properties are in another CF.
They are linked via the rowkey / ID.
It is probably not clearly documented anywhere. You may need to look into the org.janusgraph.diskstorage.hbase package.
Also you can start up JanusGraph with HBase, create the sample graph.  Then look at and scan the table to get a feeling.

Jerry


Re: HBase table definition and how flexible to change it?

Demai <nid...@...>
 

Jerry,

thanks for the pointer. Since they are stored in different CFs, it is a bit similar as Neo4j.  I have the janusGraph up and running on my mac on top of HBase 1.2, thanks for the effort to make it supporting newer HBase version, will play with it as your suggested.

Demai


On Friday, February 10, 2017 at 10:33:39 AM UTC-8, Jerry He wrote:
The edges, vertices, properties and indexes are stored in a fixed table and a fixed set of column families within the table in HBase. i.e. edges are in one CF, properties are in another CF.
They are linked via the rowkey / ID.
It is probably not clearly documented anywhere. You may need to look into the org.janusgraph.diskstorage.hbase package.
Also you can start up JanusGraph with HBase, create the sample graph.  Then look at and scan the table to get a feeling.

Jerry


Re: HBase table definition and how flexible to change it?

Demai <nid...@...>
 

interesting. I followed the example from the 'getting start' section. The HBase table contains quick a few (9 to be exact) column families, and the columnfamily name is from 'e' to 't', which I guess is generated.... kind of hard to tell what's in there, let alone to figure out which contain vertex, edge..


hbase(main):025:0> describe 'janusgraph'
Table janusgraph is ENABLED
janusgraph
COLUMN FAMILIES DESCRIPTION
{NAME => 'e', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'GZ', MIN_V
ERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
{NAME => 'f', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'GZ', MIN_V
ERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
{NAME => 'g', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'GZ', MIN_V
ERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
{NAME => 'h', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'GZ', MIN_V
ERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
{NAME => 'i', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'GZ', MIN_V
ERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
{NAME => 'l', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => '604800 SECONDS (7 DAYS)', COMPRESSIO
N => 'GZ', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
{NAME => 'm', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'GZ', MIN_V
ERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
{NAME => 's', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'GZ', MIN_V
ERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
{NAME => 't', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'GZ', MIN_V
ERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}

On Friday, February 10, 2017 at 12:03:51 PM UTC-8, Demai wrote:
Jerry,

thanks for the pointer. Since they are stored in different CFs, it is a bit similar as Neo4j.  I have the janusGraph up and running on my mac on top of HBase 1.2, thanks for the effort to make it supporting newer HBase version, will play with it as your suggested.

Demai


On Friday, February 10, 2017 at 10:33:39 AM UTC-8, Jerry He wrote:
The edges, vertices, properties and indexes are stored in a fixed table and a fixed set of column families within the table in HBase. i.e. edges are in one CF, properties are in another CF.
They are linked via the rowkey / ID.
It is probably not clearly documented anywhere. You may need to look into the org.janusgraph.diskstorage.hbase package.
Also you can start up JanusGraph with HBase, create the sample graph.  Then look at and scan the table to get a feeling.

Jerry


Re: [DISCUSS] Development Process

Ted Wilmes <twi...@...>
 

A pull request has been submitted for this work: https://github.com/JanusGraph/janusgraph/pull/106.

--Ted


On Friday, January 27, 2017 at 1:22:16 PM UTC-6, Ted Wilmes wrote:
Great input, thanks everyone.  I think we're all pretty much on the same page.  If you guys are okay with it,
I'll put together a PR to add some candidate documentation around our policies.  I'll hit on:

* policy for commits
* design docs and project decision making
* release policy

I'll take a crack at harmonizing the points noted here and within the other referenced Apache
projects and then pass it along for review.

Thanks,
Ted

On Thursday, January 26, 2017 at 4:17:41 PM UTC-6, P. Taylor Goetz wrote:
A couple of points to consider:

+1 policy for commits:
A lot of Apache projects have guidelines for how long a pull request must be open before it can be merged. Guidelines typically range from 24-72 hours after opening, or after the last code change commit in the pull request. The idea is that with a community spread out over many timezones, it’s best to let the earth rotate at least once so everyone has a chance to review the change during their working hours.

For -1 votes (vetoes) on code changes, a lot of projects specify that it must be accompanied by a justification. Without a justification the veto is invalid.


Project Direction:
I would suggest a guideline that all decision-making take place on the mailing list (similar to the ASF mantra “if it didn’t happen on an email list, id didn’t happen”). Gitter is fine for informal communication, questions, etc., but any project-related decisions should be brought to the mailing lists. An again, a waiting period for decisions gives everyone in the community a chance to participate.

I’m also in favor of some sort of “JanusGraph Improvement Proposal" (JIP) (modeled similarly for Apache Kafka’s KIP process) for large changes. It’s a good way to get community feedback for big changes.

Feature Branches:
Another idea I support. In Apache Storm we use feature branches as sort of a “distributed pull request” — commit rules are relaxed for those branches, and when the feature is complete and ready to be merged to a main branch, the changes go up for review with the same merge rules as a main branch.

Pull Request Templates:
It’s good to have guidelines around what information needs to go into a pull request title and description. Otherwise you will inevitably get pull requests with a title like “fix bug” and no description, which forces you to look at the code changes

-Taylor