Date   

Requirements gathering and roadmap development

Haikal Pribadi <hai...@...>
 

Hi everyone,

Last week was very exciting with the launch of our beloved JanusGraph project. I believe some of you are working hard to get the first build out for JanusGraph. Please shout out if you need any help. I may be a little out of touch on that topic but our team is more than happy to help out.

Now that we've launched, I would like to kickstart the process of us gathering all the requirements (features, bugs) to be developed in JanusGraph. Let's gather all the problems we've known from Titan that we want to fix in Janus, as well as the features we wish existed.

And to get the ball rolling, here's how I propose we should start.

1) Since we alreayd have Github, let's start by listing the features/bug on our GitHub repository as issues. 
2) We should be descriptive in describing the issue. Examples are good, use cases are good. Use labels, e.g. Cassandra, HBase, Graph Computer, ElasticSearch.
3) We should all start reviewing and discussing them through the comments. 
4) We should make an effort to merge requirements to make sure the list is clean and efficient. 
5) We should refrain from make a decision on whether an issue get accepted or rejected on the discussion thread, unless it was  a natural consensus.
6) Once we have a decent list of requirements, we then start scheduling TSC meetings to review them and decide yes/no.

I believe these steps would be a good start for us to start our work on JanusGraph. We can distribute and manage the work properly. We can also include whatever work that is required to release the first build.

TIP: I'm using https://waffle.io/JanusGraph/janusgraph to visualise the issues on github as a kanban board, if anyones fancies a board. :)

Let's get this started!


Re: Requirements gathering and roadmap development

Jerry He <jerr...@...>
 

This is a great suggestion, well said.
There is already some requirements/requests from the user list :-)


Re: Requirements gathering and roadmap development

Haikal Pribadi <hai...@...>
 

Let's migrate them to the GitHub issue list on the repository, so we can keep them organised and execute on them!

Let me know of any suggestions to the initiative!

Cheers,

On Saturday, January 21, 2017 at 5:57:55 AM UTC, Jerry He wrote:
This is a great suggestion, well said.
There is already some requirements/requests from the user list :-)


JanusGraph Meetup in NYC

Haikal Pribadi <hai...@...>
 

Hi everyone,

Susan and I are planning to organise a JanusGraph meetup in NYC, around 28th of February.

We think it would be a very good idea for teams from our contributors to present the work that they do with JanusGraph at this meetup and invite our users to collaborate as well.

I would like to start the discussion on this thread. We would love to hear your thoughts and whether or not you will be able to make it to the event! So let's discuss:

1) Interest in talking at the meetup, followed by a rough idea of a title of your talk (if you can already think of one, but not a must)
2) Date availability for the event.
3) Ideas on venues
4) Ideas on promoting the event to our user community.

Once we have a clear plan on the amount of content coming from our contributors, I think it would also be a very good idea to invite our users to speak as well if there is still space, what do you think?

I've also created a new meetup group: Open Source Graph Technologies, where we can host various meetups on open-source graphs around the world: https://www.meetup.com/graphs/
You can see there's already the first meetup for 28th of February set as a template. Title, description, place and date are still up for discussion! https://www.meetup.com/graphs/events/237100744/

Looking forward to hearing all your ideas, everyone!


Re: JanusGraph Meetup in NYC

Haikal Pribadi <hai...@...>
 

Date have been changed to Wednesday, 1st of March 2017.


On Saturday, January 21, 2017 at 11:56:46 PM UTC, Haikal Pribadi wrote:
Hi everyone,

Susan and I are planning to organise a JanusGraph meetup in NYC, around 28th of February.

We think it would be a very good idea for teams from our contributors to present the work that they do with JanusGraph at this meetup and invite our users to collaborate as well.

I would like to start the discussion on this thread. We would love to hear your thoughts and whether or not you will be able to make it to the event! So let's discuss:

1) Interest in talking at the meetup, followed by a rough idea of a title of your talk (if you can already think of one, but not a must)
2) Date availability for the event.
3) Ideas on venues
4) Ideas on promoting the event to our user community.

Once we have a clear plan on the amount of content coming from our contributors, I think it would also be a very good idea to invite our users to speak as well if there is still space, what do you think?

I've also created a new meetup group: Open Source Graph Technologies, where we can host various meetups on open-source graphs around the world: https://www.meetup.com/graphs/
You can see there's already the first meetup for 28th of February set as a template. Title, description, place and date are still up for discussion! https://www.meetup.com/graphs/events/237100744/

Looking forward to hearing all your ideas, everyone!


[DISCUSS] Development Process

Ted Wilmes <twi...@...>
 

Hello,
I'd like to hear what folks think about development process for JanusGraph.  I think
it would be helpful if we had some basic guidelines documented.

I'll throw a few ideas out to get the brainstorming going.

Proposed Functionality and Project Direction
-----------------------------------------------------------
Proposed functionality should be discussed on the dev list.  If it is a large change (how 
should we define large?), it should be voted on requiring 3 +1 votes from project committers.  
I could see one example of a large change as adding a new storage adapter.  Bug fixes 
and enhancements that do not change APIs nor require large amount of refactoring 
should not require a vote.


Pull Requests
-------------------
I think the Apache TinkerPop project could be one source of inspiration for our dev 
processes [1].  For PRs in particular, votes are required before a PR can be merged.  
3 votes by committers are required and the submitter can (and usually does) vote for 
their own PR if they are a committer.  Since we have folks tagged as "maintainers" for 
the different modules of JanusGraph, maybe we could also require one of the votes
to come from the relevant module maintainer.

On the other end of the spectrum, we could lower the vote requirement or allow for
commit than review (CTR), skipping votes altogether.  We could also stick with voting
but allow for CTR when a committer deems a PR low risk.

Thoughts?
Ted 


Re: JanusGraph Meetup in NYC

Dylan Bethune-Waddell <dylan.bet...@...>
 

Excellent idea Haikal, I will have to see if I can make it but I do hope to see you there. Assuming that the entities involved consent, this would be a great opportunity to start publishing relevant talks and slides featuring JanusGraph online. Personally I think the timing is quite good, is the date still flexible in case we can be more inclusive by moving it up or down a bit?

On Sunday, January 22, 2017 at 7:37:00 AM UTC-5, Haikal Pribadi wrote:
Date have been changed to Wednesday, 1st of March 2017.


On Saturday, January 21, 2017 at 11:56:46 PM UTC, Haikal Pribadi wrote:
Hi everyone,

Susan and I are planning to organise a JanusGraph meetup in NYC, around 28th of February.

We think it would be a very good idea for teams from our contributors to present the work that they do with JanusGraph at this meetup and invite our users to collaborate as well.

I would like to start the discussion on this thread. We would love to hear your thoughts and whether or not you will be able to make it to the event! So let's discuss:

1) Interest in talking at the meetup, followed by a rough idea of a title of your talk (if you can already think of one, but not a must)
2) Date availability for the event.
3) Ideas on venues
4) Ideas on promoting the event to our user community.

Once we have a clear plan on the amount of content coming from our contributors, I think it would also be a very good idea to invite our users to speak as well if there is still space, what do you think?

I've also created a new meetup group: Open Source Graph Technologies, where we can host various meetups on open-source graphs around the world: https://www.meetup.com/graphs/
You can see there's already the first meetup for 28th of February set as a template. Title, description, place and date are still up for discussion! https://www.meetup.com/graphs/events/237100744/

Looking forward to hearing all your ideas, everyone!


Re: JanusGraph Meetup in NYC

Haikal Pribadi <hai...@...>
 

Hi Dylan,

The date is flexible if any major roadblock occurs, or majority of people could make it a different day.

Looking forward to meeting you there, Dylan!

Cheers,


On Tuesday, January 24, 2017 at 6:02:50 PM UTC, Dylan Bethune-Waddell wrote:
Excellent idea Haikal, I will have to see if I can make it but I do hope to see you there. Assuming that the entities involved consent, this would be a great opportunity to start publishing relevant talks and slides featuring JanusGraph online. Personally I think the timing is quite good, is the date still flexible in case we can be more inclusive by moving it up or down a bit?

On Sunday, January 22, 2017 at 7:37:00 AM UTC-5, Haikal Pribadi wrote:
Date have been changed to Wednesday, 1st of March 2017.


On Saturday, January 21, 2017 at 11:56:46 PM UTC, Haikal Pribadi wrote:
Hi everyone,

Susan and I are planning to organise a JanusGraph meetup in NYC, around 28th of February.

We think it would be a very good idea for teams from our contributors to present the work that they do with JanusGraph at this meetup and invite our users to collaborate as well.

I would like to start the discussion on this thread. We would love to hear your thoughts and whether or not you will be able to make it to the event! So let's discuss:

1) Interest in talking at the meetup, followed by a rough idea of a title of your talk (if you can already think of one, but not a must)
2) Date availability for the event.
3) Ideas on venues
4) Ideas on promoting the event to our user community.

Once we have a clear plan on the amount of content coming from our contributors, I think it would also be a very good idea to invite our users to speak as well if there is still space, what do you think?

I've also created a new meetup group: Open Source Graph Technologies, where we can host various meetups on open-source graphs around the world: https://www.meetup.com/graphs/
You can see there's already the first meetup for 28th of February set as a template. Title, description, place and date are still up for discussion! https://www.meetup.com/graphs/events/237100744/

Looking forward to hearing all your ideas, everyone!


Re: [DISCUSS] Development Process

Dylan Bethune-Waddell <dylan.bet...@...>
 

Hi Ted,

Thanks for getting this discussion started - I personally like the idea of following the TinkerPop/Apache Way for accepting/rejecting changes, but I'm not familiar with alternatives so I'm looking forward to hearing from others on that. One thing I will say is that I believe the approval of a specific person should be required at times, but only by the request of the person proposing/committing a "big change", or at the request of another committer so long as that request for a specific pair of eyes on is not vetoed by a 3+ vote from other committers. I'm imagining situations in which the current maintainer is indisposed but should (or shouldn't) look at it, or situations in which the benevolent dictator of such and such feature throws a fit and starts vetoing the will of the people. Mostly joking - but it is my understanding no-one should get benevolent dictatorships in FOSS because sometimes weird things do happen.

Regarding new features and design discussions, I like what I have seen on the Mesos JIRA before, where a link is posted in the parent JIRA issue to a public Google Docs design document where editing/discussion can take place. The Google Docs approach for "Technical Steering" seems good to me because it is (a) flexible enough to detail anything, (b) in the public eye, and (c) provides an opportunity for the same kind of line-by-line review/discuss iteration we can do on github for code but can't really do on github for arbitrary documents (that would be crazy... right?). Along the same lines I liked this Google Sheet that WikiMedia produced to compare alternatives when they were looking for a graph database to back their query service after Titan was acquired by DataStax. This would also make the definition of "big change" pretty clear, as we define anything requiring a design doc (complete with 3+ committer vote if requirement of doc in dispute) "big", and everything else not. I could imagine us doing something similar when considering new backends/libraries and such. Generally I like the link-out to Google Docs approach because issues of design are usefully separated from development.

We might also want to use JIRA (and apply for an open source license to get it hosted by Atlassian), which I am in favour of because I believe JanusGraph consists of enough moving pieces and sub-projects that in the long run setting it up will benefit us as compared to git issue tracking/management - it is my fear that at some point, what is provided by github (just labels and links mostly, no?) will not be sufficient to properly delineate issues and efforts within the project and migrating to something like JIRA will no longer be trivial, and may even be infeasible after a year or two.

What do others think?


On Tuesday, January 24, 2017 at 9:18:00 AM UTC-5, Ted Wilmes wrote:
Hello,
I'd like to hear what folks think about development process for JanusGraph.  I think
it would be helpful if we had some basic guidelines documented.

I'll throw a few ideas out to get the brainstorming going.

Proposed Functionality and Project Direction
-----------------------------------------------------------
Proposed functionality should be discussed on the dev list.  If it is a large change (how 
should we define large?), it should be voted on requiring 3 +1 votes from project committers.  
I could see one example of a large change as adding a new storage adapter.  Bug fixes 
and enhancements that do not change APIs nor require large amount of refactoring 
should not require a vote.


Pull Requests
-------------------
I think the Apache TinkerPop project could be one source of inspiration for our dev 
processes [1].  For PRs in particular, votes are required before a PR can be merged.  
3 votes by committers are required and the submitter can (and usually does) vote for 
their own PR if they are a committer.  Since we have folks tagged as "maintainers" for 
the different modules of JanusGraph, maybe we could also require one of the votes
to come from the relevant module maintainer.

On the other end of the spectrum, we could lower the vote requirement or allow for
commit than review (CTR), skipping votes altogether.  We could also stick with voting
but allow for CTR when a committer deems a PR low risk.

Thoughts?
Ted 


Re: [DISCUSS] Development Process

Jerry He <jerr...@...>
 

I like the idea of having a formal 'development process' or 'developer' document for the project.
Regarding the process to follow to pull in large features, or regular patches or pull request, I am adding here the policy [1] from HBase. FYI. 

To quote it:

Feature Branches

Feature Branches are easy to make. You do not have to be a committer to make one. Just request the name of your branch be added to JIRA up on the developer’s mailing list and a committer will add it for you. Thereafter you can file issues against your feature branch in Apache HBase JIRA. Your code you keep elsewhere — it should be public so it can be observed — and you can update dev mailing list on progress. When the feature is ready for commit, 3 +1s from committers will get your feature merged.

Patch +1 Policy

The below policy is something we put in place 09/2012. It is a suggested policy rather than a hard requirement. We want to try it first to see if it works before we cast it in stone.
Apache HBase is made of components.
Patches that fit within the scope of a single Apache HBase component require, at least, a +1 by one of the component’s owners before commit. If owners are absent — busy or otherwise — two +1s by non-owners will suffice.
Patches that span components need at least two +1s before they can be committed, preferably +1s by owners of components touched by the x-component patch (TODO: This needs tightening up but I think fine for first pass).

Any -1 on a patch by anyone vetoes a patch; it cannot be committed until the justification for the -1 is addressed.

I think we can reconcile/merge the Tinkerpop policy and the HBase policy so that we have a simple yet fine tuned one.

Thanks,

Jerry


[1] https://hbase.apache.org/book.html#_decisions


Re: [DISCUSS] Development Process

Jason Plurad <plu...@...>
 

+1 on design docs for major feature changes. It will help retain the underlying goals and decisions, which will be important as new people join and others roll off. Kafka also requires them https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals

On Wed, Jan 25, 2017 at 3:02 PM Jerry He <jerr...@...> wrote:
I like the idea of having a formal 'development process' or 'developer' document for the project.
Regarding the process to follow to pull in large features, or regular patches or pull request, I am adding here the policy [1] from HBase. FYI. 

To quote it:

Feature Branches

Feature Branches are easy to make. You do not have to be a committer to make one. Just request the name of your branch be added to JIRA up on the developer’s mailing list and a committer will add it for you. Thereafter you can file issues against your feature branch in Apache HBase JIRA. Your code you keep elsewhere — it should be public so it can be observed — and you can update dev mailing list on progress. When the feature is ready for commit, 3 +1s from committers will get your feature merged.

Patch +1 Policy

The below policy is something we put in place 09/2012. It is a suggested policy rather than a hard requirement. We want to try it first to see if it works before we cast it in stone.
Apache HBase is made of components.
Patches that fit within the scope of a single Apache HBase component require, at least, a +1 by one of the component’s owners before commit. If owners are absent — busy or otherwise — two +1s by non-owners will suffice.
Patches that span components need at least two +1s before they can be committed, preferably +1s by owners of components touched by the x-component patch (TODO: This needs tightening up but I think fine for first pass).

Any -1 on a patch by anyone vetoes a patch; it cannot be committed until the justification for the -1 is addressed.

I think we can reconcile/merge the Tinkerpop policy and the HBase policy so that we have a simple yet fine tuned one.

Thanks,

Jerry


--
You received this message because you are subscribed to the Google Groups "JanusGraph developer list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....
For more options, visit https://groups.google.com/d/optout.


[DISCUSS] Issue Tracking - Github, JIRA, YouTrack, what suits?

Dylan Bethune-Waddell <dylan.bet...@...>
 

Both JIRA and YouTrack have free-for-open-source licensing options we could apply for (linked).
Haikal has mentioned on another thread https://waffle.io as a nice value-add to github issue tracking.

In Ted's thread about development practices, we've mentioned some top-level Apache projects that link to design docs on JIRA issues, where the team iterates/comments on that one document directly instead of forum-style comment/reply communication, while preserving that option via the JIRA issue itself. Of course we have gitter for more interactive discussion, but the clarity a document with history of revisions and comments will provide is certainly worth it to me for the big stuff - could we achieve a natural integration of this on github's issue tracker without the overhead of outsourcing issue tracking to another platform?

I certainly like the idea that all the issues are actually part of the repo, so IMO the benefits and must-have features provided by an alternative to just using github would have to be substantial and exclusive. I would like to ask everyone if they think we should switch, and if there is enough of a split, we should then separately from this discussion vote on the matter. WDYT?


[PROPOSAL] Replace FulgoraGraphComputer by IgniteGraphActors/IgniteGraphComputer

Dylan Bethune-Waddell <dylan.bet...@...>
 

So I've been toying with this idea for a while, but I want to make it concrete here so we can shoot it down, mutate, or build it up as needed:

I think that we should on-board an additional backend for distributed, asynchronous OLAP/OLTP "hybrid" processing, which Marko Rodriguez is implementing a reference implementation for right now in TinkerPop (link to branch and JIRA). Here's a link to a figure he attached to the JIRA issue. We need to drop or change the name of FulgoraGraphComputer anyways, so my suggestion is "repeal and replace" - let's take a stab at providing an implementation of Marko's new work, which he is doing with Akka, whereby we will use Apache Ignite as a backend instead with the goal of addressing a concern raised/discussed in the JIRA issue:
  • If transactions are worked out, then distributed OLTP Gremlin provides mutation capabilities (something currently not implemented for GraphComputer). That is addVaddEdrop, etc. just works. *Caveate, transactions in this environment across GremlinServer seems difficult.*

Indeed, which is why I propose we try out Apache Ignite as a way to achieve this functionality while replacing our in-memory, single-machine graph computer (Fulgora) with a distributed in-memory graph computer that will suit most OLAP-style analytics needs with the benefit of full OLTP TinkerPop API functionality and integration points with Spark and tools that work with data stored in HDFS. Quoting from this article, which admittedly reads as a long-form advertisement for Apache Ignite:
  • Ignite can provide shared storage, so state can be passed from one Spark application or job to another
  • Ignite can provide SQL with indexing so Spark SQL can be accelerated over 1,000x
  • When working with files instead of RDDs, the Apache Ignite In-Memory File System (IGFS) can also share state between Spark jobs and applications
My understanding is that Apache Ignite essentially provides a distributed In-Memory cache that is compatible with and a "drop-in" value add to both HDFS and Apache Spark with an integration point for Cassandra as well. To me the RDD-like data structure of Ignite which maintains state complete with ACID transactions via an SQL interface would therefore address Marko's concern about distributed transactions. Here are a few jumping-off points we could learn from:

1. SQLG - MIT Licensed by Pieter Martin, who is quite active on the TinkerPop mailing lists and JIRA. We could ask nicely if he would like to lend a hand, as I am looking at the SQL interface as the most sensible target here.
2. Marko Rodriguez's reference implementation of GraphActors with Akka - as mentioned above this is an active body of work that has not yet been released or merged into TinkerPop master, but is what I suggest we target here.
3. Ted Wilmes' SQL-gremlin - while this goes "the other way" and maps a TinkerPop-enabled database to a tabular representation so that you can run SQL queries over the data, I'm guessing we'll see plenty of overlapping gotchas that Ted already ran into.
4. SparkGraphComputer - or "the thing that already works". Apache Ignite shadowing the Spark/Hadoop APIs might put a drop-in IgniteGraphComputer within reach which would give us an idea of how performant and useful we could expect the system to be overall before we invest in the "big change" of IgniteGraphActors or whatever middle-ground between the GraphActors, GraphComputer, and GraphProvider frameworks we'll need to find to realise an Apache Ignite backend within JanusGraph.

I also wanted to mention my thoughts on IGFS (Ignite File System) which runs either on top of HDFS, between HDFS and Spark, or standalone (I believe). My thinking is that we can store side-effect data structures in IGFS and it will enable the same ACID-ness on distributed side-effect data structures we would be getting for elements in the Graph/RDD data structure via IgniteRDD or what have you. From there, persistence to HDFS via a BulkDumperVertexProgram or BulkExportVertexProgram would be possible, as would running spark job chains with addE()/V() and drop() on the IgniteRDD or transformations thereof, opening up a path to ETL type workflows involving other "Big Data" creatures/tools. Further, we could then persist back into JanusGraph with a BulkLoaderVertexProgram implementation. So again, this is somewhat of a GraphComputer/GraphActors hybrid, but I'm not sure I mind. Jonathan Ellithorpe mentioned his implementation of TinkerPop over RAMcloud a while back on the TinkerPop mailing list as part of a benchmarking effort - we could ask him about how performant that was as it sounds similar to what this would be. Benchmarks would be nice, too :)

I'm interested in what people think of on-boarding this kind of processing engine in principle, even if all the optimistic assumptions of the feasibility of Ignite I have made here turn out to be unfounded. Are there other options we should consider besides Ignite, or should we stick closer to home and simply implement the GraphActors/Partitioner/Partitions standard Marko is working on directly with Cassandra/HBase as a giant refactor over time? Clearly, this is a change we can move up or down our development schedule and spend a while getting right, but if performant I see a lot of value here.


Re: [DISCUSS] Development Process

"P. Taylor Goetz" <ptg...@...>
 

A couple of points to consider:

+1 policy for commits:
A lot of Apache projects have guidelines for how long a pull request must be open before it can be merged. Guidelines typically range from 24-72 hours after opening, or after the last code change commit in the pull request. The idea is that with a community spread out over many timezones, it’s best to let the earth rotate at least once so everyone has a chance to review the change during their working hours.

For -1 votes (vetoes) on code changes, a lot of projects specify that it must be accompanied by a justification. Without a justification the veto is invalid.


Project Direction:
I would suggest a guideline that all decision-making take place on the mailing list (similar to the ASF mantra “if it didn’t happen on an email list, id didn’t happen”). Gitter is fine for informal communication, questions, etc., but any project-related decisions should be brought to the mailing lists. An again, a waiting period for decisions gives everyone in the community a chance to participate.

I’m also in favor of some sort of “JanusGraph Improvement Proposal" (JIP) (modeled similarly for Apache Kafka’s KIP process) for large changes. It’s a good way to get community feedback for big changes.

Feature Branches:
Another idea I support. In Apache Storm we use feature branches as sort of a “distributed pull request” — commit rules are relaxed for those branches, and when the feature is complete and ready to be merged to a main branch, the changes go up for review with the same merge rules as a main branch.

Pull Request Templates:
It’s good to have guidelines around what information needs to go into a pull request title and description. Otherwise you will inevitably get pull requests with a title like “fix bug” and no description, which forces you to look at the code changes

-Taylor


Re: [PROPOSAL] Replace FulgoraGraphComputer by IgniteGraphActors/IgniteGraphComputer

Dylan Bethune-Waddell <dylan.bet...@...>
 

Should also have linked to the thread on the gremlin-users list - here you go: https://groups.google.com/forum/#!topic/gremlin-users/GNFgkaKjnFc


On Thursday, January 26, 2017 at 4:34:48 PM UTC-5, Dylan Bethune-Waddell wrote:
So I've been toying with this idea for a while, but I want to make it concrete here so we can shoot it down, mutate, or build it up as needed:

I think that we should on-board an additional backend for distributed, asynchronous OLAP/OLTP "hybrid" processing, which Marko Rodriguez is implementing a reference implementation for right now in TinkerPop (link to branch and JIRA). Here's a link to a figure he attached to the JIRA issue. We need to drop or change the name of FulgoraGraphComputer anyways, so my suggestion is "repeal and replace" - let's take a stab at providing an implementation of Marko's new work, which he is doing with Akka, whereby we will use Apache Ignite as a backend instead with the goal of addressing a concern raised/discussed in the JIRA issue:
  • If transactions are worked out, then distributed OLTP Gremlin provides mutation capabilities (something currently not implemented for GraphComputer). That is addVaddEdrop, etc. just works. *Caveate, transactions in this environment across GremlinServer seems difficult.*

Indeed, which is why I propose we try out Apache Ignite as a way to achieve this functionality while replacing our in-memory, single-machine graph computer (Fulgora) with a distributed in-memory graph computer that will suit most OLAP-style analytics needs with the benefit of full OLTP TinkerPop API functionality and integration points with Spark and tools that work with data stored in HDFS. Quoting from this article, which admittedly reads as a long-form advertisement for Apache Ignite:
  • Ignite can provide shared storage, so state can be passed from one Spark application or job to another
  • Ignite can provide SQL with indexing so Spark SQL can be accelerated over 1,000x
  • When working with files instead of RDDs, the Apache Ignite In-Memory File System (IGFS) can also share state between Spark jobs and applications
My understanding is that Apache Ignite essentially provides a distributed In-Memory cache that is compatible with and a "drop-in" value add to both HDFS and Apache Spark with an integration point for Cassandra as well. To me the RDD-like data structure of Ignite which maintains state complete with ACID transactions via an SQL interface would therefore address Marko's concern about distributed transactions. Here are a few jumping-off points we could learn from:

1. SQLG - MIT Licensed by Pieter Martin, who is quite active on the TinkerPop mailing lists and JIRA. We could ask nicely if he would like to lend a hand, as I am looking at the SQL interface as the most sensible target here.
2. Marko Rodriguez's reference implementation of GraphActors with Akka - as mentioned above this is an active body of work that has not yet been released or merged into TinkerPop master, but is what I suggest we target here.
3. Ted Wilmes' SQL-gremlin - while this goes "the other way" and maps a TinkerPop-enabled database to a tabular representation so that you can run SQL queries over the data, I'm guessing we'll see plenty of overlapping gotchas that Ted already ran into.
4. SparkGraphComputer - or "the thing that already works". Apache Ignite shadowing the Spark/Hadoop APIs might put a drop-in IgniteGraphComputer within reach which would give us an idea of how performant and useful we could expect the system to be overall before we invest in the "big change" of IgniteGraphActors or whatever middle-ground between the GraphActors, GraphComputer, and GraphProvider frameworks we'll need to find to realise an Apache Ignite backend within JanusGraph.

I also wanted to mention my thoughts on IGFS (Ignite File System) which runs either on top of HDFS, between HDFS and Spark, or standalone (I believe). My thinking is that we can store side-effect data structures in IGFS and it will enable the same ACID-ness on distributed side-effect data structures we would be getting for elements in the Graph/RDD data structure via IgniteRDD or what have you. From there, persistence to HDFS via a BulkDumperVertexProgram or BulkExportVertexProgram would be possible, as would running spark job chains with addE()/V() and drop() on the IgniteRDD or transformations thereof, opening up a path to ETL type workflows involving other "Big Data" creatures/tools. Further, we could then persist back into JanusGraph with a BulkLoaderVertexProgram implementation. So again, this is somewhat of a GraphComputer/GraphActors hybrid, but I'm not sure I mind. Jonathan Ellithorpe mentioned his implementation of TinkerPop over RAMcloud a while back on the TinkerPop mailing list as part of a benchmarking effort - we could ask him about how performant that was as it sounds similar to what this would be. Benchmarks would be nice, too :)

I'm interested in what people think of on-boarding this kind of processing engine in principle, even if all the optimistic assumptions of the feasibility of Ignite I have made here turn out to be unfounded. Are there other options we should consider besides Ignite, or should we stick closer to home and simply implement the GraphActors/Partitioner/Partitions standard Marko is working on directly with Cassandra/HBase as a giant refactor over time? Clearly, this is a change we can move up or down our development schedule and spend a while getting right, but if performant I see a lot of value here.


Re: [DISCUSS] Issue Tracking - Github, JIRA, YouTrack, what suits?

"P. Taylor Goetz" <ptg...@...>
 

I’d lean toward keeping it simple and starting out with GitHub issues, especially since the community is still small. If we outgrow or discover difficulty with GitHub, we can always change later.

-Taylor


On Jan 26, 2017, at 11:25 AM, Dylan Bethune-Waddell <dylan.bet...@...> wrote:

Both JIRA and YouTrack have free-for-open-source licensing options we could apply for (linked).
Haikal has mentioned on another thread https://waffle.io as a nice value-add to github issue tracking.

In Ted's thread about development practices, we've mentioned some top-level Apache projects that link to design docs on JIRA issues, where the team iterates/comments on that one document directly instead of forum-style comment/reply communication, while preserving that option via the JIRA issue itself. Of course we have gitter for more interactive discussion, but the clarity a document with history of revisions and comments will provide is certainly worth it to me for the big stuff - could we achieve a natural integration of this on github's issue tracker without the overhead of outsourcing issue tracking to another platform?

I certainly like the idea that all the issues are actually part of the repo, so IMO the benefits and must-have features provided by an alternative to just using github would have to be substantial and exclusive. I would like to ask everyone if they think we should switch, and if there is enough of a split, we should then separately from this discussion vote on the matter. WDYT?

--
You received this message because you are subscribed to the Google Groups "JanusGraph developer list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....
For more options, visit https://groups.google.com/d/optout.


Re: [DISCUSS] Issue Tracking - Github, JIRA, YouTrack, what suits?

Henry Saputra <henry....@...>
 

I would recommend stick with Github issues for now until we find it is not enough for us to track changes and enhancement requests.

Keep it simple.


On Thursday, January 26, 2017 at 8:25:37 AM UTC-8, Dylan Bethune-Waddell wrote:
Both JIRA and YouTrack have free-for-open-source licensing options we could apply for (linked).
Haikal has mentioned on another thread https://waffle.io as a nice value-add to github issue tracking.

In Ted's thread about development practices, we've mentioned some top-level Apache projects that link to design docs on JIRA issues, where the team iterates/comments on that one document directly instead of forum-style comment/reply communication, while preserving that option via the JIRA issue itself. Of course we have gitter for more interactive discussion, but the clarity a document with history of revisions and comments will provide is certainly worth it to me for the big stuff - could we achieve a natural integration of this on github's issue tracker without the overhead of outsourcing issue tracking to another platform?

I certainly like the idea that all the issues are actually part of the repo, so IMO the benefits and must-have features provided by an alternative to just using github would have to be substantial and exclusive. I would like to ask everyone if they think we should switch, and if there is enough of a split, we should then separately from this discussion vote on the matter. WDYT?


Re: [DISCUSS] Issue Tracking - Github, JIRA, YouTrack, what suits?

Dylan Bethune-Waddell <dylan.bet...@...>
 

So it is (quite obviously and plainly stated at the bottom of every comment box, so I'm sorry for asking) possible to attach design docs to github issues. Doh. Also, if we do ever want to migrate to JIRA on an open source license we can use their tool for importing from github.

On Friday, January 27, 2017 at 1:38:46 PM UTC-5, Henry Saputra wrote:
I would recommend stick with Github issues for now until we find it is not enough for us to track changes and enhancement requests.

Keep it simple.

On Thursday, January 26, 2017 at 8:25:37 AM UTC-8, Dylan Bethune-Waddell wrote:
Both JIRA and YouTrack have free-for-open-source licensing options we could apply for (linked).
Haikal has mentioned on another thread https://waffle.io as a nice value-add to github issue tracking.

In Ted's thread about development practices, we've mentioned some top-level Apache projects that link to design docs on JIRA issues, where the team iterates/comments on that one document directly instead of forum-style comment/reply communication, while preserving that option via the JIRA issue itself. Of course we have gitter for more interactive discussion, but the clarity a document with history of revisions and comments will provide is certainly worth it to me for the big stuff - could we achieve a natural integration of this on github's issue tracker without the overhead of outsourcing issue tracking to another platform?

I certainly like the idea that all the issues are actually part of the repo, so IMO the benefits and must-have features provided by an alternative to just using github would have to be substantial and exclusive. I would like to ask everyone if they think we should switch, and if there is enough of a split, we should then separately from this discussion vote on the matter. WDYT?


Re: [DISCUSS] Development Process

Ted Wilmes <twi...@...>
 

Great input, thanks everyone.  I think we're all pretty much on the same page.  If you guys are okay with it,
I'll put together a PR to add some candidate documentation around our policies.  I'll hit on:

* policy for commits
* design docs and project decision making
* release policy

I'll take a crack at harmonizing the points noted here and within the other referenced Apache
projects and then pass it along for review.

Thanks,
Ted


On Thursday, January 26, 2017 at 4:17:41 PM UTC-6, P. Taylor Goetz wrote:
A couple of points to consider:

+1 policy for commits:
A lot of Apache projects have guidelines for how long a pull request must be open before it can be merged. Guidelines typically range from 24-72 hours after opening, or after the last code change commit in the pull request. The idea is that with a community spread out over many timezones, it’s best to let the earth rotate at least once so everyone has a chance to review the change during their working hours.

For -1 votes (vetoes) on code changes, a lot of projects specify that it must be accompanied by a justification. Without a justification the veto is invalid.


Project Direction:
I would suggest a guideline that all decision-making take place on the mailing list (similar to the ASF mantra “if it didn’t happen on an email list, id didn’t happen”). Gitter is fine for informal communication, questions, etc., but any project-related decisions should be brought to the mailing lists. An again, a waiting period for decisions gives everyone in the community a chance to participate.

I’m also in favor of some sort of “JanusGraph Improvement Proposal" (JIP) (modeled similarly for Apache Kafka’s KIP process) for large changes. It’s a good way to get community feedback for big changes.

Feature Branches:
Another idea I support. In Apache Storm we use feature branches as sort of a “distributed pull request” — commit rules are relaxed for those branches, and when the feature is complete and ready to be merged to a main branch, the changes go up for review with the same merge rules as a main branch.

Pull Request Templates:
It’s good to have guidelines around what information needs to go into a pull request title and description. Otherwise you will inevitably get pull requests with a title like “fix bug” and no description, which forces you to look at the code changes

-Taylor




Re: [PROPOSAL] Replace FulgoraGraphComputer by IgniteGraphActors/IgniteGraphComputer

Ted Wilmes <twi...@...>
 

Great write-up Dylan.  I'm not familiar with Apache Ignite beyond name recognition so I'll be doing some more reading.
Here's a few thoughts that popped into my head.

* Transactions - what concurrency control methods does Ignite use and will it be feasible to run queries that have large numbers of
mutations (millions to billions) in the context of a single transaction (single meaning it would be distributed here but one "commit")
* What portion of existing JanusGraph code will be shared across the current execution engine and the new distributed OLTP? 
* Is it worth taking a crack at a distributed query mode against the existing backends?
* I think your point on ETL is critical, I frequently run into the desire to persist OLAP results back into the source graph so the more painless
of a path we can provide here, the better.  Along these lines, if we're talking about a separate ETL step back into the source graph,
having ACID support right out of the gate may not be critical because the ETL will be occurring in its own eventually
consistent fashion for the majority of our backends.  In addition to that, the copy of the source graph running Ignite will most
likely be out of date almost immediately as the the source will continue to diverge which will introduce its own challenges.

Thanks,
Ted


On Friday, January 27, 2017 at 9:42:57 AM UTC-6, Dylan Bethune-Waddell wrote:
Should also have linked to the thread on the gremlin-users list - here you go: https://groups.google.com/forum/#!topic/gremlin-users/GNFgkaKjnFc

On Thursday, January 26, 2017 at 4:34:48 PM UTC-5, Dylan Bethune-Waddell wrote:
So I've been toying with this idea for a while, but I want to make it concrete here so we can shoot it down, mutate, or build it up as needed:

I think that we should on-board an additional backend for distributed, asynchronous OLAP/OLTP "hybrid" processing, which Marko Rodriguez is implementing a reference implementation for right now in TinkerPop (link to branch and JIRA). Here's a link to a figure he attached to the JIRA issue. We need to drop or change the name of FulgoraGraphComputer anyways, so my suggestion is "repeal and replace" - let's take a stab at providing an implementation of Marko's new work, which he is doing with Akka, whereby we will use Apache Ignite as a backend instead with the goal of addressing a concern raised/discussed in the JIRA issue:
  • If transactions are worked out, then distributed OLTP Gremlin provides mutation capabilities (something currently not implemented for GraphComputer). That is addVaddEdrop, etc. just works. *Caveate, transactions in this environment across GremlinServer seems difficult.*

Indeed, which is why I propose we try out Apache Ignite as a way to achieve this functionality while replacing our in-memory, single-machine graph computer (Fulgora) with a distributed in-memory graph computer that will suit most OLAP-style analytics needs with the benefit of full OLTP TinkerPop API functionality and integration points with Spark and tools that work with data stored in HDFS. Quoting from this article, which admittedly reads as a long-form advertisement for Apache Ignite:
  • Ignite can provide shared storage, so state can be passed from one Spark application or job to another
  • Ignite can provide SQL with indexing so Spark SQL can be accelerated over 1,000x
  • When working with files instead of RDDs, the Apache Ignite In-Memory File System (IGFS) can also share state between Spark jobs and applications
My understanding is that Apache Ignite essentially provides a distributed In-Memory cache that is compatible with and a "drop-in" value add to both HDFS and Apache Spark with an integration point for Cassandra as well. To me the RDD-like data structure of Ignite which maintains state complete with ACID transactions via an SQL interface would therefore address Marko's concern about distributed transactions. Here are a few jumping-off points we could learn from:

1. SQLG - MIT Licensed by Pieter Martin, who is quite active on the TinkerPop mailing lists and JIRA. We could ask nicely if he would like to lend a hand, as I am looking at the SQL interface as the most sensible target here.
2. Marko Rodriguez's reference implementation of GraphActors with Akka - as mentioned above this is an active body of work that has not yet been released or merged into TinkerPop master, but is what I suggest we target here.
3. Ted Wilmes' SQL-gremlin - while this goes "the other way" and maps a TinkerPop-enabled database to a tabular representation so that you can run SQL queries over the data, I'm guessing we'll see plenty of overlapping gotchas that Ted already ran into.
4. SparkGraphComputer - or "the thing that already works". Apache Ignite shadowing the Spark/Hadoop APIs might put a drop-in IgniteGraphComputer within reach which would give us an idea of how performant and useful we could expect the system to be overall before we invest in the "big change" of IgniteGraphActors or whatever middle-ground between the GraphActors, GraphComputer, and GraphProvider frameworks we'll need to find to realise an Apache Ignite backend within JanusGraph.

I also wanted to mention my thoughts on IGFS (Ignite File System) which runs either on top of HDFS, between HDFS and Spark, or standalone (I believe). My thinking is that we can store side-effect data structures in IGFS and it will enable the same ACID-ness on distributed side-effect data structures we would be getting for elements in the Graph/RDD data structure via IgniteRDD or what have you. From there, persistence to HDFS via a BulkDumperVertexProgram or BulkExportVertexProgram would be possible, as would running spark job chains with addE()/V() and drop() on the IgniteRDD or transformations thereof, opening up a path to ETL type workflows involving other "Big Data" creatures/tools. Further, we could then persist back into JanusGraph with a BulkLoaderVertexProgram implementation. So again, this is somewhat of a GraphComputer/GraphActors hybrid, but I'm not sure I mind. Jonathan Ellithorpe mentioned his implementation of TinkerPop over RAMcloud a while back on the TinkerPop mailing list as part of a benchmarking effort - we could ask him about how performant that was as it sounds similar to what this would be. Benchmarks would be nice, too :)

I'm interested in what people think of on-boarding this kind of processing engine in principle, even if all the optimistic assumptions of the feasibility of Ignite I have made here turn out to be unfounded. Are there other options we should consider besides Ignite, or should we stick closer to home and simply implement the GraphActors/Partitioner/Partitions standard Marko is working on directly with Cassandra/HBase as a giant refactor over time? Clearly, this is a change we can move up or down our development schedule and spend a while getting right, but if performant I see a lot of value here.

1 - 20 of 1585