[DISCUSS] Support of CQL backend for Spark #985


Florian Hockmann <f...@...>
 

Hi,
Jan Jansen (farodin91) and I took a first try at implementing a CQL input format for Spark to enable OLAP jobs to work without Thrift (#985). Unfortunately, we can't just add wrappers for the CQL OLAP support in org.apache.cassandra and be done with it. The reason is that JanusGraph still uses version 2.1.20 of org.apache.cassandra cassandra-all which is not compatible with the version of the DataStax Cassandra driver JanusGraph is using. So, we need to update the org.apache.cassandra dependency to major version 3.
With this update we would however lose support for Thrift as that was thrown out completely in version 3.0.0 (see CASSANDRA-9353). That leaves us with the following options:
  1. Update the dependency and drop support for Thrift. Thrift is deprecated since Cassandra 3.0 which was released in November 2015. Users who want to continue using Thrift could stay on a JanusGraph version that still supports it.
  2. Create a separate project janusgraph-hadoop-cql that can use newer versions of these two dependencies without affecting the Thrift support in janusgraph-hadoop-core.
  3. Update the dependency and include the classes for Thrift support from org.apache.cassandra.hadoop package in janusgraph-hadoop-core.
The first option would of course be the easiest to implement and I think that dropping support for Thrift over 3 years after deprecation started and only in a new major release of JanusGraph would be acceptable. However, this comes with some risk as we would make CQL the only way to use OLAP with Cassandra in the same version in which we introduce the CQL OLAP support. So, if we find any problems with the implementation, then users could only downgrade to a lower version of JanusGraph that still supports Thrift as there wouldn’t be an alternative input format left.
We should probably update the dependency at some point irrespective of the support for CQL any way so creating a new project now just to not update the dependency in janusgraph-hadoop-core probably isn’t the best idea.
That is why option 3 sounds like the best one to us. We would add support for OLAP with CQL without losing support for Thrift which we could drop at any time in the future.

Any thoughts or other opinions on this topic?


Oleksandr Porunov <alexand...@...>
 

I think, over time Thrift support should be dropped in JanusGraph as it is a deprecated protocol.
Right now I think that option 3 is the best one. 
Option 1 also sounds OK to me. I don't think there will be problems, but as you noticed they may appear.

On Wednesday, January 9, 2019 at 1:02:50 PM UTC+2, Florian Hockmann wrote:
Hi,
Jan Jansen (farodin91) and I took a first try at implementing a CQL input format for Spark to enable OLAP jobs to work without Thrift (#985). Unfortunately, we can't just add wrappers for the CQL OLAP support in org.apache.cassandra and be done with it. The reason is that JanusGraph still uses version 2.1.20 of org.apache.cassandra cassandra-all which is not compatible with the version of the DataStax Cassandra driver JanusGraph is using. So, we need to update the org.apache.cassandra dependency to major version 3.
With this update we would however lose support for Thrift as that was thrown out completely in version 3.0.0 (see CASSANDRA-9353). That leaves us with the following options:
  1. Update the dependency and drop support for Thrift. Thrift is deprecated since Cassandra 3.0 which was released in November 2015. Users who want to continue using Thrift could stay on a JanusGraph version that still supports it.
  2. Create a separate project janusgraph-hadoop-cql that can use newer versions of these two dependencies without affecting the Thrift support in janusgraph-hadoop-core.
  3. Update the dependency and include the classes for Thrift support from org.apache.cassandra.hadoop package in janusgraph-hadoop-core.
The first option would of course be the easiest to implement and I think that dropping support for Thrift over 3 years after deprecation started and only in a new major release of JanusGraph would be acceptable. However, this comes with some risk as we would make CQL the only way to use OLAP with Cassandra in the same version in which we introduce the CQL OLAP support. So, if we find any problems with the implementation, then users could only downgrade to a lower version of JanusGraph that still supports Thrift as there wouldn’t be an alternative input format left.
We should probably update the dependency at some point irrespective of the support for CQL any way so creating a new project now just to not update the dependency in janusgraph-hadoop-core probably isn’t the best idea.
That is why option 3 sounds like the best one to us. We would add support for OLAP with CQL without losing support for Thrift which we could drop at any time in the future.

Any thoughts or other opinions on this topic?


Jeff Callahan <cal...@...>
 

Option 3 balances the various interests without nicely, +1 for it


On Wednesday, January 9, 2019 at 4:19:56 AM UTC-8, Oleksandr Porunov wrote:
I think, over time Thrift support should be dropped in JanusGraph as it is a deprecated protocol.
Right now I think that option 3 is the best one. 
Option 1 also sounds OK to me. I don't think there will be problems, but as you noticed they may appear.

On Wednesday, January 9, 2019 at 1:02:50 PM UTC+2, Florian Hockmann wrote:
Hi,
Jan Jansen (farodin91) and I took a first try at implementing a CQL input format for Spark to enable OLAP jobs to work without Thrift (#985). Unfortunately, we can't just add wrappers for the CQL OLAP support in org.apache.cassandra and be done with it. The reason is that JanusGraph still uses version 2.1.20 of org.apache.cassandra cassandra-all which is not compatible with the version of the DataStax Cassandra driver JanusGraph is using. So, we need to update the org.apache.cassandra dependency to major version 3.
With this update we would however lose support for Thrift as that was thrown out completely in version 3.0.0 (see CASSANDRA-9353). That leaves us with the following options:
  1. Update the dependency and drop support for Thrift. Thrift is deprecated since Cassandra 3.0 which was released in November 2015. Users who want to continue using Thrift could stay on a JanusGraph version that still supports it.
  2. Create a separate project janusgraph-hadoop-cql that can use newer versions of these two dependencies without affecting the Thrift support in janusgraph-hadoop-core.
  3. Update the dependency and include the classes for Thrift support from org.apache.cassandra.hadoop package in janusgraph-hadoop-core.
The first option would of course be the easiest to implement and I think that dropping support for Thrift over 3 years after deprecation started and only in a new major release of JanusGraph would be acceptable. However, this comes with some risk as we would make CQL the only way to use OLAP with Cassandra in the same version in which we introduce the CQL OLAP support. So, if we find any problems with the implementation, then users could only downgrade to a lower version of JanusGraph that still supports Thrift as there wouldn’t be an alternative input format left.
We should probably update the dependency at some point irrespective of the support for CQL any way so creating a new project now just to not update the dependency in janusgraph-hadoop-core probably isn’t the best idea.
That is why option 3 sounds like the best one to us. We would add support for OLAP with CQL without losing support for Thrift which we could drop at any time in the future.

Any thoughts or other opinions on this topic?


Chris Hupman <chris...@...>
 

+1 for number 3.

Number 3 is probably the best option, but I do wonder how long we should continue to support things like thrift and upgrading from titan. 


On Wednesday, January 9, 2019 at 3:02:50 AM UTC-8, Florian Hockmann wrote:
Hi,
Jan Jansen (farodin91) and I took a first try at implementing a CQL input format for Spark to enable OLAP jobs to work without Thrift (#985). Unfortunately, we can't just add wrappers for the CQL OLAP support in org.apache.cassandra and be done with it. The reason is that JanusGraph still uses version 2.1.20 of org.apache.cassandra cassandra-all which is not compatible with the version of the DataStax Cassandra driver JanusGraph is using. So, we need to update the org.apache.cassandra dependency to major version 3.
With this update we would however lose support for Thrift as that was thrown out completely in version 3.0.0 (see CASSANDRA-9353). That leaves us with the following options:
  1. Update the dependency and drop support for Thrift. Thrift is deprecated since Cassandra 3.0 which was released in November 2015. Users who want to continue using Thrift could stay on a JanusGraph version that still supports it.
  2. Create a separate project janusgraph-hadoop-cql that can use newer versions of these two dependencies without affecting the Thrift support in janusgraph-hadoop-core.
  3. Update the dependency and include the classes for Thrift support from org.apache.cassandra.hadoop package in janusgraph-hadoop-core.
The first option would of course be the easiest to implement and I think that dropping support for Thrift over 3 years after deprecation started and only in a new major release of JanusGraph would be acceptable. However, this comes with some risk as we would make CQL the only way to use OLAP with Cassandra in the same version in which we introduce the CQL OLAP support. So, if we find any problems with the implementation, then users could only downgrade to a lower version of JanusGraph that still supports Thrift as there wouldn’t be an alternative input format left.
We should probably update the dependency at some point irrespective of the support for CQL any way so creating a new project now just to not update the dependency in janusgraph-hadoop-core probably isn’t the best idea.
That is why option 3 sounds like the best one to us. We would add support for OLAP with CQL without losing support for Thrift which we could drop at any time in the future.

Any thoughts or other opinions on this topic?


Florian Hockmann <f...@...>
 

but I do wonder how long we should continue to support things like thrift and upgrading from titan.

I think we should deprecate Thrift soon and then remove it in the next minor version of JanusGraph after that, but we probably need to offer the same functionality for CQL as for Thrift before we can deprecate Thrift. Since we currently don't support Spark without Thrift, we can't really deprecate it yet in my opinion. Another issue we may want to take into account when talking about deprecation / removal of Thrift is the reported performance decrease of CQL compared to Thrift: JanusGraph/janusgraph#1249

Now regarding, how to support CQL for Spark: I found out in the meantime that Cassandra removed all Thrift code from its CQL components already in version 2.2 with CASSANDRA-8358. This most likely means for us that we only have to update to Cassandra 2.2 to get full CQL support for Spark which also means that we don't have to do anything about Thrift right now as Thrift is still fully supported in Cassandra 2.2. We only have to decide about that when we actually update to Cassandra 3.0, but we can do that at a later point (ideally before Cassandra 4.0 will be released as support for Cassandra 2.1 and 2.2 will end with that release).

The current state of this work is that the update to Cassandra 2.2 led to build failures where I have no idea yet what's causing them. janusgraph-solr and some other projects that all depend on janusgraph-cassandra now fail with this error:

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-failsafe-plugin:2.15:integration-test (default) on project janusgraph-solr: Execution default of goal
org
.apache.maven.plugins:maven-failsafe-plugin:2.15:integration-test failed: The forked VM terminated without saying properly goodbye. VM crash or System.exit called ?

This commit introduced the problem. I found a lot of reports with similar problems, but none of the suggested solutions helped so far. It's also strange that janusgraph-cassandra itself doesn't fail with this error.

I already created a branch that includes the update to Cassandra 2.2 and an initial version of a CQL input format for Spark, but for some reason my tests aren''t executed:

Running org.janusgraph.hadoop.CqlInputFormatIT
SLF4J
: Class path contains multiple SLF4J bindings.
SLF4J
: Found binding in [jar:file:/home/travis/.m2/repository/org/slf4j/slf4j-log4j12/1.7.12/slf4j-log4j12-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J
: Found binding in [jar:file:/home/travis/.m2/repository/ch/qos/logback/logback-classic/1.1.3/logback-classic-1.1.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J
: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J
: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Running org.janusgraph.hadoop.CassandraInputFormatIT

If anyone here has any helpful input on one of these two problems, then I would really appreciate that as I'm starting to run out of ideas here...

Am Mittwoch, 23. Januar 2019 03:44:40 UTC+1 schrieb Chris Hupman:
+1 for number 3.

Number 3 is probably the best option, but I do wonder how long we should continue to support things like thrift and upgrading from titan. 

On Wednesday, January 9, 2019 at 3:02:50 AM UTC-8, Florian Hockmann wrote:
Hi,
Jan Jansen (farodin91) and I took a first try at implementing a CQL input format for Spark to enable OLAP jobs to work without Thrift (#985). Unfortunately, we can't just add wrappers for the CQL OLAP support in org.apache.cassandra and be done with it. The reason is that JanusGraph still uses version 2.1.20 of org.apache.cassandra cassandra-all which is not compatible with the version of the DataStax Cassandra driver JanusGraph is using. So, we need to update the org.apache.cassandra dependency to major version 3.
With this update we would however lose support for Thrift as that was thrown out completely in version 3.0.0 (see CASSANDRA-9353). That leaves us with the following options:
  1. Update the dependency and drop support for Thrift. Thrift is deprecated since Cassandra 3.0 which was released in November 2015. Users who want to continue using Thrift could stay on a JanusGraph version that still supports it.
  2. Create a separate project janusgraph-hadoop-cql that can use newer versions of these two dependencies without affecting the Thrift support in janusgraph-hadoop-core.
  3. Update the dependency and include the classes for Thrift support from org.apache.cassandra.hadoop package in janusgraph-hadoop-core.
The first option would of course be the easiest to implement and I think that dropping support for Thrift over 3 years after deprecation started and only in a new major release of JanusGraph would be acceptable. However, this comes with some risk as we would make CQL the only way to use OLAP with Cassandra in the same version in which we introduce the CQL OLAP support. So, if we find any problems with the implementation, then users could only downgrade to a lower version of JanusGraph that still supports Thrift as there wouldn’t be an alternative input format left.
We should probably update the dependency at some point irrespective of the support for CQL any way so creating a new project now just to not update the dependency in janusgraph-hadoop-core probably isn’t the best idea.
That is why option 3 sounds like the best one to us. We would add support for OLAP with CQL without losing support for Thrift which we could drop at any time in the future.

Any thoughts or other opinions on this topic?


Chris Hupman <chris...@...>
 

I was looking into libthrift again and realized I did a bad job keeping up with us. Were most of the issues figured out in PR 1400 and PR 1436? What else is there to do before we can remove libthrift as a dependency in 0.4.0? 


On Wednesday, January 23, 2019 at 1:56:54 AM UTC-8, Florian Hockmann wrote:
but I do wonder how long we should continue to support things like thrift and upgrading from titan.

I think we should deprecate Thrift soon and then remove it in the next minor version of JanusGraph after that, but we probably need to offer the same functionality for CQL as for Thrift before we can deprecate Thrift. Since we currently don't support Spark without Thrift, we can't really deprecate it yet in my opinion. Another issue we may want to take into account when talking about deprecation / removal of Thrift is the reported performance decrease of CQL compared to Thrift: JanusGraph/janusgraph#1249

Now regarding, how to support CQL for Spark: I found out in the meantime that Cassandra removed all Thrift code from its CQL components already in version 2.2 with CASSANDRA-8358. This most likely means for us that we only have to update to Cassandra 2.2 to get full CQL support for Spark which also means that we don't have to do anything about Thrift right now as Thrift is still fully supported in Cassandra 2.2. We only have to decide about that when we actually update to Cassandra 3.0, but we can do that at a later point (ideally before Cassandra 4.0 will be released as support for Cassandra 2.1 and 2.2 will end with that release).

The current state of this work is that the update to Cassandra 2.2 led to build failures where I have no idea yet what's causing them. janusgraph-solr and some other projects that all depend on janusgraph-cassandra now fail with this error:

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-failsafe-plugin:2.15:integration-test (default) on project janusgraph-solr: Execution default of goal
org
.apache.maven.plugins:maven-failsafe-plugin:2.15:integration-test failed: The forked VM terminated without saying properly goodbye. VM crash or System.exit called ?

This commit introduced the problem. I found a lot of reports with similar problems, but none of the suggested solutions helped so far. It's also strange that janusgraph-cassandra itself doesn't fail with this error.

I already created a branch that includes the update to Cassandra 2.2 and an initial version of a CQL input format for Spark, but for some reason my tests aren''t executed:

Running org.janusgraph.hadoop.CqlInputFormatIT
SLF4J
: Class path contains multiple SLF4J bindings.
SLF4J
: Found binding in [jar:file:/home/travis/.m2/repository/org/slf4j/slf4j-log4j12/1.7.12/slf4j-log4j12-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J
: Found binding in [jar:file:/home/travis/.m2/repository/ch/qos/logback/logback-classic/1.1.3/logback-classic-1.1.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J
: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J
: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Running org.janusgraph.hadoop.CassandraInputFormatIT

If anyone here has any helpful input on one of these two problems, then I would really appreciate that as I'm starting to run out of ideas here...

Am Mittwoch, 23. Januar 2019 03:44:40 UTC+1 schrieb Chris Hupman:
+1 for number 3.

Number 3 is probably the best option, but I do wonder how long we should continue to support things like thrift and upgrading from titan. 

On Wednesday, January 9, 2019 at 3:02:50 AM UTC-8, Florian Hockmann wrote:
Hi,
Jan Jansen (farodin91) and I took a first try at implementing a CQL input format for Spark to enable OLAP jobs to work without Thrift (#985). Unfortunately, we can't just add wrappers for the CQL OLAP support in org.apache.cassandra and be done with it. The reason is that JanusGraph still uses version 2.1.20 of org.apache.cassandra cassandra-all which is not compatible with the version of the DataStax Cassandra driver JanusGraph is using. So, we need to update the org.apache.cassandra dependency to major version 3.
With this update we would however lose support for Thrift as that was thrown out completely in version 3.0.0 (see CASSANDRA-9353). That leaves us with the following options:
  1. Update the dependency and drop support for Thrift. Thrift is deprecated since Cassandra 3.0 which was released in November 2015. Users who want to continue using Thrift could stay on a JanusGraph version that still supports it.
  2. Create a separate project janusgraph-hadoop-cql that can use newer versions of these two dependencies without affecting the Thrift support in janusgraph-hadoop-core.
  3. Update the dependency and include the classes for Thrift support from org.apache.cassandra.hadoop package in janusgraph-hadoop-core.
The first option would of course be the easiest to implement and I think that dropping support for Thrift over 3 years after deprecation started and only in a new major release of JanusGraph would be acceptable. However, this comes with some risk as we would make CQL the only way to use OLAP with Cassandra in the same version in which we introduce the CQL OLAP support. So, if we find any problems with the implementation, then users could only downgrade to a lower version of JanusGraph that still supports Thrift as there wouldn’t be an alternative input format left.
We should probably update the dependency at some point irrespective of the support for CQL any way so creating a new project now just to not update the dependency in janusgraph-hadoop-core probably isn’t the best idea.
That is why option 3 sounds like the best one to us. We would add support for OLAP with CQL without losing support for Thrift which we could drop at any time in the future.

Any thoughts or other opinions on this topic?


Florian Hockmann <f...@...>
 

I think that we should not remove support for Thrift for 0.4.0 as we only gain full support for the successor CQL in 0.4.0 with the added support for Spark (PR #1436). I think we should first deprecate Thrift in a following 0.4.z version, assuming that CQL doesn't introduce bigger problems.
Then we can drop support for Thrift completely in 0.5.0 and update to Cassandra 3.

We should probably target a release date relatively soon for 0.5.0 as Cassandra 2.2 will be EOL once Cassandra 4 comes out and Cassandra 3 will also only be supported 6 months after that (source). Since we already have a PR for an improved inmemory storage backend that should probably also not be introduced in a patch version, it makes sense in general to start working on 0.5.0 soon.

Am Freitag, 24. Mai 2019 00:27:14 UTC+2 schrieb Chris Hupman:

I was looking into libthrift again and realized I did a bad job keeping up with us. Were most of the issues figured out in PR 1400 and PR 1436? What else is there to do before we can remove libthrift as a dependency in 0.4.0? 

On Wednesday, January 23, 2019 at 1:56:54 AM UTC-8, Florian Hockmann wrote:
but I do wonder how long we should continue to support things like thrift and upgrading from titan.

I think we should deprecate Thrift soon and then remove it in the next minor version of JanusGraph after that, but we probably need to offer the same functionality for CQL as for Thrift before we can deprecate Thrift. Since we currently don't support Spark without Thrift, we can't really deprecate it yet in my opinion. Another issue we may want to take into account when talking about deprecation / removal of Thrift is the reported performance decrease of CQL compared to Thrift: JanusGraph/janusgraph#1249

Now regarding, how to support CQL for Spark: I found out in the meantime that Cassandra removed all Thrift code from its CQL components already in version 2.2 with CASSANDRA-8358. This most likely means for us that we only have to update to Cassandra 2.2 to get full CQL support for Spark which also means that we don't have to do anything about Thrift right now as Thrift is still fully supported in Cassandra 2.2. We only have to decide about that when we actually update to Cassandra 3.0, but we can do that at a later point (ideally before Cassandra 4.0 will be released as support for Cassandra 2.1 and 2.2 will end with that release).

The current state of this work is that the update to Cassandra 2.2 led to build failures where I have no idea yet what's causing them. janusgraph-solr and some other projects that all depend on janusgraph-cassandra now fail with this error:

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-failsafe-plugin:2.15:integration-test (default) on project janusgraph-solr: Execution default of goal
org
.apache.maven.plugins:maven-failsafe-plugin:2.15:integration-test failed: The forked VM terminated without saying properly goodbye. VM crash or System.exit called ?

This commit introduced the problem. I found a lot of reports with similar problems, but none of the suggested solutions helped so far. It's also strange that janusgraph-cassandra itself doesn't fail with this error.

I already created a branch that includes the update to Cassandra 2.2 and an initial version of a CQL input format for Spark, but for some reason my tests aren''t executed:

Running org.janusgraph.hadoop.CqlInputFormatIT
SLF4J
: Class path contains multiple SLF4J bindings.
SLF4J
: Found binding in [jar:file:/home/travis/.m2/repository/org/slf4j/slf4j-log4j12/1.7.12/slf4j-log4j12-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J
: Found binding in [jar:file:/home/travis/.m2/repository/ch/qos/logback/logback-classic/1.1.3/logback-classic-1.1.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J
: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J
: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Running org.janusgraph.hadoop.CassandraInputFormatIT

If anyone here has any helpful input on one of these two problems, then I would really appreciate that as I'm starting to run out of ideas here...

Am Mittwoch, 23. Januar 2019 03:44:40 UTC+1 schrieb Chris Hupman:
+1 for number 3.

Number 3 is probably the best option, but I do wonder how long we should continue to support things like thrift and upgrading from titan. 

On Wednesday, January 9, 2019 at 3:02:50 AM UTC-8, Florian Hockmann wrote:
Hi,
Jan Jansen (farodin91) and I took a first try at implementing a CQL input format for Spark to enable OLAP jobs to work without Thrift (#985). Unfortunately, we can't just add wrappers for the CQL OLAP support in org.apache.cassandra and be done with it. The reason is that JanusGraph still uses version 2.1.20 of org.apache.cassandra cassandra-all which is not compatible with the version of the DataStax Cassandra driver JanusGraph is using. So, we need to update the org.apache.cassandra dependency to major version 3.
With this update we would however lose support for Thrift as that was thrown out completely in version 3.0.0 (see CASSANDRA-9353). That leaves us with the following options:
  1. Update the dependency and drop support for Thrift. Thrift is deprecated since Cassandra 3.0 which was released in November 2015. Users who want to continue using Thrift could stay on a JanusGraph version that still supports it.
  2. Create a separate project janusgraph-hadoop-cql that can use newer versions of these two dependencies without affecting the Thrift support in janusgraph-hadoop-core.
  3. Update the dependency and include the classes for Thrift support from org.apache.cassandra.hadoop package in janusgraph-hadoop-core.
The first option would of course be the easiest to implement and I think that dropping support for Thrift over 3 years after deprecation started and only in a new major release of JanusGraph would be acceptable. However, this comes with some risk as we would make CQL the only way to use OLAP with Cassandra in the same version in which we introduce the CQL OLAP support. So, if we find any problems with the implementation, then users could only downgrade to a lower version of JanusGraph that still supports Thrift as there wouldn’t be an alternative input format left.
We should probably update the dependency at some point irrespective of the support for CQL any way so creating a new project now just to not update the dependency in janusgraph-hadoop-core probably isn’t the best idea.
That is why option 3 sounds like the best one to us. We would add support for OLAP with CQL without losing support for Thrift which we could drop at any time in the future.

Any thoughts or other opinions on this topic?