Re: Issues with controlling partitions when using Apache Spark


Florian Hockmann
 

Hi Mladen,

 

I wasn’t aware that the CqlInputFormat we’re using is considered legacy. Looks then like we should migrate to spark-cassandra-connector. Could you please create an issue on GitHub for this?

And if you already have an implementation ready for this, then it would of course be really great if you could contribute it with a PR.

 

Regards,

Florian

 

Von: janusgraph-users@... <janusgraph-users@...> Im Auftrag von Mladen Marovic
Gesendet: Montag, 25. Januar 2021 17:34
An: janusgraph-users@...
Betreff: [janusgraph-users] Issues with controlling partitions when using Apache Spark

 

Hey there!

 

I've recently been working on some Apache Spark jobs for Janusgraph via hadoop-gremlin (as described on https://docs.janusgraph.org/advanced-topics/hadoop/) and encountered several issues. Generally, I kept having memory issues as the partitions were too big to be loaded into my spark executors (which I increased up to 16GB per executor).

 

After analysing the code, I found two parameters that could be used to further subsplit the partitions: cassandra.input.split.size and cassandra.input.split.size_mb. However, when trying to use these parameters, and debugging when the memory issues persisted, I noticed several bugs in the underlying org.apache.cassandra.hadoop.cql3.CqlInputFormat used to load the data. I posted the question on the datastax community forums (see https://community.datastax.com/questions/10153/how-to-control-partition-size-when-reading-data-wi.html). There I was ultimately suggested to migrate to the spark-cassandra-connector because the issues I encountered were probably bugs, but that was legacy code (and probably not maintained anymore).

 

In the meantime, I reimplemented the InputFormat classes in my app to fix the issues, and testing so far showed that this now works as intended. However, I was wondering the following:

 

1. Does anyone else have any experience with using Apache Spark, Janusgraph, and graphs too big to fit into memory without subsplitting? Did you also encounter this issue? If so, how did you deal with it?

2. Is there an "official" solution to this issue?

3. Are there any plans to migrate to the spark-cassandra connector for this use case?

 

Thanks,

 

Mladen

Join janusgraph-users@lists.lfaidata.foundation to automatically receive all group messages.