Issues with controlling partitions when using Apache Spark
Hi Mladen,
I wasn’t aware that the CqlInputFormat we’re using is considered legacy. Looks then like we should migrate to spark-cassandra-connector. Could you please create an issue on GitHub for this?
And if you already have an implementation ready for this, then it would of course be really great if you could contribute it with a PR.
Regards,
Florian
Von: janusgraph-users@... <janusgraph-users@...> Im Auftrag von Mladen Marovic
Gesendet: Montag, 25. Januar 2021 17:34
An: janusgraph-users@...
Betreff: [janusgraph-users] Issues with controlling partitions when using Apache Spark
Hey there!
I've recently been working on some Apache Spark jobs for Janusgraph via hadoop-gremlin (as described on https://docs.janusgraph.org/advanced-topics/hadoop/) and encountered several issues. Generally, I kept having memory issues as the partitions were too big to be loaded into my spark executors (which I increased up to 16GB per executor).
After analysing the code, I found two parameters that could be used to further subsplit the partitions: cassandra.input.split.size and cassandra.input.split.size_mb. However, when trying to use these parameters, and debugging when the memory issues persisted, I noticed several bugs in the underlying org.apache.cassandra.hadoop.cql3.CqlInputFormat used to load the data. I posted the question on the datastax community forums (see https://community.datastax.com/questions/10153/how-to-control-partition-size-when-reading-data-wi.html). There I was ultimately suggested to migrate to the spark-cassandra-connector because the issues I encountered were probably bugs, but that was legacy code (and probably not maintained anymore).
In the meantime, I reimplemented the InputFormat classes in my app to fix the issues, and testing so far showed that this now works as intended. However, I was wondering the following:
1. Does anyone else have any experience with using Apache Spark, Janusgraph, and graphs too big to fit into memory without subsplitting? Did you also encounter this issue? If so, how did you deal with it?
2. Is there an "official" solution to this issue?
3. Are there any plans to migrate to the spark-cassandra connector for this use case?
Thanks,
Mladen
Hello Mladen,
Yes, we have experienced this issue as well, although we weren't able to fix it.
You solution sounds very interesting, could you share your
enhacement as a PR (even not finished one)?
We have done some analysis of source code back then, I might be
able to help with PR/tests - feel free to contact me.
Best regards,
Evgenii Ignatev.
Hi Mladen,
I wasn’t aware that the CqlInputFormat we’re using is considered legacy. Looks then like we should migrate to spark-cassandra-connector. Could you please create an issue on GitHub for this?
And if you already have an implementation ready for this, then it would of course be really great if you could contribute it with a PR.
Regards,
Florian
Von: janusgraph-users@... <janusgraph-users@...> Im Auftrag von Mladen Marovic
Gesendet: Montag, 25. Januar 2021 17:34
An: janusgraph-users@...
Betreff: [janusgraph-users] Issues with controlling partitions when using Apache Spark
Hey there!
I've recently been working on some Apache Spark jobs for Janusgraph via hadoop-gremlin (as described on https://docs.janusgraph.org/advanced-topics/hadoop/) and encountered several issues. Generally, I kept having memory issues as the partitions were too big to be loaded into my spark executors (which I increased up to 16GB per executor).
After analysing the code, I found two parameters that could be used to further subsplit the partitions: cassandra.input.split.size and cassandra.input.split.size_mb. However, when trying to use these parameters, and debugging when the memory issues persisted, I noticed several bugs in the underlying org.apache.cassandra.hadoop.cql3.CqlInputFormat used to load the data. I posted the question on the datastax community forums (see https://community.datastax.com/questions/10153/how-to-control-partition-size-when-reading-data-wi.html). There I was ultimately suggested to migrate to the spark-cassandra-connector because the issues I encountered were probably bugs, but that was legacy code (and probably not maintained anymore).
In the meantime, I reimplemented the InputFormat classes in my app to fix the issues, and testing so far showed that this now works as intended. However, I was wondering the following:
1. Does anyone else have any experience with using Apache Spark, Janusgraph, and graphs too big to fit into memory without subsplitting? Did you also encounter this issue? If so, how did you deal with it?
2. Is there an "official" solution to this issue?
3. Are there any plans to migrate to the spark-cassandra connector for this use case?
Thanks,
Mladen
Having answered several questions about the JanusGraph InputFormats, I can confirm that many users encounter problems with the size of the input splits. This is the case in particular for the HBaseInputFormat where input splits are equal to HBase regions and HBase requires regions to have a size of the order of 10GB (compressed binary data!). Users could only work around this by manually and temporarily splitting the HBase regions. For the CassandraInputFormat problems surface less often, because there a default number of about 500 partitions is used, so you need a lot of data before partition size becomes a limitation.
So, I also encourage you to contribute, if possible!
Also note that there is a fundamental problem to OLAP in graphs: traversing a graph implies shuffling between partitions and this is only efficient if the entire graph fits in the cluster memory. So, where the scalability of JanusGraph OLTP queries is limited by disk space and the performance of the indexing backend, the scalability of OLAP queries is limited by cluster memory.
Best wishes, Marc
Thanks for the responses.
I'll create a github issue for this then, and also create a PR with the changes that fixed this issue for me, in case anyone finds it useful.
I'm also interested in doing the spark-cassandra-connector
implementation, however it might take a while until I get around to it.
Just a quick info: I opened an issue for this and did some additional research. See https://github.com/JanusGraph/janusgraph/issues/2420 for more details.