Topics

Issues with controlling partitions when using Apache Spark


Mladen Marović
 

Hey there!

I've recently been working on some Apache Spark jobs for Janusgraph via hadoop-gremlin (as described on https://docs.janusgraph.org/advanced-topics/hadoop/) and encountered several issues. Generally, I kept having memory issues as the partitions were too big to be loaded into my spark executors (which I increased up to 16GB per executor).

After analysing the code, I found two parameters that could be used to further subsplit the partitions: cassandra.input.split.size and cassandra.input.split.size_mb. However, when trying to use these parameters, and debugging when the memory issues persisted, I noticed several bugs in the underlying org.apache.cassandra.hadoop.cql3.CqlInputFormat used to load the data. I posted the question on the datastax community forums (see https://community.datastax.com/questions/10153/how-to-control-partition-size-when-reading-data-wi.html). There I was ultimately suggested to migrate to the spark-cassandra-connector because the issues I encountered were probably bugs, but that was legacy code (and probably not maintained anymore).

In the meantime, I reimplemented the InputFormat classes in my app to fix the issues, and testing so far showed that this now works as intended. However, I was wondering the following:

1. Does anyone else have any experience with using Apache Spark, Janusgraph, and graphs too big to fit into memory without subsplitting? Did you also encounter this issue? If so, how did you deal with it?
2. Is there an "official" solution to this issue?
3. Are there any plans to migrate to the spark-cassandra connector for this use case?

Thanks,

Mladen


Florian Hockmann
 

Hi Mladen,

 

I wasn’t aware that the CqlInputFormat we’re using is considered legacy. Looks then like we should migrate to spark-cassandra-connector. Could you please create an issue on GitHub for this?

And if you already have an implementation ready for this, then it would of course be really great if you could contribute it with a PR.

 

Regards,

Florian

 

Von: janusgraph-users@... <janusgraph-users@...> Im Auftrag von Mladen Marovic
Gesendet: Montag, 25. Januar 2021 17:34
An: janusgraph-users@...
Betreff: [janusgraph-users] Issues with controlling partitions when using Apache Spark

 

Hey there!

 

I've recently been working on some Apache Spark jobs for Janusgraph via hadoop-gremlin (as described on https://docs.janusgraph.org/advanced-topics/hadoop/) and encountered several issues. Generally, I kept having memory issues as the partitions were too big to be loaded into my spark executors (which I increased up to 16GB per executor).

 

After analysing the code, I found two parameters that could be used to further subsplit the partitions: cassandra.input.split.size and cassandra.input.split.size_mb. However, when trying to use these parameters, and debugging when the memory issues persisted, I noticed several bugs in the underlying org.apache.cassandra.hadoop.cql3.CqlInputFormat used to load the data. I posted the question on the datastax community forums (see https://community.datastax.com/questions/10153/how-to-control-partition-size-when-reading-data-wi.html). There I was ultimately suggested to migrate to the spark-cassandra-connector because the issues I encountered were probably bugs, but that was legacy code (and probably not maintained anymore).

 

In the meantime, I reimplemented the InputFormat classes in my app to fix the issues, and testing so far showed that this now works as intended. However, I was wondering the following:

 

1. Does anyone else have any experience with using Apache Spark, Janusgraph, and graphs too big to fit into memory without subsplitting? Did you also encounter this issue? If so, how did you deal with it?

2. Is there an "official" solution to this issue?

3. Are there any plans to migrate to the spark-cassandra connector for this use case?

 

Thanks,

 

Mladen


Evgenii Ignatev
 

Hello Mladen,

Yes, we have experienced this issue as well, although we weren't able to fix it.

You solution sounds very interesting, could you share your enhacement as a PR (even not finished one)?
We have done some analysis of source code back then, I might be able to help with PR/tests - feel free to contact me.

Best regards,
Evgenii Ignatev.

On 26.01.2021 16:34, Florian Hockmann wrote:

Hi Mladen,

 

I wasn’t aware that the CqlInputFormat we’re using is considered legacy. Looks then like we should migrate to spark-cassandra-connector. Could you please create an issue on GitHub for this?

And if you already have an implementation ready for this, then it would of course be really great if you could contribute it with a PR.

 

Regards,

Florian

 

Von: janusgraph-users@... <janusgraph-users@...> Im Auftrag von Mladen Marovic
Gesendet: Montag, 25. Januar 2021 17:34
An: janusgraph-users@...
Betreff: [janusgraph-users] Issues with controlling partitions when using Apache Spark

 

Hey there!

 

I've recently been working on some Apache Spark jobs for Janusgraph via hadoop-gremlin (as described on https://docs.janusgraph.org/advanced-topics/hadoop/) and encountered several issues. Generally, I kept having memory issues as the partitions were too big to be loaded into my spark executors (which I increased up to 16GB per executor).

 

After analysing the code, I found two parameters that could be used to further subsplit the partitions: cassandra.input.split.size and cassandra.input.split.size_mb. However, when trying to use these parameters, and debugging when the memory issues persisted, I noticed several bugs in the underlying org.apache.cassandra.hadoop.cql3.CqlInputFormat used to load the data. I posted the question on the datastax community forums (see https://community.datastax.com/questions/10153/how-to-control-partition-size-when-reading-data-wi.html). There I was ultimately suggested to migrate to the spark-cassandra-connector because the issues I encountered were probably bugs, but that was legacy code (and probably not maintained anymore).

 

In the meantime, I reimplemented the InputFormat classes in my app to fix the issues, and testing so far showed that this now works as intended. However, I was wondering the following:

 

1. Does anyone else have any experience with using Apache Spark, Janusgraph, and graphs too big to fit into memory without subsplitting? Did you also encounter this issue? If so, how did you deal with it?

2. Is there an "official" solution to this issue?

3. Are there any plans to migrate to the spark-cassandra connector for this use case?

 

Thanks,

 

Mladen


hadoopmarc@...
 

Hi Mladen,

Having answered several questions about the JanusGraph InputFormats, I can confirm that many users encounter problems with the size of the input splits. This is the case in particular for the HBaseInputFormat where input splits are equal to HBase regions and HBase requires regions to have a size of the order of 10GB (compressed binary data!). Users could only work around this by manually and temporarily splitting the HBase regions. For the CassandraInputFormat problems surface less often, because there a default number of about 500 partitions is used, so you need a lot of data before partition size becomes a limitation.

So, I also encourage you to contribute, if possible!

Also note that there is a fundamental problem to OLAP in graphs: traversing a graph implies shuffling between partitions and this is only efficient if the entire graph fits in the cluster memory. So, where the scalability of JanusGraph OLTP queries is limited by disk space and the performance of the indexing backend, the scalability of OLAP queries is limited by cluster memory.

Best wishes,    Marc


Mladen Marović
 

Thanks for the responses.

I'll create a github issue for this then, and also create a PR with the changes that fixed this issue for me, in case anyone finds it useful.

I'm also interested in doing the spark-cassandra-connector implementation, however it might take a while until I get around to it.


Mladen Marović
 

Just a quick info: I opened an issue for this and did some additional research. See https://github.com/JanusGraph/janusgraph/issues/2420 for more details.