toggle quoted messageShow quoted text
Yes, we have experienced this issue as well, although we weren't
able to fix it.
You solution sounds very interesting, could you share your
enhacement as a PR (even not finished one)?
We have done some analysis of source code back then, I might be
able to help with PR/tests - feel free to contact me.
On 26.01.2021 16:34, Florian Hockmann
I wasn’t aware that the CqlInputFormat we’re
using is considered legacy. Looks then like we should
migrate to spark-cassandra-connector. Could you please
create an issue on GitHub for this?
And if you already have an implementation ready
for this, then it would of course be really great if you
could contribute it with a PR.
I've recently been working on some
Apache Spark jobs for Janusgraph via hadoop-gremlin
(as described on https://docs.janusgraph.org/advanced-topics/hadoop/)
and encountered several issues. Generally, I kept having
memory issues as the partitions were too big to be loaded
into my spark executors (which I increased up to 16GB per
After analysing the code, I found two
parameters that could be used to further subsplit the
partitions: cassandra.input.split.size and cassandra.input.split.size_mb.
However, when trying to use these parameters, and
debugging when the memory issues persisted, I noticed
several bugs in the underlying org.apache.cassandra.hadoop.cql3.CqlInputFormat
used to load the data. I posted the question on the
datastax community forums (see https://community.datastax.com/questions/10153/how-to-control-partition-size-when-reading-data-wi.html).
There I was ultimately suggested to migrate to the spark-cassandra-connector
because the issues I encountered were probably bugs, but
that was legacy code (and probably not maintained
In the meantime, I reimplemented the InputFormat
classes in my app to fix the issues, and testing so far
showed that this now works as intended. However, I was
wondering the following:
1. Does anyone else have any experience
with using Apache Spark, Janusgraph, and graphs too big to
fit into memory without subsplitting? Did you also
encounter this issue? If so, how did you deal with it?
2. Is there an "official" solution to
3. Are there any plans to migrate to
connector for this use case?