I am working on OLAP using Spark and Hadoop. I have a couple of questions.
1. How to execute a filter step on the driver and create an RDD of internal ids?
2. Distributing the collected Ids to multiple Spark Executor?
3. Execute Gremlin in Parallel
Thanks & Regards,
JanusGraph has defined hadoop InputFormats for its storage backends to do OLAP queries, see https://docs.janusgraph.org/advanced-topics/hadoop/
However, these InputFormats have several problems regarding performance (see the old questions on this list), so your approach could be worthwhile:
1. It is best to create these ID's on ingestion of data in JanusGraph and add them as vertex property. If you create an index on this property, it is possible to use these id properties for retrieval during OLAP queries.
2. Spark does this automatically if you call rdd.mapPartitions on the RDD with ids.
3. Here is the disadvantage of this approach. You simply run the gremlin query per partition with ids, but you have to merge the results per partition afterwards outside gremlin. The merge logic differs per type of query.
Best wishes, Marc