My Experience with Janusgraph Bulk Loading Using SparkGraphComputer

I am sharing my experience of Janusgraph Bulk Loading:

Data Store: Cassandra
Index Store: Elasticsearch
Audit Data: Kafka Topic

Data loaded:  ~300M Nodes & ~371M Edges.
Spark Cluster: 33 Executors, 6 Cores

Cassandra + Elastic IOPs: 30k per nodes
Total Time Taken: ~2-Months (Maximum delay was due to edge creation process as while inserting any edge all nodes have to scan based upon business rules.  Little bit more performance degraded due to Audit data for each Node and Edge creation we are sending to Kafka Topic)

Problems Faced:
à Elastic Timeouts due to higher IOPS generated by spark
à Cassandra Timeouts due to higher IOPS generated by spark
à 465 Bad Nodes Created (Hopefully due to some race condition 2 Nodes Property getting merged to a single Node)

à Miss match of count on a index data


Custom VertexPrograms Built:

1>   Node Creation VertexProgram

2>   Edge Creation VertexProgram

3>   Bulk Drop VertexProgram

4>   Reprocess VertexProgram for any missing data.

5>   CountVertexProgram based upon some business rule

6>   MultiVertexCreation with single input data and linking before committing (Still to test on millions of data)

7>   Few more based upon requirements


