My Experience with Janusgraph Bulk Loading Using SparkGraphComputer
I am sharing my experience of Janusgraph Bulk Loading:
Data Store: Cassandra
Index Store: Elasticsearch
Audit Data: Kafka Topic
Data loaded: ~300M Nodes & ~371M Edges.
Spark Cluster: 33 Executors, 6 Cores
Cassandra + Elastic IOPs: 30k per nodes
Total Time Taken: ~2-Months (Maximum
delay was due to edge creation process as while inserting any edge all nodes
have to scan based upon business rules. Little
bit more performance degraded due to Audit data for each Node and Edge creation
we are sending to Kafka Topic)
Problems Faced:
à
Elastic Timeouts due to higher IOPS generated by spark
à
Cassandra Timeouts due to higher IOPS generated by spark
à 465
Bad Nodes Created (Hopefully due to some race condition 2 Nodes Property
getting merged to a single Node)
à Miss match of count on a index data
Custom VertexPrograms Built:
1> Node Creation VertexProgram
2> Edge Creation VertexProgram
3> Bulk Drop VertexProgram
4> Reprocess VertexProgram for any missing data.
5> CountVertexProgram based upon some business rule
6> MultiVertexCreation with single input data and linking before committing (Still to test on millions of data)
7> Few more based upon requirements
Thanks,
Abhay Kumar Pandit