My Experience with Janusgraph Bulk Loading Using SparkGraphComputer


Abhay Pandit <abha...@...>
 

Hi,

I am sharing my experience of Janusgraph Bulk Loading:

Data Store: Cassandra
Index Store: Elasticsearch
Audit Data: Kafka Topic

Data loaded:  ~300M Nodes & ~371M Edges.
Spark Cluster: 33 Executors, 6 Cores

Cassandra + Elastic IOPs: 30k per nodes
Total Time Taken: ~2-Months (Maximum delay was due to edge creation process as while inserting any edge all nodes have to scan based upon business rules.  Little bit more performance degraded due to Audit data for each Node and Edge creation we are sending to Kafka Topic)

Problems Faced:
à Elastic Timeouts due to higher IOPS generated by spark
à Cassandra Timeouts due to higher IOPS generated by spark
à 465 Bad Nodes Created (Hopefully due to some race condition 2 Nodes Property getting merged to a single Node)

à Miss match of count on a index data

 

Custom VertexPrograms Built:

1>   Node Creation VertexProgram

2>   Edge Creation VertexProgram

3>   Bulk Drop VertexProgram

4>   Reprocess VertexProgram for any missing data.

5>   CountVertexProgram based upon some business rule

6>   MultiVertexCreation with single input data and linking before committing (Still to test on millions of data)

7>   Few more based upon requirements

 

Thanks,
Abhay Kumar Pandit


natali2...@...
 

Can you provide some basic code example of applying SparkGraphComputer for loading data please? Can not find any code examples to understand how spark is used for bulk loading (creating nodes and edges). Thanks.

четверг, 20 февраля 2020 г., 22:03:21 UTC+3 пользователь Abhay Pandit написал:

Hi,

I am sharing my experience of Janusgraph Bulk Loading:

Data Store: Cassandra
Index Store: Elasticsearch
Audit Data: Kafka Topic

Data loaded:  ~300M Nodes & ~371M Edges.
Spark Cluster: 33 Executors, 6 Cores

Cassandra + Elastic IOPs: 30k per nodes
Total Time Taken: ~2-Months (Maximum delay was due to edge creation process as while inserting any edge all nodes have to scan based upon business rules.  Little bit more performance degraded due to Audit data for each Node and Edge creation we are sending to Kafka Topic)

Problems Faced:
à Elastic Timeouts due to higher IOPS generated by spark
à Cassandra Timeouts due to higher IOPS generated by spark
à 465 Bad Nodes Created (Hopefully due to some race condition 2 Nodes Property getting merged to a single Node)

à Miss match of count on a index data

 

Custom VertexPrograms Built:

1>   Node Creation VertexProgram

2>   Edge Creation VertexProgram

3>   Bulk Drop VertexProgram

4>   Reprocess VertexProgram for any missing data.

5>   CountVertexProgram based upon some business rule

6>   MultiVertexCreation with single input data and linking before committing (Still to test on millions of data)

7>   Few more based upon requirements

 

Thanks,
Abhay Kumar Pandit