Bulk Loading with Spark


Joe Obernberger
 

Hi All - I'm trying to use spark to do a bulk load, but it's very slow.
The cassandra cluster I'm connecting to is a bare-metal, 15 node cluster.

I'm using java code to do the loading using:
GraphTraversalSource.addV and Vertex.addEdge in a loop.

Is there a better way?

Thank you!

-Joe


--
This email has been checked for viruses by AVG.
https://www.avg.com


Joe Obernberger
 

Should have added - I'm connecting with:

JanusGraph graph = JanusGraphFactory.build()
                .set("storage.backend", "cql")
                .set("storage.hostname", "charon:9042, chaos:9042")
                .set("storage.cql.keyspace", "graph")
                .set("storage.cql.cluster-name", "JoeCluster")
.set("storage.cql.only-use-local-consistency-for-system-operations", "true")
                .set("storage.cql.batch-statement-size", 256)
                .set("storage.cql.local-max-connections-per-host", 8)
                .set("storage.cql.read-consistency-level", "ONE")
                .set("storage.batch-loading", true)
                .set("schema.default", "none")
                .set("ids.block-size", 100000)
                .set("storage.buffer-size", 16384)
                .open();


-Joe

On 5/20/2022 5:28 PM, Joe Obernberger via lists.lfaidata.foundation wrote:
Hi All - I'm trying to use spark to do a bulk load, but it's very slow.
The cassandra cluster I'm connecting to is a bare-metal, 15 node cluster.

I'm using java code to do the loading using:
GraphTraversalSource.addV and Vertex.addEdge in a loop.

Is there a better way?

Thank you!

-Joe

--
This email has been checked for viruses by AVG.
https://www.avg.com


hadoopmarc@...
 

Hi Joe,

What is slow? Can you please check the Expero blog series and compare to their reference numbers (per parallel spark task):

https://www.experoinc.com/post/janusgraph-nuts-and-bolts-part-1-write-performance

Best wishes,

Marc


Joe Obernberger
 

Thank you Marc - something isn't right with my code - debugging.  Right now the graph is 4,339,690 vertices and 15,707,179 edges, but that took days to build, and is probably 5% of the data.
Querying the graph is fast.

-Joe

On 5/22/2022 7:53 AM, hadoopmarc@... wrote:
Hi Joe,

What is slow? Can you please check the Expero blog series and compare to their reference numbers (per parallel spark task):

https://www.experoinc.com/post/janusgraph-nuts-and-bolts-part-1-write-performance

Best wishes,

Marc



AVG logo

This email has been checked for viruses by AVG antivirus software.
www.avg.com