Re: Bulk loading into JanusGraph with HBase


Jason Plurad <plu...@...>
 

Thanks for sharing all that info because makes it much easier to have a constructive conversation.

Your default batch size of 100,000 between commits looks really large. Dropping that down to 5,000, these were my results running on my machine (2015 MacBook Pro, 2.8 GHz Intel Core i7 quad core, 16 GB RAM, 1 TB SSD)

Time needed for loading schema into the graph in milliseconds: 94592
Time needed for loading data into the graph in milliseconds: 4587774
Time needed for loading vertices into the graph in milliseconds: 302718
Time needed for loading properties into the graph in milliseconds: 13071
Time needed for loading edges into the graph in milliseconds: 4271985
Total duration in milliseconds: 4682366

Time Elapsed for loading schema into the graph: 000h.01m.34s
Time Elapsed for loading data into the graph: 001h.16m.27s
Total duration: 001h.18m.2s
vertices
: 3181724, edges: 17436661

Not sure what your machine specs are, but that's already 2x faster. I didn't spend much more time on it, but experimenting with the batch size could get you better results.

You mentioned you saw 3h on local laptop vs 12h on the HBase cluster. This sounds like either your cluster is misconfigured/unoptimized or you have a big latency involved between your client application and the cluster.


On Friday, October 6, 2017 at 11:51:03 AM UTC-4, Michele Polonioli wrote:
https://drive.google.com/file/d/0B-f-jjH6bDhnZUx1RkoyOElEQlE/view?usp=sharing

Here there is a zip containing the data that took 12h on a cluster.

I also tried to load that data on JanuGraph-HBase with default configuration on a laptop and the loading took 3h.

Il giorno venerdì 6 ottobre 2017 17:36:39 UTC+2, Michele Polonioli ha scritto:
I created a repository on GitHub with my code and a very small csv samples here: https://github.com/mpolonioli/JanusGraph-importer-example.

The csv files that i provided with the repo is a very small example, I loaded 1,2GB of files in about 12 hours.

My deployment of JanuGraph is on an Hadoop Cluster composed by 4 nodes with HBase installed with Cloudera Manager.

I didn't measure the ingestion rate for vertices, properties, edges and I don't know how to do that actually.

I apologize for the wrong comment in my code, that code partially comes to an implementation of a titan-importer and I forgot to delete that comment.

I'm wondering if exists a way to load the data directly on HBase, without using the JanusGraph-API or if my code can be optimazed.

Hope this helps to solve my problem, thank you.
Il giorno venerdì 6 ottobre 2017 16:30:42 UTC+2, Jason Plurad ha scritto:
Thanks for providing the code. It would be even better if you shared everything as a GitHub project that's easy to clone and build, contains the CSV files, and also the specific parameters you're sending into program, like batchSize.

You didn't mention how slow is slow. What is the ingestion rate for vertices, properties, and edges? Some more concrete details would be helpful. What does your HBase deployment look like?

         * Note: For unknown reasons, it seems that each modification to the
         * schema must be committed in its own transaction.

I noticed this comment in the code. I don't think that's true, and GraphOfTheGodsFactory does all of its schema updates in one mgmt transaction. I'd be interested to hear more details on this scenario too.

On Thursday, October 5, 2017 at 10:17:06 AM UTC-4, Michele Polonioli wrote:
I have JanusGraph using Hbase as backend storage on an Hadoop cluster.

I need to load a very large quantity of data that represents a social network graph mapped in csv files.
By now I created a java program that creates the schema and load verticies and edges using gremlin.


The problem is that this method is very slow.


Is there a way to perform bulk loading into Hbase in order to significantly reduce the loading times?


The csv files comes out from the ldbc_snb_datage: https://github.com/ldbc/ldbc_snb_datagen

I'll attach a little portion of the files I need to load and the java classes that I wrote.

Thanks.

Join janusgraph-users@lists.lfaidata.foundation to automatically receive all group messages.