Re: Bulk loading into JanusGraph with HBase


Michele Polonioli <michele....@...>
 

https://drive.google.com/file/d/0B-f-jjH6bDhnZUx1RkoyOElEQlE/view?usp=sharing

Here there is a zip containing the data that took 12h on a cluster.

I also tried to load that data on JanuGraph-HBase with default configuration on a laptop and the loading took 3h.


Il giorno venerdì 6 ottobre 2017 17:36:39 UTC+2, Michele Polonioli ha scritto:
I created a repository on GitHub with my code and a very small csv samples here: https://github.com/mpolonioli/JanusGraph-importer-example.

The csv files that i provided with the repo is a very small example, I loaded 1,2GB of files in about 12 hours.

My deployment of JanuGraph is on an Hadoop Cluster composed by 4 nodes with HBase installed with Cloudera Manager.

I didn't measure the ingestion rate for vertices, properties, edges and I don't know how to do that actually.

I apologize for the wrong comment in my code, that code partially comes to an implementation of a titan-importer and I forgot to delete that comment.

I'm wondering if exists a way to load the data directly on HBase, without using the JanusGraph-API or if my code can be optimazed.

Hope this helps to solve my problem, thank you.
Il giorno venerdì 6 ottobre 2017 16:30:42 UTC+2, Jason Plurad ha scritto:
Thanks for providing the code. It would be even better if you shared everything as a GitHub project that's easy to clone and build, contains the CSV files, and also the specific parameters you're sending into program, like batchSize.

You didn't mention how slow is slow. What is the ingestion rate for vertices, properties, and edges? Some more concrete details would be helpful. What does your HBase deployment look like?

         * Note: For unknown reasons, it seems that each modification to the
         * schema must be committed in its own transaction.

I noticed this comment in the code. I don't think that's true, and GraphOfTheGodsFactory does all of its schema updates in one mgmt transaction. I'd be interested to hear more details on this scenario too.

On Thursday, October 5, 2017 at 10:17:06 AM UTC-4, Michele Polonioli wrote:
I have JanusGraph using Hbase as backend storage on an Hadoop cluster.

I need to load a very large quantity of data that represents a social network graph mapped in csv files.
By now I created a java program that creates the schema and load verticies and edges using gremlin.


The problem is that this method is very slow.


Is there a way to perform bulk loading into Hbase in order to significantly reduce the loading times?


The csv files comes out from the ldbc_snb_datage: https://github.com/ldbc/ldbc_snb_datagen

I'll attach a little portion of the files I need to load and the java classes that I wrote.

Thanks.

Join janusgraph-users@lists.lfaidata.foundation to automatically receive all group messages.