Re: Bulk loading into JanusGraph with HBase


Joe Obernberger <joseph.o...@...>
 

We are having similar issues with performance loading graph data into Janus backed by HBase.  I agree with Jason, we didn't have any issues with doing all the mgmt calls in one go.

One thing that we did was to multi-thread the java code which certainly helped performance.  HBase seems to respond well to multiple calls at once.  For example, in your loadVerticies method, you may want to make a thread inside the main for loop and give it a bank of maybe 32 threads (depends on the machine your're running on).  I use the Java ExecutorService - like:

ExecutorService doWork= Executors.newFixedThreadPool(MAX_WORK_CALLS);
Semaphore smDoWork= new Semaphore(MAX_WORK_CALLS);
try {
smDoWork.acquire();
  } catch (InterruptedException ex) {
             log.error("Interrupt: " + ex);
  }
   someThread= new doJanusStuff(this);
   doWork.execute(someThread);

Just make to release the semaphore when the thread is completed.  All that said, performance was then limited by the one machine doing the ingesting, and still seemed slower than one would expect.  In our case to generate a 154 million node and ~275 million edge graph took 3 days on a 5 node Hadoop cluster.

-Joe


On 10/6/2017 10:30 AM, Jason Plurad wrote:

Thanks for providing the code. It would be even better if you shared everything as a GitHub project that's easy to clone and build, contains the CSV files, and also the specific parameters you're sending into program, like batchSize.

You didn't mention how slow is slow. What is the ingestion rate for vertices, properties, and edges? Some more concrete details would be helpful. What does your HBase deployment look like?

         * Note: For unknown reasons, it seems that each modification to the
         * schema must be committed in its own transaction.

I noticed this comment in the code. I don't think that's true, and GraphOfTheGodsFactory does all of its schema updates in one mgmt transaction. I'd be interested to hear more details on this scenario too.

On Thursday, October 5, 2017 at 10:17:06 AM UTC-4, Michele Polonioli wrote:
I have JanusGraph using Hbase as backend storage on an Hadoop cluster.

I need to load a very large quantity of data that represents a social network graph mapped in csv files.
By now I created a java program that creates the schema and load verticies and edges using gremlin.


The problem is that this method is very slow.


Is there a way to perform bulk loading into Hbase in order to significantly reduce the loading times?


The csv files comes out from the ldbc_snb_datage: https://github.com/ldbc/ldbc_snb_datagen

I'll attach a little portion of the files I need to load and the java classes that I wrote.

Thanks.
--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/bb4a6e00-b069-4c5b-a87c-77580decde75%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Virus-free. www.avg.com

Join {janusgraph-users@lists.lfaidata.foundation to automatically receive all group messages.