Connect janusgraph 0.5.2 to spark standalone 2.4.0


Petr Stentor <kiri...@...>
 

Hi!
Can anybody help me connect janusgraph 0.5.2 to standalone spark 2.4.0 version?

I am trying load data to hbase(2.1.0-cdh6.3.2) witih this command:

graph = GraphFactory.open('conf/hadoop-graph/hadoop-load.properties')
blvp = BulkLoaderVertexProgram.build().writeGraph('conf/janusgraph-hbase-es.properties').create(graph)
graph.compute(SparkGraphComputer).program(blvp).submit().get()

I am resolved many dependecies conflicts just copies jar files from janusgraph to /usr/local/spark/jars (maybe this is wrong method for resolving dependecies)
But now i am stuck on this:

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndInde
pendentStages(DAGScheduler.scala:1887)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1875)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1874)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1874)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala
:926)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala
:926)
        at scala.Option.foreach(Option.scala:257)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2108)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2057)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2046)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
        at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:935)
        at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:933)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
        at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:933)
        at org.apache.spark.api.java.JavaRDDLike$class.foreachPartition(JavaRDDLike.scala:219)
        at org.apache.spark.api.java.AbstractJavaRDDLike.foreachPartition(JavaRDDLike.scala:45)
        at org.apache.tinkerpop.gremlin.spark.process.computer.SparkExecutor.executeVertexProgramIteration(SparkExecutor.java:179)
        at org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer.lambda$submitWithExecutor$1(SparkGraphComputer.java:399)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException
        at org.apache.hadoop.hbase.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:129) 

       at org.apache.hadoop.ipc.Client.setCallIdAndRetryCount(Client.java:118)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:99)
        at com.sun.proxy.$Proxy20.getBlockLocations(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1226)
        at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1213)
        at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1201)
        at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:306) 
        at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:272)
        at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:264)
        at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1526)
        at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304)
        at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:312)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
        at org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoRecordReader.initialize(GryoRecordReader.java:83)
        at org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoInputFormat.createRecordReader(GryoInputFormat.java:38)
        at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:197)
        at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:196)
        at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:151)
        at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
        at org.apache.spark.scheduler.Task.run(Task.scala:121)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
        ... 3 more


My config file hadoop-load:

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphReader=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoInputFormat
gremlin.hadoop.graphWriter=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
gremlin.hadoop.inputLocation=./data
gremlin.hadoop.outputLocation=output
gremlin.hadoop.jarsInDistributedCache=true

#
# SparkGraphComputer Configuration
#
spark.master=spark://192.168.1.13:7077
spark.driver.host=192.168.1.13
spark.driver.bindAddress=0.0.0.0
spark.executor.instances=1
spark.executor.cores=12
spark.executor.memory=8g
spark.dynamicAllocation.enabled=false
#spark.io.compression.codec=snappy
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoRegistrator



My janusgraph-hbase-es.properties file:
gremlin.graph=org.janusgraph.core.JanusGraphFactory
schema.default = none
ids.block-size = 29000
storage.backend=hbase
storage.hostname=192.168.1.13
storage.hbase.region-count=8
storage.hbase.regions-per-server=2
storage.batch-loading=true
storage.hbase.table=jstest1
cache.db-cache = true
cache.db-cache-clean-wait = 20
cache.db-cache-time = 180000
cache.db-cache-size = 0.5
index.search.backend=elasticsearch
index.search.hostname=192.168.1.13

Thanks!