Date
1 - 6 of 6
guava version issue when using Spark job to write to JanusGraph with Hbase configured as backend
Yifeng Liu <laitu...@...>
Hi folks, My team is building a knowledge graph pipeline with JanusGraph. Due to the fact that the team already got a few years experience with Hbase, we'd love to have Hbase as the JanusGraph backend. Here is the cluster setup: - OS: Ubuntu 16.04.5 LTS/Xenial Xerus - JanusGraph version: 0.3.1 - Hbase cluster version: 2.1.1 - Spark version: 2.3.1 When submitting the spark job with an uber jar, the job fails with error message: java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.createStarted()Lcom/google/common/base/Stopwatch; at org.janusgraph.graphdb.database.idassigner.StandardIDPool$IDBlockGetter.<init>(StandardIDPool.java:269) at org.janusgraph.graphdb.database.idassigner.StandardIDPool.startIDBlockGetter(StandardIDPool.java:251) at org.janusgraph.graphdb.database.idassigner.StandardIDPool.nextBlock(StandardIDPool.java:178) at org.janusgraph.graphdb.database.idassigner.StandardIDPool.nextID(StandardIDPool.java:208) at org.janusgraph.graphdb.database.idassigner.VertexIDAssigner.assignID(VertexIDAssigner.java:333) at org.janusgraph.graphdb.database.idassigner.VertexIDAssigner.assignID(VertexIDAssigner.java:182) at org.janusgraph.graphdb.database.idassigner.VertexIDAssigner.assignID(VertexIDAssigner.java:153) at org.janusgraph.graphdb.database.StandardJanusGraph.assignID(StandardJanusGraph.java:460) at org.janusgraph.graphdb.transaction.StandardJanusGraphTx.addVertex(StandardJanusGraphTx.java:514) at org.janusgraph.graphdb.transaction.StandardJanusGraphTx.addVertex(StandardJanusGraphTx.java:532) at org.janusgraph.graphdb.transaction.StandardJanusGraphTx.addVertex(StandardJanusGraphTx.java:528) at com.snafu.BulkLoader.addVertex(BulkLoader.java:189) at com.snafu.BulkLoader.bulkLoad(BulkLoader.java:130) at com.snafu.SparkBulkLoader.lambda$main$1282d8df$1(SparkBulkLoader.java:33) at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219) at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:929) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:929) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) I've been googling around online and found a few posts that seem to me they are the same problem: both Spark and Hbase use older guava version that JanusGraph does. Stack Overflow post seems to be a promising approach but I don't really want to rebuild JanusGraph to shade Guava with a bunch of other chain reaction-ish tweaks on the versions. Issue 488 seems to be close to the issue I am having right now but I am not 100% sure. Any helps will be super appreciated. Cheers, ~Yifeng |
|
HadoopMarc <bi...@...>
Hi Yifeng, The good news: the janusgraph team has already done the shading for you, but they put it the janusgraph-hbase jar. If you want to profit from it, stop using spark-submit and the uber jar and configure your spark aplication like described in the links provided in another running thread on janusgrap-hbase with spark: Let the error messages guide you and if they are hard to interpret, feel free to come back to this thread. Cheers, Marc Op maandag 11 maart 2019 14:55:49 UTC+1 schreef Yifeng Liu:
|
|
Yifeng Liu <laitu...@...>
Hi Marc, Thanks for you reply. I went through the post and your blogs thoroughly. Unfortunately the infrastructure is maintained by another team so that some of the setup mentioned in your blog is pretty though to make it happen, and even if possible, it takes forever via a ticketing system. For bulk loading part, we still stick to running embedded JanusGraph on top of a Spark job with Hbase configured as backend. Somehow we came up with an aggressive shading to all com.google.* package that resolves the Guava version issue. Such shading might backfire on us one day but for now it certainly unblocks us and we can continue the work. The following is an example of the shade plugin we are using: <plugin> </plugin> On Tuesday, March 12, 2019 at 2:19:54 AM UTC+8, HadoopMarc wrote:
|
|
HadoopMarc <bi...@...>
Hi Yifeng, Thanks for posting back your alternative approach. I will certainly keep it in mind for when I get stuck! Cheers, Marc Op vrijdag 22 maart 2019 03:56:50 UTC+1 schreef Yifeng Liu:
|
|
Ryan Stauffer <ry...@...>
Yifeng, +1 on this alternative approach!Here's an additional Google reference on managing Spark dependencies through shading: Ryan On Sunday, March 24, 2019 at 7:34:54 AM UTC-7, HadoopMarc wrote:
|
|
Yash Datta <sau...@...>
Hello Yifeng! Thanks for this, I encountered the same problem and your post saved my day. If you do not mind, could you take a look at my approach of loading data into janusgraph using spark and let me know if there are any issues with the approach? https://github.com/astrolabsoftware/grafink/blob/master/docs/LoadAlgorithm.md PS: This is an open source project as part of gsoc 2020 that I am working on. Thanks and Best Regards Yash On Friday, 22 March 2019 10:56:50 UTC+8, Yifeng Liu wrote:
|
|