Janusgraph with YARN and HBASE


Fábio Dapper <fda...@...>
 

Hello, we have a Cluster with CLOUDERA CDH 6.3.2 and I'm trying to run Janusgraph on the Cluster with YARN and HBASE, but without success.
(it's OK with SPARK Local)

Version SPARK 2.4.2
HBASE: 2.1.0-cdh6.3.2
Janusgraph (v 0.5.2 and v0.4.1)

I did a lot of searching, but I didn't find any recent references, and they all use older versions of SPARK and Janusgraph.

Some examples:
1) https://docs.janusgraph.org/advanced-topics/hadoop/  
2) http://tinkerpop.apache.org/docs/current/recipes/#olap-spark-yarn
3) http://yaaics.blogspot.com/2017/07/configuring-janusgraph-for-spark-yarn.html

According to these references, I followed the following steps:

  1. Copy the following files to the Janusgraph "lib" directory:
    1. spark-yarn-2.11-2.4.0.jar
    2. scala-reflect-2.10.5.jar
    3. hadoop-yarn-server-web-proxy-2.7.2.jar
    4. guice-servlet-3.0.jar
  2. Generate a "/tmp/spark-gremlin-0.5.2.zip" file containing all the .jar files from "janusgraph / lib /".
  3. Create a configuration file called 'test.properties' from conf/hadoop-graph/read-hbase-standalone-cluster.properties by adding (or modifying) the properties below:

        janusgraphmr.ioformat.conf.storage.hostname=XXX.XXX.XXX.XXX 
spark.master= yarn
#spark.deploy-mode=client
spark.submit.deployMode=client
spark.executor.memory=1g
spark.yarn.dist.jars=/tmp/spark-gremlin-0-5-2.zip

spark.yarn.archive=/tmp/spark-gremlin-0-5-2.zip
spark.yarn.appMasterEnv.CLASSPATH=./__spark_libs__/*:[hadoop_conf_dir]
spark.executor.extraClassPath=./__spark_libs__/*:/[hadoop_conf_dir]
spark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native 



Then I ran the following commands:
    graph = GraphFactory.open(conf/hadoop-graph/test.properties)
    g
    = graph.traversal().withComputer(SparkGraphComputer)
    g
    .V().count()
Can someone help me?
a) Are these problems related to version incompatibility?
b) Has anyone successfully used similar infrastructure?
c) Would anyone know how to determine a correct version of the necessary libraries?
d) Any suggestion?


Thank you all !!!

 Below is a copy of the Yarn Log from my last attempt.

ERROR org.apache.spark.scheduler.TaskSetManager  - Task 0 in stage 0.0 failed 4 times; aborting job
org
.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, [SERVER_NAME], executor 1): java.lang.NoSuchMethodError: net.jpountz.lz4.LZ4BlockInputStream.<init>(Ljava/io/InputStream;Z)V
at org
.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:122)
at org
.apache.spark.broadcast.TorrentBroadcast$$anonfun$6.apply(TorrentBroadcast.scala:304)
at org
.apache.spark.broadcast.TorrentBroadcast$$anonfun$6.apply(TorrentBroadcast.scala:304)
at scala
.Option.map(Option.scala:146)
at org
.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:304)
at org
.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:235)
at scala
.Option.getOrElse(Option.scala:121)
at org
.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211)
at org
.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326)
at org
.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
at org
.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
at org
.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
at org
.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
at org
.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org
.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:89)
at org
.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org
.apache.spark.scheduler.Task.run(Task.scala:121)
at org
.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org
.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org
.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java
.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java
.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java
.lang.Thread.run(Thread.java:748)

Thank you!!


Petr Stentor <kiri...@...>
 


Hi!

Try this 
spark.io.compression.codec=snappy

четверг, 23 июля 2020 г., 1:57:38 UTC+3 пользователь Fábio Dapper написал:

Hello, we have a Cluster with CLOUDERA CDH 6.3.2 and I'm trying to run Janusgraph on the Cluster with YARN and HBASE, but without success.
(it's OK with SPARK Local)

Version SPARK 2.4.2
HBASE: 2.1.0-cdh6.3.2
Janusgraph (v 0.5.2 and v0.4.1)

I did a lot of searching, but I didn't find any recent references, and they all use older versions of SPARK and Janusgraph.

Some examples:

According to these references, I followed the following steps:

  1. Copy the following files to the Janusgraph "lib" directory:
    1. spark-yarn-2.11-2.4.0.jar
    2. scala-reflect-2.10.5.jar
    3. hadoop-yarn-server-web-proxy-2.7.2.jar
    4. guice-servlet-3.0.jar
  2. Generate a "/tmp/spark-gremlin-0.5.2.zip" file containing all the .jar files from "janusgraph / lib /".
  3. Create a configuration file called 'test.properties' from conf/hadoop-graph/read-hbase-standalone-cluster.properties by adding (or modifying) the properties below:

        janusgraphmr.ioformat.conf.storage.hostname=XXX.XXX.XXX.XXX 
spark.master= yarn
#spark.deploy-mode=client
spark.submit.deployMode=client
spark.executor.memory=1g
spark.yarn.dist.jars=/tmp/spark-gremlin-0-5-2.zip

spark.yarn.archive=/tmp/spark-gremlin-0-5-2.zip
spark.yarn.appMasterEnv.CLASSPATH=./__spark_libs__/*:[hadoop_conf_dir]
spark.executor.extraClassPath=./__spark_libs__/*:/[hadoop_conf_dir]
spark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native 



Then I ran the following commands:
    graph = GraphFactory.open(conf/hadoop-graph/test.properties)
    g
    = graph.traversal().withComputer(SparkGraphComputer)
    g
    .V().count()
Can someone help me?
a) Are these problems related to version incompatibility?
b) Has anyone successfully used similar infrastructure?
c) Would anyone know how to determine a correct version of the necessary libraries?
d) Any suggestion?


Thank you all !!!

 Below is a copy of the Yarn Log from my last attempt.

ERROR org.apache.spark.scheduler.TaskSetManager  - Task 0 in stage 0.0 failed 4 times; aborting job
org
.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, [SERVER_NAME], executor 1): java.lang.NoSuchMethodError: net.jpountz.lz4.LZ4BlockInputStream.<init>(Ljava/io/InputStream;Z)V
at org
.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:122)
at org
.apache.spark.broadcast.TorrentBroadcast$$anonfun$6.apply(TorrentBroadcast.scala:304)
at org
.apache.spark.broadcast.TorrentBroadcast$$anonfun$6.apply(TorrentBroadcast.scala:304)
at scala
.Option.map(Option.scala:146)
at org
.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:304)
at org
.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:235)
at scala
.Option.getOrElse(Option.scala:121)
at org
.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211)
at org
.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326)
at org
.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
at org
.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
at org
.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
at org
.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
at org
.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org
.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:89)
at org
.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org
.apache.spark.scheduler.Task.run(Task.scala:121)
at org
.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org
.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org
.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java
.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java
.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java
.lang.Thread.run(Thread.java:748)

Thank you!!


Fábio Dapper <fda...@...>
 

Perfect!!!
That's it!
Thank you, very much!!!

Em qui., 23 de jul. de 2020 às 05:20, Petr Stentor <kiri...@...> escreveu:


Hi!

Try this 
spark.io.compression.codec=snappy

четверг, 23 июля 2020 г., 1:57:38 UTC+3 пользователь Fábio Dapper написал:
Hello, we have a Cluster with CLOUDERA CDH 6.3.2 and I'm trying to run Janusgraph on the Cluster with YARN and HBASE, but without success.
(it's OK with SPARK Local)

Version SPARK 2.4.2
HBASE: 2.1.0-cdh6.3.2
Janusgraph (v 0.5.2 and v0.4.1)

I did a lot of searching, but I didn't find any recent references, and they all use older versions of SPARK and Janusgraph.

Some examples:

According to these references, I followed the following steps:

  1. Copy the following files to the Janusgraph "lib" directory:
    1. spark-yarn-2.11-2.4.0.jar
    2. scala-reflect-2.10.5.jar
    3. hadoop-yarn-server-web-proxy-2.7.2.jar
    4. guice-servlet-3.0.jar
  2. Generate a "/tmp/spark-gremlin-0.5.2.zip" file containing all the .jar files from "janusgraph / lib /".
  3. Create a configuration file called 'test.properties' from conf/hadoop-graph/read-hbase-standalone-cluster.properties by adding (or modifying) the properties below:

        janusgraphmr.ioformat.conf.storage.hostname=XXX.XXX.XXX.XXX 
spark.master= yarn
#spark.deploy-mode=client
spark.submit.deployMode=client
spark.executor.memory=1g
spark.yarn.dist.jars=/tmp/spark-gremlin-0-5-2.zip

spark.yarn.archive=/tmp/spark-gremlin-0-5-2.zip
spark.yarn.appMasterEnv.CLASSPATH=./__spark_libs__/*:[hadoop_conf_dir]
spark.executor.extraClassPath=./__spark_libs__/*:/[hadoop_conf_dir]
spark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native 



Then I ran the following commands:
    graph = GraphFactory.open(conf/hadoop-graph/test.properties)
    g
    = graph.traversal().withComputer(SparkGraphComputer)
    g
    .V().count()
Can someone help me?
a) Are these problems related to version incompatibility?
b) Has anyone successfully used similar infrastructure?
c) Would anyone know how to determine a correct version of the necessary libraries?
d) Any suggestion?


Thank you all !!!

 Below is a copy of the Yarn Log from my last attempt.

ERROR org.apache.spark.scheduler.TaskSetManager  - Task 0 in stage 0.0 failed 4 times; aborting job
org
.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, [SERVER_NAME], executor 1): java.lang.NoSuchMethodError: net.jpountz.lz4.LZ4BlockInputStream.<init>(Ljava/io/InputStream;Z)V
at org
.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:122)
at org
.apache.spark.broadcast.TorrentBroadcast$$anonfun$6.apply(TorrentBroadcast.scala:304)
at org
.apache.spark.broadcast.TorrentBroadcast$$anonfun$6.apply(TorrentBroadcast.scala:304)
at scala
.Option.map(Option.scala:146)
at org
.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:304)
at org
.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:235)
at scala
.Option.getOrElse(Option.scala:121)
at org
.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211)
at org
.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326)
at org
.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
at org
.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
at org
.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
at org
.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
at org
.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org
.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:89)
at org
.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org
.apache.spark.scheduler.Task.run(Task.scala:121)
at org
.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org
.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org
.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java
.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java
.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java
.lang.Thread.run(Thread.java:748)

Thank you!!

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/467a21c7-b103-4c1a-9404-a514e4335671o%40googlegroups.com.