Topics

Hadoop and Spark not working with Janusgraph


Vinayak Bali <vinayakbali16@...>
 



---------- Forwarded message ---------
From: Vinayak Bali <vinayakbali16@...>
Date: Wed, 13 Jan 2021, 7:42 pm
Subject: Hadoop and Spark not working with Janusgraph
To: <janusgraph-users@...>


Hi,
Installed and configured Apache Hadoop(3.3.0) and Apache Spark(3.0.1) with Janusgraph(0.4.0). When I try to execute a query it is not working. Earlier it was working with native spark but took a long time(in minutes) to return a single record. How can the time be optimized and also make it run?

gremlin> hdfs
==>storage[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-568319645_1, ugi=fusionops (auth:SIMPLE)]]]
gremlin> graph = GraphFactory.open('conf/hadoop-graph/read-cql-standalone-cluster.properties')
==>hadoopgraph[cqlinputformat->nulloutputformat]
gremlin> g=graph.traversal().withComputer(SparkGraphComputer)
==>graphtraversalsource[hadoopgraph[cqlinputformat->nulloutputformat], sparkgraphcomputer]
gremlin> g.V().limit(1).valueMap()
13:52:00 WARN  org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer  - class org.apache.hadoop.mapreduce.lib.output.NullOutputFormat does not implement PersistResultGraphAware and thus, persistence options are unknown -- assuming all options are possible
13:52:00 WARN  org.apache.spark.SparkContext  - Another SparkContext is being constructed (or threw an exception in its constructor).  This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.SparkContext.getOrCreate(SparkContext.scala)
org.apache.tinkerpop.gremlin.spark.structure.Spark.create(Spark.java:52)
org.apache.tinkerpop.gremlin.spark.structure.Spark.create(Spark.java:60)
org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer.lambda$submitWithExecutor$1(SparkGraphComputer.java:313)
java.util.concurrent.FutureTask.run(FutureTask.java:266)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
13:53:00 ERROR org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend  - Application has been killed. Reason: All masters are unresponsive! Giving up.
13:53:00 WARN  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend  - Application ID is not initialized yet.
13:53:00 WARN  org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint  - Drop UnregisterApplication(null) because has not yet connected to master
13:53:00 ERROR org.apache.spark.SparkContext  - Error initializing SparkContext.
java.lang.NullPointerException

read-cql-standalone-cluster.properties

# Copyright 2020 JanusGraph Authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

#
# Hadoop Graph Configuration
#
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphReader=org.janusgraph.hadoop.formats.cql.CqlInputFormat
gremlin.hadoop.graphWriter=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat

gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=output
gremlin.spark.persistContext=true

#
# JanusGraph Cassandra InputFormat configuration
#
# These properties defines the connection properties which were used while write data to JanusGraph.
janusgraphmr.ioformat.conf.storage.backend=cql
# This specifies the hostname & port for Cassandra data store.
janusgraphmr.ioformat.conf.storage.hostname=127.0.0.1
janusgraphmr.ioformat.conf.storage.port=9042
# This specifies the keyspace where data is stored.
janusgraphmr.ioformat.conf.storage.cql.keyspace=janusgraph
# This defines the indexing backend configuration used while writing data to JanusGraph.
janusgraphmr.ioformat.conf.index.search.backend=elasticsearch
janusgraphmr.ioformat.conf.index.search.hostname=127.0.0.1
# Use the appropriate properties for the backend when using a different storage backend (HBase) or indexing backend (Solr).

#
# Apache Cassandra InputFormat configuration
#
cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra.input.widerows=true

#
# SparkGraphComputer Configuration
#
spark.master=spark://127.0.0.1:7077
spark.executor.memory=1g
spark.executor.extraClassPath=/opt/lib/janusgraph/*
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator=org.janusgraph.hadoop.serialize.JanusGraphKryoRegistrator

Thanks & Regards,
Vinayak


hadoopmarc@...
 

Hi Vinayak,

JanusGraph itself has spark as a dependency and the spark versions of the spark jars in the janusgraph/lib folder and of your spark cluster must match.

Best wishes,   Marc