janusgraph how to bulkload to hbase use spark on yarn,i find this method is Deprecated?


zjx...@...
 

i want to use scala and spark on yarn bulkload data to janusgraph where storage is hbase,but i find this method is Deprecated, BulkLoaderVertexProgram and OneTimeBulkLoader is @deprecated As of release 3.2.10, not directly replaced - consider graph provider specific bulk loading methods,how do i write new code ,the tinkerpop3.4.1 did not  help.
this next  is details,
my hadoop-graphson.properties config,
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
# i/o formats for graphs
gremlin.hadoop.graphReader=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONInputFormat
gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat
# i/o locations
gremlin.hadoop.inputLocation=data/tinkerpop-modern.json
gremlin.hadoop.outputLocation=output
# if the job jars are not on the classpath of every hadoop node, then they must be provided to the distributed cache at runtime
gremlin.hadoop.jarsInDistributedCache=true
# the vertex program to execute
gremlin.vertexProgram=org.apache.tinkerpop.gremlin.process.computer.ranking.pagerank.PageRankVertexProgram

####################################
# SparkGraphComputer Configuration #
####################################
spark.master=yarn
spark.deploy-mode=client
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoRegistrator

my ws-http-jhe-test.properties config
gremlin.graph=org.janusgraph.core.JanusGraphFactory
schema.default=none
storage.backend=hbase
storage.batch-loading=true
storage.hbase.table = testgraph
storage.hbase.region-count = 50
storage.buffer-size=102400
storage.hostname=tcd-***:2181,tcd-***:2181,tcd-***:2181
cache.db-cache = true
cache.db-cache-clean-wait = 20
cache.db-cache-time = 180000
cache.db-cache-size = 0.5

index.search.backend=elasticsearch
index.search.index-name=testgraph
index.search.hostname=tcd-***,tcd-***,tcd-***
graph.set-vertex-id=true

ids.block-size=100000000

my scala and spark code, if you use BulkLoaderVertexProgram and OneTimeBulkLoader will prompt deprecated.
def bulkLoad(): Unit ={
val readGraph = GraphFactory.open("janusgraph/hadoop-graphson.properties")
val blvp = BulkLoaderVertexProgram.build().bulkLoader(classOf[OneTimeBulkLoader])
.writeGraph("janusgraph/ws-http-jhe-test.properties").create(readGraph)
readGraph.compute(classOf[SparkGraphComputer]).program(blvp).submit().get()
readGraph.close()
}


marc.d...@...
 


This is basically the question: "who will do the work in an open source community?"  Apache TinkerPop concluded that a generic BulkLoaderVertexProgram ran into too many implementation specific issues, see here for the JanusGraph case, and they deprecated the library.

If the deprecated BulkLoaderVertexProgram works for you, it would be easy to copy the existing java source code into your scala project and do some minor fixes in case of API changes in future TinkerPop versions. Reworking the BulkLoaderVertexProgram into a general, well documented tool for JanusGraph would be a significant piece of work. Also note that the current BulkLoaderVertexProgram does not do much in preprocessing your data for efficient inserts (janusgraph cache hits only occur for vertices that have so many edges that the vertex is present in the janusgraph cache on each executor). I believe this is the reason that most JanusGraph users simply use the gremlin addV and addE steps on their spark executor code, close to the more complex code where the data is prepared for ingestion.

So, concluding, if you have little time, using the deprecated BulkLoaderVertexProgram is not a large risk. If resource usage (and thus ingesting speed) is important to you, investing in a targeted solution may be worthwhile (look here for inspiration).

HTH,    Marc

Op donderdag 2 januari 2020 03:54:23 UTC+1 schreef pandagungun:

i want to use scala and spark on yarn bulkload data to janusgraph where storage is hbase,but i find this method is Deprecated, BulkLoaderVertexProgram and OneTimeBulkLoader is @deprecated As of release 3.2.10, not directly replaced - consider graph provider specific bulk loading methods,how do i write new code ,the tinkerpop3.4.1 did not  help.
this next  is details,
my hadoop-graphson.properties config,
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
# i/o formats for graphs
gremlin.hadoop.graphReader=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONInputFormat
gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat
# i/o locations
gremlin.hadoop.inputLocation=data/tinkerpop-modern.json
gremlin.hadoop.outputLocation=output
# if the job jars are not on the classpath of every hadoop node, then they must be provided to the distributed cache at runtime
gremlin.hadoop.jarsInDistributedCache=true
# the vertex program to execute
gremlin.vertexProgram=org.apache.tinkerpop.gremlin.process.computer.ranking.pagerank.PageRankVertexProgram

####################################
# SparkGraphComputer Configuration #
####################################
spark.master=yarn
spark.deploy-mode=client
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoRegistrator

my ws-http-jhe-test.properties config
gremlin.graph=org.janusgraph.core.JanusGraphFactory
schema.default=none
storage.backend=hbase
storage.batch-loading=true
storage.hbase.table = testgraph
storage.hbase.region-count = 50
storage.buffer-size=102400
storage.hostname=tcd-***:2181,tcd-***:2181,tcd-***:2181
cache.db-cache = true
cache.db-cache-clean-wait = 20
cache.db-cache-time = 180000
cache.db-cache-size = 0.5

index.search.backend=elasticsearch
index.search.index-name=testgraph
index.search.hostname=tcd-***,tcd-***,tcd-***
graph.set-vertex-id=true

ids.block-size=100000000

my scala and spark code, if you use BulkLoaderVertexProgram and OneTimeBulkLoader will prompt deprecated.
def bulkLoad(): Unit ={
val readGraph = GraphFactory.open("janusgraph/hadoop-graphson.properties")
val blvp = BulkLoaderVertexProgram.build().bulkLoader(classOf[OneTimeBulkLoader])
.writeGraph("janusgraph/ws-http-jhe-test.properties").create(readGraph)
readGraph.compute(classOf[SparkGraphComputer]).program(blvp).submit().get()
readGraph.close()
}


pandagungun <zjx...@...>
 

thank you.

在 2020年1月5日星期日 UTC+8下午9:47:53,ma...@...写道:


This is basically the question: "who will do the work in an open source community?"  Apache TinkerPop concluded that a generic BulkLoaderVertexProgram ran into too many implementation specific issues, see here for the JanusGraph case, and they deprecated the library.

If the deprecated BulkLoaderVertexProgram works for you, it would be easy to copy the existing java source code into your scala project and do some minor fixes in case of API changes in future TinkerPop versions. Reworking the BulkLoaderVertexProgram into a general, well documented tool for JanusGraph would be a significant piece of work. Also note that the current BulkLoaderVertexProgram does not do much in preprocessing your data for efficient inserts (janusgraph cache hits only occur for vertices that have so many edges that the vertex is present in the janusgraph cache on each executor). I believe this is the reason that most JanusGraph users simply use the gremlin addV and addE steps on their spark executor code, close to the more complex code where the data is prepared for ingestion.

So, concluding, if you have little time, using the deprecated BulkLoaderVertexProgram is not a large risk. If resource usage (and thus ingesting speed) is important to you, investing in a targeted solution may be worthwhile (look here for inspiration).

HTH,    Marc

Op donderdag 2 januari 2020 03:54:23 UTC+1 schreef pandagungun:
i want to use scala and spark on yarn bulkload data to janusgraph where storage is hbase,but i find this method is Deprecated, BulkLoaderVertexProgram and OneTimeBulkLoader is @deprecated As of release 3.2.10, not directly replaced - consider graph provider specific bulk loading methods,how do i write new code ,the tinkerpop3.4.1 did not  help.
this next  is details,
my hadoop-graphson.properties config,
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
# i/o formats for graphs
gremlin.hadoop.graphReader=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONInputFormat
gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat
# i/o locations
gremlin.hadoop.inputLocation=data/tinkerpop-modern.json
gremlin.hadoop.outputLocation=output
# if the job jars are not on the classpath of every hadoop node, then they must be provided to the distributed cache at runtime
gremlin.hadoop.jarsInDistributedCache=true
# the vertex program to execute
gremlin.vertexProgram=org.apache.tinkerpop.gremlin.process.computer.ranking.pagerank.PageRankVertexProgram

####################################
# SparkGraphComputer Configuration #
####################################
spark.master=yarn
spark.deploy-mode=client
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoRegistrator

my ws-http-jhe-test.properties config
gremlin.graph=org.janusgraph.core.JanusGraphFactory
schema.default=none
storage.backend=hbase
storage.batch-loading=true
storage.hbase.table = testgraph
storage.hbase.region-count = 50
storage.buffer-size=102400
storage.hostname=tcd-***:2181,tcd-***:2181,tcd-***:2181
cache.db-cache = true
cache.db-cache-clean-wait = 20
cache.db-cache-time = 180000
cache.db-cache-size = 0.5

index.search.backend=elasticsearch
index.search.index-name=testgraph
index.search.hostname=tcd-***,tcd-***,tcd-***
graph.set-vertex-id=true

ids.block-size=100000000

my scala and spark code, if you use BulkLoaderVertexProgram and OneTimeBulkLoader will prompt deprecated.
def bulkLoad(): Unit ={
val readGraph = GraphFactory.open("janusgraph/hadoop-graphson.properties")
val blvp = BulkLoaderVertexProgram.build().bulkLoader(classOf[OneTimeBulkLoader])
.writeGraph("janusgraph/ws-http-jhe-test.properties").create(readGraph)
readGraph.compute(classOf[SparkGraphComputer]).program(blvp).submit().get()
readGraph.close()
}


pandagungun <zjx...@...>
 

i want to know if i use addV and addE  on spark executor code,how do i write new code?

在 2020年1月5日星期日 UTC+8下午9:47:53,ma...@...写道:


This is basically the question: "who will do the work in an open source community?"  Apache TinkerPop concluded that a generic BulkLoaderVertexProgram ran into too many implementation specific issues, see here for the JanusGraph case, and they deprecated the library.

If the deprecated BulkLoaderVertexProgram works for you, it would be easy to copy the existing java source code into your scala project and do some minor fixes in case of API changes in future TinkerPop versions. Reworking the BulkLoaderVertexProgram into a general, well documented tool for JanusGraph would be a significant piece of work. Also note that the current BulkLoaderVertexProgram does not do much in preprocessing your data for efficient inserts (janusgraph cache hits only occur for vertices that have so many edges that the vertex is present in the janusgraph cache on each executor). I believe this is the reason that most JanusGraph users simply use the gremlin addV and addE steps on their spark executor code, close to the more complex code where the data is prepared for ingestion.

So, concluding, if you have little time, using the deprecated BulkLoaderVertexProgram is not a large risk. If resource usage (and thus ingesting speed) is important to you, investing in a targeted solution may be worthwhile (look here for inspiration).

HTH,    Marc

Op donderdag 2 januari 2020 03:54:23 UTC+1 schreef pandagungun:
i want to use scala and spark on yarn bulkload data to janusgraph where storage is hbase,but i find this method is Deprecated, BulkLoaderVertexProgram and OneTimeBulkLoader is @deprecated As of release 3.2.10, not directly replaced - consider graph provider specific bulk loading methods,how do i write new code ,the tinkerpop3.4.1 did not  help.
this next  is details,
my hadoop-graphson.properties config,
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
# i/o formats for graphs
gremlin.hadoop.graphReader=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONInputFormat
gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat
# i/o locations
gremlin.hadoop.inputLocation=data/tinkerpop-modern.json
gremlin.hadoop.outputLocation=output
# if the job jars are not on the classpath of every hadoop node, then they must be provided to the distributed cache at runtime
gremlin.hadoop.jarsInDistributedCache=true
# the vertex program to execute
gremlin.vertexProgram=org.apache.tinkerpop.gremlin.process.computer.ranking.pagerank.PageRankVertexProgram

####################################
# SparkGraphComputer Configuration #
####################################
spark.master=yarn
spark.deploy-mode=client
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoRegistrator

my ws-http-jhe-test.properties config
gremlin.graph=org.janusgraph.core.JanusGraphFactory
schema.default=none
storage.backend=hbase
storage.batch-loading=true
storage.hbase.table = testgraph
storage.hbase.region-count = 50
storage.buffer-size=102400
storage.hostname=tcd-***:2181,tcd-***:2181,tcd-***:2181
cache.db-cache = true
cache.db-cache-clean-wait = 20
cache.db-cache-time = 180000
cache.db-cache-size = 0.5

index.search.backend=elasticsearch
index.search.index-name=testgraph
index.search.hostname=tcd-***,tcd-***,tcd-***
graph.set-vertex-id=true

ids.block-size=100000000

my scala and spark code, if you use BulkLoaderVertexProgram and OneTimeBulkLoader will prompt deprecated.
def bulkLoad(): Unit ={
val readGraph = GraphFactory.open("janusgraph/hadoop-graphson.properties")
val blvp = BulkLoaderVertexProgram.build().bulkLoader(classOf[OneTimeBulkLoader])
.writeGraph("janusgraph/ws-http-jhe-test.properties").create(readGraph)
readGraph.compute(classOf[SparkGraphComputer]).program(blvp).submit().get()
readGraph.close()
}


natali2...@...
 

also need some basic example of using addV and addE with Spark. Have you found solution? Or any other way for bulk loading vertices and edges. 

вторник, 7 января 2020 г., 18:23:58 UTC+3 пользователь pandagungun написал:

i want to know if i use addV and addE  on spark executor code,how do i write new code?

在 2020年1月5日星期日 UTC+8下午9:47:53,marc.d...@gmail.com写道:

This is basically the question: "who will do the work in an open source community?"  Apache TinkerPop concluded that a generic BulkLoaderVertexProgram ran into too many implementation specific issues, see here for the JanusGraph case, and they deprecated the library.

If the deprecated BulkLoaderVertexProgram works for you, it would be easy to copy the existing java source code into your scala project and do some minor fixes in case of API changes in future TinkerPop versions. Reworking the BulkLoaderVertexProgram into a general, well documented tool for JanusGraph would be a significant piece of work. Also note that the current BulkLoaderVertexProgram does not do much in preprocessing your data for efficient inserts (janusgraph cache hits only occur for vertices that have so many edges that the vertex is present in the janusgraph cache on each executor). I believe this is the reason that most JanusGraph users simply use the gremlin addV and addE steps on their spark executor code, close to the more complex code where the data is prepared for ingestion.

So, concluding, if you have little time, using the deprecated BulkLoaderVertexProgram is not a large risk. If resource usage (and thus ingesting speed) is important to you, investing in a targeted solution may be worthwhile (look here for inspiration).

HTH,    Marc

Op donderdag 2 januari 2020 03:54:23 UTC+1 schreef pandagungun:
i want to use scala and spark on yarn bulkload data to janusgraph where storage is hbase,but i find this method is Deprecated, BulkLoaderVertexProgram and OneTimeBulkLoader is @deprecated As of release 3.2.10, not directly replaced - consider graph provider specific bulk loading methods,how do i write new code ,the tinkerpop3.4.1 did not  help.
this next  is details,
my hadoop-graphson.properties config,
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
# i/o formats for graphs
gremlin.hadoop.graphReader=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONInputFormat
gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat
# i/o locations
gremlin.hadoop.inputLocation=data/tinkerpop-modern.json
gremlin.hadoop.outputLocation=output
# if the job jars are not on the classpath of every hadoop node, then they must be provided to the distributed cache at runtime
gremlin.hadoop.jarsInDistributedCache=true
# the vertex program to execute
gremlin.vertexProgram=org.apache.tinkerpop.gremlin.process.computer.ranking.pagerank.PageRankVertexProgram

####################################
# SparkGraphComputer Configuration #
####################################
spark.master=yarn
spark.deploy-mode=client
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoRegistrator

my ws-http-jhe-test.properties config
gremlin.graph=org.janusgraph.core.JanusGraphFactory
schema.default=none
storage.backend=hbase
storage.batch-loading=true
storage.hbase.table = testgraph
storage.hbase.region-count = 50
storage.buffer-size=102400
storage.hostname=tcd-***:2181,tcd-***:2181,tcd-***:2181
cache.db-cache = true
cache.db-cache-clean-wait = 20
cache.db-cache-time = 180000
cache.db-cache-size = 0.5

index.search.backend=elasticsearch
index.search.index-name=testgraph
index.search.hostname=tcd-***,tcd-***,tcd-***
graph.set-vertex-id=true

ids.block-size=100000000

my scala and spark code, if you use BulkLoaderVertexProgram and OneTimeBulkLoader will prompt deprecated.
def bulkLoad(): Unit ={
val readGraph = GraphFactory.open("janusgraph/hadoop-graphson.properties")
val blvp = BulkLoaderVertexProgram.build().bulkLoader(classOf[OneTimeBulkLoader])
.writeGraph("janusgraph/ws-http-jhe-test.properties").create(readGraph)
readGraph.compute(classOf[SparkGraphComputer]).program(blvp).submit().get()
readGraph.close()
}


Nitin Poddar <hitk.ni...@...>
 

On Tuesday, July 7, 2020 at 1:14:39 PM UTC-4, nat...@... wrote:
also need some basic example of using addV and addE with Spark. Have you found solution? Or any other way for bulk loading vertices and edges. 

вторник, 7 января 2020 г., 18:23:58 UTC+3 пользователь pandagungun написал:
i want to know if i use addV and addE  on spark executor code,how do i write new code?

在 2020年1月5日星期日 UTC+8下午9:47:53,marc.d...@gmail.com写道:

This is basically the question: "who will do the work in an open source community?"  Apache TinkerPop concluded that a generic BulkLoaderVertexProgram ran into too many implementation specific issues, see here for the JanusGraph case, and they deprecated the library.

If the deprecated BulkLoaderVertexProgram works for you, it would be easy to copy the existing java source code into your scala project and do some minor fixes in case of API changes in future TinkerPop versions. Reworking the BulkLoaderVertexProgram into a general, well documented tool for JanusGraph would be a significant piece of work. Also note that the current BulkLoaderVertexProgram does not do much in preprocessing your data for efficient inserts (janusgraph cache hits only occur for vertices that have so many edges that the vertex is present in the janusgraph cache on each executor). I believe this is the reason that most JanusGraph users simply use the gremlin addV and addE steps on their spark executor code, close to the more complex code where the data is prepared for ingestion.

So, concluding, if you have little time, using the deprecated BulkLoaderVertexProgram is not a large risk. If resource usage (and thus ingesting speed) is important to you, investing in a targeted solution may be worthwhile (look here for inspiration).

HTH,    Marc

Op donderdag 2 januari 2020 03:54:23 UTC+1 schreef pandagungun:
i want to use scala and spark on yarn bulkload data to janusgraph where storage is hbase,but i find this method is Deprecated, BulkLoaderVertexProgram and OneTimeBulkLoader is @deprecated As of release 3.2.10, not directly replaced - consider graph provider specific bulk loading methods,how do i write new code ,the tinkerpop3.4.1 did not  help.
this next  is details,
my hadoop-graphson.properties config,
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
# i/o formats for graphs
gremlin.hadoop.graphReader=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONInputFormat
gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat
# i/o locations
gremlin.hadoop.inputLocation=data/tinkerpop-modern.json
gremlin.hadoop.outputLocation=output
# if the job jars are not on the classpath of every hadoop node, then they must be provided to the distributed cache at runtime
gremlin.hadoop.jarsInDistributedCache=true
# the vertex program to execute
gremlin.vertexProgram=org.apache.tinkerpop.gremlin.process.computer.ranking.pagerank.PageRankVertexProgram

####################################
# SparkGraphComputer Configuration #
####################################
spark.master=yarn
spark.deploy-mode=client
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoRegistrator

my ws-http-jhe-test.properties config
gremlin.graph=org.janusgraph.core.JanusGraphFactory
schema.default=none
storage.backend=hbase
storage.batch-loading=true
storage.hbase.table = testgraph
storage.hbase.region-count = 50
storage.buffer-size=102400
storage.hostname=tcd-***:2181,tcd-***:2181,tcd-***:2181
cache.db-cache = true
cache.db-cache-clean-wait = 20
cache.db-cache-time = 180000
cache.db-cache-size = 0.5

index.search.backend=elasticsearch
index.search.index-name=testgraph
index.search.hostname=tcd-***,tcd-***,tcd-***
graph.set-vertex-id=true

ids.block-size=100000000

my scala and spark code, if you use BulkLoaderVertexProgram and OneTimeBulkLoader will prompt deprecated.
def bulkLoad(): Unit ={
val readGraph = GraphFactory.open("janusgraph/hadoop-graphson.properties")
val blvp = BulkLoaderVertexProgram.build().bulkLoader(classOf[OneTimeBulkLoader])
.writeGraph("janusgraph/ws-http-jhe-test.properties").create(readGraph)
readGraph.compute(classOf[SparkGraphComputer]).program(blvp).submit().get()
readGraph.close()
}


HadoopMarc <bi...@...>
 

Hi Nitin,
Good stuff. I guess you could still improve the spark loading in two ways:
  1.  Instead of just catching and printing exceptions in the spark executor, you probably want:
    - catch the exception
    - log the exception
    - rollback the transaction
    - raise a new exception so that Spark can retry the task/partition
  2. Partitions that run on the same executor can reuse the graph connection. So opening the graph can be done in a singleton class on each executor.
Also note that this code assumes that partitions are small, because you probably do not want more than 1000-10.000 vertices and edges in a single transaction.

Cheers,    Marc

Op maandag 13 juli 2020 om 03:54:35 UTC+2 schreef hit...@...:


On Tuesday, July 7, 2020 at 1:14:39 PM UTC-4, nat...@... wrote:
also need some basic example of using addV and addE with Spark. Have you found solution? Or any other way for bulk loading vertices and edges. 

вторник, 7 января 2020 г., 18:23:58 UTC+3 пользователь pandagungun написал:
i want to know if i use addV and addE  on spark executor code,how do i write new code?

在 2020年1月5日星期日 UTC+8下午9:47:53,ma...@...写道:

This is basically the question: "who will do the work in an open source community?"  Apache TinkerPop concluded that a generic BulkLoaderVertexProgram ran into too many implementation specific issues, see here for the JanusGraph case, and they deprecated the library.

If the deprecated BulkLoaderVertexProgram works for you, it would be easy to copy the existing java source code into your scala project and do some minor fixes in case of API changes in future TinkerPop versions. Reworking the BulkLoaderVertexProgram into a general, well documented tool for JanusGraph would be a significant piece of work. Also note that the current BulkLoaderVertexProgram does not do much in preprocessing your data for efficient inserts (janusgraph cache hits only occur for vertices that have so many edges that the vertex is present in the janusgraph cache on each executor). I believe this is the reason that most JanusGraph users simply use the gremlin addV and addE steps on their spark executor code, close to the more complex code where the data is prepared for ingestion.

So, concluding, if you have little time, using the deprecated BulkLoaderVertexProgram is not a large risk. If resource usage (and thus ingesting speed) is important to you, investing in a targeted solution may be worthwhile (look here for inspiration).

HTH,    Marc

Op donderdag 2 januari 2020 03:54:23 UTC+1 schreef pandagungun:
i want to use scala and spark on yarn bulkload data to janusgraph where storage is hbase,but i find this method is Deprecated, BulkLoaderVertexProgram and OneTimeBulkLoader is @deprecated As of release 3.2.10, not directly replaced - consider graph provider specific bulk loading methods,how do i write new code ,the tinkerpop3.4.1 did not  help.
this next  is details,
my hadoop-graphson.properties config,
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
# i/o formats for graphs
gremlin.hadoop.graphReader=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONInputFormat
gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat
# i/o locations
gremlin.hadoop.inputLocation=data/tinkerpop-modern.json
gremlin.hadoop.outputLocation=output
# if the job jars are not on the classpath of every hadoop node, then they must be provided to the distributed cache at runtime
gremlin.hadoop.jarsInDistributedCache=true
# the vertex program to execute
gremlin.vertexProgram=org.apache.tinkerpop.gremlin.process.computer.ranking.pagerank.PageRankVertexProgram

####################################
# SparkGraphComputer Configuration #
####################################
spark.master=yarn
spark.deploy-mode=client
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoRegistrator

my ws-http-jhe-test.properties config
gremlin.graph=org.janusgraph.core.JanusGraphFactory
schema.default=none
storage.backend=hbase
storage.batch-loading=true
storage.hbase.table = testgraph
storage.hbase.region-count = 50
storage.buffer-size=102400
storage.hostname=tcd-***:2181,tcd-***:2181,tcd-***:2181
cache.db-cache = true
cache.db-cache-clean-wait = 20
cache.db-cache-time = 180000
cache.db-cache-size = 0.5

index.search.backend=elasticsearch
index.search.index-name=testgraph
index.search.hostname=tcd-***,tcd-***,tcd-***
graph.set-vertex-id=true

ids.block-size=100000000

my scala and spark code, if you use BulkLoaderVertexProgram and OneTimeBulkLoader will prompt deprecated.
def bulkLoad(): Unit ={
val readGraph = GraphFactory.open("janusgraph/hadoop-graphson.properties")
val blvp = BulkLoaderVertexProgram.build().bulkLoader(classOf[OneTimeBulkLoader])
.writeGraph("janusgraph/ws-http-jhe-test.properties").create(readGraph)
readGraph.compute(classOf[SparkGraphComputer]).program(blvp).submit().get()
readGraph.close()
}