Failed to find all paths between 2 vertices upon the graph having 100 million vertices and 100 million edges using SparkGraphComputer
Roy Yu <7604...@...>
The graph has 100 million vertices and 100 edges Graph data is saved at HBase Table: MyHBaseTable. The size of MyHBaseTable is 16.2GB: root@~$ hdfs dfs -du -h /apps/hbase/data/data/default/ 16.2 G 32.4 G /apps/hbase/data/data/default/MyHBaseTable MyHBaseTable has 190 regions, the edge data (HBase column family e ) of every region is less than 100MB (one spark task processes one region, in order to avoid spark OOM during loading region data, I use HBaseAdmin to split HBase region to make sure the edges data (HBase column family e ) of every region is less than 100MB) . Below the size of region 077288f4be4c439443bb45b0c2369d5b is more than 100MB because it has index data. root@~$ hdfs dfs -du -h /apps/hbase/data/data/default/MyHBaseTable 3.8 K 7.6 K /apps/hbase/data/data/default/MyHBaseTable/.tabledesc 0 0 /apps/hbase/data/data/default/MyHBaseTable/.tmp 78.3 M 156.7 M /apps/hbase/data/data/default/MyHBaseTable/007e9dbf74f5d35862b68d6434f1d6f2 92.2 M 184.3 M /apps/hbase/data/data/default/MyHBaseTable/077288f4be4c439443bb45b0c2369d5b 102.4 M 204.8 M /apps/hbase/data/data/default/MyHBaseTable/0782782071e4a7f2d17800d4a0989a7f 50.6 M 101.3 M /apps/hbase/data/data/default/MyHBaseTable/07e795022e56a969ede48c9c23fbbc7c 50.6 M 101.3 M /apps/hbase/data/data/default/MyHBaseTable/084e54e61bbcfc2decd14dcbac55bc50 99.7 M 199.4 M /apps/hbase/data/data/default/MyHBaseTable/0a85ae356b19c605d9a32b9bf513bcbb 431.3 M 862.6 M /apps/hbase/data/data/default/MyHBaseTable/0b024c812acfa6efaa40e1cca232e192 5.0 K 10.1 K /apps/hbase/data/data/default/MyHBaseTable/0c2d8e3a6daaa8ab30c399783e343890 ... the properties of the graph: gremlin.graph=org.janusgraph.core.JanusGraphFactory cluster.max-partitions = 16 storage.backend=hbase storage.hbase.table=MyHBaseTable storage.hbase.ext.zookeeper.znode.parent=/hbase-unsecure schema.default=none storage.hostname=master001,master002,master003 storage.port=2181 storage.hbase.region-count=64 storage.write-time=1000000 storage.read-time=100000 ids.block-size=200000 ids.renew-timeout=600000 ids.renew-percentage=0.4 ids.authority.conflict-avoidance-mode=GLOBAL_AUTO index.search.backend=elasticsearch index.search.hostname=es001,es002,es003 index.search.elasticsearch.create.ext.index.number_of_shards=15 index.search.elasticsearch.create.ext.index.refresh_interval=-1 index.search.elasticsearch.create.ext.index.translog.sync_interval=5000s index.search.elasticsearch.create.ext.index.translog.durability=async index.search.elasticsearch.create.ext.index.number_of_replicas=0 index.search.elasticsearch.create.ext.index.shard.check_on_startup=false the schema of the graph: def defineSchema(graph) { m = graph.openManagement() node = m.makeVertexLabel("node").make() relation = m.makeEdgeLabel("relation").make() obj_type_value = m.makePropertyKey("obj_type_value").dataType(String.class).make() // edge props start_time = m.makePropertyKey("start_time").dataType(Date.class).make() end_time = m.makePropertyKey("end_time").dataType(Date.class).make() count = m.makePropertyKey("count").dataType(Integer.class).make() rel_type = m.makePropertyKey("rel_type").dataType(String.class).make() //index m.buildIndex("MyHBaseTable_obj_type_value_Index", Vertex.class).addKey(obj_type_value).unique().buildCompositeIndex() m.buildIndex("MyHBaseTable_rel_type_index", Edge.class).addKey(rel_type).buildCompositeIndex() m.buildIndex("MyHBaseTable_count_index", Edge.class).addKey(count).buildMixedIndex("search") m.buildIndex("MyHBaseTable_start_time_index", Edge.class).addKey(start_time).buildMixedIndex("search") m.buildIndex("MyHBaseTable_end_time_index", Edge.class).addKey(end_time).buildMixedIndex("search") m.commit() } the Gremlin I use to find all paths between 2 vertices: import org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer; import org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.__; import org.apache.tinkerpop.gremlin.process.traversal.P; def executeScript(graph){ traversal = graph.traversal().withComputer(SparkGraphComputer.class); return traversal.V(624453904).repeat(__.both().simplePath()).until(__.hasId(192204064).or().loops().is(200)).hasId(192204064).path().dedup().limit(1000).toList() //return traversal.V().where(__.outE().count().is(P.gte(50000))).id().toList() }; The OLAP spark graph conf: gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph gremlin.hadoop.graphReader=org.janusgraph.hadoop.formats.hbase.HBaseInputFormat gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat gremlin.hadoop.jarsInDistributedCache=true gremlin.hadoop.inputLocation=none gremlin.hadoop.outputLocation=output gremlin.spark.graphStorageLevel=DISK_ONLY gremlin.spark.persistStorageLevel=DISK_ONLY #################################### # JanusGraph HBase InputFormat configuration #################################### janusgraphmr.ioformat.conf.storage.backend=hbase janusgraphmr.ioformat.conf.storage.hostname=master002,master003,master001 janusgraphmr.ioformat.conf.storage.hbase.table=MyHBaseTable janusgraphmr.ioformat.conf.storage.hbase.ext.zookeeper.znode.parent=/hbase-unsecure #################################### # SparkGraphComputer Configuration # #################################### spark.master=yarn spark.submit.deployMode=client spark.yarn.jars=hdfs://GRAPHOLAP/user/spark/jars/*.jar # the Spark YARN ApplicationManager needs this to resolve classpath it sends to the executors spark.yarn.appMasterEnv.JAVA_HOME=/usr/local/jdk1.8.0_191/ spark.yarn.appMasterEnv.HADOOP_CONF_DIR=/usr/hdp/3.1.4.0-315/hadoop/conf spark.yarn.am.extraJavaOptions=-Diop.version=3.1.4.0-315 -Djava.library.path=/usr/hdp/current/hadoop-client/lib/native spark.executor.memoryOverhead=5G spark.driver.extraJavaOptions=-Diop.version=3.1.4.0-315 -Djava.library.path=/usr/hdp/current/hadoop-client/lib/native # the Spark Executors (on the work nodes) needs this to resolve classpath to run Spark tasks spark.executorEnv.JAVA_HOME=/usr/local/jdk1.8.0_191/ #spark.executorEnv.HADOOP_CONF_DIR=/usr/hdp/3.1.4.0-315/hadoop/conf spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:MaxGCPauseMillis=500 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/mnt/data_1/log/spark2/gc-spark%p.log spark.executor.cores=1 spark.executor.memory=80G spark.executor.instances=3 spark.executor.extraClassPath=/etc/hadoop/conf:/usr/spark/jars:/usr/hdp/current/hbase-client/lib:/usr/janusgraph/0.4.0/lib spark.serializer=org.apache.spark.serializer.KryoSerializer spark.network.timeout=1000000 spark.rpc.askTimeout=1000000 spark.shuffle.service.enabled=true spark.shuffle.service.port=7447 spark.maxRemoteBlockSizeFetchToMem=10485760 spark.memory.useLegacyMode=true spark.shuffle.memoryFraction=0.1 spark.storage.memoryFraction=0.1 spark.memory.fraction=0.1 spark.memory.storageFraction=0.1 spark.shuffle.accurateBlockThreshold=1048576 The spark job failed at stage 50 : 20/12/30 01:53:00 ERROR executor.Executor: Exception in task 40.0 in stage 50.0 (TID 192084) java.lang.OutOfMemoryError: Java heap space at sun.reflect.generics.repository.ClassRepository.getSuperInterfaces(ClassRepository.java:114) at java.lang.Class.getGenericInterfaces(Class.java:913) at java.util.HashMap.comparableClassFor(HashMap.java:351) at java.util.HashMap$TreeNode.treeify(HashMap.java:1932) at java.util.HashMap.treeifyBin(HashMap.java:772) at java.util.HashMap.putVal(HashMap.java:644) at java.util.HashMap.put(HashMap.java:612) at java.util.Collections$SynchronizedMap.put(Collections.java:2588) at org.apache.tinkerpop.gremlin.process.traversal.traverser.util.TraverserSet.add(TraverserSet.java:90) at org.apache.tinkerpop.gremlin.process.computer.traversal.WorkerExecutor.lambda$drainStep$4(WorkerExecutor.java:232) at org.apache.tinkerpop.gremlin.process.computer.traversal.WorkerExecutor$$Lambda$86/877696627.accept(Unknown Source) at java.util.Iterator.forEachRemaining(Iterator.java:116) at org.apache.tinkerpop.gremlin.process.computer.traversal.WorkerExecutor.drainStep(WorkerExecutor.java:221) at org.apache.tinkerpop.gremlin.process.computer.traversal.WorkerExecutor.execute(WorkerExecutor.java:151) at org.apache.tinkerpop.gremlin.process.computer.traversal.TraversalVertexProgram.execute(TraversalVertexProgram.java:307) at org.apache.tinkerpop.gremlin.spark.process.computer.SparkExecutor.lambda$null$4(SparkExecutor.java:118) at org.apache.tinkerpop.gremlin.spark.process.computer.SparkExecutor$$Lambda$72/1209554928.apply(Unknown Source) at org.apache.tinkerpop.gremlin.util.iterator.IteratorUtils$3.next(IteratorUtils.java:247) at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) From the log it seems there is too much data even the 80G executor heap is not enough. Anybody who can help me ? Anybody who has idea to find all the paths between 2 vertices upon large graph? |
|
Re: Running OLAP on HBase with SparkGraphComputer fails with Error Container killed by YARN for exceeding memory limits
Roy Yu <7604...@...>
Thanks
Evgenii
toggle quoted message
Show quoted text
On Tuesday, December 15, 2020 at 8:24:11 PM UTC+8 yevg...@... wrote:
|
|
Re: Remote Traversal with Java
HadoopMarc <bi...@...>
Hi Peter, This seems more relevant: https://docs.janusgraph.org/basics/configured-graph-factory/#graph-and-traversal-bindings So, some JanusGraph and GraphTraversalSource objects are created remotely with names following a convention. You cannot assign the instances locally. Best wishes, Marc Op donderdag 31 december 2020 om 12:25:10 UTC+1 schreef HadoopMarc:
|
|
Slow convert to Java object
Maxim Milovanov <milov...@...>
Hi! I am trying write select-query and my perfomance is very slow when I call method toList(). My example: long start; long end; try (GraphTraversalSource g = graph.traversal()) { start = System.currentTimeMillis(); GraphTraversal<Vertex, Vertex> list2 = g.V().hasLabel("Entity"); end = System.currentTimeMillis(); System.out.printf("getGremlinTime: %d ms\n", (end - start)); start = end; List<Vertex> res2 = list2.toList(); end = System.currentTimeMillis(); System.out.printf("toList: %d ms\n gremlin count: %d\n", (end - start), res2.size()); } Debug log: getGremlinTime: 13 ms 2020-12-31 14:19:01.336 WARN 14144 --- [ main] o.j.g.transaction.StandardJanusGraphTx : Query requires iterating over all vertices [(~label = Entity)]. For better performance, use indexes toList: 12025 ms gremlin count: 105 |
|
Re: Remote Traversal with Java
HadoopMarc <bi...@...>
Hi Peter, Have you tried the suggestions from an earlier thread: https://groups.google.com/g/janusgraph-users/c/w2_qMchATnw/m/zoMSlMO5BwAJ Best wishes, Marc Op woensdag 30 december 2020 om 20:06:36 UTC+1 schreef Peter Borissow:
|
|
Re: Degree-Centrality Filtering & Search – Scalable Strategies for OLTP
BO XUAN LI <libo...@...>
Hi Zach,
toggle quoted message
Show quoted text
If you want to run the query in a multi-threading manner, try enabling “query.batch” (ref: https://docs.janusgraph.org/basics/configuration-reference/#query). Since you are using Cassandra which does not support batch reading natively, JanusGraph will use a thread pool to fire the backend queries. This should reduce latency of this single query but might impact overall application performance if your application is already handling heavy workloads. Best regards, Boxuan
|
|
Re: Degree-Centrality Filtering & Search – Scalable Strategies for OLTP
"zb...@gmail.com" <zblu...@...>
Hi Marc, Boxuan, Thank you for the discussion. I have been experimenting with different queries including your id() suggesting Marc. Along Boxuan’s feedback, the where() step performs about the same (maybe slightly slower) when adding .id() step. My bigger concern for my use case is how this type of operation scales in a matter that seems relatively linear with sample size. i.e. g.V().limit(10).where(InE().count().is(gt(6))).profile() => ~30 ms g.V().limit(100).where(InE().count().is(gt(6))).profile() => ~147 ms g.V().limit(1000).where(InE().count().is(gt(6))).profile() => ~1284 ms g.V().limit(10000).where(InE().count().is(gt(6))).profile() => ~13779 ms g.V().limit(100000).where(InE().count().is(gt(6))).profile() => ? > 120000 ms (timeout)
This behavior makes sense when I think about it and also when I inspect the profile (example profile of limit(10) traversal below) I know the above traversal seems a bit funky, but I am trying to consistently analyze the effect of sample size on the edge count portion of the query. Looking at the profile, it seems like JG needs to perform a sliceQuery operation on each vertex sequentially which isn’t well optimized for my use case. I know that if centrality properties were included in a mixed index then it can be configured for scalable performance. However, going back to the original post, I am not sure that is the best/only way. Are there other configurations that could be optimized to make this operation more scalable without to an additional index property? In case it is relevant, I am using JanusGraph v 0.5.2 with Cassandra-CQL backend v3.11. Thank you, Zach Example Profile gremlin> g.V().limit(10).where(inE().count().is(gt(6))).profile() ==>Traversal Metrics Step Count Traversers Time (ms) % Dur ============================================================================================================= JanusGraphStep(vertex,[]) 10 10 8.684 28.71 \_condition=() \_orders=[] \_limit=10 \_isFitted=false \_isOrdered=true \_query=[] optimization 0.005 optimization 0.001 scan 0.000 \_query=[] \_fullscan=true \_condition=VERTEX TraversalFilterStep([JanusGraphVertexStep(IN,ed... 21.564 71.29 JanusGraphVertexStep(IN,edge) 13 13 21.350 \_condition=(EDGE AND visibility:normal) \_orders=[] \_limit=7 \_isFitted=false \_isOrdered=true \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d \_vertices=1 optimization 0.003 backend-query 3 4.434 \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d \_limit=14 optimization 0.001 backend-query 1 1.291 \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d \_limit=14 optimization 0.001 backend-query 2 1.311 \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d \_limit=14 optimization 0.001 backend-query 1 2.483 \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d \_limit=14 optimization 0.001 backend-query 2 1.310 \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d \_limit=14 optimization 0.001 backend-query 2 1.313 \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d \_limit=14 optimization 0.001 backend-query 2 1.192 \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d \_limit=14 optimization 0.001 backend-query 4 1.287 \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d \_limit=14 optimization 0.001 backend-query 3 1.231 \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d \_limit=14 optimization 0.001 backend-query 2 3.546 \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d \_limit=14 RangeGlobalStep(0,7) 13 13 0.037 CountGlobalStep 10 10 0.041 IsStep(gt(6)) 0.022 >TOTAL - - 30.249 - On Wednesday, December 30, 2020 at 4:59:20 AM UTC-5 li...@... wrote:
|
|
Remote Traversal with Java
Peter Borissow <peter....@...>
Dear All, I have installed/configured a single node JanusGraph Server with a Berkeley database backend and ConfigurationManagementGraph support so that I can create/manage multiple graphs on the server. In a Gremlin console on my desktop I can connect to the remote server, create graphs, create vertexes, etc. gremlin> :remote connect tinkerpop.server conf/remote.yaml session gremlin> :remote console gremlin> graph = ConfiguredGraphFactory.open("test"); gremlin> g = graph.traversal(); In Java, once I connect to the server/cluster and create a client connection, I think it should be as simple as this: DriverRemoteConnection conn = DriverRemoteConnection.using(client, name); GraphTraversalSource g = AnonymousTraversalSource.traversal().withRemote(conn); More info here: https://stackoverflow.com/questions/65486512/janusgraph-remote-traversal-with-java Any help/guidance would be greatly appreciated! Thanks, Peter |
|
Re: Degree-Centrality Filtering & Search – Scalable Strategies for OLTP
BO XUAN LI <libo...@...>
Hi Marc,
toggle quoted message
Show quoted text
I think it will be as same slow as the initial one if not even slower. If I recall correctly, JanusGraph always fetches the whole edge (column + value) even if you only need the count (in which case neither column nor value is really needed), or you only need edge id (in which case only column is needed). I created https://github.com/JanusGraph/janusgraph/discussions/2315 to discuss about this potential optimization. Btw, even if we assume this optimization is in-place, I don’t expect significant performance boost for Zach’s usecase. Best regards, Boxuan
|
|
Re: Degree-Centrality Filtering & Search – Scalable Strategies for OLTP
HadoopMarc <bi...@...>
Hi Zach, Boxuan, There is one thing I do not understand. According to the JanusGraph datamodel, the outE relatationIdentifiers are stored in the vertex. So, retrieving all outE() relationIdentifiers with the vertex for counting them should not take long, even if there are 100 thousands of them. What happens if you do: g.V() .has("someProperty",eq("someValue")) .where(outE().id().count().is(gt(10))); If this does not work, it should be possible to configure/modify JanusGraph such, that it does not start fetching edge properties that are not needed for the count. Best wishes, Marc Op woensdag 30 december 2020 om 04:15:46 UTC+1 schreef li...@...:
|
|
Re: Degree-Centrality Filtering & Search – Scalable Strategies for OLTP
BO XUAN LI <libo...@...>
Hi Zach,
toggle quoted message
Show quoted text
I have some concerns over concurrency and consistency issues, but this might still be a nice feature to have. I think you could open a new discussion on https://github.com/JanusGraph/janusgraph/discussions. That would be a better place for brainstorming. It would be awesome if you can share more context on why you think this is a very common business requirement. Cheers, Boxuan
|
|
Re: Degree-Centrality Filtering & Search – Scalable Strategies for OLTP
"zb...@gmail.com" <zblu...@...>
Thank you Boxuan, Was using the term “job” pretty loosely. Your inference about doing these things within ingest/deletion process makes sense. I know there is a lot on the community’s plate now, but if my above solution is truly optimal for current state, I wonder if a JG feature addition may help tackle this problem more consistently. Something like an additional, 3rd , index type (in addition to “graph” and “vertex-centric” indices) . i.e. “edge-connection” or “degree-centrality” index. The feature would require a mixed indexing backend, and minimally a mechanism to choose vertex and edge label combinations to count IN, OUT, and/or BOTH degree centrality. Not sure what the level of effort or implementation details would be, but this is a very common business requirement for graph-based search. If JanusGraph has native/tested support for it, it would make JanusGraph even easier to champion. 😊 Best, Zach On Tuesday, December 29, 2020 at 3:19:46 AM UTC-5 li...@... wrote:
|
|
Re: Degree-Centrality Filtering & Search – Scalable Strategies for OLTP
BO XUAN LI <libo...@...>
Hi Zach,
toggle quoted message
Show quoted text
Personally I think your workaround is the most optimal one. JanusGraph does not store number of edges as metadata in the vertex (there are both Pros & Cons for doing / not doing this). Btw do you have to have another job doing centrality calculation separately? If your application is built on top of JanusGraph, then probably you can maintain the “outDegree” property when inserting/deleting edges. Best regards, Boxuan
|
|
Degree-Centrality Filtering & Search – Scalable Strategies for OLTP
"zb...@gmail.com" <zblu...@...>
Hello all, Curious about best approaches/practices for scalable degree-centrality search filters on large (millions to billions of nodes) JanusGraphs. i.e. something like : g.V() .has("someProperty",eq("someValue")) .where(outE().count().is(gt(10))); Suppose the has-step narrows down to a large number of vertices (hundreds of thousands), then performing that form of count on that many vertices will result in timeouts and inefficiencies (at least in my experience). My workaround for this has been pre-calculating centrality in another job and writing to a Vertex Property that can subsequently be included in a mixed index. So we can do: g.V() .has("someProperty",eq("someValue")) .has(“outDegree”,gt(10)) This works, but it is yet another calculation we must maintain in our pipeline and while sufficing, it seems like more of a workaround then a great solution. I was hoping there was a more optimal approach/strategy. Please let me know. Thank you, Zach |
|
Re: JanusGraph 0.5.2 and BigTable
Assaf Schwartz <schw...@...>
Hi Boxuan!
toggle quoted message
Show quoted text
I perhaps wasn't clear. The composite indexing didn't solve the locking issue (it went away by itself 🙄, as if there's a cold start issue). However, my actual problem, about failing to lookup, was indeed solved. Again many thanks for the information and promptly replied. Assaf On Saturday, December 19, 2020 at 10:09:42 AM UTC+2 li...@... wrote:
|
|
Re: How to upload rdf bulk data to janus graph
Arpan Jain <arpan...@...>
Actually I have around 70 fields. So my doubt is - whether is it possible to insert so data without bulk upload so that Janus graph will create it's own schema and letter for remaining data I will use bulk upload true. Will this process give error? That's right |
|
Re: How to upload rdf bulk data to janus graph
"alex...@gmail.com" <alexand...@...>
That's right
toggle quoted message
Show quoted text
On Thursday, December 24, 2020 at 1:43:42 PM UTC+2 ar...@... wrote:
|
|
Re: How to upload rdf bulk data to janus graph
Arpan Jain <arpan...@...>
All these properties I need to set in the Janusgraph properties file right? I mean the config on which the server is starting. I mean the file where we set the backend storage and host etc.
|
|
Re: How to upload rdf bulk data to janus graph
"alex...@gmail.com" <alexand...@...>
Hi, Try to enable batch loading: "storage.batch-loading=true". Increase your batch mutations buffer: "storage.buffer-size=20480". Increase ids block size: "ids.block-size=10000000". Not sure if your flows just adds or upserts data. In case it upserts you may also set "query.batch=true". That said, I didn't use rdf2gremlin and can't suggest much. Above configurations are just options which I can immediately think of. Of course a proper investigation should be done to suggest performance improvement. You may additionally optimize your ScyllaDB for your use cases. Best regards, Oleksandr On Thursday, December 24, 2020 at 12:24:10 PM UTC+2 ar...@... wrote: I have data in RDF(ttl) format. It is having around 6 million triplets. Currently, I have used rdf2gremlin python script for this conversion but it's taking to much time i.e. for 10k records it took around 1 hour. I am using Scylla DB as a Janus graph backend. Below is the python code I am using. |
|
How to upload rdf bulk data to janus graph
Arpan Jain <arpan...@...>
I have data in RDF(ttl) format. It is having around 6 million triplets. Currently, I have used rdf2gremlin python script for this conversion but it's taking to much time i.e. for 10k records it took around 1 hour. I am using Scylla DB as a Janus graph backend. Below is the python code I am using.
from rdf2g import setup_graph DEFAULT_LOCAL_CONNECTION_STRING = "ws://localhost:8182/gremlin"
g = setup_graph(DEFAULT_LOCAL_CONNECTION_STRING) import rdflib import pathlib OUTPUT_FILE_LAM_PROPERTIES = pathlib.Path("path/to/ttl/file/.ttl").resolve() rdf_graph = rdflib.Graph() rdf_graph.parse(str(OUTPUT_FILE_LAM_PROPERTIES), format="ttl") Same RDF data in neo4j is taking around only 10 mins to load the whole data. But I want to use the Janus graph. Kindly suggest to me the best way to upload bulk RDF data to Janus graph using python or java. |
|