Date   

Failed to find all paths between 2 vertices upon the graph having 100 million vertices and 100 million edges using SparkGraphComputer

Roy Yu <7604...@...>
 

The graph has 100 million vertices and 100 edges
Graph data is saved at HBase Table: MyHBaseTable.

The size of MyHBaseTable is 16.2GB:
root@~$ hdfs dfs -du -h /apps/hbase/data/data/default/
16.2 G   32.4 G   /apps/hbase/data/data/default/MyHBaseTable

MyHBaseTable has 190 regions, the edge data (HBase column family e ) of every region is less than 100MB (one spark task processes one region, in order to avoid spark OOM during loading region data, I use HBaseAdmin to split HBase region to make sure the edges data (HBase column family e ) of every region is less than 100MB) . Below the size of region 077288f4be4c439443bb45b0c2369d5b is more than 100MB because it has index data.
root@~$ hdfs dfs -du -h /apps/hbase/data/data/default/MyHBaseTable
3.8 K    7.6 K    /apps/hbase/data/data/default/MyHBaseTable/.tabledesc
0        0        /apps/hbase/data/data/default/MyHBaseTable/.tmp
78.3 M   156.7 M  /apps/hbase/data/data/default/MyHBaseTable/007e9dbf74f5d35862b68d6434f1d6f2
92.2 M   184.3 M  /apps/hbase/data/data/default/MyHBaseTable/077288f4be4c439443bb45b0c2369d5b
102.4 M  204.8 M  /apps/hbase/data/data/default/MyHBaseTable/0782782071e4a7f2d17800d4a0989a7f
50.6 M   101.3 M  /apps/hbase/data/data/default/MyHBaseTable/07e795022e56a969ede48c9c23fbbc7c
50.6 M   101.3 M  /apps/hbase/data/data/default/MyHBaseTable/084e54e61bbcfc2decd14dcbac55bc50
99.7 M   199.4 M  /apps/hbase/data/data/default/MyHBaseTable/0a85ae356b19c605d9a32b9bf513bcbb
431.3 M  862.6 M  /apps/hbase/data/data/default/MyHBaseTable/0b024c812acfa6efaa40e1cca232e192
5.0 K    10.1 K   /apps/hbase/data/data/default/MyHBaseTable/0c2d8e3a6daaa8ab30c399783e343890
...


the properties of the graph:
gremlin.graph=org.janusgraph.core.JanusGraphFactory
cluster.max-partitions = 16
storage.backend=hbase
storage.hbase.table=MyHBaseTable
storage.hbase.ext.zookeeper.znode.parent=/hbase-unsecure
schema.default=none

storage.hostname=master001,master002,master003
storage.port=2181
storage.hbase.region-count=64
storage.write-time=1000000
storage.read-time=100000

ids.block-size=200000
ids.renew-timeout=600000
ids.renew-percentage=0.4
ids.authority.conflict-avoidance-mode=GLOBAL_AUTO

index.search.backend=elasticsearch
index.search.hostname=es001,es002,es003
index.search.elasticsearch.create.ext.index.number_of_shards=15
index.search.elasticsearch.create.ext.index.refresh_interval=-1
index.search.elasticsearch.create.ext.index.translog.sync_interval=5000s
index.search.elasticsearch.create.ext.index.translog.durability=async
index.search.elasticsearch.create.ext.index.number_of_replicas=0
index.search.elasticsearch.create.ext.index.shard.check_on_startup=false


the schema of the graph:
def defineSchema(graph) {
    m = graph.openManagement()

        node = m.makeVertexLabel("node").make()

        relation = m.makeEdgeLabel("relation").make()
        obj_type_value = m.makePropertyKey("obj_type_value").dataType(String.class).make()

    // edge props
        start_time = m.makePropertyKey("start_time").dataType(Date.class).make()
        end_time = m.makePropertyKey("end_time").dataType(Date.class).make()
        count = m.makePropertyKey("count").dataType(Integer.class).make()
        rel_type = m.makePropertyKey("rel_type").dataType(String.class).make()
    //index
        m.buildIndex("MyHBaseTable_obj_type_value_Index", Vertex.class).addKey(obj_type_value).unique().buildCompositeIndex()
        m.buildIndex("MyHBaseTable_rel_type_index", Edge.class).addKey(rel_type).buildCompositeIndex()
        m.buildIndex("MyHBaseTable_count_index", Edge.class).addKey(count).buildMixedIndex("search")
        m.buildIndex("MyHBaseTable_start_time_index", Edge.class).addKey(start_time).buildMixedIndex("search")
        m.buildIndex("MyHBaseTable_end_time_index", Edge.class).addKey(end_time).buildMixedIndex("search")
    m.commit()
}

the Gremlin I use to find all paths between 2 vertices:

import org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer;
import org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.__;
import org.apache.tinkerpop.gremlin.process.traversal.P;
def executeScript(graph){
    traversal = graph.traversal().withComputer(SparkGraphComputer.class);
    return traversal.V(624453904).repeat(__.both().simplePath()).until(__.hasId(192204064).or().loops().is(200)).hasId(192204064).path().dedup().limit(1000).toList()
    //return traversal.V().where(__.outE().count().is(P.gte(50000))).id().toList()
};

The OLAP spark graph conf:
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphReader=org.janusgraph.hadoop.formats.hbase.HBaseInputFormat
gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat

gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=output
gremlin.spark.graphStorageLevel=DISK_ONLY
gremlin.spark.persistStorageLevel=DISK_ONLY
####################################
# JanusGraph HBase InputFormat configuration
####################################
janusgraphmr.ioformat.conf.storage.backend=hbase
janusgraphmr.ioformat.conf.storage.hostname=master002,master003,master001
janusgraphmr.ioformat.conf.storage.hbase.table=MyHBaseTable
janusgraphmr.ioformat.conf.storage.hbase.ext.zookeeper.znode.parent=/hbase-unsecure

####################################
# SparkGraphComputer Configuration #
####################################
spark.master=yarn
spark.submit.deployMode=client
spark.yarn.jars=hdfs://GRAPHOLAP/user/spark/jars/*.jar

# the Spark YARN ApplicationManager needs this to resolve classpath it sends to the executors
spark.yarn.appMasterEnv.JAVA_HOME=/usr/local/jdk1.8.0_191/
spark.yarn.appMasterEnv.HADOOP_CONF_DIR=/usr/hdp/3.1.4.0-315/hadoop/conf
spark.yarn.am.extraJavaOptions=-Diop.version=3.1.4.0-315 -Djava.library.path=/usr/hdp/current/hadoop-client/lib/native
spark.executor.memoryOverhead=5G
spark.driver.extraJavaOptions=-Diop.version=3.1.4.0-315 -Djava.library.path=/usr/hdp/current/hadoop-client/lib/native

# the Spark Executors (on the work nodes) needs this to resolve classpath to run Spark tasks
spark.executorEnv.JAVA_HOME=/usr/local/jdk1.8.0_191/
#spark.executorEnv.HADOOP_CONF_DIR=/usr/hdp/3.1.4.0-315/hadoop/conf
spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:MaxGCPauseMillis=500 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/mnt/data_1/log/spark2/gc-spark%p.log
spark.executor.cores=1
spark.executor.memory=80G
spark.executor.instances=3
spark.executor.extraClassPath=/etc/hadoop/conf:/usr/spark/jars:/usr/hdp/current/hbase-client/lib:/usr/janusgraph/0.4.0/lib

spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.network.timeout=1000000
spark.rpc.askTimeout=1000000
spark.shuffle.service.enabled=true
spark.shuffle.service.port=7447
spark.maxRemoteBlockSizeFetchToMem=10485760
spark.memory.useLegacyMode=true
spark.shuffle.memoryFraction=0.1
spark.storage.memoryFraction=0.1
spark.memory.fraction=0.1
spark.memory.storageFraction=0.1
spark.shuffle.accurateBlockThreshold=1048576


The spark job failed at stage 50 :
20/12/30 01:53:00 ERROR executor.Executor: Exception in task 40.0 in stage 50.0 (TID 192084)
java.lang.OutOfMemoryError: Java heap space
        at sun.reflect.generics.repository.ClassRepository.getSuperInterfaces(ClassRepository.java:114)
        at java.lang.Class.getGenericInterfaces(Class.java:913)
        at java.util.HashMap.comparableClassFor(HashMap.java:351)
        at java.util.HashMap$TreeNode.treeify(HashMap.java:1932)
        at java.util.HashMap.treeifyBin(HashMap.java:772)
        at java.util.HashMap.putVal(HashMap.java:644)
        at java.util.HashMap.put(HashMap.java:612)
        at java.util.Collections$SynchronizedMap.put(Collections.java:2588)
        at org.apache.tinkerpop.gremlin.process.traversal.traverser.util.TraverserSet.add(TraverserSet.java:90)
        at org.apache.tinkerpop.gremlin.process.computer.traversal.WorkerExecutor.lambda$drainStep$4(WorkerExecutor.java:232)
        at org.apache.tinkerpop.gremlin.process.computer.traversal.WorkerExecutor$$Lambda$86/877696627.accept(Unknown Source)
        at java.util.Iterator.forEachRemaining(Iterator.java:116)
        at org.apache.tinkerpop.gremlin.process.computer.traversal.WorkerExecutor.drainStep(WorkerExecutor.java:221)
        at org.apache.tinkerpop.gremlin.process.computer.traversal.WorkerExecutor.execute(WorkerExecutor.java:151)
        at org.apache.tinkerpop.gremlin.process.computer.traversal.TraversalVertexProgram.execute(TraversalVertexProgram.java:307)
        at org.apache.tinkerpop.gremlin.spark.process.computer.SparkExecutor.lambda$null$4(SparkExecutor.java:118)
        at org.apache.tinkerpop.gremlin.spark.process.computer.SparkExecutor$$Lambda$72/1209554928.apply(Unknown Source)
        at org.apache.tinkerpop.gremlin.util.iterator.IteratorUtils$3.next(IteratorUtils.java:247)
        at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)


From the log it seems there is too much data even the 80G executor heap is not enough. 
Anybody who can help me ?  Anybody who has idea to find all the paths between 2 vertices upon large graph?


Re: Running OLAP on HBase with SparkGraphComputer fails with Error Container killed by YARN for exceeding memory limits

Roy Yu <7604...@...>
 

Thanks  Evgenii  


On Tuesday, December 15, 2020 at 8:24:11 PM UTC+8 yevg...@... wrote:

Oh, I recall that we once tried to debug the same issue with JanusGraph-Hbase, had clear supernodes in the graph. No attempts on repartitioning, including analyzing code of SparkGraphComputer and tinkering around thought to make it work for partitioned vertices etc. were successful - apparently using Cassandra (latest 3.x version at the time) didn't lead to OOM, but was noticeably slower than HBase when we used it with smaller graphs.

Best regards,
Evgenii Ignatev.

On 15.12.2020 07:07, Roy Yu wrote:
Thanks Marc

On Friday, December 11, 2020 at 3:40:25 PM UTC+8 HadoopMarc wrote:
Hi Roy,

I think I would first check whether the skew is absent if you count the rows reading the HBase table directly from spark (so, without using janusgraph), e.g.:


If this works all right, than you know that somehow in janusgraph HBaseInputFormat the mappers do not get the right key ranges to read from.

I also though about the storage.hbase.region-count property of janusgraph-hbase. If you would specify this at 40 while creating the graph, janusgraph-hbase would create many small regions that will be compacted by HBase later on. But maybe this creates a different structure in the row keys that can be leveraged by the hbase.mapreduce.tableinput.mappers.per.region.

Best wishes,     Marc


Op woensdag 9 december 2020 om 17:16:35 UTC+1 schreef Roy Yu:
Hi Marc, 

The parameter  hbase.mapreduce.tableinput.mappers.per.region  can be effective. I set it to 40, and there are 40 tasks processing every region. But here comes the new promblem--the data skew. I use g.E().count() to count all the edges of the graph. During counting one region, one spark task containing all 2.6GB data, while other 39 tasks containing 0 data. The task failed again.  I checked my data. There are some vertices which have more 1 million incident edges.  So I tried to solve this promblem using vertex cut(https://docs.janusgraph.org/advanced-topics/partitioning/), my graph schema is something like  [mgmt.makeVertexLabel('product').partition().make() ]. But when I using MR to load data to the new graph, it consumed more than 10 times when the attemp without using partition(), from the hbase table detail page, I found the data loading process was busy reading data from  and writing data to the first region. The first region became the hot spot. I guess it relates to vertex ids. Could help me again?

On Tuesday, December 8, 2020 at 3:13:42 PM UTC+8 HadoopMarc wrote:
Hi Roy,

As I mentioned, I did not keep up with possibly new janusgraph-hbase features. From the HBase source, I see that HBase now has a "hbase.mapreduce.tableinput.mappers.per.region" config parameter.


It should not be too difficult to adapt the janusgraph HBaseInputFormat to leverage this feature (or maybe it even works without change???).

Best wishes,

Marc

Op dinsdag 8 december 2020 om 04:21:19 UTC+1 schreef Roy Yu:
you seem to run on cloud infra that reduces your requested 40 Gb to 33 Gb (see https://databricks.com/session_na20/running-apache-spark-on-kubernetes-best-practices-and-pitfalls). Fact of life. 
---------------------
Sorry Marc I misled you. Error Message was generated when I set spark.executor.memory to 30G, when it failed, I increased spark.executor.memory  to 40G, it failed either. I felt desperate and come here to ask for help
On Tuesday, December 8, 2020 at 10:35:19 AM UTC+8 Roy Yu wrote:
Hi Marc

Thanks for your immediate response.
I've tried to set spark.yarn.executor.memoryOverhead=10G and re-run the task, and it stilled failed. From the spark task UI, I saw 80% of processing time is Full GC time. As you said, 2.6GB(GZ compressed) exploding is  my root cause. Now I'm trying to reduce my region size to 1GB, if that will still fail, I'm gonna config the hbase hfile not using compressed format.
This was my first time running janusgraph OLAP, and I think this is a common promblom, as HBase region size 2.6GB(compressed) is not large, 20GB is very common in our production. If the community dose not solve the promblem, the Janusgraph HBase based OLAP solution cannot be adopted by other companies either.

On Tuesday, December 8, 2020 at 12:40:40 AM UTC+8 HadoopMarc wrote:
Hi Roy,

There seem to be three things bothering you here:
  1. you did not specify spark.yarn.executor.memoryOverhead, as the exception message says. Easily solved.
  2. you seem to run on cloud infra that reduces your requested 40 Gb to 33 Gb (see https://databricks.com/session_na20/running-apache-spark-on-kubernetes-best-practices-and-pitfalls). Fact of life.
  3. the janusgraph HBaseInputFormat use sentire HBase regions as hadoop partitions, which are fed into spark tasks. The 2.6Gb region size is for compressed binary data which explodes when expanded into java objects. This is your real problem.
I did not follow the latest status of janusgraph-hbase features for the HBaseInputFormat, but you have to somehow use spark with smaller partitions than an entire HBase region.
A long time ago, I had success with skipping the HBaseInputFormat and have spark executors connect to JanusGraph themselves. That is not a quick solution, though.

Best wishes,

Marc

Op maandag 7 december 2020 om 14:10:55 UTC+1 schreef Roy Yu:
Error message:
ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 33.1 GB of 33 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714. 

 graph conifg:
spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:MaxGCPauseMillis=500 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/mnt/data_1/log/spark2/gc-spark%p.log
spark.executor.cores=1
spark.executor.memory=40960m
spark.executor.instances=3

Region info:
hdfs dfs -du -h /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc
67     134    /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc/.regioninfo
0      0      /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc/.tmp
2.6 G  5.1 G  /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc/e
0      0      /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc/f
0      0      /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc/g
0      0      /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc/h
0      0      /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc/i
0      0      /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc/l
0      0      /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc/m
0      0      /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc/recovered.edits
0      0      /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc/s
0      0      /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc/t
root@~$

Anybody who can help me?
--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/46bcc3bb-9e66-4fb1-add0-22374909fb63n%40googlegroups.com.


Re: Remote Traversal with Java

HadoopMarc <bi...@...>
 

Hi Peter,

This seems more relevant:
https://docs.janusgraph.org/basics/configured-graph-factory/#graph-and-traversal-bindings

So, some JanusGraph and GraphTraversalSource objects are created remotely with names following a convention. You cannot assign the instances locally.

Best wishes,    Marc

Op donderdag 31 december 2020 om 12:25:10 UTC+1 schreef HadoopMarc:

Hi Peter,

Have you tried the suggestions from an earlier thread:


Best wishes,    Marc

Op woensdag 30 december 2020 om 20:06:36 UTC+1 schreef Peter Borissow:
Dear All,
    I have installed/configured a single node JanusGraph Server with a Berkeley database backend and ConfigurationManagementGraph support so that I can create/manage multiple graphs on the server. 

In a Gremlin console on my desktop I can connect to the remote server, create graphs, create vertexes, etc.   

In Java code on my desktop, I can connect to the remote server and issue commands via Client.submit() method. However, I cannot figure out how to open a specific graph on the server and get a traversal. In the Gremlin console it is as simple as this:

gremlin> :remote connect tinkerpop.server conf/remote.yaml session 
gremlin> :remote console  
gremlin> graph = ConfiguredGraphFactory.open("test"); 
gremlin> g = graph.traversal();  

In Java, once I connect to the server/cluster and create a client connection, I think it should be as simple as this:

DriverRemoteConnection conn = DriverRemoteConnection.using(client, name);
GraphTraversalSource g = AnonymousTraversalSource.traversal().withRemote(conn);  

More info here:

Any help/guidance would be greatly appreciated!

Thanks,
Peter


Slow convert to Java object

Maxim Milovanov <milov...@...>
 

Hi!

I am trying write select-query and my perfomance is very slow when I call method toList().
How I can improve this perfomace ?


My example:

        long start;

        long end;

        try (GraphTraversalSource g = graph.traversal()) {

            start = System.currentTimeMillis();

            GraphTraversal<Vertex, Vertex> list2 = g.V().hasLabel("Entity");


            end = System.currentTimeMillis();


            System.out.printf("getGremlinTime: %d ms\n", (end - start));


            start = end;

            List<Vertex> res2 = list2.toList();


            end = System.currentTimeMillis();

            System.out.printf("toList: %d ms\n gremlin count: %d\n", (end - start), res2.size());

        }


Debug log:

getGremlinTime: 13 ms
2020-12-31 14:19:01.336  WARN 14144 --- [           main] o.j.g.transaction.StandardJanusGraphTx   : Query requires iterating over all vertices [(~label = Entity)]. For better performance, use indexes
toList: 12025 ms
 gremlin count: 105  


Re: Remote Traversal with Java

HadoopMarc <bi...@...>
 

Hi Peter,

Have you tried the suggestions from an earlier thread:

https://groups.google.com/g/janusgraph-users/c/w2_qMchATnw/m/zoMSlMO5BwAJ

Best wishes,    Marc

Op woensdag 30 december 2020 om 20:06:36 UTC+1 schreef Peter Borissow:

Dear All,
    I have installed/configured a single node JanusGraph Server with a Berkeley database backend and ConfigurationManagementGraph support so that I can create/manage multiple graphs on the server. 

In a Gremlin console on my desktop I can connect to the remote server, create graphs, create vertexes, etc.   

In Java code on my desktop, I can connect to the remote server and issue commands via Client.submit() method. However, I cannot figure out how to open a specific graph on the server and get a traversal. In the Gremlin console it is as simple as this:

gremlin> :remote connect tinkerpop.server conf/remote.yaml session 
gremlin> :remote console  
gremlin> graph = ConfiguredGraphFactory.open("test"); 
gremlin> g = graph.traversal();  

In Java, once I connect to the server/cluster and create a client connection, I think it should be as simple as this:

DriverRemoteConnection conn = DriverRemoteConnection.using(client, name);
GraphTraversalSource g = AnonymousTraversalSource.traversal().withRemote(conn);  

More info here:

Any help/guidance would be greatly appreciated!

Thanks,
Peter


Re: Degree-Centrality Filtering & Search – Scalable Strategies for OLTP

BO XUAN LI <libo...@...>
 

Hi Zach,

If you want to run the query in a multi-threading manner, try enabling “query.batch” (ref: https://docs.janusgraph.org/basics/configuration-reference/#query).

Since you are using Cassandra which does not support batch reading natively, JanusGraph will use a thread pool to fire the backend queries. This should reduce latency of this single query but might impact overall application performance if your application is already handling heavy workloads.

Best regards,
Boxuan

On Dec 31, 2020, at 8:51 AM, zblu...@gmail.com <zblu...@...> wrote:

Hi Marc, Boxuan,

Thank you for the discussion. I have been experimenting with different queries including your id() suggesting Marc.  Along Boxuan’s feedback, the where() step performs about the same (maybe slightly slower) when adding .id() step.

My bigger concern for my use case is how this type of operation scales in a matter that seems relatively linear with sample size. i.e.

g.V().limit(10).where(InE().count().is(gt(6))).profile() => ~30 ms

g.V().limit(100).where(InE().count().is(gt(6))).profile() => ~147 ms

g.V().limit(1000).where(InE().count().is(gt(6))).profile()  => ~1284 ms

g.V().limit(10000).where(InE().count().is(gt(6))).profile()  => ~13779 ms

g.V().limit(100000).where(InE().count().is(gt(6))).profile()  => ? > 120000 ms (timeout)

 

This behavior makes sense when I think about it and also when I inspect the profile (example profile of limit(10) traversal below)

I know the above traversal seems a bit funky, but I am trying to consistently analyze the effect of sample size on the edge count portion of the query.

Looking at the profile, it seems like JG needs to perform a sliceQuery operation on each vertex sequentially which isn’t well optimized for my use case. I know that if centrality properties were included in a mixed index then it can be configured for scalable performance.  However, going back to the original post, I am not sure that is the best/only way.  Are there other configurations that could be optimized to make this operation more scalable without to an additional index property? 

In case it is relevant, I am using JanusGraph v 0.5.2 with Cassandra-CQL backend v3.11.

Thank you,

Zach

Example Profile

gremlin> g.V().limit(10).where(inE().count().is(gt(6))).profile()

==>Traversal Metrics

Step                                                               Count  Traversers       Time (ms)    % Dur

=============================================================================================================

JanusGraphStep(vertex,[])                                             10          10           8.684    28.71

    \_condition=()

    \_orders=[]

    \_limit=10

    \_isFitted=false

    \_isOrdered=true

    \_query=[]

  optimization                                                                                 0.005

  optimization                                                                                 0.001

  scan                                                                                         0.000

    \_query=[]

    \_fullscan=true

    \_condition=VERTEX

TraversalFilterStep([JanusGraphVertexStep(IN,ed...                                            21.564    71.29

  JanusGraphVertexStep(IN,edge)                                       13          13          21.350

    \_condition=(EDGE AND visibility:normal)

    \_orders=[]

    \_limit=7

    \_isFitted=false

    \_isOrdered=true

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_vertices=1

    optimization                                                                               0.003

    backend-query                                                      3                       4.434

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      1                       1.291

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       1.311

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      1                       2.483

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       1.310

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       1.313

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       1.192

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      4                       1.287

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      3                       1.231

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       3.546

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

  RangeGlobalStep(0,7)                                                13          13           0.037

  CountGlobalStep                                                     10          10           0.041

  IsStep(gt(6))                                                                                0.022

                                            >TOTAL                     -           -          30.249        -


On Wednesday, December 30, 2020 at 4:59:20 AM UTC-5 libo...@connect.hku.hk wrote:
Hi Marc,

I think it will be as same slow as the initial one if not even slower. If I recall correctly, JanusGraph always fetches the whole edge (column + value) even if you only need the count (in which case neither column nor value is really needed), or you only need edge id (in which case only column is needed). I created https://github.com/JanusGraph/janusgraph/discussions/2315 to discuss about this potential optimization. Btw, even if we assume this optimization is in-place, I don’t expect significant performance boost for Zach’s usecase.

Best regards,
Boxuan


On Dec 30, 2020, at 4:44 PM, HadoopMarc <b...@...> wrote:

Hi Zach, Boxuan,

There is one thing I do not understand. According to the JanusGraph datamodel, the outE relatationIdentifiers are stored in the vertex. So, retrieving all outE() relationIdentifiers with the vertex for counting them should not take long, even if there are 100 thousands of them. What happens if you do:

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().id().count().is(gt(10)));

If this does not work, it should be possible to configure/modify JanusGraph such, that it does not start fetching edge properties that are not needed for the count.

Best wishes,   Marc

Op woensdag 30 december 2020 om 04:15:46 UTC+1 schreef libo...@connect.hku.hk:
Hi Zach,

I have some concerns over concurrency and consistency issues, but this might still be a nice feature to have. I think you could open a new discussion on https://github.com/JanusGraph/janusgraph/discussions. That would be a better place for brainstorming. It would be awesome if you can share more context on why you think this is a very common business requirement.

Cheers,
Boxuan

On Dec 30, 2020, at 4:42 AM, zblu...@gmail.com <zb...@...> wrote:

Thank you Boxuan,

Was using the term “job” pretty loosely.  Your inference about doing these things within ingest/deletion process makes sense.

I know there is a lot on the community’s plate now, but if my above solution is truly optimal for current state, I wonder if a JG feature addition may help tackle this problem more consistently. Something like an additional, 3rd , index type (in addition to “graph” and “vertex-centric” indices) . i.e. “edge-connection” or “degree-centrality” index. The feature would require a mixed indexing backend, and minimally a mechanism to choose vertex and edge label combinations to count IN, OUT, and/or BOTH degree centrality.

Not sure what the level of effort or implementation details would be, but this is a very common business requirement for graph-based search.  If JanusGraph has native/tested support for it, it would make JanusGraph even easier to champion.

😊

Best,

Zach


On Tuesday, December 29, 2020 at 3:19:46 AM UTC-5 libo...@connect.hku.hk wrote:
Hi Zach,

Personally I think your workaround is the most optimal one. JanusGraph does not store number of edges as metadata in the vertex (there are both Pros & Cons for doing / not doing this).

Btw do you have to have another job doing centrality calculation separately? If your application is built on top of JanusGraph, then probably you can maintain the “outDegree” property when inserting/deleting edges.

Best regards,
Boxuan

On Dec 29, 2020, at 6:49 AM, zblu...@gmail.com <zb...@...> wrote:

Hello all,

Curious about best approaches/practices for scalable degree-centrality search filters on large (millions to billions of nodes) JanusGraphs.  i.e. something like :

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().count().is(gt(10)));                            

Suppose the has-step narrows down to a large number of vertices (hundreds of thousands), then performing that form of count on that many vertices will result in timeouts and inefficiencies (at least in my experience).  My workaround for this has been pre-calculating centrality in another job and writing to a Vertex Property that can subsequently be included in a mixed index. So we can do:

g.V()

   .has("someProperty",eq("someValue"))

   .has(“outDegree”,gt(10))

This works, but it is yet another calculation we must maintain in our pipeline and while sufficing, it seems like more of a workaround then a great solution. I was hoping there was a more optimal approach/strategy. Please let me know. 

Thank you,

Zach


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/385af431-d723-4be6-95cb-43b2954f2e58n%40googlegroups.com.


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/0ff0c37a-6a56-476c-8efb-c30416380ec1n%40googlegroups.com.


Re: Degree-Centrality Filtering & Search – Scalable Strategies for OLTP

"zb...@gmail.com" <zblu...@...>
 

Hi Marc, Boxuan,

Thank you for the discussion. I have been experimenting with different queries including your id() suggesting Marc.  Along Boxuan’s feedback, the where() step performs about the same (maybe slightly slower) when adding .id() step.

My bigger concern for my use case is how this type of operation scales in a matter that seems relatively linear with sample size. i.e.

g.V().limit(10).where(InE().count().is(gt(6))).profile() => ~30 ms

g.V().limit(100).where(InE().count().is(gt(6))).profile() => ~147 ms

g.V().limit(1000).where(InE().count().is(gt(6))).profile()  => ~1284 ms

g.V().limit(10000).where(InE().count().is(gt(6))).profile()  => ~13779 ms

g.V().limit(100000).where(InE().count().is(gt(6))).profile()  => ? > 120000 ms (timeout)

 

This behavior makes sense when I think about it and also when I inspect the profile (example profile of limit(10) traversal below)

I know the above traversal seems a bit funky, but I am trying to consistently analyze the effect of sample size on the edge count portion of the query.

Looking at the profile, it seems like JG needs to perform a sliceQuery operation on each vertex sequentially which isn’t well optimized for my use case. I know that if centrality properties were included in a mixed index then it can be configured for scalable performance.  However, going back to the original post, I am not sure that is the best/only way.  Are there other configurations that could be optimized to make this operation more scalable without to an additional index property? 

In case it is relevant, I am using JanusGraph v 0.5.2 with Cassandra-CQL backend v3.11.

Thank you,

Zach

Example Profile

gremlin> g.V().limit(10).where(inE().count().is(gt(6))).profile()

==>Traversal Metrics

Step                                                               Count  Traversers       Time (ms)    % Dur

=============================================================================================================

JanusGraphStep(vertex,[])                                             10          10           8.684    28.71

    \_condition=()

    \_orders=[]

    \_limit=10

    \_isFitted=false

    \_isOrdered=true

    \_query=[]

  optimization                                                                                 0.005

  optimization                                                                                 0.001

  scan                                                                                         0.000

    \_query=[]

    \_fullscan=true

    \_condition=VERTEX

TraversalFilterStep([JanusGraphVertexStep(IN,ed...                                            21.564    71.29

  JanusGraphVertexStep(IN,edge)                                       13          13          21.350

    \_condition=(EDGE AND visibility:normal)

    \_orders=[]

    \_limit=7

    \_isFitted=false

    \_isOrdered=true

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_vertices=1

    optimization                                                                               0.003

    backend-query                                                      3                       4.434

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      1                       1.291

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       1.311

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      1                       2.483

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       1.310

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       1.313

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       1.192

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      4                       1.287

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      3                       1.231

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       3.546

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

  RangeGlobalStep(0,7)                                                13          13           0.037

  CountGlobalStep                                                     10          10           0.041

  IsStep(gt(6))                                                                                0.022

                                            >TOTAL                     -           -          30.249        -


On Wednesday, December 30, 2020 at 4:59:20 AM UTC-5 li...@... wrote:
Hi Marc,

I think it will be as same slow as the initial one if not even slower. If I recall correctly, JanusGraph always fetches the whole edge (column + value) even if you only need the count (in which case neither column nor value is really needed), or you only need edge id (in which case only column is needed). I created https://github.com/JanusGraph/janusgraph/discussions/2315 to discuss about this potential optimization. Btw, even if we assume this optimization is in-place, I don’t expect significant performance boost for Zach’s usecase.

Best regards,
Boxuan


On Dec 30, 2020, at 4:44 PM, HadoopMarc <b...@...> wrote:

Hi Zach, Boxuan,

There is one thing I do not understand. According to the JanusGraph datamodel, the outE relatationIdentifiers are stored in the vertex. So, retrieving all outE() relationIdentifiers with the vertex for counting them should not take long, even if there are 100 thousands of them. What happens if you do:

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().id().count().is(gt(10)));

If this does not work, it should be possible to configure/modify JanusGraph such, that it does not start fetching edge properties that are not needed for the count.

Best wishes,   Marc

Op woensdag 30 december 2020 om 04:15:46 UTC+1 schreef libo...@connect.hku.hk:
Hi Zach,

I have some concerns over concurrency and consistency issues, but this might still be a nice feature to have. I think you could open a new discussion on https://github.com/JanusGraph/janusgraph/discussions. That would be a better place for brainstorming. It would be awesome if you can share more context on why you think this is a very common business requirement.

Cheers,
Boxuan

On Dec 30, 2020, at 4:42 AM, zblu...@gmail.com <zb...@...> wrote:

Thank you Boxuan,

Was using the term “job” pretty loosely.  Your inference about doing these things within ingest/deletion process makes sense.

I know there is a lot on the community’s plate now, but if my above solution is truly optimal for current state, I wonder if a JG feature addition may help tackle this problem more consistently. Something like an additional, 3rd , index type (in addition to “graph” and “vertex-centric” indices) . i.e. “edge-connection” or “degree-centrality” index. The feature would require a mixed indexing backend, and minimally a mechanism to choose vertex and edge label combinations to count IN, OUT, and/or BOTH degree centrality.

Not sure what the level of effort or implementation details would be, but this is a very common business requirement for graph-based search.  If JanusGraph has native/tested support for it, it would make JanusGraph even easier to champion.

😊

Best,

Zach


On Tuesday, December 29, 2020 at 3:19:46 AM UTC-5 libo...@connect.hku.hk wrote:
Hi Zach,

Personally I think your workaround is the most optimal one. JanusGraph does not store number of edges as metadata in the vertex (there are both Pros & Cons for doing / not doing this).

Btw do you have to have another job doing centrality calculation separately? If your application is built on top of JanusGraph, then probably you can maintain the “outDegree” property when inserting/deleting edges.

Best regards,
Boxuan

On Dec 29, 2020, at 6:49 AM, zblu...@gmail.com <zb...@...> wrote:

Hello all,

Curious about best approaches/practices for scalable degree-centrality search filters on large (millions to billions of nodes) JanusGraphs.  i.e. something like :

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().count().is(gt(10)));                            

Suppose the has-step narrows down to a large number of vertices (hundreds of thousands), then performing that form of count on that many vertices will result in timeouts and inefficiencies (at least in my experience).  My workaround for this has been pre-calculating centrality in another job and writing to a Vertex Property that can subsequently be included in a mixed index. So we can do:

g.V()

   .has("someProperty",eq("someValue"))

   .has(“outDegree”,gt(10))

This works, but it is yet another calculation we must maintain in our pipeline and while sufficing, it seems like more of a workaround then a great solution. I was hoping there was a more optimal approach/strategy. Please let me know. 

Thank you,

Zach


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/385af431-d723-4be6-95cb-43b2954f2e58n%40googlegroups.com.


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....


Remote Traversal with Java

Peter Borissow <peter....@...>
 

Dear All,
    I have installed/configured a single node JanusGraph Server with a Berkeley database backend and ConfigurationManagementGraph support so that I can create/manage multiple graphs on the server. 

In a Gremlin console on my desktop I can connect to the remote server, create graphs, create vertexes, etc.   

In Java code on my desktop, I can connect to the remote server and issue commands via Client.submit() method. However, I cannot figure out how to open a specific graph on the server and get a traversal. In the Gremlin console it is as simple as this:

gremlin> :remote connect tinkerpop.server conf/remote.yaml session 
gremlin> :remote console  
gremlin> graph = ConfiguredGraphFactory.open("test"); 
gremlin> g = graph.traversal();  

In Java, once I connect to the server/cluster and create a client connection, I think it should be as simple as this:

DriverRemoteConnection conn = DriverRemoteConnection.using(client, name);
GraphTraversalSource g = AnonymousTraversalSource.traversal().withRemote(conn);  

More info here:
https://stackoverflow.com/questions/65486512/janusgraph-remote-traversal-with-java

Any help/guidance would be greatly appreciated!

Thanks,
Peter


Re: Degree-Centrality Filtering & Search – Scalable Strategies for OLTP

BO XUAN LI <libo...@...>
 

Hi Marc,

I think it will be as same slow as the initial one if not even slower. If I recall correctly, JanusGraph always fetches the whole edge (column + value) even if you only need the count (in which case neither column nor value is really needed), or you only need edge id (in which case only column is needed). I created https://github.com/JanusGraph/janusgraph/discussions/2315 to discuss about this potential optimization. Btw, even if we assume this optimization is in-place, I don’t expect significant performance boost for Zach’s usecase.

Best regards,
Boxuan


On Dec 30, 2020, at 4:44 PM, HadoopMarc <bi...@...> wrote:

Hi Zach, Boxuan,

There is one thing I do not understand. According to the JanusGraph datamodel, the outE relatationIdentifiers are stored in the vertex. So, retrieving all outE() relationIdentifiers with the vertex for counting them should not take long, even if there are 100 thousands of them. What happens if you do:

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().id().count().is(gt(10)));

If this does not work, it should be possible to configure/modify JanusGraph such, that it does not start fetching edge properties that are not needed for the count.

Best wishes,   Marc

Op woensdag 30 december 2020 om 04:15:46 UTC+1 schreef libo...@connect.hku.hk:
Hi Zach,

I have some concerns over concurrency and consistency issues, but this might still be a nice feature to have. I think you could open a new discussion on https://github.com/JanusGraph/janusgraph/discussions. That would be a better place for brainstorming. It would be awesome if you can share more context on why you think this is a very common business requirement.

Cheers,
Boxuan

On Dec 30, 2020, at 4:42 AM, zblu...@gmail.com <zb...@...> wrote:

Thank you Boxuan,

Was using the term “job” pretty loosely.  Your inference about doing these things within ingest/deletion process makes sense.

I know there is a lot on the community’s plate now, but if my above solution is truly optimal for current state, I wonder if a JG feature addition may help tackle this problem more consistently. Something like an additional, 3rd , index type (in addition to “graph” and “vertex-centric” indices) . i.e. “edge-connection” or “degree-centrality” index. The feature would require a mixed indexing backend, and minimally a mechanism to choose vertex and edge label combinations to count IN, OUT, and/or BOTH degree centrality.

Not sure what the level of effort or implementation details would be, but this is a very common business requirement for graph-based search.  If JanusGraph has native/tested support for it, it would make JanusGraph even easier to champion.

😊

Best,

Zach


On Tuesday, December 29, 2020 at 3:19:46 AM UTC-5 libo...@connect.hku.hk wrote:
Hi Zach,

Personally I think your workaround is the most optimal one. JanusGraph does not store number of edges as metadata in the vertex (there are both Pros & Cons for doing / not doing this).

Btw do you have to have another job doing centrality calculation separately? If your application is built on top of JanusGraph, then probably you can maintain the “outDegree” property when inserting/deleting edges.

Best regards,
Boxuan

On Dec 29, 2020, at 6:49 AM, zblu...@gmail.com <zb...@...> wrote:

Hello all,

Curious about best approaches/practices for scalable degree-centrality search filters on large (millions to billions of nodes) JanusGraphs.  i.e. something like :

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().count().is(gt(10)));                            

Suppose the has-step narrows down to a large number of vertices (hundreds of thousands), then performing that form of count on that many vertices will result in timeouts and inefficiencies (at least in my experience).  My workaround for this has been pre-calculating centrality in another job and writing to a Vertex Property that can subsequently be included in a mixed index. So we can do:

g.V()

   .has("someProperty",eq("someValue"))

   .has(“outDegree”,gt(10))

This works, but it is yet another calculation we must maintain in our pipeline and while sufficing, it seems like more of a workaround then a great solution. I was hoping there was a more optimal approach/strategy. Please let me know. 

Thank you,

Zach


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/385af431-d723-4be6-95cb-43b2954f2e58n%40googlegroups.com.


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/5c39e3cb-1b97-4c16-a1a7-0fb0b6f1ae7dn%40googlegroups.com.


Re: Degree-Centrality Filtering & Search – Scalable Strategies for OLTP

HadoopMarc <bi...@...>
 

Hi Zach, Boxuan,

There is one thing I do not understand. According to the JanusGraph datamodel, the outE relatationIdentifiers are stored in the vertex. So, retrieving all outE() relationIdentifiers with the vertex for counting them should not take long, even if there are 100 thousands of them. What happens if you do:

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().id().count().is(gt(10)));

If this does not work, it should be possible to configure/modify JanusGraph such, that it does not start fetching edge properties that are not needed for the count.

Best wishes,   Marc

Op woensdag 30 december 2020 om 04:15:46 UTC+1 schreef li...@...:

Hi Zach,

I have some concerns over concurrency and consistency issues, but this might still be a nice feature to have. I think you could open a new discussion on https://github.com/JanusGraph/janusgraph/discussions. That would be a better place for brainstorming. It would be awesome if you can share more context on why you think this is a very common business requirement.

Cheers,
Boxuan

On Dec 30, 2020, at 4:42 AM, zblu...@gmail.com <zb...@...> wrote:

Thank you Boxuan,

Was using the term “job” pretty loosely.  Your inference about doing these things within ingest/deletion process makes sense.

I know there is a lot on the community’s plate now, but if my above solution is truly optimal for current state, I wonder if a JG feature addition may help tackle this problem more consistently. Something like an additional, 3rd , index type (in addition to “graph” and “vertex-centric” indices) . i.e. “edge-connection” or “degree-centrality” index. The feature would require a mixed indexing backend, and minimally a mechanism to choose vertex and edge label combinations to count IN, OUT, and/or BOTH degree centrality.

Not sure what the level of effort or implementation details would be, but this is a very common business requirement for graph-based search.  If JanusGraph has native/tested support for it, it would make JanusGraph even easier to champion.

😊

Best,

Zach


On Tuesday, December 29, 2020 at 3:19:46 AM UTC-5 libo...@connect.hku.hk wrote:
Hi Zach,

Personally I think your workaround is the most optimal one. JanusGraph does not store number of edges as metadata in the vertex (there are both Pros & Cons for doing / not doing this).

Btw do you have to have another job doing centrality calculation separately? If your application is built on top of JanusGraph, then probably you can maintain the “outDegree” property when inserting/deleting edges.

Best regards,
Boxuan

On Dec 29, 2020, at 6:49 AM, zblu...@gmail.com <zb...@...> wrote:

Hello all,

Curious about best approaches/practices for scalable degree-centrality search filters on large (millions to billions of nodes) JanusGraphs.  i.e. something like :

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().count().is(gt(10)));                            

Suppose the has-step narrows down to a large number of vertices (hundreds of thousands), then performing that form of count on that many vertices will result in timeouts and inefficiencies (at least in my experience).  My workaround for this has been pre-calculating centrality in another job and writing to a Vertex Property that can subsequently be included in a mixed index. So we can do:

g.V()

   .has("someProperty",eq("someValue"))

   .has(“outDegree”,gt(10))

This works, but it is yet another calculation we must maintain in our pipeline and while sufficing, it seems like more of a workaround then a great solution. I was hoping there was a more optimal approach/strategy. Please let me know. 

Thank you,

Zach


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/385af431-d723-4be6-95cb-43b2954f2e58n%40googlegroups.com.


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....


Re: Degree-Centrality Filtering & Search – Scalable Strategies for OLTP

BO XUAN LI <libo...@...>
 

Hi Zach,

I have some concerns over concurrency and consistency issues, but this might still be a nice feature to have. I think you could open a new discussion on https://github.com/JanusGraph/janusgraph/discussions. That would be a better place for brainstorming. It would be awesome if you can share more context on why you think this is a very common business requirement.

Cheers,
Boxuan

On Dec 30, 2020, at 4:42 AM, zblu...@gmail.com <zblu...@...> wrote:

Thank you Boxuan,

Was using the term “job” pretty loosely.  Your inference about doing these things within ingest/deletion process makes sense.

I know there is a lot on the community’s plate now, but if my above solution is truly optimal for current state, I wonder if a JG feature addition may help tackle this problem more consistently. Something like an additional, 3rd , index type (in addition to “graph” and “vertex-centric” indices) . i.e. “edge-connection” or “degree-centrality” index. The feature would require a mixed indexing backend, and minimally a mechanism to choose vertex and edge label combinations to count IN, OUT, and/or BOTH degree centrality.

Not sure what the level of effort or implementation details would be, but this is a very common business requirement for graph-based search.  If JanusGraph has native/tested support for it, it would make JanusGraph even easier to champion.

😊

Best,

Zach


On Tuesday, December 29, 2020 at 3:19:46 AM UTC-5 libo...@connect.hku.hk wrote:
Hi Zach,

Personally I think your workaround is the most optimal one. JanusGraph does not store number of edges as metadata in the vertex (there are both Pros & Cons for doing / not doing this).

Btw do you have to have another job doing centrality calculation separately? If your application is built on top of JanusGraph, then probably you can maintain the “outDegree” property when inserting/deleting edges.

Best regards,
Boxuan

On Dec 29, 2020, at 6:49 AM, zblu...@gmail.com <zb...@...> wrote:

Hello all,

Curious about best approaches/practices for scalable degree-centrality search filters on large (millions to billions of nodes) JanusGraphs.  i.e. something like :

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().count().is(gt(10)));                            

Suppose the has-step narrows down to a large number of vertices (hundreds of thousands), then performing that form of count on that many vertices will result in timeouts and inefficiencies (at least in my experience).  My workaround for this has been pre-calculating centrality in another job and writing to a Vertex Property that can subsequently be included in a mixed index. So we can do:

g.V()

   .has("someProperty",eq("someValue"))

   .has(“outDegree”,gt(10))

This works, but it is yet another calculation we must maintain in our pipeline and while sufficing, it seems like more of a workaround then a great solution. I was hoping there was a more optimal approach/strategy. Please let me know. 

Thank you,

Zach


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/385af431-d723-4be6-95cb-43b2954f2e58n%40googlegroups.com.


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/c6539751-c586-42c1-af96-010b6275d1f1n%40googlegroups.com.


Re: Degree-Centrality Filtering & Search – Scalable Strategies for OLTP

"zb...@gmail.com" <zblu...@...>
 

Thank you Boxuan,

Was using the term “job” pretty loosely.  Your inference about doing these things within ingest/deletion process makes sense.

I know there is a lot on the community’s plate now, but if my above solution is truly optimal for current state, I wonder if a JG feature addition may help tackle this problem more consistently. Something like an additional, 3rd , index type (in addition to “graph” and “vertex-centric” indices) . i.e. “edge-connection” or “degree-centrality” index. The feature would require a mixed indexing backend, and minimally a mechanism to choose vertex and edge label combinations to count IN, OUT, and/or BOTH degree centrality.

Not sure what the level of effort or implementation details would be, but this is a very common business requirement for graph-based search.  If JanusGraph has native/tested support for it, it would make JanusGraph even easier to champion.

😊

Best,

Zach


On Tuesday, December 29, 2020 at 3:19:46 AM UTC-5 li...@... wrote:
Hi Zach,

Personally I think your workaround is the most optimal one. JanusGraph does not store number of edges as metadata in the vertex (there are both Pros & Cons for doing / not doing this).

Btw do you have to have another job doing centrality calculation separately? If your application is built on top of JanusGraph, then probably you can maintain the “outDegree” property when inserting/deleting edges.

Best regards,
Boxuan

On Dec 29, 2020, at 6:49 AM, zblu...@gmail.com <zb...@...> wrote:

Hello all,

Curious about best approaches/practices for scalable degree-centrality search filters on large (millions to billions of nodes) JanusGraphs.  i.e. something like :

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().count().is(gt(10)));                            

Suppose the has-step narrows down to a large number of vertices (hundreds of thousands), then performing that form of count on that many vertices will result in timeouts and inefficiencies (at least in my experience).  My workaround for this has been pre-calculating centrality in another job and writing to a Vertex Property that can subsequently be included in a mixed index. So we can do:

g.V()

   .has("someProperty",eq("someValue"))

   .has(“outDegree”,gt(10))

This works, but it is yet another calculation we must maintain in our pipeline and while sufficing, it seems like more of a workaround then a great solution. I was hoping there was a more optimal approach/strategy. Please let me know. 

Thank you,

Zach


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/385af431-d723-4be6-95cb-43b2954f2e58n%40googlegroups.com.


Re: Degree-Centrality Filtering & Search – Scalable Strategies for OLTP

BO XUAN LI <libo...@...>
 

Hi Zach,

Personally I think your workaround is the most optimal one. JanusGraph does not store number of edges as metadata in the vertex (there are both Pros & Cons for doing / not doing this).

Btw do you have to have another job doing centrality calculation separately? If your application is built on top of JanusGraph, then probably you can maintain the “outDegree” property when inserting/deleting edges.

Best regards,
Boxuan

On Dec 29, 2020, at 6:49 AM, zblu...@gmail.com <zblu...@...> wrote:

Hello all,

Curious about best approaches/practices for scalable degree-centrality search filters on large (millions to billions of nodes) JanusGraphs.  i.e. something like :

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().count().is(gt(10)));                            

Suppose the has-step narrows down to a large number of vertices (hundreds of thousands), then performing that form of count on that many vertices will result in timeouts and inefficiencies (at least in my experience).  My workaround for this has been pre-calculating centrality in another job and writing to a Vertex Property that can subsequently be included in a mixed index. So we can do:

g.V()

   .has("someProperty",eq("someValue"))

   .has(“outDegree”,gt(10))

This works, but it is yet another calculation we must maintain in our pipeline and while sufficing, it seems like more of a workaround then a great solution. I was hoping there was a more optimal approach/strategy. Please let me know. 

Thank you,

Zach


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/385af431-d723-4be6-95cb-43b2954f2e58n%40googlegroups.com.


Degree-Centrality Filtering & Search – Scalable Strategies for OLTP

"zb...@gmail.com" <zblu...@...>
 

Hello all,

Curious about best approaches/practices for scalable degree-centrality search filters on large (millions to billions of nodes) JanusGraphs.  i.e. something like :

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().count().is(gt(10)));                            

Suppose the has-step narrows down to a large number of vertices (hundreds of thousands), then performing that form of count on that many vertices will result in timeouts and inefficiencies (at least in my experience).  My workaround for this has been pre-calculating centrality in another job and writing to a Vertex Property that can subsequently be included in a mixed index. So we can do:

g.V()

   .has("someProperty",eq("someValue"))

   .has(“outDegree”,gt(10))

This works, but it is yet another calculation we must maintain in our pipeline and while sufficing, it seems like more of a workaround then a great solution. I was hoping there was a more optimal approach/strategy. Please let me know. 

Thank you,

Zach


Re: JanusGraph 0.5.2 and BigTable

Assaf Schwartz <schw...@...>
 

Hi Boxuan!

I perhaps wasn't clear. The composite indexing didn't solve the locking issue (it went away by itself 🙄, as if there's a cold start issue).
However, my actual problem, about failing to lookup, was indeed solved. 

Again many thanks for the information and promptly replied.
Assaf

On Saturday, December 19, 2020 at 10:09:42 AM UTC+2 li...@... wrote:
> About the locking, what do you consider a JVM instance? An instance of the Gremlin server? JanusGraph iteslf? If I try and use Janus as a cluster (multiple dockers instead of one), will that translate to having more than 1 JVM?

Sorry I wasn’t very clear about this. I meant by the JVM instance where JanusGraph itself runs. To be accurate, you could see local lock contentions when multiple threads, under the same process, contend for the same lock. This is due to JanusGraph’s locking mechanism:

Step 1: Local lock resolution (inter-thread synchronization), utilizing in-memory data structures (concurrent hashmap). If conflict detected, you typically see error message like "Local lock contention”.
Step 2: Inter-process synchronization, utilizing data backend (e.g. HBase). If conflict detected, you typically see other error messages like “Lock write retry count exceeded”.


If you have multiple transactions contending for the same lock, then it’s better to have them running on the same JVM instance because local lock synchronization is faster and can let conflicting transactions fail early.

Glad to hear you don’t have the problem anymore. To be honest I don’t know why switching to composite indexes helps you resolve the locking exception issues.

Cheers,
Boxuan

On Dec 19, 2020, at 3:41 PM, Assaf Schwartz <sc...@...> wrote:

Thanks a lot Boxuan!

For some reason I missed being notified on your response.
The indexes were indeed the issue (as I began to suspect), switching them to being composite indexes (there was no real need for them being mixed) solved the issue :)

About the locking, what do you consider a JVM instance? An instance of the Gremlin server? JanusGraph iteslf? If I try and use Janus as a cluster (multiple dockers instead of one), will that translate to having more than 1 JVM?

Again thanks,
Assaf

On Thursday, December 17, 2020 at 12:39:48 PM UTC+2 libo...@connect.hku.hk wrote:
Hi Assaf,

I am not familiar with GKE but I can try to answer some of your questions:

> how does a traversal behave when looking up based on an index key when the key is not yet indexed

Assuming the index has been enabled. If a particular key is still in the indexing process (e.g. you are in the middle of a committing process) in one thread, then another thread will not be able to find any data because it finds nothing in the index key lookup. Note that when you are using mixed index, the data is written to your primary backend (e.g. hbase) first, and then mixed index backend (e.g. Elasticsearch). If the data has already been written into hbase but not into Elasticsearch yet, the querying thread cannot find any data (if JanusGraph decides your query can be satisfied by a mixed index).

> org.janusgraph.diskstorage.locking.PermanentLockingException: Local lock contention at org.janusgraph.diskstorage.locking.AbstractLocker.writeLock(AbstractLocker.java:327) 

This usually happens when you have multiple local threads (running on the same JVM instance) contending for the same lock. You might want to check your application logic.

Best regards,
Boxuan

On Dec 17, 2020, at 6:16 PM, Assaf Schwartz <sc...@...> wrote:

Could this be related to delays in indexing? I don't know how to figure out of such exists, but assuming this happens - 
how does a traversal behave when looking up based on an index key when the key is not yet indexed?

On Thursday, December 17, 2020 at 10:54:32 AM UTC+2 Assaf Schwartz wrote:
Hi All,

I'm experiencing an issues with running JanusGraph (on top of GKE) against BigTable.
This is the general setup description:
  • We are using a single node BigTable cluster (for development / integration purposes) with the vanilla 0.5.2 docker.
  • Indexing is configured to be done with ES (also running on GKE)
  • JanusGraph is configured through environment variables:
  • Interaction with JanusGraph are done only through a single gRPC server that is running gremlin-python, let's call it DB-SERVER.
  • Last time we've done testing against BT was with version 0.4.1 of JanusGraph, precompiled to support HBase1.
  • All of our components communicate via gRPC.
Description of the problem:
  1. The DB-SERVER creates a Vertex i, generate some XML to represent work to be done, and sends it to another service for processing, let's call in ORCHESTRATOR.
  2. The ORCHESTRATOR generates two properties, w and r (local identifiers) and sends them back to the DB-SERVER, so they will be set as properties on Vertex i. These two properties are also mixed String indexes.
  3. After setting the properties, DB-SERVER will ack ORCHESTRATOR, which will start processing. As part of the processing, ORCHESTRATOR will send updates back to the DB-SERVER using w and r.
  4. On getting these updates DB-SERVER, it will try looking up Vertex i based on w and r, like so:
    g.V().has("r", <some_r>).has("w", <some_w>).next()
  5. At that point, a null / None is returned as the traversal fails to find Vertex i.
  6. Trying the same traversal in a separate console (python and gremlin) does fetch the vertex. Since it's a single instance cluster, I ruled out any eventual consistency issues.
I'm not sure if it's a regression introduced after 0.4.1.
I've also validated that db-caching is turned off.

Help! :)
Many thanks in advance,
Assaf




-- 
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/d4373c5a-ab97-4aa4-a143-f26c3ce50677n%40googlegroups.com.


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....


Re: How to upload rdf bulk data to janus graph

Arpan Jain <arpan...@...>
 

Actually I have around 70 fields. So my doubt is - whether is it possible to insert so data without bulk upload so that Janus graph will create it's own schema and letter for remaining data I will use bulk upload true.
Will this process give error?

On Thu, 24 Dec, 2020, 5:14 pm alex...@..., <alexand...@...> wrote:
That's right

On Thursday, December 24, 2020 at 1:43:42 PM UTC+2 ar...@... wrote:
All these properties I need to set in the Janusgraph properties file right? I mean the config on which the server is starting. I mean the file where we set the backend storage and host etc.

On Thu, 24 Dec, 2020, 4:05 pm alex...@..., <alex...@...> wrote:
Hi,

Try to enable batch loading: "storage.batch-loading=true".
Increase your batch mutations buffer: "storage.buffer-size=20480".
Increase ids block size: "ids.block-size=10000000".
Not sure if your flows just adds or upserts data. In case it upserts you may also set "query.batch=true".
That said, I didn't use rdf2gremlin and can't suggest much. Above configurations are just options which I can immediately think of. Of course a proper investigation should be done to suggest performance improvement. You may additionally optimize your ScyllaDB for your use cases. 

Best regards,
Oleksandr
On Thursday, December 24, 2020 at 12:24:10 PM UTC+2 ar...@... wrote:
I have data in RDF(ttl) format. It is having around 6 million triplets. Currently, I have used rdf2gremlin python script for this conversion but it's taking to much time i.e. for 10k records it took around 1 hour. I am using Scylla DB as a Janus graph backend. Below is the python code I am using. 

from rdf2g import setup_graph
DEFAULT_LOCAL_CONNECTION_STRING = "ws://localhost:8182/gremlin"
g = setup_graph(DEFAULT_LOCAL_CONNECTION_STRING) 
 import rdflib 
import pathlib
OUTPUT_FILE_LAM_PROPERTIES = pathlib.Path("path/to/ttl/file/.ttl").resolve() 
rdf_graph = rdflib.Graph() 
rdf_graph.parse(str(OUTPUT_FILE_LAM_PROPERTIES), format="ttl") 

Same RDF data in neo4j is taking around only 10 mins to load the whole data. But I want to use the Janus graph.

Kindly suggest to me the best way to upload bulk RDF data to Janus graph using python or java.

--
You received this message because you are subscribed to a topic in the Google Groups "JanusGraph users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/janusgraph-users/Buk0hjlxVOs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to janusgr...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/7a341919-78a4-48d2-9380-100f827803e1n%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/ddb3eb4d-3fe2-4a4e-9c34-4a76476af7c2n%40googlegroups.com.


Re: How to upload rdf bulk data to janus graph

"alex...@gmail.com" <alexand...@...>
 

That's right


On Thursday, December 24, 2020 at 1:43:42 PM UTC+2 ar...@... wrote:
All these properties I need to set in the Janusgraph properties file right? I mean the config on which the server is starting. I mean the file where we set the backend storage and host etc.

On Thu, 24 Dec, 2020, 4:05 pm alex...@..., <alex...@...> wrote:
Hi,

Try to enable batch loading: "storage.batch-loading=true".
Increase your batch mutations buffer: "storage.buffer-size=20480".
Increase ids block size: "ids.block-size=10000000".
Not sure if your flows just adds or upserts data. In case it upserts you may also set "query.batch=true".
That said, I didn't use rdf2gremlin and can't suggest much. Above configurations are just options which I can immediately think of. Of course a proper investigation should be done to suggest performance improvement. You may additionally optimize your ScyllaDB for your use cases. 

Best regards,
Oleksandr
On Thursday, December 24, 2020 at 12:24:10 PM UTC+2 ar...@... wrote:
I have data in RDF(ttl) format. It is having around 6 million triplets. Currently, I have used rdf2gremlin python script for this conversion but it's taking to much time i.e. for 10k records it took around 1 hour. I am using Scylla DB as a Janus graph backend. Below is the python code I am using. 

from rdf2g import setup_graph
DEFAULT_LOCAL_CONNECTION_STRING = "ws://localhost:8182/gremlin"
g = setup_graph(DEFAULT_LOCAL_CONNECTION_STRING) 
 import rdflib 
import pathlib
OUTPUT_FILE_LAM_PROPERTIES = pathlib.Path("path/to/ttl/file/.ttl").resolve() 
rdf_graph = rdflib.Graph() 
rdf_graph.parse(str(OUTPUT_FILE_LAM_PROPERTIES), format="ttl") 

Same RDF data in neo4j is taking around only 10 mins to load the whole data. But I want to use the Janus graph.

Kindly suggest to me the best way to upload bulk RDF data to Janus graph using python or java.

--
You received this message because you are subscribed to a topic in the Google Groups "JanusGraph users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/janusgraph-users/Buk0hjlxVOs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to janusgr...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/7a341919-78a4-48d2-9380-100f827803e1n%40googlegroups.com.


Re: How to upload rdf bulk data to janus graph

Arpan Jain <arpan...@...>
 

All these properties I need to set in the Janusgraph properties file right? I mean the config on which the server is starting. I mean the file where we set the backend storage and host etc.


On Thu, 24 Dec, 2020, 4:05 pm alex...@..., <alexand...@...> wrote:
Hi,

Try to enable batch loading: "storage.batch-loading=true".
Increase your batch mutations buffer: "storage.buffer-size=20480".
Increase ids block size: "ids.block-size=10000000".
Not sure if your flows just adds or upserts data. In case it upserts you may also set "query.batch=true".
That said, I didn't use rdf2gremlin and can't suggest much. Above configurations are just options which I can immediately think of. Of course a proper investigation should be done to suggest performance improvement. You may additionally optimize your ScyllaDB for your use cases. 

Best regards,
Oleksandr
On Thursday, December 24, 2020 at 12:24:10 PM UTC+2 ar...@... wrote:
I have data in RDF(ttl) format. It is having around 6 million triplets. Currently, I have used rdf2gremlin python script for this conversion but it's taking to much time i.e. for 10k records it took around 1 hour. I am using Scylla DB as a Janus graph backend. Below is the python code I am using. 

from rdf2g import setup_graph
DEFAULT_LOCAL_CONNECTION_STRING = "ws://localhost:8182/gremlin"
g = setup_graph(DEFAULT_LOCAL_CONNECTION_STRING) 
 import rdflib 
import pathlib
OUTPUT_FILE_LAM_PROPERTIES = pathlib.Path("path/to/ttl/file/.ttl").resolve() 
rdf_graph = rdflib.Graph() 
rdf_graph.parse(str(OUTPUT_FILE_LAM_PROPERTIES), format="ttl") 

Same RDF data in neo4j is taking around only 10 mins to load the whole data. But I want to use the Janus graph.

Kindly suggest to me the best way to upload bulk RDF data to Janus graph using python or java.

--
You received this message because you are subscribed to a topic in the Google Groups "JanusGraph users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/janusgraph-users/Buk0hjlxVOs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/7a341919-78a4-48d2-9380-100f827803e1n%40googlegroups.com.


Re: How to upload rdf bulk data to janus graph

"alex...@gmail.com" <alexand...@...>
 

Hi,

Try to enable batch loading: "storage.batch-loading=true".
Increase your batch mutations buffer: "storage.buffer-size=20480".
Increase ids block size: "ids.block-size=10000000".
Not sure if your flows just adds or upserts data. In case it upserts you may also set "query.batch=true".
That said, I didn't use rdf2gremlin and can't suggest much. Above configurations are just options which I can immediately think of. Of course a proper investigation should be done to suggest performance improvement. You may additionally optimize your ScyllaDB for your use cases. 

Best regards,
Oleksandr

On Thursday, December 24, 2020 at 12:24:10 PM UTC+2 ar...@... wrote:
I have data in RDF(ttl) format. It is having around 6 million triplets. Currently, I have used rdf2gremlin python script for this conversion but it's taking to much time i.e. for 10k records it took around 1 hour. I am using Scylla DB as a Janus graph backend. Below is the python code I am using. 

from rdf2g import setup_graph
DEFAULT_LOCAL_CONNECTION_STRING = "ws://localhost:8182/gremlin"
g = setup_graph(DEFAULT_LOCAL_CONNECTION_STRING) 
 import rdflib 
import pathlib
OUTPUT_FILE_LAM_PROPERTIES = pathlib.Path("path/to/ttl/file/.ttl").resolve() 
rdf_graph = rdflib.Graph() 
rdf_graph.parse(str(OUTPUT_FILE_LAM_PROPERTIES), format="ttl") 

Same RDF data in neo4j is taking around only 10 mins to load the whole data. But I want to use the Janus graph.

Kindly suggest to me the best way to upload bulk RDF data to Janus graph using python or java.


How to upload rdf bulk data to janus graph

Arpan Jain <arpan...@...>
 

I have data in RDF(ttl) format. It is having around 6 million triplets. Currently, I have used rdf2gremlin python script for this conversion but it's taking to much time i.e. for 10k records it took around 1 hour. I am using Scylla DB as a Janus graph backend. Below is the python code I am using. 

from rdf2g import setup_graph
DEFAULT_LOCAL_CONNECTION_STRING = "ws://localhost:8182/gremlin"
g = setup_graph(DEFAULT_LOCAL_CONNECTION_STRING) 
 import rdflib 
import pathlib
OUTPUT_FILE_LAM_PROPERTIES = pathlib.Path("path/to/ttl/file/.ttl").resolve() 
rdf_graph = rdflib.Graph() 
rdf_graph.parse(str(OUTPUT_FILE_LAM_PROPERTIES), format="ttl") 

Same RDF data in neo4j is taking around only 10 mins to load the whole data. But I want to use the Janus graph.

Kindly suggest to me the best way to upload bulk RDF data to Janus graph using python or java.